linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* objrmap and vmtruncate
@ 2003-04-04 14:34 Hugh Dickins
  2003-04-04 16:14 ` William Lee Irwin III
  2003-04-04 18:54 ` Andrew Morton
  0 siblings, 2 replies; 105+ messages in thread
From: Hugh Dickins @ 2003-04-04 14:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave McCracken, linux-kernel, linux-mm

I see you're going for locking the page around page_convert_anon,
to guard page->mapping against truncation.  Nice thought,
but the words "tip" and "iceberg" spring to mind.

Truncating a sys_remap_file_pages file?  You're the first to
begin to consider such an absurd possibility: vmtruncate_list
still believes vm_pgoff tells it what needs to be done.

I propose that we don't change vmtruncate_list, zap_page_range, ...
at all for this: let it unmap inappropriate pages, even from a
VM_LOCKED vma, that's just a price userspace pays for the
privilege of truncating a sys_remap_file_pages file.

But truncate_inode_pages should check page_mapped, and if so
try_to_unmap with a force flag to attack even VM_LOCKED vmas.
Sadly, if page_table_lock is held, it won't be able to unmap:
leave those for shrink_list?  But that won't find them once
page->mapping gone: page_convert_anon from here too?
What about invalidate_inode_pages2?

This will also cover some of the racy pages, which another cpu
found in the cache before vmtruncate started, but inserted into
page table after vmtruncate_list passed that way; but it won't
cover those racy pages which were found before, but are not yet
put into the page table (e.g. those where your page_convert_anon
bailed because page->mapping is now NULL).  Worth adding checks
for? but I don't think we have absolute locking against this.

Various places in rmap.c where !page->mapping is considered a
BUG(), but you've now drawn attention to the fact it may get
vmtruncated at any moment.  Easy to remove those BUG()s.

Consider page_add_rmap of page with NULL (or swapper_space)
mapping as Anon?  In which case move all the SetPageAnon stuff
inside rmap.c, and do ClearPageAnon inside there too?

Or, stop resetting page->mapping to NULL when we remove from
page cache?  So objrmap can still find the pages even though
find_get_page etc. cannot.

Sorry, off to replace my "?" key, it's worn out.

Hugh


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 14:34 objrmap and vmtruncate Hugh Dickins
@ 2003-04-04 16:14 ` William Lee Irwin III
  2003-04-04 16:29   ` Hugh Dickins
  2003-04-04 18:54 ` Andrew Morton
  1 sibling, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-04 16:14 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Dave McCracken, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 03:34:48PM +0100, Hugh Dickins wrote:
> I see you're going for locking the page around page_convert_anon,
> to guard page->mapping against truncation.  Nice thought,
> but the words "tip" and "iceberg" spring to mind.
> Truncating a sys_remap_file_pages file?  You're the first to
> begin to consider such an absurd possibility: vmtruncate_list
> still believes vm_pgoff tells it what needs to be done.
> I propose that we don't change vmtruncate_list, zap_page_range, ...
> at all for this: let it unmap inappropriate pages, even from a
> VM_LOCKED vma, that's just a price userspace pays for the
> privilege of truncating a sys_remap_file_pages file.

Hmm, aren't the file offset calculations wrong for sys_remap_file_pages()
even before objrmap?


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 16:14 ` William Lee Irwin III
@ 2003-04-04 16:29   ` Hugh Dickins
  0 siblings, 0 replies; 105+ messages in thread
From: Hugh Dickins @ 2003-04-04 16:29 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Dave McCracken, linux-kernel, linux-mm

On Fri, 4 Apr 2003, William Lee Irwin III wrote:
> 
> Hmm, aren't the file offset calculations wrong for sys_remap_file_pages()
> even before objrmap?

Yes - objrmap merely makes it difficult to find the missed pages later on.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 14:34 objrmap and vmtruncate Hugh Dickins
  2003-04-04 16:14 ` William Lee Irwin III
@ 2003-04-04 18:54 ` Andrew Morton
  2003-04-04 21:43   ` Hugh Dickins
  2003-04-04 21:45   ` Andrea Arcangeli
  1 sibling, 2 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-04 18:54 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: dmccr, linux-kernel, linux-mm

Hugh Dickins <hugh@veritas.com> wrote:
>
> Truncating a sys_remap_file_pages file?  You're the first to
> begin to consider such an absurd possibility: vmtruncate_list
> still believes vm_pgoff tells it what needs to be done.

Well I knew mincore() was bust for nonlinear mappings.  Never thought about
truncate.

> I propose that we don't change vmtruncate_list, zap_page_range, ...
> at all for this: let it unmap inappropriate pages, even from a
> VM_LOCKED vma, that's just a price userspace pays for the
> privilege of truncating a sys_remap_file_pages file.
> 
> But truncate_inode_pages should check page_mapped, and if so
> try_to_unmap with a force flag to attack even VM_LOCKED vmas.
> Sadly, if page_table_lock is held, it won't be able to unmap:
> leave those for shrink_list?  But that won't find them once
> page->mapping gone: page_convert_anon from here too?
> What about invalidate_inode_pages2?
> 
> This will also cover some of the racy pages, which another cpu
> found in the cache before vmtruncate started, but inserted into
> page table after vmtruncate_list passed that way; but it won't
> cover those racy pages which were found before, but are not yet
> put into the page table (e.g. those where your page_convert_anon
> bailed because page->mapping is now NULL).  Worth adding checks
> for? but I don't think we have absolute locking against this.

How about we just don't do the SIGBUS thing at all for nonlinear mappings? 
Any pages outside i_size which are mapped into a nonlinear mapping become
anonymous.

We'd need vm_flags:VM_NONLINEAR.

> Various places in rmap.c where !page->mapping is considered a
> BUG(), but you've now drawn attention to the fact it may get
> vmtruncated at any moment.  Easy to remove those BUG()s.

Well not really.  page_referenced_obj() is racy wrt truncate and will deref
null.  We're back to locking the pages in refill_inactive_zone().  There is
no other way of stabilising ->mapping.

Probably a trylock in page_referenced_obj() would suit.

btw,
        if (PageSwapCache(page))
                BUG();

is that safe against your weird tmpfs address_space swizzling?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 18:54 ` Andrew Morton
@ 2003-04-04 21:43   ` Hugh Dickins
  2003-04-04 21:45   ` Andrea Arcangeli
  1 sibling, 0 replies; 105+ messages in thread
From: Hugh Dickins @ 2003-04-04 21:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dmccr, linux-kernel, linux-mm

On Fri, 4 Apr 2003, Andrew Morton wrote:
> 
> How about we just don't do the SIGBUS thing at all for nonlinear mappings? 
> Any pages outside i_size which are mapped into a nonlinear mapping become
> anonymous.
> 
> We'd need vm_flags:VM_NONLINEAR.

That's an idea, it sounds plausible, but I'll need to think more.

I'm not convinced there won't be difficulties around the corner
going that way.  For example, it's difficult to do sensible page
accounting if the vma is shared writable, but parts of it can go
private without warning.  It actually introduces a new category
of page (or perhaps legitimizes what already exists as a rare,
forgotten category of outlaw page).

Also (sob, sob) that's a little inconvenient for anonymous objrmap
(such pages may be shared outside of the anonmm).  Neither of which
rules out the idea, but they do hint that it might prove awkward in
other ways too.

> > Various places in rmap.c where !page->mapping is considered a
> > BUG(), but you've now drawn attention to the fact it may get
> > vmtruncated at any moment.  Easy to remove those BUG()s.
> 
> Well not really.  page_referenced_obj() is racy wrt truncate and will deref
> null.  We're back to locking the pages in refill_inactive_zone().  There is
> no other way of stabilising ->mapping.
> 
> Probably a trylock in page_referenced_obj() would suit.

I didn't get you at first, but now I see it.  Shame it's taken us
so long to notice that.  I think there is another way, but it's not
necessarily preferable: I suggested before that truncate_inode_pages
should forcibly try_to_unmap if it sees a page_mapped page (either
from sys_remap_file_pages or racing nopage) - for that it would
have to take the pte_chain_lock, wouldn't that give the required
serialization against page_referenced_obj?

> btw,
>         if (PageSwapCache(page))
>                 BUG();
> 
> is that safe against your weird tmpfs address_space swizzling?

Yes, it's safe against my weird swizzling, because it's against the
rules for a tmpfs page to have swap identity while it's mapped into
an mm - BUG_ON(page_mapped(page)) in shmem_writepage.

But I don't think it's safe against truncation nulling page->mapping,
then shrink_list doing add_to_swap later.  Probably a SetPageAnon in
add_to_swap would fix all rmap.c's PageSwapCache BUG()s.

Hugh


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 18:54 ` Andrew Morton
  2003-04-04 21:43   ` Hugh Dickins
@ 2003-04-04 21:45   ` Andrea Arcangeli
  2003-04-04 21:58     ` Benjamin LaHaise
  2003-04-04 23:07     ` Andrew Morton
  1 sibling, 2 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-04 21:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 10:54:17AM -0800, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> >
> > Truncating a sys_remap_file_pages file?  You're the first to
> > begin to consider such an absurd possibility: vmtruncate_list
> > still believes vm_pgoff tells it what needs to be done.
> 
> Well I knew mincore() was bust for nonlinear mappings.  Never thought about
> truncate.

IMHO sys_remap_file_pages and the nonlinear mapping is an hack and
should be dropped eventually. I mean it's not too bad but it's a mere
workaround for:

1) lack of 64bit address space that will be fixed
2) lack of O(log(N)) mmap, that will be fixed too

1) and 2) are the only reason why there's huge interest in such syscall
right now. So I don't like it too much and I'm not convinced it was
right to merge it in 2.5 given 2) is a software problem and I've the
design to fix it with a rbtree extension, and 1) is an hardware problem
that will be fixed very soon. the API is not too bad but there is a
reason we have the vma for all other mappings.

Maybe I'm missing something, I'm curious to hear what you think and what
other cases needs this syscall even after 1) and 2) are fixed.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 21:45   ` Andrea Arcangeli
@ 2003-04-04 21:58     ` Benjamin LaHaise
  2003-04-04 23:07     ` Andrew Morton
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin LaHaise @ 2003-04-04 21:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Hugh Dickins, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 11:45:47PM +0200, Andrea Arcangeli wrote:
> Maybe I'm missing something, I'm curious to hear what you think and what
> other cases needs this syscall even after 1) and 2) are fixed.

It's useful for UML and emulators that simulate page tables too.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 21:45   ` Andrea Arcangeli
  2003-04-04 21:58     ` Benjamin LaHaise
@ 2003-04-04 23:07     ` Andrew Morton
  2003-04-05  0:03       ` Andrea Arcangeli
  2003-04-05  3:53       ` Rik van Riel
  1 sibling, 2 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-04 23:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Ingo Molnar; +Cc: hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Fri, Apr 04, 2003 at 10:54:17AM -0800, Andrew Morton wrote:
> > Hugh Dickins <hugh@veritas.com> wrote:
> > >
> > > Truncating a sys_remap_file_pages file?  You're the first to
> > > begin to consider such an absurd possibility: vmtruncate_list
> > > still believes vm_pgoff tells it what needs to be done.
> > 
> > Well I knew mincore() was bust for nonlinear mappings.  Never thought about
> > truncate.
> 
> IMHO sys_remap_file_pages and the nonlinear mapping is an hack and
> should be dropped eventually.

It has created exceptional situations which are rather tying our hands in
other areas.

> I mean it's not too bad but it's a mere
> workaround for:
> 
> 1) lack of 64bit address space that will be fixed
> 2) lack of O(log(N)) mmap, that will be fixed too

Yes, mmap() overhead due to the linear search, VMA space consumption,
additional TLB invalidations and additional faults.  The latter could be
fixed up via MAP_PREFAULT and are independent of nonlinearity.

Here's Ingo's original summary:

- really complex remappings (used by databases or virtualizing
  applications) create a *huge* amount of vmas - and vma's are per-process
  which puts a really big load on kernel memory allocations, especially on
  32-bit systems. I've seen applications that had a mapping setup that
  generated 128 *thousand* vmas per process, causing lots of problems.

- setting up separate mappings is expensive, causes one pagefault per page
  and also causes TLB flushes.

- even on 64-bit systems, when mapping really large (terabyte size) and
  really sparse files, sparse mappings can be a disadvantage - in the
  worst-case there can be as much as 1 more pagetable page allocated for
  every file page that is mapped in.

> 1) and 2) are the only reason why there's huge interest in such syscall
> right now. So I don't like it too much and I'm not convinced it was
> right to merge it in 2.5 given 2) is a software problem and I've the
> design to fix it with a rbtree extension, and 1) is an hardware problem
> that will be fixed very soon. the API is not too bad but there is a
> reason we have the vma for all other mappings.
> 
> Maybe I'm missing something, I'm curious to hear what you think and what
> other cases needs this syscall even after 1) and 2) are fixed.

I think that's right - the system call is very specialised and is targeted at
solving problems which have been encountered in a small number of
applications, but important ones.

Right now, I do not feel that we are going to be able to come up with an
acceptably simple VM which has both nonlinear mappings and objrmap.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 23:07     ` Andrew Morton
@ 2003-04-05  0:03       ` Andrea Arcangeli
  2003-04-05  0:31         ` Andrew Morton
  2003-04-05  3:53       ` Rik van Riel
  1 sibling, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05  0:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, hugh, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 03:07:44PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Fri, Apr 04, 2003 at 10:54:17AM -0800, Andrew Morton wrote:
> > > Hugh Dickins <hugh@veritas.com> wrote:
> > > >
> > > > Truncating a sys_remap_file_pages file?  You're the first to
> > > > begin to consider such an absurd possibility: vmtruncate_list
> > > > still believes vm_pgoff tells it what needs to be done.
> > > 
> > > Well I knew mincore() was bust for nonlinear mappings.  Never thought about
> > > truncate.
> > 
> > IMHO sys_remap_file_pages and the nonlinear mapping is an hack and
> > should be dropped eventually.
> 
> It has created exceptional situations which are rather tying our hands in
> other areas.
> 
> > I mean it's not too bad but it's a mere
> > workaround for:
> > 
> > 1) lack of 64bit address space that will be fixed
> > 2) lack of O(log(N)) mmap, that will be fixed too
> 
> Yes, mmap() overhead due to the linear search, VMA space consumption,
> additional TLB invalidations and additional faults.  The latter could be
> fixed up via MAP_PREFAULT and are independent of nonlinearity.
> 
> Here's Ingo's original summary:
> 
> - really complex remappings (used by databases or virtualizing
>   applications) create a *huge* amount of vmas - and vma's are per-process
>   which puts a really big load on kernel memory allocations, especially on
>   32-bit systems. I've seen applications that had a mapping setup that
>   generated 128 *thousand* vmas per process, causing lots of problems.

the current max map count is 64k so I'm not sure how can he have seen 128k,
I've to assume the kernel was hacked for it.

> 
> - setting up separate mappings is expensive, causes one pagefault per page
>   and also causes TLB flushes.
> 
> - even on 64-bit systems, when mapping really large (terabyte size) and
>   really sparse files, sparse mappings can be a disadvantage - in the
>   worst-case there can be as much as 1 more pagetable page allocated for
					      ^^^^^^^^^
>   every file page that is mapped in.

he certainly means 1 vma, not 1 pagetable.

> > 1) and 2) are the only reason why there's huge interest in such syscall
> > right now. So I don't like it too much and I'm not convinced it was
> > right to merge it in 2.5 given 2) is a software problem and I've the
> > design to fix it with a rbtree extension, and 1) is an hardware problem
> > that will be fixed very soon. the API is not too bad but there is a
> > reason we have the vma for all other mappings.
> > 
> > Maybe I'm missing something, I'm curious to hear what you think and what
> > other cases needs this syscall even after 1) and 2) are fixed.
> 
> I think that's right - the system call is very specialised and is targeted at
> solving problems which have been encountered in a small number of
> applications, but important ones.
> 
> Right now, I do not feel that we are going to be able to come up with an
> acceptably simple VM which has both nonlinear mappings and objrmap.

that's basically my point. if you allocate the regular rmap then you
could allocate the vma. and I while some of those apps are important,
the important ones are solved by the 64bit address space. The others
would better be fixed so that they don't eat all those vmas. Also I'm
aware of one single critical app that uses thousand vmas not for the
same purpose of the ones that will be fixed definitely with the 64bit
address space.  But such an important app is hurted badly by
get_unmapped_area only, not at all by the rest of the vma load and
rbtree lookups. And with remap_file_pages you _lose_ completely the
get_unmapped_area feature. If you are ok with the remap_file_pages, then
you could as well use MAP_FIXED and your mmap would just run O(log(N))
dropping completely the get_unmapped_area load that is the only real
offender.  The problem is they don't want that, they use
get_unmapped_area today, and what they want IMHO, is my new design for
the O(log(N)) mmap w/o MAP_FIXED and w/o addr-hint, _not_
remap_file_pages. So they can still use get_unmapped_area and it'll run
as fast as MAP_FIXED, we could completely avoid any additional lookup on
the rbtree and just use the single lookup checkpoint that also the
MAP_FIXED just is using to verify and insert in a single walk of the
tree. This is technically doable as far as I can tell, and I'm going to
implement it.

So I definitely vote to drop remap_file_pages. I don't have ready right
now the O(log(N)) get_unmapped_area, that will be tricky, but it's
definitely doable and it's the next thing I'll work on for 2.5 from my
part. In the meantime people could use MAP_FIXED (or at least the hint),
if they can't use MAP_FIXED they can't use remap_file_pages either,
period. Infact using MAP_FIXED is an order of magnitude simpler.

the worst part IMHO is that it screwup the vma making the vma->vm_file
totally wrong for the pages in the vma. The only way to leave it is:

1) to allow it only in a special vma called VM_NONLINEAR allocated via
   mmap previously

2) such a magic vma will have a null vm_file and it will be
   totally ignored by the VM

3) the remap_file_pages would then need to be enabled via a sysctl for
   security reasons (can pin indefinite amounts of ram)

4) no issue with sigbus

5) if a truncate happens in a file, just leave the page mapped, the
   reference count will leave it allocated and outside the pagecache,
   and the possibly dirty page will be freed during the zap_pages_ranges
   of the munmap done on the VM_NONLINAER vm

In the current form it seems totally broken and at the very least the
above should be done to fix it but I vote to drop it enterely since it
doesn't worth the complexity IMHO.

>From my part I will do all I can to make 64bit to run so much faster
than whatever remap_file_pages that people won't want to use
remap_file_pages anyways.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  0:03       ` Andrea Arcangeli
@ 2003-04-05  0:31         ` Andrew Morton
  2003-04-05  1:31           ` Andrea Arcangeli
  2003-04-05  2:13           ` Martin J. Bligh
  0 siblings, 2 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-05  0:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> the worst part IMHO is that it screwup the vma making the vma->vm_file
> totally wrong for the pages in the vma.

Not sure what you mean here.  All pages in the vma are backed by the file at
vm_file.  It is vm_pgoff which is meaningless.

As for your other concerns: yes, I hear you.  I suspect something will have
to give.  Ingo has a better feel for the problems which this code is solving
and hopefully he can comment.

Perhaps it is useful to itemise the prblems which we're trying to solve here:

- ZONE_NORMAL consumption by pte_chains

  Solved by objrmap and presumably page clustering.

- ZONE_NORMAL consumption by VMAs

  Solved by remap_file_pages.  Neither objrmap nor page clustering will
  help here.

- pte_chain setup and teardown CPU cost.

  objrmap does not seem to help.  Page clustering might, but is unlikely to
  be enabled on the machines which actually care about the overhead.

- get_unmapped_area() search complexity.

  Solved by remap_file_pages and by as-yet unimplemented algorithmic rework.

- pagefault frequency and TLB invalidation cost.

  Solved by MAP_POPULATE, could also be solved by MAP_PREFAULT (but it's
  not really a demonstrated problem).

Anything else?


So looking at the above, remap_file_pages() actually has pretty good
coverage.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  0:31         ` Andrew Morton
@ 2003-04-05  1:31           ` Andrea Arcangeli
  2003-04-05  1:52             ` Benjamin LaHaise
  2003-04-05  2:06             ` Andrew Morton
  2003-04-05  2:13           ` Martin J. Bligh
  1 sibling, 2 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05  1:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, hugh, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 04:31:54PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > the worst part IMHO is that it screwup the vma making the vma->vm_file
> > totally wrong for the pages in the vma.
> 
> Not sure what you mean here.  All pages in the vma are backed by the file at
> vm_file.  It is vm_pgoff which is meaningless.

vm_pgoff sure. But thanks for pointing this out, that's one more reason
to change the API so you can as well map from different files. Quite
frankly I thought it was just the case, I thought it was the most useful
thing for the non-workaround 32bit case. Some app even has different
segments of shm.  If you've 64bit you can do an huge mmap of an huge
file and touch only what you need.  You don't need to mmap sparse, you
just need to mmap it all and touch it sparse. At the light of this the
current API sounds totally useless on a 64bit arch, not just for the
most important apps.  this only saves address space basically, not vmas.
You can save vmas if you can map more than one as I expected. so IMHO a
"fd" descriptor should be passed to the syscall. I think the important
app using get_unmapped_area is using different files. If it's not using
different files 64bit can't have any advantage from remap_file_pages
anyways, infact remap_file_pages is a waseful additional syscall on a
64bit archs where saving address space is worthless.

> As for your other concerns: yes, I hear you.  I suspect something will have
> to give.  Ingo has a better feel for the problems which this code is solving
> and hopefully he can comment.
> 
> Perhaps it is useful to itemise the prblems which we're trying to solve here:
> 
> - ZONE_NORMAL consumption by pte_chains
> 
>   Solved by objrmap and presumably page clustering.

yes, but this has nothing to do with remap_file_pages IMHO.

> - ZONE_NORMAL consumption by VMAs
> 
>   Solved by remap_file_pages.  Neither objrmap nor page clustering will
>   help here.

this is not significant for 64bit apps, and even today with my tree you
can do as much as 32G of shm for a database on a 32bit arch. With
all shm bigpages. (note: the original bigpages patch was buggy and
crashes over 4G of bigpages and I debugged and fixed it)

But of course it'll run  faster w/o any vma at all. But the point is
that with such amount of tlb and pagetable trashing, that isn't going to
work too well anyways and 64G has other problems that may be addressed
on 2.7. (and during 2.7 we'll do the softpagesize more to decrease the
frequencly of page faults for 64bit archs than for 32bit archs)

> - pte_chain setup and teardown CPU cost.
> 
>   objrmap does not seem to help.  Page clustering might, but is unlikely to
>   be enabled on the machines which actually care about the overhead.

Again, this has nothing to do with this.

basically you're saying "we benchmark with the database and running the
database slow or failing with oom is a showstopper so rather than
dropping the rmap waste, we just workaround rmap so it won't hurt in the
critical app". So what? and all the other poor apps using mmap? what are
you doing for them? still let them suffer from rmap? are you going to
port all the userspace to nonlinear vmas so you won't suffer from rmap?

I mean, I can't buy in any way rmap related arguments about
remap_file_pages. rmap (note not objrmap) is a waste  and those apps
just shows it more.  claiming that you need rmap_file_pages to hide the
rmap waste sounds pointless to me.

> - get_unmapped_area() search complexity.
> 
>   Solved by remap_file_pages and by as-yet unimplemented algorithmic rework.

what is this "yet unimplemented algorithmic rework". I never heard of
this. this sounds like reinventing the wheel and making it fast for
remap_file_pages and not for mmap. my future get_unmapped_area will work
for regular mmap, not just for remap_file_pages.

I think remap_file_pages if it stays, should obey to all the rules in my
previous email plus two new ones that I forgotten:

6) pass a file descriptor so it makes some minor sense on a 64bit arch
7) all MM syscalls but mmap and munmap, must return -EINVAL if they're
   deal with a VM_NONLINEAR vma

 7) was implicit of course.

if rmap_file_pages stays, it has to be a kind of bypass to setup
pagetables from userspace with the vm-paging disabled in the kernel.
only mmap and munmap must work on the nonlinear vma, all other
operations must refuse to work returning -EINVAL.  Nothing more. Any
attempt to make it smart sounds overdesign. The "yet unimplemented
algorithmic rework" has to be in userspace, not kernel space. so make it
a userspace library if something. The whole point of remap_user_pages is
to let userspace control the pagetables, so it has as well to manage
them completely, that should be faster too by avoding entering kernel.
making life easier to the remap_file_pages users doesn't make sense to
me.

> - pagefault frequency and TLB invalidation cost.
> 
>   Solved by MAP_POPULATE, could also be solved by MAP_PREFAULT (but it's
>   not really a demonstrated problem).

the real issues are with the shmfs and with shmfs we need largepages
anyways. Largepages do prefaulting by design ;)

Also consider this significant factor: the larger the shmfs the smaller
the nonlinear 1G window will be and the higher the trashing. With 32G of
bigpages the remap_file_pages will trash like crazy generating an order
of mangnitude more of "window misses". I mean 32bit are just pushed at
the limit today regardless the lack of remap_file_pages. Example, if
you don't use largepages going past 16G of shm is going to be derimental.
The cost of the mmap doesn't sounds like the showstopper.

> Anything else?
> 
> 
> So looking at the above, remap_file_pages() actually has pretty good
> coverage.

If it is changed according to my 7 points then I can live with it but
still I vote for removing it. I perfectly know where it is needed in the
32bit archs today, it will be visible in the benchmarks, but it
shouldn't make an order of magnitude of difference and it's unlikely you
will get an huge benefit in the 32G case because of the tlb trashing.
Infact I suspect largepages makes difference there because it'll walk
only 2 levels also, not only for the larger tlb.  here all the points
for clarity:


1) to allow remap_file_pages to work only inside a special vma called
   VM_NONLINEAR allocated via mmap previously

2) such a magic vma will have a null vm_file and null vm_pgoff and it
   will be totally ignored by the VM paging algorithm

3) the remap_file_pages would then need to be enabled via a sysctl for
   security reasons (can pin indefinite amounts of ram)
   (both mmap(VM_NONLINEAR) and remap_file_pages have to return -EPERM
   or -EINVAL at your option, if the sysctl isn't set to 1)

4) no issue with sigbus, all ram will be pinned hard in the pagetables

5) if a truncate happens in a file, just leave the page mapped, the
   reference count will leave it allocated and outside the pagecache,
   and the possibly dirty page will be freed during the zap_pages_ranges
   of the munmap done on the VM_NONLINAER vm

6) pass a file descriptor to remap_file_pages too, so it makes some
   minor sense on a 64bit arch

7) all MM syscalls but mmap and munmap, must return -EINVAL if they're
   dealing in any way with a VM_NONLINEAR vma 


you could try to avoid the need of the sysctl by teaching the vm to
unmap such vma, but I don't think it worth and I'm sure those apps
prefers to have the stuff pinned anyways w/o the risk of sigbus and w/o
the need of mlock and it looks cleaner to me to avoid any mess with the
vm and long term nobody will care about this sysctl since 64bit will run
so much fatster w/o any remap_file_pages and tlb flush running at all

If you're ok with the above 7 points then it will look sane
speedup-hack, even if it still not worthwhile IMHO. I'm not going to
implement the 7 points, if it was me to have to do the work I would
delete it for now, and I would finish the the O(log(N)) mmap before
thinking at reintroducing remap_file_pages.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  1:31           ` Andrea Arcangeli
@ 2003-04-05  1:52             ` Benjamin LaHaise
  2003-04-05  2:22               ` Andrea Arcangeli
  2003-04-05  2:06             ` Andrew Morton
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin LaHaise @ 2003-04-05  1:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 03:31:43AM +0200, Andrea Arcangeli wrote:
> Also consider this significant factor: the larger the shmfs the smaller
> the nonlinear 1G window will be and the higher the trashing. With 32G of
> bigpages the remap_file_pages will trash like crazy generating an order
> of mangnitude more of "window misses". I mean 32bit are just pushed at
> the limit today regardless the lack of remap_file_pages. Example, if
> you don't use largepages going past 16G of shm is going to be derimental.
> The cost of the mmap doesn't sounds like the showstopper.

You're guessing here.  At least for oracle, that behaviour is dependant on 
the locality of accesses.  Given that each user has their own process you 
can bet there is a fair amount of locality to their transactions.

> you could try to avoid the need of the sysctl by teaching the vm to
> unmap such vma, but I don't think it worth and I'm sure those apps
> prefers to have the stuff pinned anyways w/o the risk of sigbus and w/o
> the need of mlock and it looks cleaner to me to avoid any mess with the
> vm and long term nobody will care about this sysctl since 64bit will run
> so much fatster w/o any remap_file_pages and tlb flush running at all

It is still useful for things outside of the pure databases on 32 bits 
realm.  Consider a fast bochs running 32 bit apps on a 64 bit machine -- 
should it have to deal with the overhead of zillions of vmas for emulating 
page tables?

If anything, I think we should be moving in the direction of doing more 
along the lines of remap_file_pages: things like executables might as well 
keep their state in page tables since we never discard them and instead 
toss the vma out the window.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  1:31           ` Andrea Arcangeli
  2003-04-05  1:52             ` Benjamin LaHaise
@ 2003-04-05  2:06             ` Andrew Morton
  2003-04-05  2:24               ` Andrea Arcangeli
  1 sibling, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-05  2:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> > - get_unmapped_area() search complexity.
> > 
> >   Solved by remap_file_pages and by as-yet unimplemented algorithmic rework.
> 
> what is this "yet unimplemented algorithmic rework".

I was referring to your planned mmap speedup.  I should have said 'or', nor
'and'.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  0:31         ` Andrew Morton
  2003-04-05  1:31           ` Andrea Arcangeli
@ 2003-04-05  2:13           ` Martin J. Bligh
  2003-04-05  2:44             ` Andrea Arcangeli
  2003-04-05  3:22             ` Andrew Morton
  1 sibling, 2 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05  2:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: mingo, hugh, dmccr, linux-kernel, linux-mm

> Perhaps it is useful to itemise the prblems which we're trying to solve here:
> 
> - ZONE_NORMAL consumption by pte_chains
> 
>   Solved by objrmap and presumably page clustering.
> 
> - ZONE_NORMAL consumption by VMAs
> 
>   Solved by remap_file_pages.  Neither objrmap nor page clustering will
>   help here.

I'm not convinced that we can't do something with nonlinear mappings for
this ... we just need to keep a list of linear areas within the nonlinear
vmas, and use that to do the objrmap stuff with. Dave and I talked about
this yesterday ... we both had different terminology, but I think the
same underlying fundamental concept ... I was calling them "sub-vmas"
for each linear region within the nonlinear space. 

The fundamental problem I came to (and I think Dave had the same problem) 
is that I couldn't see what problem remap_file_pages was trying to solve,
so it was tricky to see if we'd cause the same thing or not. sub-vmas
could certainly be a lot smaller, but we weren't thinking of 128K of the
damned things, so ... the other thing is of course the setup and teardown
time ... but the could be a btree or something for the structure.

Of course, if we did this, it would get rid of the whole conversion
to and from object based stuff ;-) I think Dave had some other bright
idea on this too, but I don't recall what it was ;-(

> - pte_chain setup and teardown CPU cost.
> 
>   objrmap does not seem to help.  Page clustering might, but is unlikely to
>   be enabled on the machines which actually care about the overhead.

eh? Not sure what you mean by that. It helped massively ...
diffprofile from kernbench showed:

     -4666   -74.9% page_add_rmap
    -10666   -92.0% page_remove_rmap

I'd say that about an 85% reduction in cost is pretty damned fine ;-)
And that was about a 20% overall reduction in the system time for the
test too ... that was all for partial objrmap (file backed, not anon).

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  1:52             ` Benjamin LaHaise
@ 2003-04-05  2:22               ` Andrea Arcangeli
  2003-04-05 10:01                 ` Jamie Lokier
  0 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05  2:22 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, mingo, hugh, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 08:52:48PM -0500, Benjamin LaHaise wrote:
> On Sat, Apr 05, 2003 at 03:31:43AM +0200, Andrea Arcangeli wrote:
> > Also consider this significant factor: the larger the shmfs the smaller
> > the nonlinear 1G window will be and the higher the trashing. With 32G of
> > bigpages the remap_file_pages will trash like crazy generating an order
> > of mangnitude more of "window misses". I mean 32bit are just pushed at
> > the limit today regardless the lack of remap_file_pages. Example, if
> > you don't use largepages going past 16G of shm is going to be derimental.
> > The cost of the mmap doesn't sounds like the showstopper.
> 
> You're guessing here.  At least for oracle, that behaviour is dependant on 
> the locality of accesses.  Given that each user has their own process you 
> can bet there is a fair amount of locality to their transactions.
> 

I'm definitely not guessing about the largepage factor, you'd better
drop some ram and run with largepages. I've to guess about
remap_file_pages only because that's not backported yet (thankfully due
its insane api).  But if largepages makes such an huge difference, mmap
can't be the big cost under such a tlb trashing scenarios. largepages
shouldn't affect the mmap frequency at all.

Sure the locality exists, but if you wouldn't need a moving window you
wouldn't need the vlm and with 32G shm vs 512M window, your trashing
will be an order of magnitude higher than with a 1G shm, obviously.

I'm guessing but I'm guessing based on non-guesses.

However I'm not questioning that remap_file_pages will help, it will
obviously, I just don't think it's worthwhile enough and I don't see
mmap as the big cost, the big cost is the pagetable mangling and tlb
flushing that will have to happen anyways, regardless if you overwrite
the vma with an mmap or if you call remap_file_pages.

> > you could try to avoid the need of the sysctl by teaching the vm to
> > unmap such vma, but I don't think it worth and I'm sure those apps
> > prefers to have the stuff pinned anyways w/o the risk of sigbus and w/o
> > the need of mlock and it looks cleaner to me to avoid any mess with the
> > vm and long term nobody will care about this sysctl since 64bit will run
> > so much fatster w/o any remap_file_pages and tlb flush running at all
> 
> It is still useful for things outside of the pure databases on 32 bits 
> realm.  Consider a fast bochs running 32 bit apps on a 64 bit machine -- 
> should it have to deal with the overhead of zillions of vmas for emulating 
> page tables?

I can't understand this very well so it maybe my fault, but it doesn't
make any sense to me. I don't know how bochs works but for certain you
won't get any help from the API of remap_file_pages implemented in
2.5.66 in a 64bit arch.

If you think you can get any benefit, then I tell you, rather than using
remap_file_pages, just go ahead mmap the whole file for me, as large as
it is, likely you're dealing with a 32bit address space so it will be
a mere 4G. I doubt you're dealing with 1 terabytes files with bochs that
is by definintion a 32bit thing.

map it all with mmap, and access it sparse. Then you have
remap_file_pages in the 64bit archs, for free w/o special syscalls and
w/o any sigbus handling, the kernel will do the paging for you to the
swap and back into the right place in ram w/o passing through a slower
userspace signal.

I can't see any useful application of the current API of
remap_file_pages in a 64bit arch, but it's possible I'm missing
something. And no, I don't mind to waste 3.8G of address space in the
bochs process, since I still have some petabyte of it unused.

> If anything, I think we should be moving in the direction of doing more 
> along the lines of remap_file_pages: things like executables might as well 
> keep their state in page tables since we never discard them and instead 
> toss the vma out the window.

I'm sorry, but I don't understand very well this, sorry. Could you
elaborate? What state do you want to put in the pagetables? Are you
talking about the pagetables of the cpu or a simulated one in userspace?

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:06             ` Andrew Morton
@ 2003-04-05  2:24               ` Andrea Arcangeli
  0 siblings, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05  2:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, hugh, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 06:06:20PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > - get_unmapped_area() search complexity.
> > > 
> > >   Solved by remap_file_pages and by as-yet unimplemented algorithmic rework.
> > 
> > what is this "yet unimplemented algorithmic rework".
> 
> I was referring to your planned mmap speedup.  I should have said 'or', nor
> 'and'.

Oh I see, thanks for the clarification. My plan was to do it only for
mmap, not to let it work for remap_file_pages too to avoid any
additional kernel complexity. I think remap_file_pages (if it stays)
should be a magic for 32bit archs to allow mangling the pagetables from
userspace, so I'm not very interested to mix the mmap layer in any way
with it (more than teaching all MM functions to stay away from a
VM_NONLINEAR vma).

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:13           ` Martin J. Bligh
@ 2003-04-05  2:44             ` Andrea Arcangeli
  2003-04-05  3:24               ` Andrew Morton
                                 ` (3 more replies)
  2003-04-05  3:22             ` Andrew Morton
  1 sibling, 4 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05  2:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, mingo, hugh, dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 06:13:52PM -0800, Martin J. Bligh wrote:
> > Perhaps it is useful to itemise the prblems which we're trying to solve here:
> > 
> > - ZONE_NORMAL consumption by pte_chains
> > 
> >   Solved by objrmap and presumably page clustering.
> > 
> > - ZONE_NORMAL consumption by VMAs
> > 
> >   Solved by remap_file_pages.  Neither objrmap nor page clustering will
> >   help here.
> 
> I'm not convinced that we can't do something with nonlinear mappings for
> this ... we just need to keep a list of linear areas within the nonlinear
> vmas, and use that to do the objrmap stuff with. Dave and I talked about
> this yesterday ... we both had different terminology, but I think the
> same underlying fundamental concept ... I was calling them "sub-vmas"
> for each linear region within the nonlinear space. 

that's wasted memory IMHO, if you need nonlinear, you don't want to
waste further metadata, you only want to pin pages in the pagetables,
the 'window' over the pagecache (incidentally shm)

the vm shouldn't know about it.

> The fundamental problem I came to (and I think Dave had the same problem) 
> is that I couldn't see what problem remap_file_pages was trying to solve,

Oh that's clear, it's only the avoidance of the mmap calls that walks
the rbtree with many vmas allocated. Which is another reason for not
having any kind of metadata associated with the pages attached to the
nonlinear vma. Taking a linearity inside the non-linearity sounds
not worthwhile.

remap_file_pages isn't a regular API, it's a 32bit hack to mangle
pagetables and attach pages into it hard due the lack of address space
that avoids you to map the whole file at once.

Should pin stuff into ram and be enabled by a sysctl, and to be not used
on 64bit archs that can map all at once in a cleaner way that also
allows efficient swapping etc...

> so it was tricky to see if we'd cause the same thing or not. sub-vmas
> could certainly be a lot smaller, but we weren't thinking of 128K of the
> damned things, so ... the other thing is of course the setup and teardown
> time ... but the could be a btree or something for the structure.
> 
> Of course, if we did this, it would get rid of the whole conversion
> to and from object based stuff ;-) I think Dave had some other bright
> idea on this too, but I don't recall what it was ;-(
> 
> > - pte_chain setup and teardown CPU cost.
> > 
> >   objrmap does not seem to help.  Page clustering might, but is unlikely to
> >   be enabled on the machines which actually care about the overhead.
> 
> eh? Not sure what you mean by that. It helped massively ...
> diffprofile from kernbench showed:

Indeed. objrmap is the only way to avoid the big rmap waste. Infact I'm
not even convinced about the hybrid approch, rmap should be avoided even
for the anon pages. And the swap cpu doesn't matter, as far as we can
reach pagteables in linear time that's fine, doesn't matter how many
fixed cycles it takes. Only the complexity factor matters, and objrmap
takes care of it just fine.

> 
>      -4666   -74.9% page_add_rmap
>     -10666   -92.0% page_remove_rmap
> 
> I'd say that about an 85% reduction in cost is pretty damned fine ;-)
> And that was about a 20% overall reduction in the system time for the
> test too ... that was all for partial objrmap (file backed, not anon).
> 
> M.


Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:13           ` Martin J. Bligh
  2003-04-05  2:44             ` Andrea Arcangeli
@ 2003-04-05  3:22             ` Andrew Morton
  2003-04-05  3:35               ` Martin J. Bligh
  1 sibling, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-05  3:22 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> >   objrmap does not seem to help.  Page clustering might, but is unlikely to
> >   be enabled on the machines which actually care about the overhead.
> 
> eh? Not sure what you mean by that. It helped massively ...
> diffprofile from kernbench showed:
> 
>      -4666   -74.9% page_add_rmap
>     -10666   -92.0% page_remove_rmap
> 
> I'd say that about an 85% reduction in cost is pretty damned fine ;-)
> And that was about a 20% overall reduction in the system time for the
> test too ... that was all for partial objrmap (file backed, not anon).
> 

In the test I use (my patch management scripts, which is basically bash
forking its brains out) objrmap reclaims only 30-50% of the rmap CPU
overhead.

Maybe you had a very high sharing level.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:44             ` Andrea Arcangeli
@ 2003-04-05  3:24               ` Andrew Morton
  2003-04-05 12:06                 ` Andrew Morton
  2003-04-05  3:45               ` objrmap and vmtruncate Martin J. Bligh
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-05  3:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> Indeed. objrmap is the only way to avoid the big rmap waste. Infact I'm
> not even convinced about the hybrid approch, rmap should be avoided even
> for the anon pages. And the swap cpu doesn't matter, as far as we can
> reach pagteables in linear time that's fine, doesn't matter how many
> fixed cycles it takes. Only the complexity factor matters, and objrmap
> takes care of it just fine.

Well not really.

Consider the case where 100 processes each own 100 vma's against the same
file.

To unmap a page with objrmap we need to search those 10,000 vma's (10000
cachelines).  With full rmap we need to search only 100 pte_chain slots (3 to
33 cachelines).  That's an enormous difference.  It happens for *each* page.

And, worse, we have the same cost when searching for referenced bits in the
pagetables.  Nobody has written an "exploit" for this yet, but it's there.

Possibly we should defer the assembly of the pte chain until a page hits the
tail of the LRU.  That's an awkward time to be allocating memory though.  We
could perhaps fall back to the vma walk if pte_chain allocation starts to
endanger the page reserves.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  3:22             ` Andrew Morton
@ 2003-04-05  3:35               ` Martin J. Bligh
  0 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05  3:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

>> >   objrmap does not seem to help.  Page clustering might, but is unlikely to
>> >   be enabled on the machines which actually care about the overhead.
>> 
>> eh? Not sure what you mean by that. It helped massively ...
>> diffprofile from kernbench showed:
>> 
>>      -4666   -74.9% page_add_rmap
>>     -10666   -92.0% page_remove_rmap
>> 
>> I'd say that about an 85% reduction in cost is pretty damned fine ;-)
>> And that was about a 20% overall reduction in the system time for the
>> test too ... that was all for partial objrmap (file backed, not anon).
> 
> In the test I use (my patch management scripts, which is basically bash
> forking its brains out) objrmap reclaims only 30-50% of the rmap CPU
> overhead.
> 
> Maybe you had a very high sharing level.

Not especially, I was running "make -j 32" for that one, which seems like
a fairly small sharing load (though maybe a bit lighter than yours still).
Going to high numbers of tasks will show even more impressive improvements.
"make -j 256" actually looked reasonably similar.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:44             ` Andrea Arcangeli
  2003-04-05  3:24               ` Andrew Morton
@ 2003-04-05  3:45               ` Martin J. Bligh
  2003-04-05  3:59               ` Rik van Riel
  2003-04-05  4:52               ` Martin J. Bligh
  3 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05  3:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mingo, hugh, dmccr, linux-kernel, linux-mm

>> I'm not convinced that we can't do something with nonlinear mappings for
>> this ... we just need to keep a list of linear areas within the nonlinear
>> vmas, and use that to do the objrmap stuff with. Dave and I talked about
>> this yesterday ... we both had different terminology, but I think the
>> same underlying fundamental concept ... I was calling them "sub-vmas"
>> for each linear region within the nonlinear space. 
> 
> that's wasted memory IMHO, if you need nonlinear, you don't want to
> waste further metadata, you only want to pin pages in the pagetables,
> the 'window' over the pagecache (incidentally shm)
> 
> the vm shouldn't know about it.

OK, but this is only for the case when the things aren't memlocked anyway,
which in Oracle's case is never. Seems like we're thrashing a lot of time
and effort over sys_remap_file_pages considering it's never actually
desirable to scan the chains for pageout anyway.

And does anyone *really* start off with a linear vma and them convert
it to a linear one after using it? Can't we just fail that call? Would
result in orders of magnitude of reduction in complexity if we could
just narrow the scope of this beastie a bit.

If you have a *non-memlocked* VMA that you've *previously used* as linear,
then the sys_remap_file_pages stuff would fail with an error code. Is 
that too painful? Maybe you can't "un-memlock" a non-linear VMA once
it's memlocked either. I'm quite possibly missing something but if someone
could point out what that is ... ?

>> The fundamental problem I came to (and I think Dave had the same problem) 
>> is that I couldn't see what problem remap_file_pages was trying to solve,
> 
> Oh that's clear, it's only the avoidance of the mmap calls that walks
> the rbtree with many vmas allocated. Which is another reason for not
> having any kind of metadata associated with the pages attached to the
> nonlinear vma. Taking a linearity inside the non-linearity sounds
> not worthwhile.

Well, it's an order of magnitude less expensive than mem_map still.
No, not perfect, but the world's a compromise ;-) And we don't need
this at all for memlocked ones.
 
>> eh? Not sure what you mean by that. It helped massively ...
>> diffprofile from kernbench showed:
> 
> Indeed. objrmap is the only way to avoid the big rmap waste. Infact I'm
> not even convinced about the hybrid approch, rmap should be avoided even
> for the anon pages. 

Right, but they seem to be mostly singletons (non-shared), so the 
pte_direct stuff takes care of those mostly anyway. Would be nice eventually,
but I thinking taking one step at a time is maybe helpful ;-)

M

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-04 23:07     ` Andrew Morton
  2003-04-05  0:03       ` Andrea Arcangeli
@ 2003-04-05  3:53       ` Rik van Riel
  1 sibling, 0 replies; 105+ messages in thread
From: Rik van Riel @ 2003-04-05  3:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Ingo Molnar, hugh, dmccr, linux-kernel, linux-mm

On Fri, 4 Apr 2003, Andrew Morton wrote:

> I think that's right - the system call is very specialised and is
> targeted at solving problems which have been encountered in a small
> number of applications, but important ones.
> 
> Right now, I do not feel that we are going to be able to come up with an
> acceptably simple VM which has both nonlinear mappings and objrmap.

This is ok if we make nonlinear VMAs automatically mlocked,
meaning they don't need reverse mapping at all.

If you need the space saving from nonlinear VMAs, you also
need to save the space of any kind of reverse mapping scheme,
even a mythical nonlinear object one (just think about the
minimum amount of data you need to store).

IMHO it'd be fair to limit nonlinear VMAs to the set of very
specialised applications that need it (Oracle, DB2, anything
else?) and impose some limitations on the functionality so
the main part of the VM stay sane.

Rik



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:44             ` Andrea Arcangeli
  2003-04-05  3:24               ` Andrew Morton
  2003-04-05  3:45               ` objrmap and vmtruncate Martin J. Bligh
@ 2003-04-05  3:59               ` Rik van Riel
  2003-04-05  4:10                 ` William Lee Irwin III
  2003-04-05  4:52               ` Martin J. Bligh
  3 siblings, 1 reply; 105+ messages in thread
From: Rik van Riel @ 2003-04-05  3:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Andrew Morton, mingo, hugh, dmccr, linux-kernel,
	linux-mm

On Sat, 5 Apr 2003, Andrea Arcangeli wrote:

> that's wasted memory IMHO, if you need nonlinear, you don't want to
> waste further metadata, you only want to pin pages in the pagetables,
> the 'window' over the pagecache (incidentally shm)

Agreed.

> > > - pte_chain setup and teardown CPU cost.
> > > 
> > >   objrmap does not seem to help.  Page clustering might, but is unlikely to
> > >   be enabled on the machines which actually care about the overhead.
> > 
> > eh? Not sure what you mean by that. It helped massively ...
> > diffprofile from kernbench showed:
> 
> Indeed. objrmap is the only way to avoid the big rmap waste. Infact I'm
> not even convinced about the hybrid approch, rmap should be avoided even
> for the anon pages. And the swap cpu doesn't matter, as far as we can
> reach pagteables in linear time that's fine, doesn't matter how many
> fixed cycles it takes. Only the complexity factor matters, and objrmap
> takes care of it just fine.

The only issues with objrmap seems to be mremap, which Hugh
seems to have taken care of, and the case of a large number
of processes mapping different parts of the same file multiple
times (1000 processes mapping each 1000 parts of the same file),
which would grow the complexity of the VMA search from linear
to quadratical.

That last case is also fixable, though, probably best done using
k-d trees.

Except for nonlinear VMAs I don't think there are any big obstacles
left that would keep us from switching to object rmap.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  3:59               ` Rik van Riel
@ 2003-04-05  4:10                 ` William Lee Irwin III
  2003-04-05  4:49                   ` Martin J. Bligh
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-05  4:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, mingo, hugh,
	dmccr, linux-kernel, linux-mm

On Fri, Apr 04, 2003 at 10:59:59PM -0500, Rik van Riel wrote:
> The only issues with objrmap seems to be mremap, which Hugh
> seems to have taken care of, and the case of a large number
> of processes mapping different parts of the same file multiple
> times (1000 processes mapping each 1000 parts of the same file),
> which would grow the complexity of the VMA search from linear
> to quadratical.
> That last case is also fixable, though, probably best done using
> k-d trees.
> Except for nonlinear VMAs I don't think there are any big obstacles
> left that would keep us from switching to object rmap.

The k-d trees only solve the "external" interference case, that is,
it thins the search space by eliminating vma's the page must
necessarily be outside of.

They don't solve the "internal" interference case, where the page does
fall into all of the vma's, but only a few out of those have actually
faulted the page into the pagetables. This is likely only "fixable" by
pointwise methods, which seem to come with notable maintenance expense.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  4:10                 ` William Lee Irwin III
@ 2003-04-05  4:49                   ` Martin J. Bligh
  2003-04-05 13:31                     ` Rik van Riel
  0 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05  4:49 UTC (permalink / raw)
  To: William Lee Irwin III, Rik van Riel
  Cc: Andrea Arcangeli, Andrew Morton, mingo, hugh, dmccr,
	linux-kernel, linux-mm

>> The only issues with objrmap seems to be mremap, which Hugh
>> seems to have taken care of, and the case of a large number
>> of processes mapping different parts of the same file multiple
>> times (1000 processes mapping each 1000 parts of the same file),
>> which would grow the complexity of the VMA search from linear
>> to quadratical.
>> That last case is also fixable, though, probably best done using
>> k-d trees.
>> Except for nonlinear VMAs I don't think there are any big obstacles
>> left that would keep us from switching to object rmap.
> 
> The k-d trees only solve the "external" interference case, that is,
> it thins the search space by eliminating vma's the page must
> necessarily be outside of.
> 
> They don't solve the "internal" interference case, where the page does
> fall into all of the vma's, but only a few out of those have actually
> faulted the page into the pagetables. This is likely only "fixable" by
> pointwise methods, which seem to come with notable maintenance expense.

I don't think we have an app that has 1000 processes mapping the whole
file 1000 times per process. If we do, shooting the author seems like 
the best course of action to me.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:44             ` Andrea Arcangeli
                                 ` (2 preceding siblings ...)
  2003-04-05  3:59               ` Rik van Riel
@ 2003-04-05  4:52               ` Martin J. Bligh
  3 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05  4:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mingo, hugh, dmccr, linux-kernel, linux-mm

>> I'm not convinced that we can't do something with nonlinear mappings for
>> this ... we just need to keep a list of linear areas within the nonlinear
>> vmas, and use that to do the objrmap stuff with. Dave and I talked about
>> this yesterday ... we both had different terminology, but I think the
>> same underlying fundamental concept ... I was calling them "sub-vmas"
>> for each linear region within the nonlinear space. 
> 
> that's wasted memory IMHO, if you need nonlinear, you don't want to
> waste further metadata, you only want to pin pages in the pagetables,
> the 'window' over the pagecache (incidentally shm)

Hold on a minute ... don't the rmap chains (which this would be replacing)
waste rather more space than this anyway? I'd rather have it per linear
area than per-page ... think of it as "shared rmap pte chains with offsets"
if you like ;-)

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  2:22               ` Andrea Arcangeli
@ 2003-04-05 10:01                 ` Jamie Lokier
  2003-04-05 10:11                   ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Jamie Lokier @ 2003-04-05 10:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin LaHaise, Andrew Morton, mingo, hugh, dmccr,
	linux-kernel, linux-mm

Andrea Arcangeli wrote:
> > It is still useful for things outside of the pure databases on 32 bits 
> > realm.  Consider a fast bochs running 32 bit apps on a 64 bit machine -- 
> > should it have to deal with the overhead of zillions of vmas for emulating 
> > page tables?
> 
> I can't understand this very well so it maybe my fault, but it doesn't
> make any sense to me. I don't know how bochs works but for certain you
> won't get any help from the API of remap_file_pages implemented in
> 2.5.66 in a 64bit arch.
> 
> If you think you can get any benefit, then I tell you, rather than using
> remap_file_pages, just go ahead mmap the whole file for me, as large as
> it is, likely you're dealing with a 32bit address space so it will be
> a mere 4G. I doubt you're dealing with 1 terabytes files with bochs that
> is by definintion a 32bit thing.

1. You missed the "fast" in "fast bochs".

The idea is to have a file representing the simulated RAM (anything up
to 64G in size), and to map that the same way as the simulated page tables.

Then the virtual machine can address the memory directly, which is very fast.
Doing it your way, the virtual machine would have to do a virtual TLB
lookup for every memory access, which slows down the simulation considerably.

2. Another use of non-linear mappings is when you want different per
page memory protections.  In this case you don't need different
pg_offset per page, you just want to write protect and unprotect
individual pages.  This comes up in the context of some garbage
collector algorithms.

Again, the idea is that the mapping (and in this case SIGSEGV
handling) costs a little, but it is less than the cost of checking
each memory access in the main code.


Both of these use many thousands of VMAs when done using mmap().  I
don't think either of these uses of non-linear mappings are covered by
your suggestion to use a 64 bit address space.

-- Jamie

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 10:01                 ` Jamie Lokier
@ 2003-04-05 10:11                   ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-05 10:11 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrea Arcangeli, Benjamin LaHaise, Andrew Morton, mingo, hugh,
	dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 11:01:13AM +0100, Jamie Lokier wrote:
> 1. You missed the "fast" in "fast bochs".
> The idea is to have a file representing the simulated RAM (anything up
> to 64G in size), and to map that the same way as the simulated page tables.
> Then the virtual machine can address the memory directly, which is very fast.
> Doing it your way, the virtual machine would have to do a virtual TLB
> lookup for every memory access, which slows down the simulation considerably.

This is probably the only feasible method of getting general access to
hardware of that kind.


On Sat, Apr 05, 2003 at 11:01:13AM +0100, Jamie Lokier wrote:
> 2. Another use of non-linear mappings is when you want different per
> page memory protections.  In this case you don't need different
> pg_offset per page, you just want to write protect and unprotect
> individual pages.  This comes up in the context of some garbage
> collector algorithms.
> Again, the idea is that the mapping (and in this case SIGSEGV
> handling) costs a little, but it is less than the cost of checking
> each memory access in the main code.

This is a novel use for it that may require an extension to support.
It's worthwhile but I'm terrified enough of the invariants broken by
the code as it stands, even disregarding further extension.


On Sat, Apr 05, 2003 at 11:01:13AM +0100, Jamie Lokier wrote:
> Both of these use many thousands of VMAs when done using mmap().  I
> don't think either of these uses of non-linear mappings are covered by
> your suggestion to use a 64 bit address space.

The 64GB simulation on 64-bit sort of is, but isn't really since the
additional minor faults would be taken with the ordinary mmap()
approach, incurring one or two additional round trips through the
kernel and actually blocking at the time of access instead of pre-
populated (prefaulted). The access is also likely to be sparse which
raises the spectre of pagetable fragmentation yet again...

Thanks for the good insights.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  3:24               ` Andrew Morton
@ 2003-04-05 12:06                 ` Andrew Morton
  2003-04-05 15:11                   ` Martin J. Bligh
                                     ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-05 12:06 UTC (permalink / raw)
  To: andrea, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrew Morton <akpm@digeo.com> wrote:
>
> Nobody has written an "exploit" for this yet, but it's there.

Here we go.  The test app is called `rmap-test'.  It is in ext3 CVS.  See

	http://www.zip.com.au/~akpm/linux/ext3/

It sets up N MAP_SHARED VMA's and N tasks touching them in various access
patterns.

vmm:/usr/src/ext3/tools> ./rmap-test 
Usage: ./rmap-test [-hlrvV] [-iN] [-nN] [-sN] [-tN] filename
     -h:          Pattern: half of memory is busy
     -l:          Pattern: linear
     -r:          Pattern: random
     -iN:         Number of iterations
     -nN:         Number of VMAs
     -sN:         VMA size (pages)
     -tN:         Run N tasks
     -VN:         Number of VMAs to process
     -v:          Verbose

The kernels which were compared were 2.5.66-mm4, 2.5.66-mm4+all objrmap
patches and 2.4.21-pre5aa2.  The machine has 256MB of memory, 2.7G P4,
uniprocessor, IDE disk.




The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
modify their 100 vma's in a linear walk.  Total working set is 240MB
(slightly more than is available).

	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo

2.5.66-mm4:
	15.76s user 86.91s system 33% cpu 5:05.07 total
2.5.66-mm4+objrmap:
	23.07s user 1143.26s system 87% cpu 22:09.81 total
2.4.21-pre5aa2:
	14.91s user 75.30s system 24% cpu 6:15.84 total





In the second test we again have 100 tasks, each with 100 vma's but the
access pattern is random:

	./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo

2.5.66-mm4:
	0.12s user 6.05s system 2% cpu 3:59.68 total
2.5.66-mm4+objrmap:
	0.12s user 2.10s system 0% cpu 4:01.15 total
2.4.21-pre5aa2:
	0.07s user 2.03s system 0% cpu 4:12.69 total


The -aa VM failed in this test.

	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
	VM: killing process rmap-test

I'd have to call this a bug - the machine was full of reclaimable memory.

I also saw the 2.4 kernel do 705,000 context switches in a single second,
which was odd.  It only happened once.





In the third test a single task owns 10000 VMA's and walks across them in a
linear pattern:

	./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo

2.5.66-mm4:
	0.25s user 3.75s system 1% cpu 4:38.44 total
2.5.66-mm4+objrmap:
	0.28s user 146.45s system 16% cpu 15:14.59 total
2.4.21-pre5aa2:
	0.32s user 4.83s system 0% cpu 18:25.90 total




These are not ridiculous workloads, especially the third one.  And 10k VMA's
is by no means inconceivable.

The objrmap code will be show-stoppingly expensive at 100k vmas per file.

And as expected, the full rmap implementation gives the most stable,
predictable and highest performance result under heavy load.  That's why
we're using it.

When it comes to the VM, there is a lot of value in sturdiness under unusual
and heavy loads.

Tomorrow I'll change the test app to do nonlinear mappings too.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05  4:49                   ` Martin J. Bligh
@ 2003-04-05 13:31                     ` Rik van Riel
  0 siblings, 0 replies; 105+ messages in thread
From: Rik van Riel @ 2003-04-05 13:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Andrea Arcangeli, Andrew Morton, mingo,
	hugh, dmccr, linux-kernel, linux-mm

On Fri, 4 Apr 2003, Martin J. Bligh wrote:

> I don't think we have an app that has 1000 processes mapping the whole
> file 1000 times per process. If we do, shooting the author seems like
> the best course of action to me.

Please, don't shoot akpm ;)

Rik
-- 
Engineers don't grow up, they grow sideways.
http://www.surriel.com/		http://kernelnewbies.org/

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 12:06                 ` Andrew Morton
@ 2003-04-05 15:11                   ` Martin J. Bligh
       [not found]                     ` <20030405161758.1ee19bfa.akpm@digeo.com>
  2003-04-05 16:30                   ` Andrea Arcangeli
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-05 15:11 UTC (permalink / raw)
  To: Andrew Morton, andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

> It sets up N MAP_SHARED VMA's and N tasks touching them in various access
> patterns.

Can you clarify ... are these VMAs all mapping the same address space,
or different ones? If the same, are you mapping the whole thing each time?

>> I don't think we have an app that has 1000 processes mapping the whole
>> file 1000 times per process. If we do, shooting the author seems like
>> the best course of action to me.
> 
> Rik:
>
> Please, don't shoot akpm ;)

If mapping the *whole* address space hundreds of times, why would anyone 
ever actually want to do that? It kills some important optimisations that 
Dave has made, and seems to be an unrealistic test case. I don't understand
what you're trying to simulate with that (if that's what you are doing).
Mapping 1000 subsegments I can understand, but not the whole thing.

Is excellent to actually have a tool to do real testing on this stuff
with though ... thanks ;-)

> 2.5.66-mm4:
> 2.5.66-mm4+objrmap:

So mm4 has what? No partial objrmap at all (you dropped it?)? 
Or partial but non anon?

> These are not ridiculous workloads, especially the third one.  And 10k VMA's
> is by no means inconceivable.
> 
> The objrmap code will be show-stoppingly expensive at 100k vmas per file.
> 
> And as expected, the full rmap implementation gives the most stable,
> predictable and highest performance result under heavy load.  That's why
> we're using it.

Well, it also consumes the most space. How about adding a test that has
1000s of processes mapping one large (2GB say) VMA, and seeing what that 
does? That's the workload of lots of database type things.

> When it comes to the VM, there is a lot of value in sturdiness under 
> unusual and heavy loads.

Indeed. Which includes not locking up the whole box in a solid hang
from ZONE_NORMAL consumption though ...
 
> Tomorrow I'll change the test app to do nonlinear mappings too.

Cool, thanks ... I'll try to have a play with your tools later this evening.

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 12:06                 ` Andrew Morton
  2003-04-05 15:11                   ` Martin J. Bligh
@ 2003-04-05 16:30                   ` Andrea Arcangeli
  2003-04-05 19:01                     ` Andrea Arcangeli
                                       ` (3 more replies)
  2003-04-05 23:25                   ` William Lee Irwin III
  2003-04-06  2:23                   ` Martin J. Bligh
  3 siblings, 4 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05 16:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> Andrew Morton <akpm@digeo.com> wrote:
> >
> > Nobody has written an "exploit" for this yet, but it's there.
> 
> Here we go.  The test app is called `rmap-test'.  It is in ext3 CVS.  See
> 
> 	http://www.zip.com.au/~akpm/linux/ext3/
> 
> It sets up N MAP_SHARED VMA's and N tasks touching them in various access
> patterns.

I'm not questioning during paging rmap is more efficient than objrmap,
but your argument about rmap having lower complexity of objrmap and that
rmap is needed is wrong. The fact is that with your 100 mappings per
each of the 100 tasks case, both algorithms works in O(N) where N is
the number of the pagetables mapping the page. No difference in
complexity.  I don't care how many cycles you spend to reach the 100x100
pagetables, those are fixed cycles, the fact is that there are 100x100
pagetables, rmap won't change the complexity of the algorithm at all,
that's mandated by the hardware and by your application, we can't do
better than O(N) with N the number of pagetables to unmap a single page.
Even rmap has the O(N) complexity, it won't be allowed to reach only 100
pagetables instead of 100000 pagetables. swapping isn't the important
path so it is extremely worthwhile to take fast all the other normal
workloads and the real important realistic benchmarks fast, and spend
more cpu into the swapping as far as it can be done with a non
exponential complexity.  And IMHO your rmap-test should be renamed as
the rmap-very-best-and-not-interesting-case, since that's the only
workload where rmap pays off, and objrmap is more than enough to satify
my complexity needs for the multigigabyte swapping and objrmap obviously
can't hurt the fast paths, becase we always had objrmap even in 2.4.
And objrmap can't avoided, it's needed for the truncate semantics
against mmap.

Check all other important benchmarks not testing the paging load like
page faults, kernel compile from Martin, fork, AIM etc... Those are IMHO
an order of magnitude of more interest than your rmap-test paging load
with some hundred thousand of vmas. The paging just needs to run with
linear complexity, and it's useful anyways to have objrmap to be able to
defragment ram, or to possibly do process migration of threads with
anonymous memory in a more friendy manner.

> vmm:/usr/src/ext3/tools> ./rmap-test 
> Usage: ./rmap-test [-hlrvV] [-iN] [-nN] [-sN] [-tN] filename
>      -h:          Pattern: half of memory is busy
>      -l:          Pattern: linear
>      -r:          Pattern: random
>      -iN:         Number of iterations
>      -nN:         Number of VMAs
>      -sN:         VMA size (pages)
>      -tN:         Run N tasks
>      -VN:         Number of VMAs to process
>      -v:          Verbose
> 
> The kernels which were compared were 2.5.66-mm4, 2.5.66-mm4+all objrmap
> patches and 2.4.21-pre5aa2.  The machine has 256MB of memory, 2.7G P4,
> uniprocessor, IDE disk.
> 
> 
> 
> 
> The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
> modify their 100 vma's in a linear walk.  Total working set is 240MB
> (slightly more than is available).
> 
> 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
> 
> 2.5.66-mm4:
> 	15.76s user 86.91s system 33% cpu 5:05.07 total
> 2.5.66-mm4+objrmap:
> 	23.07s user 1143.26s system 87% cpu 22:09.81 total
> 2.4.21-pre5aa2:
> 	14.91s user 75.30s system 24% cpu 6:15.84 total
> 
> 
> 
> 
> 
> In the second test we again have 100 tasks, each with 100 vma's but the
> access pattern is random:
> 
> 	./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
> 
> 2.5.66-mm4:
> 	0.12s user 6.05s system 2% cpu 3:59.68 total
> 2.5.66-mm4+objrmap:
> 	0.12s user 2.10s system 0% cpu 4:01.15 total
> 2.4.21-pre5aa2:
> 	0.07s user 2.03s system 0% cpu 4:12.69 total
> 
> 
> The -aa VM failed in this test.
> 
> 	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> 	VM: killing process rmap-test

I'll work on it. Many thanks. I wonder if it could be related to the
mixture of the access bit with the overcomplexity of the algorithm that
makes the passes over so many vmas useless. Certainly this workload
isn't common. I guess what I will try to do first is to simply ignore
the accessed bitflag after half of the passes failed. What do you think?

> I'd have to call this a bug - the machine was full of reclaimable
> memory.

If it's full of reclaimable memory it's definitely a bug and I need to
fix it ASAP ;).

> I also saw the 2.4 kernel do 705,000 context switches in a single second,
> which was odd.  It only happened once.

that could be the yelding paths in the getblk or ext3. Usually when you
see that it's the yielding not very good paths in the kernel. Probably
it's related to the oom anyways. Those paths won't fail, they'll just
loop forever until they succeed generating the overscheduling. Fixing
the oom should take care of them too. I don't worry about them at the
moment especially since you said it only happened once which confirms my
theory everything is fine and nothing is running out of control and it's
only a side effect of this workload that is folling the accessed bit
driving the box to a fake oom.

> These are not ridiculous workloads, especially the third one.  And 10k VMA's

the point is that you must be paging hard this stuff. This is what makes
it not very realistic. sure, 10k vmas are fine, but I prefer to run so
much faster in the fast paths, than to be able to swap them much faster.

> And as expected, the full rmap implementation gives the most stable,
> predictable and highest performance result under heavy load.  That's why
> we're using it.

and that's why the much more important fast paths suffers. I don't trade
some speedup in paging with the fast paths. 99% of users care about the
fast paths, or the paging in a laptop not under you special
best-case-rmap-test, which means pure swap bandwidth with a normal
number of vmas, no matter the cpu usage. Especially very high end smp
and PDA embedded usage definitely should avoid rmap and use objrmap IMHO
(those won't even need the anonymoys and shm information since they may
not have any swap at all, this definitely is the case of the PDA, so
objrmap is perfect for that and CONFIG_SWAP should exactly do that,
provide only objrmap for file mappings and leave anon ram pinned and
unknown by the vm).

Maybe I'm wrong but those are my current opinions on the matter ;)

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 16:30                   ` Andrea Arcangeli
@ 2003-04-05 19:01                     ` Andrea Arcangeli
  2003-04-05 20:14                       ` Andrew Morton
  2003-04-05 21:24                     ` Andrew Morton
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05 19:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 06:30:03PM +0200, Andrea Arcangeli wrote:
> On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > The -aa VM failed in this test.
> > 
> > 	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> > 	VM: killing process rmap-test
> 
> I'll work on it. Many thanks. I wonder if it could be related to the
> mixture of the access bit with the overcomplexity of the algorithm that
> makes the passes over so many vmas useless. Certainly this workload
> isn't common. I guess what I will try to do first is to simply ignore
> the accessed bitflag after half of the passes failed. What do you think?

unfortunately I can't reproduce. Booted with mem=256m on a 4-way xeon 2.5ghz:

jupiter:~ # ./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
[..]
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
vma 1/100 done
0/1
jupiter:~ # free
             total       used       free     shared    buffers     cached
Mem:        245804     236272       9532          0        688     216620
-/+ buffers/cache:      18964     226840
Swap:       265032       3732     261300
jupiter:~ # 

maybe it's a timing issue because I've an extremely fast storage? Or
maybe it's ext3 related, you're flushing on the filesystem and you need
to journal the inode updates at least. So maybe it's an ext3 bug not a
vm bug. I can't say it's an obvious vm bug at least, since I can't
reproduce in any way with such command line and such amount of ram (and
you see I've almost no swap and it's not even swapping heavily, it's a
server box without the 100mbyte of GUI).

could you try to run it on ext2?  I'm running it on top of ext2 at the
moment and it works flawlessy so far in this 256m configuration 4-way
(more cpus should not make differences but I can try again with 1 cpu if
you can reproduce on ext2 too).

Not a single failure, I started now an infinite loop now.

btw, I'm not running exactly 2.4.21pre5aa2, but there is not a single vm
difference between the kernel I'm testing on and 2.4.21pre5aa2, so it
shouldn't really matter.

I will try later with ext3, but now I'll leave it running for a while
with ext2 to make sure it never happens with ext2 in my hardware at
least.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 19:01                     ` Andrea Arcangeli
@ 2003-04-05 20:14                       ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-05 20:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Sat, Apr 05, 2003 at 06:30:03PM +0200, Andrea Arcangeli wrote:
> > On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > > The -aa VM failed in this test.
> > > 
> > > 	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> > > 	VM: killing process rmap-test
> > 
> > I'll work on it. Many thanks. I wonder if it could be related to the
> > mixture of the access bit with the overcomplexity of the algorithm that
> > makes the passes over so many vmas useless. Certainly this workload
> > isn't common. I guess what I will try to do first is to simply ignore
> > the accessed bitflag after half of the passes failed. What do you think?

Yes, I agree.  If we're getting close to OOM, who cares about accuracy of
page replacement decisions?

> unfortunately I can't reproduce. Booted with mem=256m on a 4-way xeon 2.5ghz:

I only saw it the once.  I'd hit ^C on the test and noticed the message on
the console some 5-10 seconds later.  It may have been from before the ^C
though.  So it _might_ be related to the exit path tearing down pagetables
and setting tons of dirty bits.

> Or maybe it's ext3 related

Conceivably.  It wouldn't be the first one.  But all the pages were mapped to
disk, so the writepage path is really the same as ext2 in that case.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 16:30                   ` Andrea Arcangeli
  2003-04-05 19:01                     ` Andrea Arcangeli
@ 2003-04-05 21:24                     ` Andrew Morton
  2003-04-05 22:06                       ` Andrea Arcangeli
  2003-04-05 21:34                     ` Rik van Riel
  2003-04-06  9:29                     ` Benjamin LaHaise
  3 siblings, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-05 21:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > Andrew Morton <akpm@digeo.com> wrote:
> > >
> > > Nobody has written an "exploit" for this yet, but it's there.
> > 
> > Here we go.  The test app is called `rmap-test'.  It is in ext3 CVS.  See
> > 
> > 	http://www.zip.com.au/~akpm/linux/ext3/
> > 
> > It sets up N MAP_SHARED VMA's and N tasks touching them in various access
> > patterns.
> 
> I'm not questioning during paging rmap is more efficient than objrmap,
> but your argument about rmap having lower complexity of objrmap and that
> rmap is needed is wrong. The fact is that with your 100 mappings per
> each of the 100 tasks case, both algorithms works in O(N) where N is
> the number of the pagetables mapping the page.

Nope.  To unmap a page, full rmap has to scan 100 pte_chain slots, which is 3
cachelines worth.  objrmap has to scan 10,000 vma's, 9,900 of which do not map
that page at all.

(Actually, there's a recent optimisation in objrmap which will on average
halve these figures).

> And objrmap can't avoided, it's needed for the truncate semantics
> against mmap.

What do you mean by this?  vmtruncate continues to use the 2.4 algorithm for
that.

> Check all other important benchmarks not testing the paging load like
> page faults, kernel compile from Martin, fork, AIM etc... Those are IMHO
> an order of magnitude of more interest than your rmap-test paging load
> with some hundred thousand of vmas.

Andrea, I whine about rmap as much as anyone ;) I'm the guy who halved both
its speed and space overhead shortly after it was merged.

But the fact is that it is not completely useless overhead.  It provides a
very robust VM which is stable and predictable under extreme and unusual
loads.  That is valuable.

Yes, rmap adds a few% speed overhead - up to 10% for things which are
admittedly already very inefficient.

objrmap will reclaim a lot of that common-case overhead.  But the cost of
that is apparently unviability for certain workloads on certain machines. 
Once you hit 100k VMA's it's time to find a new operating system.

Maybe that is a tradeoff we want to make.  I'm adding some balance here.

The space consumption of rmap is a much more serious problem than the speed
overhead.  It makes some workloads on huge ia32 machines unviable.


Me, I have never seen any evidence that we need any of it.  I have never seen
a demonstration of the alleged failure modes of 2.4's virtual scan.  But then
I haven't tried very hard.

The extreme stability and scalability of full rmap is good.  The space
consumption on highmem is bad.  The CPU cost is much less important than
these things.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 16:30                   ` Andrea Arcangeli
  2003-04-05 19:01                     ` Andrea Arcangeli
  2003-04-05 21:24                     ` Andrew Morton
@ 2003-04-05 21:34                     ` Rik van Riel
  2003-04-06  9:29                     ` Benjamin LaHaise
  3 siblings, 0 replies; 105+ messages in thread
From: Rik van Riel @ 2003-04-05 21:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, 5 Apr 2003, Andrea Arcangeli wrote:

> I'm not questioning during paging rmap is more efficient than objrmap,
> but your argument about rmap having lower complexity of objrmap and that
> rmap is needed is wrong. The fact is that with your 100 mappings per
> each of the 100 tasks case, both algorithms works in O(N) where N is
> the number of the pagetables mapping the page. No difference in
> complexity.  I don't care how many cycles you spend to reach the 100x100
> pagetables, those are fixed cycles, the fact is that there are 100x100
> pagetables,

Umm no.  The fact that a VMA is "mapping" the page doesn't
mean the page is resident in any page tables.   For example,
think about the MAP_PRIVATE mapping of the relocation tables
from libc.so ... every process will have its own, modified,
copy of that data.  The original page might not be mapped by
the page tables of any processes.

> rmap won't change the complexity of the algorithm at all,

It will for some cases (as shown above), but I agree that for
most common situations objrmap and pte rmap should have very
similar algorithmic complexity in the pageout path.

> that's mandated by the hardware and by your application, we can't do
> better than O(N) with N the number of pagetables to unmap a single page.
> Even rmap has the O(N) complexity, it won't be allowed to reach only 100
> pagetables instead of 100000 pagetables.

There is one common situation where objrmap is O(N^2) while
pte rmap is only O(N).  However, this case isn't interesting
because this workload tends to run mlocked anyway.

This is, of course, Oracle on 32 bit systems with gazillions
of windows into the larger-than-virtual-memory shared memory
area.

This aspect of Oracle can be special-cased with remap_file_pages
and the reverse mapping can be skipped alltogether since Oracle's
shared memory area should (IMHO) be mlocked anyway.

In short, I agree with you that we probably want object rmap for
all the common cases.

cheers,

Rik
-- 
Engineers don't grow up, they grow sideways.
http://www.surriel.com/		http://kernelnewbies.org/

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 21:24                     ` Andrew Morton
@ 2003-04-05 22:06                       ` Andrea Arcangeli
  2003-04-05 22:31                         ` Andrew Morton
  0 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05 22:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 01:24:06PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > > Andrew Morton <akpm@digeo.com> wrote:
> > > >
> > > > Nobody has written an "exploit" for this yet, but it's there.
> > > 
> > > Here we go.  The test app is called `rmap-test'.  It is in ext3 CVS.  See
> > > 
> > > 	http://www.zip.com.au/~akpm/linux/ext3/
> > > 
> > > It sets up N MAP_SHARED VMA's and N tasks touching them in various access
> > > patterns.
> > 
> > I'm not questioning during paging rmap is more efficient than objrmap,
> > but your argument about rmap having lower complexity of objrmap and that
> > rmap is needed is wrong. The fact is that with your 100 mappings per
> > each of the 100 tasks case, both algorithms works in O(N) where N is
> > the number of the pagetables mapping the page.
> 
> Nope.  To unmap a page, full rmap has to scan 100 pte_chain slots, which is 3
> cachelines worth.  objrmap has to scan 10,000 vma's, 9,900 of which do not map
> that page at all.

I see what you mean, you're right. That's because all the 10,000 vma
belongs to the same inode.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 22:06                       ` Andrea Arcangeli
@ 2003-04-05 22:31                         ` Andrew Morton
  2003-04-05 23:10                           ` Andrea Arcangeli
                                             ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-05 22:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> I see what you mean, you're right. That's because all the 10,000 vma
> belongs to the same inode.

I see two problems with objrmap - this search, and the complexity of the
interworking with nonlinear mappings.

There is talk going around about implementing some more sophisticated search
structure thatn a linear list.

And treating the nonlinear mappings as being mlocked is a great
simplification - I'd be interested in Ingo's views on that.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 22:31                         ` Andrew Morton
@ 2003-04-05 23:10                           ` Andrea Arcangeli
  2003-04-06  1:58                             ` Andrew Morton
  2003-04-06  7:38                             ` William Lee Irwin III
  2003-04-06 12:37                           ` Jamie Lokier
  2003-04-22 11:00                           ` Ingo Molnar
  2 siblings, 2 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-05 23:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 02:31:38PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > I see what you mean, you're right. That's because all the 10,000 vma
> > belongs to the same inode.
> 
> I see two problems with objrmap - this search, and the complexity of the
> interworking with nonlinear mappings.

I still think we shouldn't associate any metadata with the nonlinear.
nonlinaer should be enabled via a sysctl and have it run at true full
speed, it's a bypass for the VM so you can mangle the pagetables from
userspace.

As soon as you start associating metadata to nonlinar, it's not the
"raw fast" thing anymore and it increases the complexity.

running bochs after echoing 1 into a sysctl should be fine, like also
uml should echoing 1 into a sysctl to get revirtualized vsyscalls
(unless we make it a prctl but that'll be more complex and slower).

When bochs starts and runs the mmap(VM_NONLINEAR) it will get -EPERM and
it will fall into the mmap mode (for 2.4 anyways). Or they can as well
require the echoing so they won't need to maintain two modes.

the nonlinear should work only in a separate special vma, its current
api is very unclean since it can mix with original linear stuff into the
same linear vma, and it doesn't allow more than one file into the same
nonlinear vma. I still reccomend all my points that I posted yesterday
to change the API to something much more approriate.

there is a reason we have the vma. I mean, if we can do a lighter thing
inside the nonlinear vmas, that has the same powerful functionality of
the linear vmas, then why don't replace the vma with this ligher thing
in the first place?

> There is talk going around about implementing some more sophisticated search
> structure thatn a linear list.
> 
> And treating the nonlinear mappings as being mlocked is a great
> simplification - I'd be interested in Ingo's views on that.

it's the right way IMHO, remap_file_pages is such an hack that can for
sure live under a sysctl. Think vmware, it even requires the kernel
modules. A sysctl is nothing compared to that. I wouldn't like to see
applications start using it.  Esepcially those sigbus in the current api
would be more expensive than the regular paging internal to the VM and
besides the signal it would generate flood of syscalls and kind of
duplication of memory management inside the userspace. And for the
database they just live under the sysctl for the largepages in 2.4
anyways.

About the rmap lower complexity vs objrmap, that's interesting now that
I understood what your case is doing exactly, and well you have a good
argument against objrmap, but given the performance difference I
still definitely give the priority to the fast paths.  There's to say to
be 100% fair the benchmarks comparisons between 2.4.21preXaaX and 2.5
should be done with a glibc that uses the syscall instruction in 2.5,
but I doubt it can only be explained by that especially given the latest
speedups in that area. In short I personally don't care about running
the rmap-test that much faster.

about the oom problem, I tried to C^c a few times but nothing bad
happened here. I'm not sure if it worth for me to change anything on the
2.4 side, I would like to reproduce it at least once. I know I can't
guarantee 100% reliable allocations, that's given, I don't know how many
pages are freeable, if I loop forever I could deadlock.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 12:06                 ` Andrew Morton
  2003-04-05 15:11                   ` Martin J. Bligh
  2003-04-05 16:30                   ` Andrea Arcangeli
@ 2003-04-05 23:25                   ` William Lee Irwin III
  2003-04-05 23:57                     ` Andrew Morton
  2003-04-06  9:26                     ` Benjamin LaHaise
  2003-04-06  2:23                   ` Martin J. Bligh
  3 siblings, 2 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-05 23:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrew Morton <akpm@digeo.com> wrote:
>> Nobody has written an "exploit" for this yet, but it's there.

On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> Here we go.  The test app is called `rmap-test'.  It is in ext3 CVS.  See
> 	http://www.zip.com.au/~akpm/linux/ext3/
> It sets up N MAP_SHARED VMA's and N tasks touching them in various access
> patterns.

I apparently erred when I claimed this kind of test would not provide
useful figures of merit for page replacement algorithms. There appears
to be more to life than picking the right pages.


On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
> modify their 100 vma's in a linear walk.  Total working set is 240MB
> (slightly more than is available).
> 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
> 2.5.66-mm4:
> 	15.76s user 86.91s system 33% cpu 5:05.07 total
> 2.5.66-mm4+objrmap:
> 	23.07s user 1143.26s system 87% cpu 22:09.81 total
> 2.4.21-pre5aa2:
> 	14.91s user 75.30s system 24% cpu 6:15.84 total

This seems ominous; I hope that methods of reducing "external
interference" as I called it are able to salvage the space conservation
benefits. IMHO this is the most important of the tests posted.


On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> In the second test we again have 100 tasks, each with 100 vma's but the
> access pattern is random:
> 	./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
> 2.5.66-mm4:
> 	0.12s user 6.05s system 2% cpu 3:59.68 total
> 2.5.66-mm4+objrmap:
> 	0.12s user 2.10s system 0% cpu 4:01.15 total
> 2.4.21-pre5aa2:
> 	0.07s user 2.03s system 0% cpu 4:12.69 total
> The -aa VM failed in this test.
> 	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> 	VM: killing process rmap-test
> I'd have to call this a bug - the machine was full of reclaimable memory.
> I also saw the 2.4 kernel do 705,000 context switches in a single second,
> which was odd.  It only happened once.

I'm actually somewhat surprised that any (much less all) of the three
behaved so well with a random access pattern. AIUI workloads without
locality of reference are not really very well served by LRU replacement;
perhaps this understanding should be revised.


On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> In the third test a single task owns 10000 VMA's and walks across them in a
> linear pattern:
> 	./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo
> 2.5.66-mm4:
> 	0.25s user 3.75s system 1% cpu 4:38.44 total
> 2.5.66-mm4+objrmap:
> 	0.28s user 146.45s system 16% cpu 15:14.59 total
> 2.4.21-pre5aa2:
> 	0.32s user 4.83s system 0% cpu 18:25.90 total

This doesn't appear to be the kind of issue that would be addressed by
the more advanced search structure to replace ->i_mmap and ->i_mmap_shared.
I'm somewhat surprised the virtualscan does so poorly; from an a priori
POV with low sharing and linear access there's no obvious reason in my
mind why it would do as poorly as or worse than the objrmap here.

I'm not sure why objrmap is chewing so much cpu here. There doesn't
appear to be any sharing happening. It sounds implausible at first
but it could be checking ->i_mmap and ->i_mmap_shared then walking
all 3 levels of the pagetables once over for each pte to edit. This
would also seem to reflect the case for file-backed pages, where large
amounts of file data are accessed and page replacement is needed, but
actual sharing levels are low, as MAP_SHARED implies the creation of
a different anonymous file object and different memory for each vma.


On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> These are not ridiculous workloads, especially the third one.  And 10k VMA's
> is by no means inconceivable.
> The objrmap code will be show-stoppingly expensive at 100k vmas per file.
> And as expected, the full rmap implementation gives the most stable,
> predictable and highest performance result under heavy load.  That's why
> we're using it.
> When it comes to the VM, there is a lot of value in sturdiness under unusual
> and heavy loads.
> Tomorrow I'll change the test app to do nonlinear mappings too.

Hmm, thing is, it can be fixed for 100K mostly disjoint vma's per file
by going to an O(lg(vmas)) ->i_mmap/->i_mmap_shared structure, but it
can't be fixed that way for 10K vma's all pointing to different files
or 100K vma's covering identical ranges. It might be possible to do a
one-off that uses ->pte.direct (or something) for when pages aren't
really shared to avoid walking up and down pagetables, but that's
defeated by a small-but-reasonable constellation of sharers, which is
probably the real typical usage case for large vma counts.

But I don't see a lack of directions to move in; it may well be
possible to mix some pagetable walking in a "related" area to amortize
the pointwise top-down pagetable walking cost, and other things I
haven't thought of yet may also be possible. IMHO it's worthwhile to
pursue the space conservation benefits even at the price of some
complexity, but of course what you want to merge is (unfortunately for
some, possibly even me) another matter.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:25                   ` William Lee Irwin III
@ 2003-04-05 23:57                     ` Andrew Morton
  2003-04-06  0:14                       ` Andrea Arcangeli
  2003-04-06  2:13                       ` William Lee Irwin III
  2003-04-06  9:26                     ` Benjamin LaHaise
  1 sibling, 2 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-05 23:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: andrea, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
> > modify their 100 vma's in a linear walk.  Total working set is 240MB
> > (slightly more than is available).
> > 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
> > 2.5.66-mm4:
> > 	15.76s user 86.91s system 33% cpu 5:05.07 total
> > 2.5.66-mm4+objrmap:
> > 	23.07s user 1143.26s system 87% cpu 22:09.81 total
> > 2.4.21-pre5aa2:
> > 	14.91s user 75.30s system 24% cpu 6:15.84 total
> 
> This seems ominous; I hope that methods of reducing "external
> interference" as I called it are able to salvage the space conservation
> benefits. IMHO this is the most important of the tests posted.

Well the third test (one task, 10k windows into a large file) does not seem
like an unreasonable design for an application.

> 
> On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > In the second test we again have 100 tasks, each with 100 vma's but the
> > access pattern is random:
> > 	./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
> > 2.5.66-mm4:
> > 	0.12s user 6.05s system 2% cpu 3:59.68 total
> > 2.5.66-mm4+objrmap:
> > 	0.12s user 2.10s system 0% cpu 4:01.15 total
> > 2.4.21-pre5aa2:
> > 	0.07s user 2.03s system 0% cpu 4:12.69 total
> > The -aa VM failed in this test.
> > 	__alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
> > 	VM: killing process rmap-test
> > I'd have to call this a bug - the machine was full of reclaimable memory.
> > I also saw the 2.4 kernel do 705,000 context switches in a single second,
> > which was odd.  It only happened once.
> 
> I'm actually somewhat surprised that any (much less all) of the three
> behaved so well with a random access pattern. AIUI workloads without
> locality of reference are not really very well served by LRU replacement;
> perhaps this understanding should be revised.

Note the "-i 1".  That's one iteration, as opposed to ten.

Given that we managed to achieve 100 milliseconds of user CPU time in four
minutes, this isn't really interesting.  It is completely IO-bound and the
machine is underprovisioned for the load which it is running.

> 
> On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > In the third test a single task owns 10000 VMA's and walks across them in a
> > linear pattern:
> > 	./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo
> > 2.5.66-mm4:
> > 	0.25s user 3.75s system 1% cpu 4:38.44 total
> > 2.5.66-mm4+objrmap:
> > 	0.28s user 146.45s system 16% cpu 15:14.59 total
> > 2.4.21-pre5aa2:
> > 	0.32s user 4.83s system 0% cpu 18:25.90 total
> 
> This doesn't appear to be the kind of issue that would be addressed by
> the more advanced search structure to replace ->i_mmap and ->i_mmap_shared.

We have 10000 disjoint VMA's and we want to find the one which maps this
page.  If we cannot solve this then we have a problem.

> I'm somewhat surprised the virtualscan does so poorly; from an a priori
> POV with low sharing and linear access there's no obvious reason in my
> mind why it would do as poorly as or worse than the objrmap here.

The virtual scan did well in all tests I _think_.  What happened in this test
is that the IO scheduling was crap - the disk sounded like a dentist's drill.

Could be that this is due to the elevator changes which Andrea has made, or
perhaps fault-time readaround is broken or something else.  The file layout
was effectivey identical in both kernels.  I don't know, but I think it's
unrelated to the scanning design.

> I'm not sure why objrmap is chewing so much cpu here. There doesn't
> appear to be any sharing happening.

The file has 10k vma's attached to it.  The VM has to scan 50% of those for
each page_referenced() and try_to_unmap() attempt against each page.

It shouldn't be too hard to locate the first VMA which covers file offset N
with a tree or whatever.

> Hmm, thing is, it can be fixed for 100K mostly disjoint vma's per file
> by going to an O(lg(vmas)) ->i_mmap/->i_mmap_shared structure, but it
> can't be fixed that way for 10K vma's all pointing to different files

We don't have to solve the "all pointing to different files" problem do we?

> or 100K vma's covering identical ranges.

100k vma's covering identical ranges is not too bad.  Because we know that
they all cover the page we're interested in.  (assuming they're all the same
length..)

Yes, there is the secondary inefficeincy that not all of the vma's pagetables
are necessarily actively mapping the page, but at least the main problem of
searching completely irrelevant vma's isn't there.

> IMHO it's worthwhile to
> pursue the space conservation benefits even at the price of some
> complexity, but of course what you want to merge is (unfortunately for
> some, possibly even me) another matter.

Well I haven't seen anything yet, alas.

Is the pte_chain space saving which page clustering gives not sufficient?



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:57                     ` Andrew Morton
@ 2003-04-06  0:14                       ` Andrea Arcangeli
  2003-04-06  1:39                         ` Andrew Morton
  2003-04-06  2:13                       ` William Lee Irwin III
  1 sibling, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-06  0:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, mbligh, mingo, hugh, dmccr, linux-kernel,
	linux-mm

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> > On Sat, Apr 05, 2003 at 04:06:14AM -0800, Andrew Morton wrote:
> > > In the third test a single task owns 10000 VMA's and walks across them in a
> > > linear pattern:
> > > 	./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo
> > > 2.5.66-mm4:
> > > 	0.25s user 3.75s system 1% cpu 4:38.44 total
> > > 2.5.66-mm4+objrmap:
> > > 	0.28s user 146.45s system 16% cpu 15:14.59 total
> > > 2.4.21-pre5aa2:
> > > 	0.32s user 4.83s system 0% cpu 18:25.90 total
> > 
> > This doesn't appear to be the kind of issue that would be addressed by
> > the more advanced search structure to replace ->i_mmap and ->i_mmap_shared.
> 
> We have 10000 disjoint VMA's and we want to find the one which maps this
> page.  If we cannot solve this then we have a problem.
> 
> > I'm somewhat surprised the virtualscan does so poorly; from an a priori
> > POV with low sharing and linear access there's no obvious reason in my
> > mind why it would do as poorly as or worse than the objrmap here.
> 
> The virtual scan did well in all tests I _think_.  What happened in this test
> is that the IO scheduling was crap - the disk sounded like a dentist's drill.
> 
> Could be that this is due to the elevator changes which Andrea has made, or

2.4-aa is outperforming 2.5 in almost all tiobenchs results, so I doubt
the elevator is that bad and could explain such drop in performance. 

I suspect it must be something on the lines of the filesystem doing
synchronous I/O for some reason inside writepage, like doing a
wait_on_buffer for every writepage, generating the above fake results.
Note the 0% cpu time. You're not benchmarking the vm here. Infact I
would be interested to see the above repeated on ext2.

It's not true that ext3 is sharing the same writepage of ext2 as you
said in a earlier email, the ext3 writepage starts like this:

static int ext3_writepage(struct page *page)
{
	struct inode *inode = page->mapping->host;
	struct buffer_head *page_buffers;
	handle_t *handle = NULL;
	int ret = 0, err;
	int needed;
	int order_data;

	J_ASSERT(PageLocked(page));
	
	/*
	 * We give up here if we're reentered, because it might be
	 * for a different filesystem.  One *could* look for a
	 * nested transaction opportunity.
	 */
	lock_kernel();
	if (ext3_journal_current_handle())
		goto out_fail;

	needed = ext3_writepage_trans_blocks(inode);
	if (current->flags & PF_MEMALLOC)
		handle = ext3_journal_try_start(inode, needed);
	else
		handle = ext3_journal_start(inode, needed);

and even the ext2 writepage can be synchronous if it has to call
get_block. Infact I would reccomend to fill the "foo" file with zeros
and not to have holes in it just to avoid additional synchronous fs
overhead and to only be sync in the inode map lookup.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  0:14                       ` Andrea Arcangeli
@ 2003-04-06  1:39                         ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2003-04-06  1:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: wli, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> 2.4-aa is outperforming 2.5 in almost all tiobenchs results, so I doubt
> the elevator is that bad and could explain such drop in performance. 

Well.  tiobench doesn't measure concurrent reads and writes.

A quick test shows the anticipatory scheduler runs `tiobench --threads 16'
1.5x faster on reads and 1.15x faster on writes.  But that's a damn good
result for a 2.4 elevator.

It has a starvation problem though.

Running this:

while true
do
	dd if=/dev/zero of=x bs=1M count=300 conv=notrunc
done

in parallel with five reads from five 200M files shows writes getting stalled
for 20 second periods.


 0  8      0   3980   1800 223056    0    0 11688 10168  450   351  0  1 99  0
 0  7      0   3516   1792 223548    0    0 20036  4136  491   476  0  5 95  0
 2  5      0   3864   1792 223200    0    0 23444     0  469   727  1  2 97  0
 0  7      0   3384   1792 223684    0    0 21952     0  456   639  0  6 94  0
 1  6      0   3468   1792 223596    0    0 23436     0  475   680  0  2 98  0
 0  7      0   4172   1792 222896    0    0 22824     0  469   597  0  2 98  0
 0  7      0   3472   1792 223592    0    0 24376     0  493   599  0  5 95  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  4      0   3352   1792 223712    0    0 24680     0  496   574  1 11 88  0
 0  7      0   3920   1792 223144    0    0 24120     0  482   708  1  5 94  0
 0  7      0   3920   1792 223144    0    0 23536     0  474   556  1  4 95  0
 0  7      0   3912   1792 223152    0    0 22524     0  468   502  0  5 95  0
 0  7      0   3564   1792 223500    0    0 23120     0  471   510  0  4 96  0
 0  7      0   3324   1792 223740    0    0 21732     0  449   657  0  4 96  0
 0  7      0   3656   1792 223408    0    0 24236     0  484   554  1  3 96  0
 0  7      0   4256   1792 222808    0    0 23076     0  474   561  0  8 92  0
 0  7      0   3436   1792 223628    0    0 22312     0  455   501  0  1 99  0
 0  7      0   3384   1792 223680    0    0 23588     0  476   611  1  1 98  0
 0  8      0   3408   1792 223656    0    0 21464  1312  474   615  0  7 93  0
 0  8      0   3328   1792 223736    0    0 13772 10300  478   467  0  1 99  0
 0  8      0   3492   1656 223712    0    0 11988 12612  497   409  0  1 99  0
 0  8      0   3976    796 224088    0    0 14748  5952  432   367  0  2 98  0
 0  8      0   3812    728 224324    0    0 15636  8064  476   449  1  2 97  0
 0  8      0   3768    732 224364    0    0 12328 10880  469   361  0  1 99  0
 0  8      0   3504    752 224608    0    0 12548  8452  435   354  0  2 98  0
 0  8      0   4180    760 223924    0    0 14676  9920  492   419  0  4 96  0
 0  8      0   3616    776 224472    0    0 12976  9660  462   367  0  3 97  0
 0  7      0   3784    792 224288    0    0 15312  8864  483   401  0  4 96  0
 0  7      0   3328    808 224728    0    0 21060     0  443   468  0  2 98  0
 3  4      0   3324    832 224708    0    0 21752     0  449   470  0  4 96  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  7      0   3496    852 224516    0    0 21584     0  447   427  0  5 95  0
 0  7      0   3760    876 224228    0    0 22772     0  458   526  1  3 96  0
 0  7      0   4240    896 223728    0    0 22172     0  460   448  0  4 96  0
 0  7      0   3308    920 224636    0    0 22428     0  461   471  0  5 95  0
 0  7      0   3292    944 224628    0    0 22660     0  463   493  0  5 95  0
 5  2      0   3592    960 224312    0    0 21328     0  445   459  0  1 99  0
 1  6      0   4120    980 223764    0    0 22508     0  453   435  1  1 98  0
 1  6      0   3568   1008 224288    0    0 23332     0  475   516  1  5 94  0
 0  7      0   3484   1024 224356    0    0 21968     0  457   476  1  4 95  0
 1  6      0   3928   1048 223888    0    0 20284     0  427   494  0  3 97  0
 0  7      0   3996   1072 223796    0    0 22584     0  461   538  0  3 97  0
 0  7      0   3892   1088 223884    0    0 21728     0  455   470  1  5 94  0
 0  7      0   3916    728 224220    0    0 22884     0  463   503  1  7 92  0
 0  7      0   4340    752 223772    0    0 23000     0  473   502  0  6 94  0
 0  8      0   3600    768 224496    0    0 20692  1124  447   519  0  3 97  0


> I suspect it must be something on the lines of the filesystem doing
> synchronous I/O for some reason inside writepage, like doing a
> wait_on_buffer for every writepage, generating the above fake results.
> Note the 0% cpu time. You're not benchmarking the vm here. Infact I
> would be interested to see the above repeated on ext2.
> 
> It's not true that ext3 is sharing the same writepage of ext2 as you
> said in a earlier email, the ext3 writepage starts like this:

No, that code's all just fluff.  These pages get a disk mapping real early
(I've just added an msync to make sure though).  So ext3_writepage() is
really nothing in this test except for block_write_full_page().  The
journalling system does not get involved much at all with overwrites to
blocks on an ordered-data filesystem.

> and even the ext2 writepage can be synchronous if it has to call
> get_block. Infact I would reccomend to fill the "foo" file with zeros
> and not to have holes in it just to avoid additional synchronous fs
> overhead and to only be sync in the inode map lookup.

Yup, I did that.  It doesn't make any difference.

But you're right, the problem does not occur on 2.4.21-pre5aa2+ext2.  Nor
does it occur on 2.5+ext3, and nor does it occur on 2.4.21-pre5+ext3.  It is
something specific to aa+ext3.

I don't know what's gone wrong.  It's just stuck in filemap_nopage->lock_page
all the time, seeking all over the disk.  It smells like a VM/VFS problem.

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1   6380   4064   5836 232120    0    0  2940     0  214   231  0  2 98  0
 1  0   6380   3196   5836 233032    0    0  3164     0  217   239  0  0 100  0
 0  1   6380   3516   5836 232752    0    0  3136     0  216   234  0  0 100  0
 0  1   6380   3764   5836 232612    0    0  3080     0  214   231  0  2 98  0
 0  1   6380   4028   5836 232432    0    0  3080     0  214   231  0  1 99  0
 1  0   6380   3224   5836 233292    0    0  3108     0  215   231  0  1 99  0
 1  0   6380   3396   5836 233176    0    0  3164     0  216   238  0  0 100  0
 0  1   6380   3600   5836 233048    0    0  3248     0  220   243  0  3 97  0
 0  1   6380   3732   5836 232968    0    0  3192     0  218   239  0  1 99  0

It should look like:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  2   8656   3200    588 241892    0    0  6440 11928  548   545  0  1 99  0
 1  0   8656   2272    588 242896    0    0 13804  7172  817  1173  0  3 97  0
 0  1   8656   2240    588 243036   20   60 16344 14572  924  1236  0  4 96  0
 0  1   8656   2380    528 243188    0    0 16060 16044  965  1183  0  9 91  0
 0  2   8656   3192    544 242068  312    4 13104 13316  801  1038  0  5 95  0
 0  2   8656   2208    564 243204    0    0 14676 17920  929  1086  0  5 95  0
 0  1   8656   2264    576 243160   20    0 16160 11836  892  1231  0  3 97  0
 1  0   8656   4444    584 240956    0    0 10200 15640  740   829  0  3 97  0

File layout is OK, same as ext2:

79895-79895: 0-0 (1)
79896-79902: 260848-260854 (7)
79903-79903: 0-0 (1)
79904-79910: 32607-32613 (7)
79911-79911: 0-0 (1)
79912-79918: 260904-260910 (7)
79919-79919: 0-0 (1)
79920-79926: 32614-32620 (7)
79927-79927: 0-0 (1)
79928-79934: 260960-260966 (7)
79935-79935: 0-0 (1)
79936-79942: 32621-32627 (7)
79943-79943: 0-0 (1)
79944-79950: 261016-261022 (7)
79951-79951: 0-0 (1)

I applied the -aa ext3 patches to 2.4.21-pre5 and that ran OK.

It's almost like the VM is refusing to call ext3_writepage() for some reason,
or is only reclaiming clean pagecache, or the filemap_nopage() readaround
isn't working.  Very odd.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:10                           ` Andrea Arcangeli
@ 2003-04-06  1:58                             ` Andrew Morton
  2003-04-06 14:47                               ` Andrea Arcangeli
  2003-04-06  7:38                             ` William Lee Irwin III
  1 sibling, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-06  1:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>
> Esepcially those sigbus in the current api
> would be more expensive than the regular paging internal to the VM and
> besides the signal it would generate flood of syscalls and kind of
> duplication of memory management inside the userspace.

That went away.  We now encode the file offset in the unmapped ptes, so the
kernel's fault handler can transparently reestablish the page.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:57                     ` Andrew Morton
  2003-04-06  0:14                       ` Andrea Arcangeli
@ 2003-04-06  2:13                       ` William Lee Irwin III
  1 sibling, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  2:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

William Lee Irwin III <wli@holomorphy.com> wrote:
>> This seems ominous; I hope that methods of reducing "external
>> interference" as I called it are able to salvage the space conservation
>> benefits. IMHO this is the most important of the tests posted.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> Well the third test (one task, 10k windows into a large file) does not seem
> like an unreasonable design for an application.

They're all useful, but I thought this was the "most typical" usage case.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm actually somewhat surprised that any (much less all) of the three
>> behaved so well with a random access pattern. AIUI workloads without
>> locality of reference are not really very well served by LRU replacement;
>> perhaps this understanding should be revised.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> Note the "-i 1".  That's one iteration, as opposed to ten.
> Given that we managed to achieve 100 milliseconds of user CPU time in four
> minutes, this isn't really interesting.  It is completely IO-bound and the
> machine is underprovisioned for the load which it is running.

Point, a large buffering arena is needed for this kind of load to be
effective in order to coalesce io.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> This doesn't appear to be the kind of issue that would be addressed by
>> the more advanced search structure to replace ->i_mmap and ->i_mmap_shared.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> We have 10000 disjoint VMA's and we want to find the one which maps this
> page.  If we cannot solve this then we have a problem.

It's a matter of programming stuff described in various textbooks and
papers, so aside from lacking an implementation and/or being behind
schedule implementing it, I think we're fine.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm somewhat surprised the virtualscan does so poorly; from an a priori
>> POV with low sharing and linear access there's no obvious reason in my
>> mind why it would do as poorly as or worse than the objrmap here.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> The virtual scan did well in all tests I _think_.  What happened in this test
> is that the IO scheduling was crap - the disk sounded like a dentist's drill.
> Could be that this is due to the elevator changes which Andrea has made, or
> perhaps fault-time readaround is broken or something else.  The file layout
> was effectivey identical in both kernels.  I don't know, but I think it's
> unrelated to the scanning design.

Bad seeking behavior could easily explain it; I was at a loss to
explain what on earth things could be blocked on, as I've gotten
perhaps too used to large buffering arenas being readily available.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm not sure why objrmap is chewing so much cpu here. There doesn't
>> appear to be any sharing happening.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> The file has 10k vma's attached to it.  The VM has to scan 50% of those for
> each page_referenced() and try_to_unmap() attempt against each page.
> It shouldn't be too hard to locate the first VMA which covers file offset N
> with a tree or whatever.

Sorry, I misunderstood this part of the test. I thought each vma had a
different file when I wrote the post.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> Hmm, thing is, it can be fixed for 100K mostly disjoint vma's per file
>> by going to an O(lg(vmas)) ->i_mmap/->i_mmap_shared structure, but it
>> can't be fixed that way for 10K vma's all pointing to different files

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> We don't have to solve the "all pointing to different files" problem do we?

I sure hope not, because I don't see a way to fix it at all.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> or 100K vma's covering identical ranges.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> 100k vma's covering identical ranges is not too bad.  Because we know
> that they all cover the page we're interested in.  (assuming they're
> all the same length..)
> Yes, there is the secondary inefficeincy that not all of the vma's
> pagetables are necessarily actively mapping the page, but at least
> the main problem of searching completely irrelevant vma's isn't there.

>From your test it does seem like the thinning the search space by
filtering out irrelevant vma's is more important than avoiding the
top-down pagetable walk.

That secondary inefficiency is what I called "internal interference";
I suspect its importance was overestimated.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> IMHO it's worthwhile to
>> pursue the space conservation benefits even at the price of some
>> complexity, but of course what you want to merge is (unfortunately for
>> some, possibly even me) another matter.

On Sat, Apr 05, 2003 at 03:57:40PM -0800, Andrew Morton wrote:
> Well I haven't seen anything yet, alas.
> Is the pte_chain space saving which page clustering gives not sufficient?

Well, there are caveats. Don't look at the code yet, the fault handler
bits are really disgusting at the moment and in general it isn't doing
all of what it's supposed to. I really had to sort of burst out with
the announcement in advance of my 2.5 code being ready to display for
reasons of timing of hardware availability. Some time in the future
arrangements will be made for infrequent but regular testing of it.

First, the MMUPAGE_SIZE-sized pieces of a truly anonymous (not
MAP_SHARED) page are intentionally scattered, so no one PTE is truly
representative of the which one maps the page. So only truly file-
backed pages can have the linear space reduction applied if/when it is
reducing the space savings below a factor of PAGE_MMUCOUNT. This also
means the optimization requires special-casing of file-backed pages vs.
anonymous pages while creating and scanning pte_chains (codesize hit).
This also means a pte_chain allocation is required for a fully utilized
anonymous page, as several PTE's would map it, where before PG_direct
is almost universally usable for truly anonymous pages.

Second, it's only a linear reduction in the number of PTE's per process
address space required to be chained, the number of sharers still
causes the same space consumption pattern, albeit with a (maximum of a)
something below PAGE_MMUCOUNT factor decrease in growth rate.

Third, this doesn't account for pagecache fragmentation, so partial
mappings of a page require a pte_chain and the net reduction of
PAGE_MMUCOUNT is reduced further by the aggregate proportion of
utilization of file-backed pages.

The sum of these three effects is that I expect the space reduction
page clustering can achieve for pte_chains is nondeterministic and
something less than the "natural" factor would suggest. It may actually
solve it for everyone, but I'm very uncertain of whether it would.

There is another reason I favor the object-based methods, which is
that the pointwise methods are (codewise) more complex when combined
with it. It might really be too early to evaluate this, as not only am
I lacking required functionality and bugfixes that might otherwise make
the interactions more apparent, but I'm also missing a demonstration of
how page clustering and the object-based methods act in combination.
Better motives for page clustering would be larger fs blocksize support
and mem_map[] size reduction, as those effects are known a priori.

(Whether it's desirable for everyone to use larger fs blocksizes is an
open question, as you mentioned earlier in a private message.)


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 12:06                 ` Andrew Morton
                                     ` (2 preceding siblings ...)
  2003-04-05 23:25                   ` William Lee Irwin III
@ 2003-04-06  2:23                   ` Martin J. Bligh
  2003-04-06  3:55                     ` Andrew Morton
  2003-04-06 14:49                     ` Alan Cox
  3 siblings, 2 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06  2:23 UTC (permalink / raw)
  To: Andrew Morton, andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

> The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
> modify their 100 vma's in a linear walk.  Total working set is 240MB
> (slightly more than is available).
> 
> 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
> 
> 2.5.66-mm4:
> 	15.76s user 86.91s system 33% cpu 5:05.07 total
> 2.5.66-mm4+objrmap:
> 	23.07s user 1143.26s system 87% cpu 22:09.81 total
> 2.4.21-pre5aa2:
> 	14.91s user 75.30s system 24% cpu 6:15.84 total

Isn't the intent to use sys_remap_file_pages for these sort of workloads
anyway? In which case partial objrmap = rmap for these tests, so we're
still OK?

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  3:55                     ` Andrew Morton
@ 2003-04-06  3:08                       ` Martin J. Bligh
  2003-04-06  7:42                         ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06  3:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

>> > The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
>> > modify their 100 vma's in a linear walk.  Total working set is 240MB
>> > (slightly more than is available).
>> > 
>> > 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
>> > 
>> > 2.5.66-mm4:
>> > 	15.76s user 86.91s system 33% cpu 5:05.07 total
>> > 2.5.66-mm4+objrmap:
>> > 	23.07s user 1143.26s system 87% cpu 22:09.81 total
>> > 2.4.21-pre5aa2:
>> > 	14.91s user 75.30s system 24% cpu 6:15.84 total
>> 
>> Isn't the intent to use sys_remap_file_pages for these sort of workloads
>> anyway? In which case partial objrmap = rmap for these tests, so we're
>> still OK?
>> 
> 
> remap_file_pages() would work OK for this, yes.  Bit sad that an application
> which runs OK on 2.4 would need recoding to work acceptably under 2.5 though.

You mean like trying to run Oracle Apps or something? ;-)

5000 tasks, 2Gb shmem segment = 10Gb of PTE chains (I think).
In ZONE_NORMAL.
Somehow.

Both regular rmap and partial objrmap have their corner cases. Both are
fairly easily avoidable. Both may require some app tweaks. I'd argue
that partial objrmap's bad cases are actually more obscure. And it does
better in the common case ... I think that's important.

regular rmap + shared pagetables is more workable than without.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  2:23                   ` Martin J. Bligh
@ 2003-04-06  3:55                     ` Andrew Morton
  2003-04-06  3:08                       ` Martin J. Bligh
  2003-04-06 14:49                     ` Alan Cox
  1 sibling, 1 reply; 105+ messages in thread
From: Andrew Morton @ 2003-04-06  3:55 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

"Martin J. Bligh" <mbligh@aracnet.com> wrote:
>
> > The first test has 100 tasks, each of which has 100 vma's.  The 100 processes
> > modify their 100 vma's in a linear walk.  Total working set is 240MB
> > (slightly more than is available).
> > 
> > 	./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
> > 
> > 2.5.66-mm4:
> > 	15.76s user 86.91s system 33% cpu 5:05.07 total
> > 2.5.66-mm4+objrmap:
> > 	23.07s user 1143.26s system 87% cpu 22:09.81 total
> > 2.4.21-pre5aa2:
> > 	14.91s user 75.30s system 24% cpu 6:15.84 total
> 
> Isn't the intent to use sys_remap_file_pages for these sort of workloads
> anyway? In which case partial objrmap = rmap for these tests, so we're
> still OK?
> 

remap_file_pages() would work OK for this, yes.  Bit sad that an application
which runs OK on 2.4 would need recoding to work acceptably under 2.5 though.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
       [not found]                     ` <20030405161758.1ee19bfa.akpm@digeo.com>
@ 2003-04-06  7:07                       ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  7:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin J. Bligh, andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 04:17:58PM -0800, Andrew Morton wrote:
> There are perhaps a few things we can do about that.
> It's only a problem on the kooky highmem boxes, and they need page clustering
> anyway.
> And this is just another instance of "lowmem pinned by highmem pages" which
> could be solved by unmapping (and not necessarily reclaiming) the highmem
> pages.  But that's a pretty lame thing to do.

I've actually liked this approach, despite not being terribly highly
performant, on the grounds it is relatively non-invasive, and that once
it's in place, various stronger (and more invasive) space reduction
techniques become optimizations instead of workload feasibility patches.
The fact it generalizes to other (non-highmem) situations is also good.

I'm not terribly attached to it, but since there is some mention of it,
thought it worth mentioning that there is _some_ middle ground that I
(as one of the "big highmem box bad guys") find acceptable.

I'm largely anticipating out-of-tree patches will be needed to run
these machines anyway and am prepared (in nontrivial senses; significant
amounts of my time in the future are allocated to maintaining and
implementing the things needed for it) to take on some of the
maintenance load to keep workloads running and running performantly on
these boxen (specifically pgcl; other patches [e.g. shpte] have other
maintainers).  It's probably not the best thing to say in the face of
the possibility of a truly immense maintenance load, but it is true.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:10                           ` Andrea Arcangeli
  2003-04-06  1:58                             ` Andrew Morton
@ 2003-04-06  7:38                             ` William Lee Irwin III
  2003-04-06 14:51                               ` Andrea Arcangeli
  1 sibling, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  7:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> I still think we shouldn't associate any metadata with the nonlinear.
> nonlinaer should be enabled via a sysctl and have it run at true full
> speed, it's a bypass for the VM so you can mangle the pagetables from
> userspace.
> As soon as you start associating metadata to nonlinar, it's not the
> "raw fast" thing anymore and it increases the complexity.

One of the big reasons why it's desirable is to reduce the metadata,
so I agree here.


On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> running bochs after echoing 1 into a sysctl should be fine, like also
> uml should echoing 1 into a sysctl to get revirtualized vsyscalls
> (unless we make it a prctl but that'll be more complex and slower).
> When bochs starts and runs the mmap(VM_NONLINEAR) it will get -EPERM and
> it will fall into the mmap mode (for 2.4 anyways). Or they can as well
> require the echoing so they won't need to maintain two modes.
> the nonlinear should work only in a separate special vma, its current
> api is very unclean since it can mix with original linear stuff into the
> same linear vma, and it doesn't allow more than one file into the same
> nonlinear vma. I still reccomend all my points that I posted yesterday
> to change the API to something much more approriate.

This is an unusual idea; I'd expect capable(CAP_IPC_LOCK) to suffice
to provide the privilege checks for direct mlocking as well as other
operations that lock memory (please don't look at hugetlbfs for this...).


On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> there is a reason we have the vma. I mean, if we can do a lighter thing
> inside the nonlinear vmas, that has the same powerful functionality of
> the linear vmas, then why don't replace the vma with this ligher thing
> in the first place?

Well, the only information that the sub-vma's would need is the address
range and data structure linkage, the entire remainder of the rest
could be divined from the master vma. It's vaguely plausible, though
I've no idea how effective.


At some point in the past, akpm wrote:
>> There is talk going around about implementing some more sophisticated search
>> structure thatn a linear list.
>> And treating the nonlinear mappings as being mlocked is a great
>> simplification - I'd be interested in Ingo's views on that.

On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> it's the right way IMHO, remap_file_pages is such an hack that can for
> sure live under a sysctl. Think vmware, it even requires the kernel
> modules. A sysctl is nothing compared to that. I wouldn't like to see
> applications start using it.  Esepcially those sigbus in the current api
> would be more expensive than the regular paging internal to the VM and
> besides the signal it would generate flood of syscalls and kind of
> duplication of memory management inside the userspace. And for the
> database they just live under the sysctl for the largepages in 2.4
> anyways.

I don't know of any SIGBUS in the current API; AFAIK it prefaults the
the PTE at remap_file_pages()-time if possible (there is a "blocking"
flag IIRC) and if not, allows minor faults by retrieving the file
offset from the (invalid!) PTE and otherwise normal fault servicing.

I'm not convinced it's truly a hack. Inherently complex mappings need
a more efficient mapping mechanism on both 32-bit and 64-bit machines,
for the reason of the pagetable utilization and space efficiency on
64-bit, and on 32-bit on the grounds of lowmem consumption of vma's.


On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> About the rmap lower complexity vs objrmap, that's interesting now that
> I understood what your case is doing exactly, and well you have a good
> argument against objrmap, but given the performance difference I
> still definitely give the priority to the fast paths.  There's to say to
> be 100% fair the benchmarks comparisons between 2.4.21preXaaX and 2.5
> should be done with a glibc that uses the syscall instruction in 2.5,
> but I doubt it can only be explained by that especially given the latest
> speedups in that area. In short I personally don't care about running
> the rmap-test that much faster.

Predictability and graceful degradation under load is very important.
fork() and exec() are actually slow paths, despite being related to the
response times of forking servers and throughput of shell scripts;
they're very heavyweight operations and the "hit" seems to be within
reach of recovery with cocktails of pending or otherwise available but
not currently very well-interacting patches. Shell scripts and forking
servers are grossly inefficient anyway; performant methods generally
minimize process counts and spawning and/or turnover.

IMHO the real answer to these problems is refining the physical to
virtual address resolution methods. Physical scanning has very real
graceful degradation advantages in the presence of large-scale sharing
and complex mappings, but the space overhead of pointwise methods is
a catastrophic problem for large 32-bit machines and they have process
creation overheads that have raised eyebrows across the board.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  3:08                       ` Martin J. Bligh
@ 2003-04-06  7:42                         ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  7:42 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, andrea, mingo, hugh, dmccr, linux-kernel, linux-mm

At some point in the past, akpm wrote:
>> remap_file_pages() would work OK for this, yes.  Bit sad that an application
>> which runs OK on 2.4 would need recoding to work acceptably under 2.5 though.

On Sat, Apr 05, 2003 at 08:08:25PM -0700, Martin J. Bligh wrote:
> You mean like trying to run Oracle Apps or something? ;-)
> 5000 tasks, 2Gb shmem segment = 10Gb of PTE chains (I think).
> In ZONE_NORMAL.
> Somehow.
> Both regular rmap and partial objrmap have their corner cases. Both are
> fairly easily avoidable. Both may require some app tweaks. I'd argue
> that partial objrmap's bad cases are actually more obscure. And it does
> better in the common case ... I think that's important.
> regular rmap + shared pagetables is more workable than without.

There were 10GB of PTE's on 2.4.x too, which was not acceptable. A
constant factor more or less doesn't make or break the algorithm. The
pagetable space consumption alone, even if highmem, is beyond reason.
The fact the additional load comes from lowmem is fatal so we are of
course in a quandary, but none of this is remotely new.

I'd like to get it addressed but have serious doubts that dragging it
into this discussion is productive, for AIUI we began with the premise
that both objrmap (fobjrmap?) and sys_remap_file_pages() are desirable
but raise serious implementation difficulties when combined.

If I can reiterate the topic of the thread, objrmap needs to address
its cpu performance corner cases and sys_remap_file_pages() needs bits
of bugfixing around the VM for correctness. I think we need code to
address these things as opposed to a further protracted discussion,
as akpm and hugh have partially identified the areas in need of work to
fix sys_remap_file_pages()' issues and akpm has written testcases to
measure the performance of objrmap in its corner cases.

I think the only thing "up in the air" is that we would take a
significant code complexity hit (presumably maintenance load and
maintainer preference being the main factors) in fixing these issues
while retaining both, and people are at least weighing the code
simplification benefits of dropping one or the other or restricting the
usage of sys_remap_file_pages() to avoid the code complexity hit, and
are reexamining things to determine which (if either) to drop or change.

In the spirit of fixing bugs, are there testcases to determine whether
the vmtruncate() vs. sys_remap_file_pages() bugs are fixed or to
otherwise try to trigger them?


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 23:25                   ` William Lee Irwin III
  2003-04-05 23:57                     ` Andrew Morton
@ 2003-04-06  9:26                     ` Benjamin LaHaise
  2003-04-06  9:41                       ` William Lee Irwin III
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin LaHaise @ 2003-04-06  9:26 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, andrea, mbligh, mingo,
	hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 03:25:24PM -0800, William Lee Irwin III wrote:
> I apparently erred when I claimed this kind of test would not provide
> useful figures of merit for page replacement algorithms. There appears
> to be more to life than picking the right pages.

This is precisely the conclusion which davem and myself came to, and 
explained at the beginning of this whole ordeal.  It all boils down to 
the complexity of the algorithm, and the fact that the number of cache 
misses scales with that.

Can we get on with merging pgcl to mitigate some of the rmap costs now?  ;-)

		-ben

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 16:30                   ` Andrea Arcangeli
                                       ` (2 preceding siblings ...)
  2003-04-05 21:34                     ` Rik van Riel
@ 2003-04-06  9:29                     ` Benjamin LaHaise
  3 siblings, 0 replies; 105+ messages in thread
From: Benjamin LaHaise @ 2003-04-06  9:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 06:30:03PM +0200, Andrea Arcangeli wrote:
> 
> I'm not questioning during paging rmap is more efficient than objrmap,
> but your argument about rmap having lower complexity of objrmap and that
> rmap is needed is wrong. The fact is that with your 100 mappings per
> each of the 100 tasks case, both algorithms works in O(N) where N is
> the number of the pagetables mapping the page. No difference in

Small mistake on your part: there are two different parameters to that:
objrmap is O(N) where N is the number of vmas, and regular rmap is O(M) 
where M is the number of currently mapped ptes.  M <= N and is frequently 
less for sparsely resident pages (ie in things like executables).

		-ben

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  9:26                     ` Benjamin LaHaise
@ 2003-04-06  9:41                       ` William Lee Irwin III
  2003-04-06  9:54                         ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  9:41 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, andrea, mbligh, mingo, hugh, dmccr, linux-kernel,
	linux-mm

On Sat, Apr 05, 2003 at 03:25:24PM -0800, William Lee Irwin III wrote:
>> I apparently erred when I claimed this kind of test would not provide
>> useful figures of merit for page replacement algorithms. There appears
>> to be more to life than picking the right pages.

On Sun, Apr 06, 2003 at 05:26:03AM -0400, Benjamin LaHaise wrote:
> This is precisely the conclusion which davem and myself came to, and 
> explained at the beginning of this whole ordeal.  It all boils down to 
> the complexity of the algorithm, and the fact that the number of cache 
> misses scales with that.
> Can we get on with merging pgcl to mitigate some of the rmap costs now?  ;-)

No!

You do _not_ want me to merge it now unless you're speaking of drivers/
and arch code from the standpoint of "advance notice", "version skew",
or "compatibility API's". I am not done and the core impacts of the
current source are unacceptable (even to me, who wrote the 2.5 stuff)
and furthermore the antifragmentation bits are so grossly incomplete
it's not worthy of being called a full implementation. I am willing to
accept "help" but would prefer it not be given; yes I am being slow,
but unless it's a true emergency I would much prefer to do my own
homework as it were, if only for my own edification.

This _may_ sound like it's "anti-open", but it isn't really. The fact
is this is important, so even intermediate steps need discussion, but
at the same time, I am working hard, learning, and absolutely must not
be "bailed out" except if this becomes truly critical, for otherwise I
won't learn enough to maintain it myself and as I've taken on this
myself, I need to be able to take up the burden of maintaining it even
after it's merged -- that is, by learning the hard way, not bailouts.

Even beyond this general statement, the supposed pte_chain space
reductions owing to page clustering are an open question with respect
to effectiveness. I myself would need better a priori evidence and/or
empirical evidence of this to add it to my list of claimed benefits.

All this said, the drivers and the arch code bits are actually largely
trivial substitions. If the discussion is truly limited to that, I'm
okay with sending in pieces; still it makes me uneasy to do anything
while the code I have now is so far from working as it truly should.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  9:41                       ` William Lee Irwin III
@ 2003-04-06  9:54                         ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06  9:54 UTC (permalink / raw)
  To: Benjamin LaHaise, Andrew Morton, andrea, mbligh, mingo, hugh,
	dmccr, linux-kernel, linux-mm

On Sun, Apr 06, 2003 at 01:41:28AM -0800, William Lee Irwin III wrote:
> All this said, the drivers and the arch code bits are actually largely
> trivial substitions. If the discussion is truly limited to that, I'm
> okay with sending in pieces; still it makes me uneasy to do anything
> while the code I have now is so far from working as it truly should.

Also, part of this is prior agreement with Hugh (the originator of the
2.4.x version of the stuff) and akpm to withhold merging until full
functionality is achieved.

IMHO this full functionality has not been achieved, even though some
demonstrations of broadened hardware support benefits are feasible.

To both respect this agreement and remain within the bounds of my own
coding ethics, I should refuse to merge until both a greater degree of
completeness of implementation and a far greater degree of code
cleanliness is achieved.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 22:31                         ` Andrew Morton
  2003-04-05 23:10                           ` Andrea Arcangeli
@ 2003-04-06 12:37                           ` Jamie Lokier
  2003-04-06 13:12                             ` William Lee Irwin III
  2003-04-22 11:00                           ` Ingo Molnar
  2 siblings, 1 reply; 105+ messages in thread
From: Jamie Lokier @ 2003-04-06 12:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrew Morton wrote:
> And treating the nonlinear mappings as being mlocked is a great
> simplification - I'd be interested in Ingo's views on that.

More generally, how about automatically discarding VMAs and rmap
chains when pages become mlocked, and not creating those structures in
the first place when mapping with MAP_LOCKED?

The idea is that adjacent locked regions would be mergable into a
single VMA, looking a lot like the present non-linear mapping, and
with no need for rmap chains.

Because mlock is reversible, you'd need the capability to reconsitute
individual VMAs from ptes when unlocking a region.

-- Jamie

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06 12:37                           ` Jamie Lokier
@ 2003-04-06 13:12                             ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06 13:12 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	linux-kernel, linux-mm

Andrew Morton wrote:
>> And treating the nonlinear mappings as being mlocked is a great
>> simplification - I'd be interested in Ingo's views on that.

On Sun, Apr 06, 2003 at 01:37:53PM +0100, Jamie Lokier wrote:
> More generally, how about automatically discarding VMAs and rmap
> chains when pages become mlocked, and not creating those structures in
> the first place when mapping with MAP_LOCKED?

There is some complexity there, as multiple allocationns are involved.


On Sun, Apr 06, 2003 at 01:37:53PM +0100, Jamie Lokier wrote:
> The idea is that adjacent locked regions would be mergable into a
> single VMA, looking a lot like the present non-linear mapping, and
> with no need for rmap chains.

This is an even harder issue than we've considered. We're going to have
to redefine things before this is certain.


On Sun, Apr 06, 2003 at 01:37:53PM +0100, Jamie Lokier wrote:
> Because mlock is reversible, you'd need the capability to reconsitute
> individual VMAs from ptes when unlocking a region.

This is a horror from which we've not yet recovered (that I know of).
I'd like to see a proper answer for this.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  1:58                             ` Andrew Morton
@ 2003-04-06 14:47                               ` Andrea Arcangeli
  2003-04-06 21:35                                 ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-06 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 05:58:24PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > Esepcially those sigbus in the current api
> > would be more expensive than the regular paging internal to the VM and
> > besides the signal it would generate flood of syscalls and kind of
> > duplication of memory management inside the userspace.
> 
> That went away.  We now encode the file offset in the unmapped ptes, so the
> kernel's fault handler can transparently reestablish the page.

if you put the file offset in the pte, you will break the max file
offset that you can map, that at least should be recoded with a cookie
like we do with the swap space

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  2:23                   ` Martin J. Bligh
  2003-04-06  3:55                     ` Andrew Morton
@ 2003-04-06 14:49                     ` Alan Cox
  2003-04-06 16:13                       ` Martin J. Bligh
  1 sibling, 1 reply; 105+ messages in thread
From: Alan Cox @ 2003-04-06 14:49 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm

On Sul, 2003-04-06 at 03:23, Martin J. Bligh wrote:
> > 	14.91s user 75.30s system 24% cpu 6:15.84 total
> 
> Isn't the intent to use sys_remap_file_pages for these sort of workloads
> anyway? In which case partial objrmap = rmap for these tests, so we're
> still OK?

What matters is the worst case not the best case. Users will do non
optimal things on a regular basis. 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06  7:38                             ` William Lee Irwin III
@ 2003-04-06 14:51                               ` Andrea Arcangeli
  0 siblings, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-06 14:51 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, mbligh, mingo, hugh, dmccr,
	linux-kernel, linux-mm

On Sat, Apr 05, 2003 at 11:38:36PM -0800, William Lee Irwin III wrote:
> On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> > I still think we shouldn't associate any metadata with the nonlinear.
> > nonlinaer should be enabled via a sysctl and have it run at true full
> > speed, it's a bypass for the VM so you can mangle the pagetables from
> > userspace.
> > As soon as you start associating metadata to nonlinar, it's not the
> > "raw fast" thing anymore and it increases the complexity.
> 
> One of the big reasons why it's desirable is to reduce the metadata,
> so I agree here.
> 
> 
> On Sun, Apr 06, 2003 at 01:10:08AM +0200, Andrea Arcangeli wrote:
> > running bochs after echoing 1 into a sysctl should be fine, like also
> > uml should echoing 1 into a sysctl to get revirtualized vsyscalls
> > (unless we make it a prctl but that'll be more complex and slower).
> > When bochs starts and runs the mmap(VM_NONLINEAR) it will get -EPERM and
> > it will fall into the mmap mode (for 2.4 anyways). Or they can as well
> > require the echoing so they won't need to maintain two modes.
> > the nonlinear should work only in a separate special vma, its current
> > api is very unclean since it can mix with original linear stuff into the
> > same linear vma, and it doesn't allow more than one file into the same
> > nonlinear vma. I still reccomend all my points that I posted yesterday
> > to change the API to something much more approriate.
> 
> This is an unusual idea; I'd expect capable(CAP_IPC_LOCK) to suffice
> to provide the privilege checks for direct mlocking as well as other
> operations that lock memory (please don't look at hugetlbfs for this...).

that would be enough if you could ask any capability to those apps.
Still you could override the sysctl check and allow the
mmap(VM_NONLINEAR) to work even w/ the sysctl, iff CAP_IPC_LOCK is set,
that's certainly safe, I don't mind about it.

so it could be an additional way to gain access to such functionalty,
but it doesn't obviate the need of the sysctl IMHO.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06 14:49                     ` Alan Cox
@ 2003-04-06 16:13                       ` Martin J. Bligh
  2003-04-06 21:34                         ` subobj-rmap Martin J. Bligh
  0 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 16:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm

>> > 	14.91s user 75.30s system 24% cpu 6:15.84 total
>> 
>> Isn't the intent to use sys_remap_file_pages for these sort of workloads
>> anyway? In which case partial objrmap = rmap for these tests, so we're
>> still OK?
> 
> What matters is the worst case not the best case. Users will do non
> optimal things on a regular basis. 

Humpf. Well I have a fairly simple plan to fix it now. I'll either publish
some code or the plan later today, once I've thought about it a bit more.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* subobj-rmap
  2003-04-06 16:13                       ` Martin J. Bligh
@ 2003-04-06 21:34                         ` Martin J. Bligh
  2003-04-06 21:42                           ` subobj-rmap Rik van Riel
  0 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 21:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin, Rik van Riel

> Humpf. Well I have a fairly simple plan to fix it now. I'll either publish
> some code or the plan later today, once I've thought about it a bit more.

I'm not sure we need a full 2-d tree to solve this, because the 2 dimensions
aren't independant. What we have is a list of virtual ranges of the 
address_space, which might (but probably don't) overlap. If they never 
overlapped, this would be easy, we'd just keep a sorted structure (list or
tree) of regions, and find the region we lay in. In fact, Dave already did
that (sort by start addr) ... but we have to walk the rest of the chains 
as well to find other regions.

Supposing we keep a list of areas (hung from the address_space) that 
describes independant linear ranges of memory that have the same set
of vma's mapping them (call those subobjects). Each subobject has a
chain of vma's from it that are mapping that subobject.


address_space ---> subobject ---> subobject ---> subobject ---> subobject
                       |              |              |              |
                       v              v              v              v
                      vma            vma            vma            vma
                       |                             |              |
                       v                             v              v
                      vma                           vma            vma
                       |                             |        
                       v                             v        
                      vma                           vma       

Now we can just find the first element in that sorted list that maps
the address we're looking for, and it has a chain of vma's that we
need to worry about. This should solve the 100x100 case. To solve the 
1x10000 case efficiently, we should be able to just turn the subobject 
sorted list into an rbtree.

When we map a new VMA, we need to look for overlaps with existing 
subobjects. I suspect (with no real proof, save intuition) that most
of the time we'll either map a new space (create a new subobject),
or an existing space completely (just tack yourself onto the vma chain
from the subobject). If we do get a partial overlap, we'll split the
subobject in twain, and add ourselves to the overlapping part. Note that
This now starts to look very like the process's tree of vma's, so there's 
lots of potential for code-reuse. If the overlaps don't happen a lot,
(and I suspect they won't) it should be dirt cheap to do.

This is a bit more expensive on the maintainance side than objrmap, but
cheaper than pte_chains, since it's per-vma, not per-page. It should be
much cheaper than objrmap in the corner cases we've been discussing though.

Thoughts / flames?

Part 2
------

Moreover, this can be used for sys_remap_file_pages (and indeed
my though process is partly based on some discussions with Dave last week
about how to solve that). However, if people think this is too heavy,
we can still use pte-chains for this, so don't discard the above if you
have the following bit. We just keep a subobject for each linear region 
within the non-linear VMA - it might need a little more info in the 
subobject to work. Yes, it's more expense at remap time, but we don't
have to do the per-page stuff (and it's lighter than vmas). I suspect 
that's a good tradeoff (unless some crazy person is worried about mapping 
lots of windows and never using them). However, it would need to be 
benchmarked, and it's independant of the above.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-06 14:47                               ` Andrea Arcangeli
@ 2003-04-06 21:35                                 ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06 21:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, mbligh, mingo, hugh, dmccr, linux-kernel, linux-mm

Andrea Arcangeli <andrea@suse.de> wrote:
>>> Esepcially those sigbus in the current api
>>> would be more expensive than the regular paging internal to the VM and
>>> besides the signal it would generate flood of syscalls and kind of
>>> duplication of memory management inside the userspace.

On Sat, Apr 05, 2003 at 05:58:24PM -0800, Andrew Morton wrote:
>> That went away.  We now encode the file offset in the unmapped ptes, so the
>> kernel's fault handler can transparently reestablish the page.

On Sun, Apr 06, 2003 at 04:47:34PM +0200, Andrea Arcangeli wrote:
> if you put the file offset in the pte, you will break the max file
> offset that you can map, that at least should be recoded with a cookie
> like we do with the swap space

IIRC we just restricted the size of the file that can use the things to
avoid having to code quite so much up.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 21:34                         ` subobj-rmap Martin J. Bligh
@ 2003-04-06 21:42                           ` Rik van Riel
  2003-04-06 21:52                             ` subobj-rmap Davide Libenzi
                                               ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Rik van Riel @ 2003-04-06 21:42 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Alan Cox, Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

On Sun, 6 Apr 2003, Martin J. Bligh wrote:

> Supposing we keep a list of areas (hung from the address_space) that 
> describes independant linear ranges of memory that have the same set
> of vma's mapping them (call those subobjects). Each subobject has a
> chain of vma's from it that are mapping that subobject.
> 
> address_space ---> subobject ---> subobject ---> subobject ---> subobject
>                        |              |              |              |
>                        v              v              v              v
>                       vma            vma            vma            vma
>                        |                             |              |
>                        v                             v              v
>                       vma                           vma            vma
>                        |                             |        
>                        v                             v        
>                       vma                           vma       

OK, lets say we have a file of 1000 pages, or
offsets 0 to 999, with the following mappings:

VMA A:   0-999
VMA B:   0-200
VMA C: 150-400
VMA D: 300-500
VMA E: 300-500
VMA F:   0-999

How would you describe these with independant
regions ?

For VMAs D & E and A & F it's a no-brainer,
but for Oracle shared memory you shouldn't
assume that you have any similar mappings.

I don't see how the data structure you describe
would allow us to efficiently select the subset
of VMAs for which:

1) the start address is smaller than the address we want
and
2) the end address is larger than the address we want

Then again, that might just be my lack of imagination.

cheers,

Rik


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 21:42                           ` subobj-rmap Rik van Riel
@ 2003-04-06 21:52                             ` Davide Libenzi
  2003-04-06 21:55                             ` subobj-rmap Jamie Lokier
  2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
  2 siblings, 0 replies; 105+ messages in thread
From: Davide Libenzi @ 2003-04-06 21:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux Kernel Mailing List

On Sun, 6 Apr 2003, Rik van Riel wrote:

> On Sun, 6 Apr 2003, Martin J. Bligh wrote:
>
> > Supposing we keep a list of areas (hung from the address_space) that
> > describes independant linear ranges of memory that have the same set
> > of vma's mapping them (call those subobjects). Each subobject has a
> > chain of vma's from it that are mapping that subobject.
> >
> > address_space ---> subobject ---> subobject ---> subobject ---> subobject
> >                        |              |              |              |
> >                        v              v              v              v
> >                       vma            vma            vma            vma
> >                        |                             |              |
> >                        v                             v              v
> >                       vma                           vma            vma
> >                        |                             |
> >                        v                             v
> >                       vma                           vma
>
> OK, lets say we have a file of 1000 pages, or
> offsets 0 to 999, with the following mappings:
>
> VMA A:   0-999
> VMA B:   0-200
> VMA C: 150-400
> VMA D: 300-500
> VMA E: 300-500
> VMA F:   0-999
>
> How would you describe these with independant
> regions ?

You should decompose each VMA in a set on independent regions. Immagine to
pile up each VMA with boundaries that cuts each other VMA address space :

1) |----------------------|
2)     |-----------|
3) |--------------------------|
4)         |-----------|

R) |---|---|-------|---|--|---|

    1   1   1       1   1  3
    3   2   2       3   3
        3   3       4
            4



- Davide


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 21:42                           ` subobj-rmap Rik van Riel
  2003-04-06 21:52                             ` subobj-rmap Davide Libenzi
@ 2003-04-06 21:55                             ` Jamie Lokier
  2003-04-06 22:39                               ` subobj-rmap William Lee Irwin III
  2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
  2 siblings, 1 reply; 105+ messages in thread
From: Jamie Lokier @ 2003-04-06 21:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Martin J. Bligh, Alan Cox, Andrew Morton, andrea, mingo, hugh,
	dmccr, Linux Kernel Mailing List, linux-mm, Bill Irwin

Rik van Riel wrote:
> I don't see how the data structure you describe
> would allow us to efficiently select the subset
> of VMAs for which:
> 
> 1) the start address is smaller than the address we want
> and
> 2) the end address is larger than the address we want

Think about the data structures some text editors use to describe
special regions of the text.  A common operation is to search for all
the special regions covering a particular cursor position.

Several data structures are available.  I'm not aware of any that have
perfect behaviour in all corner cases.

It might be worth noting that these data structures are good at
determining the set of regions covering position X+1 having recently
calculated the set for position X.  Perhaps that has relevance for
speeding up page scanning?

-- Jamie

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 21:42                           ` subobj-rmap Rik van Riel
  2003-04-06 21:52                             ` subobj-rmap Davide Libenzi
  2003-04-06 21:55                             ` subobj-rmap Jamie Lokier
@ 2003-04-06 22:03                             ` Martin J. Bligh
  2003-04-06 22:06                               ` subobj-rmap Martin J. Bligh
                                                 ` (2 more replies)
  2 siblings, 3 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 22:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

>> Supposing we keep a list of areas (hung from the address_space) that 
>> describes independant linear ranges of memory that have the same set
>> of vma's mapping them (call those subobjects). Each subobject has a
>> chain of vma's from it that are mapping that subobject.
>> 
>> address_space ---> subobject ---> subobject ---> subobject ---> subobject
>>                        |              |              |              | 
>>                        v              v              v              v
>>                       vma            vma            vma            vma
>>                        |                             |              | 
>>                        v                             v              v
>>                       vma                           vma            vma
>>                        |                             |        
>>                        v                             v        
>>                       vma                           vma       
> 
> OK, lets say we have a file of 1000 pages, or
> offsets 0 to 999, with the following mappings:
> 
> VMA A:   0-999
> VMA B:   0-200
> VMA C: 150-400
> VMA D: 300-500
> VMA E: 300-500
> VMA F:   0-999
> 
> How would you describe these with independant regions ?

Good question to illustrate with.
Extra spacing added just for ease of reading:

0-150 -> 150-200 -> 200-300 -> 300-400 -> 400-500 -> 500-999
 A          A          A          A          A          A
 B          B
            C          C          C 
                                  D          D          
                                  E          E          
 F          F          F          F          F          F

> For VMAs D & E and A & F it's a no-brainer,
> but for Oracle shared memory you shouldn't
> assume that you have any similar mappings

We can always leave the sys_remap_file_pages stuff using pte_chains,
and should certainly do that at first. But doing it for normal stuff
should be less controversial, I think.

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
@ 2003-04-06 22:06                               ` Martin J. Bligh
  2003-04-06 22:15                               ` subobj-rmap Andrea Arcangeli
  2003-04-06 23:06                               ` subobj-rmap Jamie Lokier
  2 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 22:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Andrew Morton, andrea, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

>> OK, lets say we have a file of 1000 pages, or
>> offsets 0 to 999, with the following mappings:
>> 
>> VMA A:   0-999
>> VMA B:   0-200
>> VMA C: 150-400
>> VMA D: 300-500
>> VMA E: 300-500
>> VMA F:   0-999
>> 
>> How would you describe these with independant regions ?
> 
> Good question to illustrate with.
> Extra spacing added just for ease of reading:
> 
> 0-150 -> 150-200 -> 200-300 -> 300-400 -> 400-500 -> 500-999
>  A          A          A          A          A          A
>  B          B
>             C          C          C 
>                                   D          D          
>                                   E          E          
>  F          F          F          F          F          F

Bah, offsets are slightly wrong, but the point is obviously the same

0-150 -> 151-200 -> 201-300 -> 301-400 -> 401-500 -> 501-999
 A          A          A          A          A          A
 B          B
            C          C          C 
                                  D          D          
                                  E          E          
 F          F          F          F          F          F

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
  2003-04-06 22:06                               ` subobj-rmap Martin J. Bligh
@ 2003-04-06 22:15                               ` Andrea Arcangeli
  2003-04-06 22:25                                 ` subobj-rmap Martin J. Bligh
  2003-04-06 23:06                               ` subobj-rmap Jamie Lokier
  2 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-06 22:15 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Alan Cox, Andrew Morton, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

On Sun, Apr 06, 2003 at 03:03:03PM -0700, Martin J. Bligh wrote:
> We can always leave the sys_remap_file_pages stuff using pte_chains,

not sure why you want still to have the vm to know about the
mmap(VM_NONLINEAR) hack at all.

that's a vm bypass. I can bet the people who wants to use it for running
faster on the the 32bit archs will definitely prefer zero overhead and
full hardware speed with only the pagetable and tlb flushing trash, and
zero additional kernel internal overhead. that's just a vm bypass that
could otherwise sit in kernel module, not a real kernel API.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 22:15                               ` subobj-rmap Andrea Arcangeli
@ 2003-04-06 22:25                                 ` Martin J. Bligh
  2003-04-07 21:25                                   ` subobj-rmap Andrea Arcangeli
  0 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 22:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Alan Cox, Andrew Morton, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

>> We can always leave the sys_remap_file_pages stuff using pte_chains,
> 
> not sure why you want still to have the vm to know about the
> mmap(VM_NONLINEAR) hack at all.
> 
> that's a vm bypass. I can bet the people who wants to use it for running
> faster on the the 32bit archs will definitely prefer zero overhead and
> full hardware speed with only the pagetable and tlb flushing trash, and
> zero additional kernel internal overhead. that's just a vm bypass that
> could otherwise sit in kernel module, not a real kernel API.

Well, you don't get zero overhead whatever you do. You either pay the
cost at remap time of manipulating sub-objects, or the cost at page-touch
time of the pte_chains stuff. I suspect sub-objects are cheaper if we
read /write the 32K chunks, not if people mostly just touch one page
per remap though.

What do you think about using this for the linear stuff though?

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 21:55                             ` subobj-rmap Jamie Lokier
@ 2003-04-06 22:39                               ` William Lee Irwin III
  0 siblings, 0 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-06 22:39 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Rik van Riel, Martin J. Bligh, Alan Cox, Andrew Morton, andrea,
	mingo, hugh, dmccr, Linux Kernel Mailing List, linux-mm

Rik van Riel wrote:
>> I don't see how the data structure you describe
>> would allow us to efficiently select the subset
>> of VMAs for which:
>> 1) the start address is smaller than the address we want
>> and
>> 2) the end address is larger than the address we want

On Sun, Apr 06, 2003 at 10:55:30PM +0100, Jamie Lokier wrote:
> Think about the data structures some text editors use to describe
> special regions of the text.  A common operation is to search for all
> the special regions covering a particular cursor position.
> Several data structures are available.  I'm not aware of any that have
> perfect behaviour in all corner cases.
> It might be worth noting that these data structures are good at
> determining the set of regions covering position X+1 having recently
> calculated the set for position X.  Perhaps that has relevance for
> speeding up page scanning?

Multidimensional search trees are routine and decades old last I
checked; why do none of them suffice and why would they be good at
sequential queries?


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
  2003-04-06 22:06                               ` subobj-rmap Martin J. Bligh
  2003-04-06 22:15                               ` subobj-rmap Andrea Arcangeli
@ 2003-04-06 23:06                               ` Jamie Lokier
  2003-04-06 23:26                                 ` subobj-rmap Martin J. Bligh
  2 siblings, 1 reply; 105+ messages in thread
From: Jamie Lokier @ 2003-04-06 23:06 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Alan Cox, Andrew Morton, andrea, mingo, hugh,
	dmccr, Linux Kernel Mailing List, linux-mm, Bill Irwin

Martin J. Bligh wrote:
> 0-150 -> 150-200 -> 200-300 -> 300-400 -> 400-500 -> 500-999
>  A          A          A          A          A          A
>  B          B
>             C          C          C 
>                                   D          D          
>                                   E          E          
>  F          F          F          F          F          F

I thought of that but decided it is too simple :)

A downside with it is that from time to time you need to split or
merge subobjects, and that means splitting or merging the list nodes
linking "rows" in the table above - potentially quite a lot of memory
allocation and traversal for a single mmap().

> > For VMAs D & E and A & F it's a no-brainer,
> > but for Oracle shared memory you shouldn't
> > assume that you have any similar mappings
> 
> We can always leave the sys_remap_file_pages stuff using pte_chains,
> and should certainly do that at first. But doing it for normal stuff
> should be less controversial, I think.

If you implement the 2d data structure that you illustrated, you have
a list node for each point in the table.

By the time your subobject regions are 1 page wide, you have a data
structure that is order-equivalent to pte rmap chains, although the
exact number of words is likely to be higher.

To me this suggests that the 2d data structure could be designed
carefully, so that in the extreme case it gracefully _becomes_ rmap
chains.  For memory efficiency you'd need to pack together multiple
list nodes into a single cache line - the same tricks used to minimise
rmap memory consumption.

I'm not convinced this is the best data structure, but it does seem to
suggest the possibility of a hybrid which gives the best of both
objrmap and rmap data structures.

-- Jamie

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 23:06                               ` subobj-rmap Jamie Lokier
@ 2003-04-06 23:26                                 ` Martin J. Bligh
  0 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-06 23:26 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Rik van Riel, Alan Cox, Andrew Morton, andrea, mingo, hugh,
	dmccr, Linux Kernel Mailing List, linux-mm, Bill Irwin

>> 0-150 -> 150-200 -> 200-300 -> 300-400 -> 400-500 -> 500-999
>>  A          A          A          A          A          A
>>  B          B
>>             C          C          C 
>>                                   D          D          
>>                                   E          E          
>>  F          F          F          F          F          F
> 
> I thought of that but decided it is too simple :)
> 
> A downside with it is that from time to time you need to split or
> merge subobjects, and that means splitting or merging the list nodes
> linking "rows" in the table above - potentially quite a lot of memory
> allocation and traversal for a single mmap().

The amount of work to be done is still fairly small ... and we already
do (as far as I can see) *exactly* this already for the existing
rb tree. Yes, mmap has a little bit more overhead, but you lose all
the per-page stuff, which seems much more efficient to me.
 
>> We can always leave the sys_remap_file_pages stuff using pte_chains,
>> and should certainly do that at first. But doing it for normal stuff
>> should be less controversial, I think.
> 
> If you implement the 2d data structure that you illustrated, you have
> a list node for each point in the table.
> 
> By the time your subobject regions are 1 page wide, you have a data
> structure that is order-equivalent to pte rmap chains, although the
> exact number of words is likely to be higher.

Well, yes. Except I hope nobody would want to do that on a per-page
basis. If you want that level of granularity, we should just do this 
for linear objects, and fall back to pte_chains for nonlinear.

Life would be a whole lot simpler if people were willing to specify
non-linear VMAs at create time - I don't see that as a big burden,
personally. That'd get rid of all the conversion stuff.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: subobj-rmap
  2003-04-06 22:25                                 ` subobj-rmap Martin J. Bligh
@ 2003-04-07 21:25                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-07 21:25 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rik van Riel, Alan Cox, Andrew Morton, mingo, hugh, dmccr,
	Linux Kernel Mailing List, linux-mm, Bill Irwin

On Sun, Apr 06, 2003 at 03:25:08PM -0700, Martin J. Bligh wrote:
> >> We can always leave the sys_remap_file_pages stuff using pte_chains,
> > 
> > not sure why you want still to have the vm to know about the
> > mmap(VM_NONLINEAR) hack at all.
> > 
> > that's a vm bypass. I can bet the people who wants to use it for running
> > faster on the the 32bit archs will definitely prefer zero overhead and
> > full hardware speed with only the pagetable and tlb flushing trash, and
> > zero additional kernel internal overhead. that's just a vm bypass that
> > could otherwise sit in kernel module, not a real kernel API.
> 
> Well, you don't get zero overhead whatever you do. You either pay the
> cost at remap time of manipulating sub-objects, or the cost at page-touch
> time of the pte_chains stuff. I suspect sub-objects are cheaper if we
> read /write the 32K chunks, not if people mostly just touch one page
> per remap though.
> 
> What do you think about using this for the linear stuff though?

I think at this only for the linear stuff. it would solve Andrew's
exploit against objrmap, for each page we would walk only the vmas
matching the pagetables mapping to the page. However those sub-objects
have a cost, the cost will be 8bytes per fragment. the slowest part
should be the split of the subobject when a new mapping happens and the
possible flood of list_add/list_del. I'm unsure it worth.

However it would be nice to se how the current 2.4 pte walking clock
algorithm does compared to objrmap and rmap when ext2 is used because
ext3 generated an I/O bound behaviour at least for my tree, that made
any vm-side comparison invalid.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-05 22:31                         ` Andrew Morton
  2003-04-05 23:10                           ` Andrea Arcangeli
  2003-04-06 12:37                           ` Jamie Lokier
@ 2003-04-22 11:00                           ` Ingo Molnar
  2003-04-22 11:54                             ` William Lee Irwin III
                                               ` (3 more replies)
  2 siblings, 4 replies; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, mbligh, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm


On Sat, 5 Apr 2003, Andrew Morton wrote:

> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > I see what you mean, you're right. That's because all the 10,000 vma
> > belongs to the same inode.
> 
> I see two problems with objrmap - this search, and the complexity of the
> interworking with nonlinear mappings.
> 
> There is talk going around about implementing some more sophisticated
> search structure thatn a linear list.
> 
> And treating the nonlinear mappings as being mlocked is a great
> simplification - I'd be interested in Ingo's views on that.

i believe the right direction is the one that is currently happening: to
make nonlinear mappings more generic. sys_remap_file_pages() started off
as a special hack mostly usable for locked down pages. Now it's directly
encoded in the pte and thus swappable, and uses up a fraction of the vma
cost for finegrained mappings.

(i believe the next step should be to encode permission bits into the pte
as well, and thus enable eg. mprotect() to work without splitting up vmas.  
On 32-bit ptes this is not relistic due to the file size limit imposed,
but once 64-bit ptes become commonplace it's a step worth taking i
believe.)

the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
serious design problem i believe. 100 mappings in 100 contexts on the same
inode is not uncommon at all - still it totally DoS-es the VM's scanning
code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
users of that mapping.

If the O(N^2) can be optimized away then i'm all for it. If not, then i
dont really understand how the same people who call sys_remap_file_pages()
a 'hack' [i believe they are not understanding the current state of the
API] can argue for objrmap in the same paragraph.

i believe the main problem wrt. rmap is the pte_chain lowmem overhead on
32-bit systems. (it also causes some fork() runtime overhead, but i doubt
anyone these days should argue that fork() latency is a commanding
parameter to optimize the VM for. We have vfork() and good threading, and
any fork()-sensitive app uses preforking anyway.)

to solve this problem i believe the pte chains should be made
double-linked lists, and should be organized in a completely different
(and much simpler) way: in a 'companion page' to the actual pte page. The
companion page stores the pte-chain links, corresponding directly to the
pte in the pagetable. Ie. if we have pte #100 in the pagetable, then we
look at entry #100 in the companion page. [the size of the page is
platform-dependent, eg. on PAE x86 it's a single page, on 64-platforms
it's two pages most of the time.] That entry then points to the 'next' and
'previous' pte in the pte chain. [the pte pagetable page itself has
pointers towards the companion page(s) in the struct page itself, existing
fields can be reused for this.]

This simpler pte chain construct also makes it easy to high-map the pte
chains: whenever we high-map the pte page, we can high-map the pte chain
page(s) as well. No more lowmem overhead for pte chains.

It also makes it easy to calculate the overhead of the pte chains: twice
the amount of pagetable overhead. Ie. with 32-bit pte's it's +8 bytes
overhead, or +0.2% of RAM overhead per mapped page, using a 4K page. With
64-bit ptes on 32-bit platforms (PAE), the overhead is still 8 bytes. On
64-bit platforms using 8K pages the overhead is still +0.2% of RAM, in
additionl to the 0.1% of RAM overhead for the pte itself. The worst-case
is 64-bit platforms with a 4K pagesize, there the overhead is +0.4% of
RAM, in addition to the 0.2% overhead caused by the pte itself.

(as a comparison, for finegrained mappings, if a single page is mapped by
a single vma, the 64-byte overhead of the vma causes a +1.5% overhead.)

so i think it's doable, and it solves many of the hairy allocation
deadlock issues wrt. pte-chains - the 'companion pages' hosting the pte
chain back and forward pointers can be allocated at the same time a
pagetable page is allocated. I believe this approach also greatly reduces
the complexity of pte chains, plus it makes unmap-time O(1) unlinking of
pte chains possible. If we can live with the RAM overhead.  (which would
scale linearly with the already existing pagetable overhead.)

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 11:00                           ` Ingo Molnar
@ 2003-04-22 11:54                             ` William Lee Irwin III
  2003-04-22 14:31                               ` Ingo Molnar
  2003-04-22 12:37                             ` Andrea Arcangeli
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 11:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Sat, 5 Apr 2003, Andrew Morton wrote:
>> And treating the nonlinear mappings as being mlocked is a great
>> simplification - I'd be interested in Ingo's views on that.

On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> i believe the right direction is the one that is currently happening: to
> make nonlinear mappings more generic. sys_remap_file_pages() started off
> as a special hack mostly usable for locked down pages. Now it's directly
> encoded in the pte and thus swappable, and uses up a fraction of the vma
> cost for finegrained mappings.
> (i believe the next step should be to encode permission bits into the pte
> as well, and thus enable eg. mprotect() to work without splitting up vmas.  
> On 32-bit ptes this is not relistic due to the file size limit imposed,
> but once 64-bit ptes become commonplace it's a step worth taking i
> believe.)

Are the reserved bits in PAE kernel-usable at all or do they raise
exceptions when set? This may be cpu revision -dependent, but if things
are usable in some majority of models it could be ihteresting.


On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
> serious design problem i believe. 100 mappings in 100 contexts on the same
> inode is not uncommon at all - still it totally DoS-es the VM's scanning
> code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
> users of that mapping.
> If the O(N^2) can be optimized away then i'm all for it. If not, then i
> dont really understand how the same people who call sys_remap_file_pages()
> a 'hack' [i believe they are not understanding the current state of the
> API] can argue for objrmap in the same paragraph.
> i believe the main problem wrt. rmap is the pte_chain lowmem overhead on
> 32-bit systems. (it also causes some fork() runtime overhead, but i doubt
> anyone these days should argue that fork() latency is a commanding
> parameter to optimize the VM for. We have vfork() and good threading, and
> any fork()-sensitive app uses preforking anyway.)

pte_chain lowmem overhead is relatively serious. It seems to be the
main motivator of objrmap. OTOH I tend to fall on the other side of the
fence from the "pagetables are sacred relics" or whatever camp and
would prefer to keep things less pte-based, but am not terribly
religious about it.


On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> to solve this problem i believe the pte chains should be made
> double-linked lists, and should be organized in a completely different
> (and much simpler) way: in a 'companion page' to the actual pte page. The
> companion page stores the pte-chain links, corresponding directly to the
> pte in the pagetable. Ie. if we have pte #100 in the pagetable, then we
> look at entry #100 in the companion page. [the size of the page is
> platform-dependent, eg. on PAE x86 it's a single page, on 64-platforms
> it's two pages most of the time.] That entry then points to the 'next' and
> 'previous' pte in the pte chain. [the pte pagetable page itself has
> pointers towards the companion page(s) in the struct page itself, existing
> fields can be reused for this.]
> This simpler pte chain construct also makes it easy to high-map the pte
> chains: whenever we high-map the pte page, we can high-map the pte chain
> page(s) as well. No more lowmem overhead for pte chains.

Getting the things out of lowmem sounds very interesting, although I
vaguely continue to wonder about the total RAM overhead. ISTR an old
2.4 benchmark run on PAE x86 where 90+% of physical RAM was consumed by
pagetables _after_ pte_highmem (where before the kernel dropped dead).

I've thought about just reaping pagetables (and hence pte_chains) many
times but haven't carried it through. It sounds mostly orthogonal to
everything else, and after it, all the "workload feasibility patches"
are just optimizations we can think about merging whenever we're ready.
I like it no no small part b/c the PAE-specific damage is entirely nil.
I wonder if that might be a better in-tree solution and if various
other PAE-specific lowmem consumption optimizations are really necessary
for a mainline tree, or if they could sit out-of-tree for 5-10 years
until ppc64 or ia64 (anything but that opcode prefix hack) takes over.
OTOH if everyone uses it, it begs the question of "why not merge it?"

Also, my general measurements of PTE utilization on i386 are somewhere
around 20%, which is an absurd amount of waste.

But anyway, companion pages are doable. The real metric is what the
code looks like and how it performs and what workloads it supports.


On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> It also makes it easy to calculate the overhead of the pte chains: twice
> the amount of pagetable overhead. Ie. with 32-bit pte's it's +8 bytes
> overhead, or +0.2% of RAM overhead per mapped page, using a 4K page. With
> 64-bit ptes on 32-bit platforms (PAE), the overhead is still 8 bytes. On
> 64-bit platforms using 8K pages the overhead is still +0.2% of RAM, in
> additionl to the 0.1% of RAM overhead for the pte itself. The worst-case
> is 64-bit platforms with a 4K pagesize, there the overhead is +0.4% of
> RAM, in addition to the 0.2% overhead caused by the pte itself.

I would not say 0.4% of RAM. I would say 0.4% of aggregate virtualspace.
So someone needs to factor virtual:physical ratio for the important
workloads into that analysis.


On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> (as a comparison, for finegrained mappings, if a single page is mapped by
> a single vma, the 64-byte overhead of the vma causes a +1.5% overhead.)
> so i think it's doable, and it solves many of the hairy allocation
> deadlock issues wrt. pte-chains - the 'companion pages' hosting the pte
> chain back and forward pointers can be allocated at the same time a
> pagetable page is allocated. I believe this approach also greatly reduces
> the complexity of pte chains, plus it makes unmap-time O(1) unlinking of
> pte chains possible. If we can live with the RAM overhead.  (which would
> scale linearly with the already existing pagetable overhead.)

Well, the already-existing pagetable overhead is not insignificant.
It's somewhere around 3MB on lightly-loaded 768MB x86-32 UP, which is
very close to beginning to swap.


-- wli

$ uname -a
Linux megeira 2.5.68 #1 SMP Mon Apr 21 22:01:35 PDT 2003 i686 unknown unknown GNU/Linux
$ cat /proc/meminfo
MemTotal:     65949952 kB
MemFree:      65840448 kB
Buffers:          5472 kB
Cached:          15328 kB
SwapCached:          0 kB
Active:          37536 kB
Inactive:        12864 kB
HighTotal:    65198080 kB
HighFree:     65131968 kB
LowTotal:       751872 kB
LowFree:        708480 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:          24320 kB
Slab:            13216 kB
Committed_AS:     7164 kB
PageTables:       2304 kB
VmallocTotal:   131080 kB
VmallocUsed:      4552 kB
VmallocChunk:   126528 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
$ 

(yep, that's pgcl-2.5.68-1A)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 11:00                           ` Ingo Molnar
  2003-04-22 11:54                             ` William Lee Irwin III
@ 2003-04-22 12:37                             ` Andrea Arcangeli
  2003-04-22 13:20                               ` William Lee Irwin III
  2003-04-22 14:29                             ` Martin J. Bligh
  2003-04-22 14:32                             ` Martin J. Bligh
  3 siblings, 1 reply; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-22 12:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, mbligh, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> 
> On Sat, 5 Apr 2003, Andrew Morton wrote:
> 
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > I see what you mean, you're right. That's because all the 10,000 vma
> > > belongs to the same inode.
> > 
> > I see two problems with objrmap - this search, and the complexity of the
> > interworking with nonlinear mappings.
> > 
> > There is talk going around about implementing some more sophisticated
> > search structure thatn a linear list.
> > 
> > And treating the nonlinear mappings as being mlocked is a great
> > simplification - I'd be interested in Ingo's views on that.
> 
> i believe the right direction is the one that is currently happening: to
> make nonlinear mappings more generic. sys_remap_file_pages() started off
> as a special hack mostly usable for locked down pages. Now it's directly
> encoded in the pte and thus swappable, and uses up a fraction of the vma
> cost for finegrained mappings.
> 
> (i believe the next step should be to encode permission bits into the pte
> as well, and thus enable eg. mprotect() to work without splitting up vmas.  
> On 32-bit ptes this is not relistic due to the file size limit imposed,
> but once 64-bit ptes become commonplace it's a step worth taking i
> believe.)
> 
> the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
> serious design problem i believe. 100 mappings in 100 contexts on the same
> inode is not uncommon at all - still it totally DoS-es the VM's scanning
> code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
> users of that mapping.
> 
> If the O(N^2) can be optimized away then i'm all for it. If not, then i
> dont really understand how the same people who call sys_remap_file_pages()
> a 'hack' [i believe they are not understanding the current state of the

it's an hack primarly because you're mixing linear with non linear,
incidentally that as well breaks truncate. In the current state truncate
is malfunctioning. To make truncate working in the current state you
would need to check all pages->indexes for every page pointed by the
pagetables belonging to each vma linked in the objrmap.

I don't think anybody wants to slowdown truncate like that (I mean, with
partial truncates and huge vmas).

Fixing it so truncate works still at a the current speed (when you don't
use sys_remap_file_pages) means changing the API to be sane and at the
very least to stop mixing linaer with nonlinaer vmas.

And I found very unclean anyways that you can mangle a linaer vma, and
to have it partly linear and partly nonlinear. nonlinear vmas are
special, if they would not be special we would not break anything with
the nonlinear behaviour inside a linear vma.

At the very least you need a mmap(VM_NONLINEAR) to allocate the
nonlinaer virtual space, and to have sys_remap_file_pages working only
inside this space.

This was one of my first points to consider sys_remap_file_pages a stay
in the kernel as a sane API. The other points are lower prio actually.

As for the other points I still think the whole purpose of
sys_remap_file_pages is to bypass the VM enterely so it should have the
least possible hardware cost associated with it. It is meant only to
mangle pagetables from userspace. And sys_remap_file_pages has nothing
to do with rmap or objrmap btw (that is an issue for everything, not
just this). But since the whole purpose of sys_remap_file_pages is to
bypass the VM enterely and to make it as fast as possible, we should as
well turn off the paging to allow people to get the biggest advantage
out of sys_remap_file_pages and to allow to pass the filedescriptor as
well to sys_remap_file_pages, so that you can map multiple files in the
same vma. I think allowing multiple files makes perfect sense and the
lack of this additional important feature is a concern to me.

Also sys_remap_file_pages should as well try to use largepages to map
the pagecache, as far as the alignment and the largepage pool allows it.
That makes perfect sense. 

As for bochs it will have no problem in enabling a system wide sysctl
before running, that's much cleaner than loading two kernel modules.

Overall trying to make nonlinear a usable by default generic API looks
wrong to me, sys_remap_file_pages has to be a VM bypass or it has to go.
If you want it to stay as a possibly default generic API then drop the
vma enterely and have mmap() and mprotect and mlock not generating any
vma overhead, but have them generating nonlinare stuff inside a single
whole vma for the whole address space. If you can do everything
generically (as you seem to want to reach) with sys_remap_file_pages,
then do it with the current API w/o generating a new non standard API.
It's a matter of functionalty inside the kernel, if you can do
everything w/o vma, then dorp the vma from mmap, that's all.
sys_remap_file_pages is equivalent to a mmap(MAP_FIXED) anyways.

I'm not against making mmap faster or whatever, but sys_remap_file_pages
makes sense to me only as a VM bypass, something that will always be
faster than the regular mmap or whatever by bypassing the VM. If you
don't bypass the VM you should make mmap run as fast as
sys_remap_file_pages instead IMHO.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 12:37                             ` Andrea Arcangeli
@ 2003-04-22 13:20                               ` William Lee Irwin III
  2003-04-22 14:38                                 ` Martin J. Bligh
  2003-04-22 14:52                                 ` Andrea Arcangeli
  0 siblings, 2 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 13:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
>> If the O(N^2) can be optimized away then i'm all for it. If not, then i
>> dont really understand how the same people who call sys_remap_file_pages()
>> a 'hack' [i believe they are not understanding the current state of the

On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> it's an hack primarly because you're mixing linear with non linear,
> incidentally that as well breaks truncate. In the current state truncate
> is malfunctioning. To make truncate working in the current state you
> would need to check all pages->indexes for every page pointed by the
> pagetables belonging to each vma linked in the objrmap.
> I don't think anybody wants to slowdown truncate like that (I mean, with
> partial truncates and huge vmas).
> Fixing it so truncate works still at a the current speed (when you don't
> use sys_remap_file_pages) means changing the API to be sane and at the
> very least to stop mixing linaer with nonlinaer vmas.

The truncate() issues are a relatively major outstanding issue in -mm,
and IIRC hugh was the first to raise them.


On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> And I found very unclean anyways that you can mangle a linaer vma, and
> to have it partly linear and partly nonlinear. nonlinear vmas are
> special, if they would not be special we would not break anything with
> the nonlinear behaviour inside a linear vma.
> At the very least you need a mmap(VM_NONLINEAR) to allocate the
> nonlinaer virtual space, and to have sys_remap_file_pages working only
> inside this space.
> This was one of my first points to consider sys_remap_file_pages a stay
> in the kernel as a sane API. The other points are lower prio actually.

I don't know that it's unclean; AFAICT tagging at any level should
suffice. The arrangement as it stands is (of course) oopsable.


On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> As for the other points I still think the whole purpose of
> sys_remap_file_pages is to bypass the VM enterely so it should have the
> least possible hardware cost associated with it. It is meant only to
> mangle pagetables from userspace. And sys_remap_file_pages has nothing
> to do with rmap or objrmap btw (that is an issue for everything, not
> just this). But since the whole purpose of sys_remap_file_pages is to
> bypass the VM enterely and to make it as fast as possible, we should as
> well turn off the paging to allow people to get the biggest advantage
> out of sys_remap_file_pages and to allow to pass the filedescriptor as
> well to sys_remap_file_pages, so that you can map multiple files in the
> same vma. I think allowing multiple files makes perfect sense and the
> lack of this additional important feature is a concern to me.
> Also sys_remap_file_pages should as well try to use largepages to map
> the pagecache, as far as the alignment and the largepage pool allows it.
> That makes perfect sense. 

I've already been tagged to implement sys_remap_file_pages() for
hugetlbfs. For implicit API's (which are arguably superior) there are
fewer issues than meet the eye so long as memory remains locked. Making
things aware of large pages at various levels of the VM, VFS, and block
io subsystem looks very attractive from a number of POV's, but more
research is needed to understand its effectiveness.


On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> As for bochs it will have no problem in enabling a system wide sysctl
> before running, that's much cleaner than loading two kernel modules.
> Overall trying to make nonlinear a usable by default generic API looks
> wrong to me, sys_remap_file_pages has to be a VM bypass or it has to go.
> If you want it to stay as a possibly default generic API then drop the
> vma enterely and have mmap() and mprotect and mlock not generating any
> vma overhead, but have them generating nonlinare stuff inside a single
> whole vma for the whole address space. If you can do everything
> generically (as you seem to want to reach) with sys_remap_file_pages,
> then do it with the current API w/o generating a new non standard API.
> It's a matter of functionalty inside the kernel, if you can do
> everything w/o vma, then dorp the vma from mmap, that's all.
> sys_remap_file_pages is equivalent to a mmap(MAP_FIXED) anyways.

I'm hard-pressed to comment on API's. Someone with more understanding of
userspaces' needs will have to deliver a more adequate response.


On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> I'm not against making mmap faster or whatever, but sys_remap_file_pages
> makes sense to me only as a VM bypass, something that will always be
> faster than the regular mmap or whatever by bypassing the VM. If you
> don't bypass the VM you should make mmap run as fast as
> sys_remap_file_pages instead IMHO.

Well, AFAICT the question wrt. sys_remap_file_pages() is not speed, but
space. Speeding up mmap() is of course worthy of merging given the
usual mergeability criteria.

On this point I must make a concession: k-d trees as formulated by
Bentley et al have space consumption issues that may well render them
inappropriate for kernel usage. I still believe it's worth an empirical
investigation once descriptions of on-line algorithms for their
maintenance are recovered, as well as other 2D+ spatial algorithms, esp.
those with better space behavior.

Specifically, k-d trees require internal nodes to partition spaces that
are not related to leaf nodes (i.e. data points), and not all
rebalancing policies are guaranteed to recover space.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 11:00                           ` Ingo Molnar
  2003-04-22 11:54                             ` William Lee Irwin III
  2003-04-22 12:37                             ` Andrea Arcangeli
@ 2003-04-22 14:29                             ` Martin J. Bligh
  2003-04-22 15:07                               ` Ingo Molnar
                                                 ` (2 more replies)
  2003-04-22 14:32                             ` Martin J. Bligh
  3 siblings, 3 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 14:29 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Andrea Arcangeli, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm

>> > I see what you mean, you're right. That's because all the 10,000 vma
>> > belongs to the same inode.
>> 
>> I see two problems with objrmap - this search, and the complexity of the
>> interworking with nonlinear mappings.
>> 
>> There is talk going around about implementing some more sophisticated
>> search structure thatn a linear list.
>> 
>> And treating the nonlinear mappings as being mlocked is a great
>> simplification - I'd be interested in Ingo's views on that.
> 
> i believe the right direction is the one that is currently happening: to
> make nonlinear mappings more generic. sys_remap_file_pages() started off
> as a special hack mostly usable for locked down pages. Now it's directly
> encoded in the pte and thus swappable, and uses up a fraction of the vma
> cost for finegrained mappings.
> 
> (i believe the next step should be to encode permission bits into the pte
> as well, and thus enable eg. mprotect() to work without splitting up
> vmas.   On 32-bit ptes this is not relistic due to the file size limit
> imposed, but once 64-bit ptes become commonplace it's a step worth taking
> i believe.)

I don't think you commented on Andrew's actual question, as far as I can
see ... can we mlock the nonlinear mappings? I think for the original
designed usage (Oracle) that's fine, as far as I know.

The other massive problem we seem to have is that fact we don't know at
create time whether the mapping is non-linear or not. Knowing that would
allow us to do what we're acutally trying to do now, which is to keep
pte-chains for these mappings, and use objrmap for linear ones. The pain is
not dealing with them, it's converting from one to the other.

Either that, or we keep a list of the nonlinear regions for each VMA so we
can do the objrmap for the non-linear regions as well. Yes, it's a little
bit of overhead for sys_remap_file_pages, but you lose the overhead per
page on the pte-chain manipulation front.

So if we can do any of those 3 things, I think we're fine. I find it hard
to believe that none of them is acceptable. Particularly as we can probably
combine the first with one of the others fairly easily.

> the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
> serious design problem i believe. 100 mappings in 100 contexts on the same
> inode is not uncommon at all - still it totally DoS-es the VM's scanning
> code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
> users of that mapping.
> 
> If the O(N^2) can be optimized away then i'm all for it. If not, then i
> dont really understand how the same people who call sys_remap_file_pages()
> a 'hack' [i believe they are not understanding the current state of the
> API] can argue for objrmap in the same paragraph.

Well, we can easily fix the O(N^2) property by using a data structure other
than a simple non-sorted linked list. However ... that has significant
overhead itself. I think we're optimising for the wrong case here - isn't
the 100x100 mappings case exactly what we have sys_remap_file_pages for?
Which keeps pte_chains (for the hybrid case of partial objrmap), so it's
just as fast as before (assuming we can resolve the issue above).

We can make the O(?) stuff look as fancy as we like. However, in reality,
that makes the constants suck, and I'm not at all sure it's a good plan.

> i believe the main problem wrt. rmap is the pte_chain lowmem overhead on
> 32-bit systems. (it also causes some fork() runtime overhead, but i doubt
> anyone these days should argue that fork() latency is a commanding
> parameter to optimize the VM for. We have vfork() and good threading, and
> any fork()-sensitive app uses preforking anyway.)

For workloads like server consolidation, its not as easy as that. You have
myriad numbers of little applications, and rewriting all of them because
fork just became a lot slower is not really practical.

I think you're seriously underestimating the performance impact vs space
problems as well ... IIRC my simple kernel compile test became about 25%
less system time via partial objrmap.

> to solve this problem i believe the pte chains should be made
> double-linked lists, and should be organized in a completely different

It seems ironic that the solution to space consumption is do double the
amount of space taken ;-) I see what you're trying to do (shove things up
into highmem), but it seems like a much better plan to me to just kill the
bloat altogether. 

> This simpler pte chain construct also makes it easy to high-map the pte
> chains: whenever we high-map the pte page, we can high-map the pte chain
> page(s) as well. No more lowmem overhead for pte chains.

Well, it's traded lowmem space overhead for > 2x highmem overhead
(sparseness). But what's killer is even more time overhead than before. Now
you have to kmap the hell out of everything to do manipulations. The
overhead for pte-highmem is horrible as it is (something like 10% systime
for kernel compile IIRC). And this is worse - you have to manipulate the
prev and next elements in the list. I know how to fix pte_highmem kmap
overhead already (via UKVA), but not rmap pages. Not only is it kmap, but
you have double the number of cachelines touched.

I think the holes in objrmap are quite small - and are already addressed by
your sys_remap_file_pages mechanism.

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 11:54                             ` William Lee Irwin III
@ 2003-04-22 14:31                               ` Ingo Molnar
  2003-04-22 14:56                                 ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 14:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, William Lee Irwin III wrote:

> Are the reserved bits in PAE kernel-usable at all or do they raise
> exceptions when set? This may be cpu revision -dependent, but if things
> are usable in some majority of models it could be ihteresting.

if the present bit is clear then the remaining 63 bits are documented by
Intel as being software-available, so this all works just fine.

> Getting the things out of lowmem sounds very interesting, although I
> vaguely continue to wonder about the total RAM overhead. ISTR an old 2.4
> benchmark run on PAE x86 where 90+% of physical RAM was consumed by
> pagetables _after_ pte_highmem (where before the kernel dropped dead).

just create a sparse enough memory layout (one page mapped every 2MB) and
pagetable overhead will dominate. Is it a problem in practice? I doubt it,
and you get what you asked for, and you can always offset it with RAM.

> But anyway, companion pages are doable. The real metric is what the code
> looks like and how it performs and what workloads it supports.

> I would not say 0.4% of RAM. I would say 0.4% of aggregate virtualspace.
> So someone needs to factor virtual:physical ratio for the important
> workloads into that analysis.

yes.

> Well, the already-existing pagetable overhead is not insignificant. It's
> somewhere around 3MB on lightly-loaded 768MB x86-32 UP, which is very
> close to beginning to swap.

3MB might sound alot. Companion pagetables will make that 9MB on non-PAE.
(current pte chains should make that roughly 6MB on average) 9MB is 1.1%
of all RAM. 4K granular mem_map[] is 1.5% cost, and even there it's not
mainly the RAM overhead that hurts us, but the lowmem overhead.

(btw., the size of companion pagetables is likely reduced via pgcl as well
- they need to track the VM units of pages, not the MMU units of pages.)

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 11:00                           ` Ingo Molnar
                                               ` (2 preceding siblings ...)
  2003-04-22 14:29                             ` Martin J. Bligh
@ 2003-04-22 14:32                             ` Martin J. Bligh
  2003-04-22 15:09                               ` Ingo Molnar
  3 siblings, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 14:32 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Andrea Arcangeli, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm

> It also makes it easy to calculate the overhead of the pte chains: twice
> the amount of pagetable overhead. Ie. with 32-bit pte's it's +8 bytes
> overhead, or +0.2% of RAM overhead per mapped page, using a 4K page. With
> 64-bit ptes on 32-bit platforms (PAE), the overhead is still 8 bytes. On
> 64-bit platforms using 8K pages the overhead is still +0.2% of RAM, in
> additionl to the 0.1% of RAM overhead for the pte itself. The worst-case
> is 64-bit platforms with a 4K pagesize, there the overhead is +0.4% of
> RAM, in addition to the 0.2% overhead caused by the pte itself.

Oh, BTW. You're assuming no sharing of any pages in the above. Look what
happens if 1000 processes share the same page ... 

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 13:20                               ` William Lee Irwin III
@ 2003-04-22 14:38                                 ` Martin J. Bligh
  2003-04-22 15:10                                   ` William Lee Irwin III
  2003-04-22 14:52                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 14:38 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm

> Well, AFAICT the question wrt. sys_remap_file_pages() is not speed, but
> space. Speeding up mmap() is of course worthy of merging given the
> usual mergeability criteria.
> 
> On this point I must make a concession: k-d trees as formulated by
> Bentley et al have space consumption issues that may well render them
> inappropriate for kernel usage. I still believe it's worth an empirical
> investigation once descriptions of on-line algorithms for their
> maintenance are recovered, as well as other 2D+ spatial algorithms, esp.
> those with better space behavior.
> 
> Specifically, k-d trees require internal nodes to partition spaces that
> are not related to leaf nodes (i.e. data points), and not all
> rebalancing policies are guaranteed to recover space.

We can still do the simple sorted list of lists thing (I have preliminary
non-functional code). But I don't see that it's really worth the overhead
in the common case to fix a corner case that has already been fixed in a
different way.

/*
 * s = address_space, r = address_range, v = vma
 *
 * s - r - r - r - r - r
 *     |   |   |   |   |
 *     v   v   v   v   v
 *     |   |           |
 *     v   v           v
 *         |
 *         v
 */

struct address_range {
       unsigned long           start;
       unsigned long           end;
       struct list_head        ranges;
       struct list_head        vmas;
};

where the list of address_ranges is sorted by start address. This is
intended to make use of the real-world case that many things (like shared
libs) map the same exact address ranges over and over again (ie something
like 3 ranges, but hundreds or thousands of mappings).

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 13:20                               ` William Lee Irwin III
  2003-04-22 14:38                                 ` Martin J. Bligh
@ 2003-04-22 14:52                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-22 14:52 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar, Andrew Morton, mbligh, mingo,
	hugh, dmccr, Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 06:20:13AM -0700, William Lee Irwin III wrote:
> On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
> >> If the O(N^2) can be optimized away then i'm all for it. If not, then i
> >> dont really understand how the same people who call sys_remap_file_pages()
> >> a 'hack' [i believe they are not understanding the current state of the
> 
> On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> > it's an hack primarly because you're mixing linear with non linear,
> > incidentally that as well breaks truncate. In the current state truncate
> > is malfunctioning. To make truncate working in the current state you
> > would need to check all pages->indexes for every page pointed by the
> > pagetables belonging to each vma linked in the objrmap.
> > I don't think anybody wants to slowdown truncate like that (I mean, with
> > partial truncates and huge vmas).
> > Fixing it so truncate works still at a the current speed (when you don't
> > use sys_remap_file_pages) means changing the API to be sane and at the
> > very least to stop mixing linaer with nonlinaer vmas.
> 
> The truncate() issues are a relatively major outstanding issue in -mm,
> and IIRC hugh was the first to raise them.

yes.

> On Tue, Apr 22, 2003 at 02:37:19PM +0200, Andrea Arcangeli wrote:
> > I'm not against making mmap faster or whatever, but sys_remap_file_pages
> > makes sense to me only as a VM bypass, something that will always be
> > faster than the regular mmap or whatever by bypassing the VM. If you
> > don't bypass the VM you should make mmap run as fast as
> > sys_remap_file_pages instead IMHO.
> 
> Well, AFAICT the question wrt. sys_remap_file_pages() is not speed, but
> space. Speeding up mmap() is of course worthy of merging given the
> usual mergeability criteria.

I completely agree. The major cost is the mangling of the pagetables
that makes the ~32G case worthwhile only with largepages (better drop
some ram and use largepages than add more ram w/o largepages). I just
fixed my tree to allow up to 64G in largepages in a single file because
16G was a too low limit now. So sys_remap_file_pages should provide few
performance advantages, and its main benefit is in avoiding all the ram
waste with the vmas (and the rmap waste if rmap is used in the VM) which
contraddicts the "generic" usage with VM knowledge that may add overhead
(I mean the partial object over the remap-file-pages).

If we can get sys_remap_file_pages running at zero space overhead with
vm knowledge that's ok, but at least from my part I much prefer being
able to mix multiple files in the same mmap(VM_NONLINEAR) than to have
vm knowledge on the VM bypass.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:31                               ` Ingo Molnar
@ 2003-04-22 14:56                                 ` William Lee Irwin III
  2003-04-22 15:26                                   ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 14:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Tue, 22 Apr 2003, William Lee Irwin III wrote:
>> Are the reserved bits in PAE kernel-usable at all or do they raise
>> exceptions when set? This may be cpu revision -dependent, but if things
>> are usable in some majority of models it could be ihteresting.

On Tue, Apr 22, 2003 at 10:31:49AM -0400, Ingo Molnar wrote:
> if the present bit is clear then the remaining 63 bits are documented by
> Intel as being software-available, so this all works just fine.

Sorry; I should have caught that from the standard docs; I was over-
anticipating something involving valid PTE's.


On Tue, 22 Apr 2003, William Lee Irwin III wrote:
>> Getting the things out of lowmem sounds very interesting, although I
>> vaguely continue to wonder about the total RAM overhead. ISTR an old 2.4
>> benchmark run on PAE x86 where 90+% of physical RAM was consumed by
>> pagetables _after_ pte_highmem (where before the kernel dropped dead).

On Tue, Apr 22, 2003 at 10:31:49AM -0400, Ingo Molnar wrote:
> just create a sparse enough memory layout (one page mapped every 2MB) and
> pagetable overhead will dominate. Is it a problem in practice? I doubt it,
> and you get what you asked for, and you can always offset it with RAM.

Actually it wasn't from sparse memory, it was from massive sharing.
Basically 10000 processes whose virtualspace was dominated by shmem
shared across all of them.

On some reflection I suspect a variety of techniques are needed here.


On Tue, 22 Apr 2003, William Lee Irwin III wrote:
>> Well, the already-existing pagetable overhead is not insignificant. It's
>> somewhere around 3MB on lightly-loaded 768MB x86-32 UP, which is very
>> close to beginning to swap.

On Tue, Apr 22, 2003 at 10:31:49AM -0400, Ingo Molnar wrote:
> 3MB might sound alot. Companion pagetables will make that 9MB on non-PAE.
> (current pte chains should make that roughly 6MB on average) 9MB is 1.1%
> of all RAM. 4K granular mem_map[] is 1.5% cost, and even there it's not
> mainly the RAM overhead that hurts us, but the lowmem overhead.
> (btw., the size of companion pagetables is likely reduced via pgcl as well
> - they need to track the VM units of pages, not the MMU units of pages.)

Well, the thing is pte_chains are O(utilized_ptes) so it ends up being
around 3MB + 3/5MB == 3.6MB. I've gone and applied the objrmap and
anobjrmap patches (despite the worst case behavior) and the space
savings are very noticeable and very beneficial, though still not as
good as with shpte which cut the sum of the two to well under 2MB.

Also, I'd be _very_ careful when claiming pgcl can offer space
reductions here. Page clustering involves the core VM understanding
that pieces of pages can be scattered about, with anonymous pages in
particularly arbitrary scatter/gather relationships. This breaks the
very assumption people are making when assuming page clustering can
save pte_chain space: that physical contiguity within an anonymous
software page can be exploited to infer small scanning regions to
recover multiple pte's from a single pointer. This is not the case.

For anonymous memory alone (which has been the sole usage of pte_chains
in -mm kernels for some time) the pte_chain space is 5MB except under
the heaviest of memory pressure, where things are temporarily reaped
down to under 100KB. The arbitrary relationship of anonymous pages to
virtual offsets in the presence of page clustering means that most (not
all, but high unpredictability) pte_chains must be retained for them.

At the very least I'd like to have the public opinion on the impact of
page clustering on pte_chain space downgraded from "improvement" to
"no effect whatsoever". My own experience shows:

HighTotal:    65198080 kB
HighFree:     60631808 kB
LowTotal:       751872 kB
LowFree:         25152 kB

dentry_cache               295889K        390350K      75.80%   
ext2_inode_cache           260442K        268617K      96.96%   
buffer_head                   799K          1868K      42.79%   
size-8192                    1760K          1856K      94.83%   
size-1024                    1737K          1767K      98.30%   
pae_pmd                       604K          1152K      52.43%   
biovec-BIO_MAX_PAGES          768K           780K      98.46%   
size-2048                     612K           690K      88.70%   
size-512                      503K           535K      93.93%   
size-64                       270K           450K      59.93%   
task_struct                   310K           428K      72.50%   
biovec-128                    384K           409K      93.77%   
blkdev_requests               396K           404K      98.02%   
inode_cache                   352K           375K      93.96%   
size-4096                     168K           256K      65.62%   
mm_struct                      19K           250K       7.95%   
pte_chain                      75K           227K      33.25%   
proc_inode_cache               72K           220K      32.65%   
biovec-64                     192K           220K      87.07%   
size-256                      162K           218K      74.40%   
radix_tree_node               121K           218K      55.63%   
sighand_cache                 160K           215K      74.40%   
filp                           38K           186K      20.60%   

under light load. I don't trust this to mean much.

Basically, I have _very_ good reasons to believe even after the
discussed potential pte_chain space optimizations page clustering
enables are implemented they will not be highly effective and won't
provide a generally applicable solution to the pte_chain space problem.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:29                             ` Martin J. Bligh
@ 2003-04-22 15:07                               ` Ingo Molnar
  2003-04-22 15:42                                 ` William Lee Irwin III
  2003-04-22 15:16                               ` Andrea Arcangeli
  2003-04-22 15:49                               ` Ingo Molnar
  2 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 15:07 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, Martin J. Bligh wrote:

> [...] I think we're optimising for the wrong case here - isn't the
> 100x100 mappings case exactly what we have sys_remap_file_pages for?

i'm inherently uncomfortable about adding any non-limited component to the
VM scanning code that deteriorates this badly by such a relatively low
level of sharing. Even with just 10 processes sharing the same inode 10
times (easily possible) causes an iteration of 100 steps for every page,
100+ cachelines touched. This brings us back to the Linux 2.0 days. It
also sends the wrong message: 'the more you share, the more we will punish
you'.

Also, the overhead pops up at the wrong place, not in the application
itself: in a central algorithm that otherwise _needs_ timely operation
just for the sake of generic system health. I might be wrong, but i very
much believe that first-class support for 'sharing as much stuff as
possible' should be a central design thing in the VM.

also, it's an inherent DoS thing. Any application that has the 'privilege'
of creating 1000 mappings in 1000 processes to the same inode (this is
already being done, and it will be commonplace in a few years) will
immediately DoS the whole VM. I might be repeating myself, but quadratic
algorithms do get linearly _worse_ as the hw evolves. The pidhash
quadratic behavior triggering the NMI watchdog on the biggest boxes is
just one example of this.

all the VM work in 2.5 has proven that the path to a good and reliable VM
is no-nonsense simplicity and basic robustness, both in algorithms and in
data structures. Queueing stuff as much as possible, avoiding extra
scanning as much as possible. And for God's sake, do not reintroduce any
quadratic algorithm, anywhere.

all this loss in quality and predictibility just to avoid some easily
calculatable RAM overhead? [which RAM overhead can be avoided by smart
applications if they want.]

where does sys_remap_file_pages() stand in this picture?

sys_remap_file_pages() could be fully substituted with mmap(): if the same
file in the same vma, using the same permissions is used with a nonlinear
offset then mmap() could decide to use the techniques of
sys_remap_file_pages() to create nonlinear ptes in that range. It's a
vma-overhead optimization for highly granular mappings.

so in theory we could do the following: if the sharing factor is less than
... 4-5 or so, then use objrmap, otherwise use nonlinear mappings. There
are a couple of problems with this hybrid approach: there is cost
associated with a 'flipover' from objrmap to nonlinear (the vmas have to
be merged, and all non-present ptes have to be fixed up to their pre-merge
offset), but it could probably be reduced and delegated to the app doing
the mapping, which would remove this cost component from the generic
scanning code.

doing the 'sharing factor of 5' flipover would address all my complaints:
nonlinear mappings can automatically solve the quadratic-algorithm
problems, and objrmap can be used whenever the sharing factor is low
enough.

ie. an app creating many mappings in many processes to the same inode
would 'magically' be presented with nonlinear mappings - without it having
to care.

can anyone see any problem with this approach?

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:32                             ` Martin J. Bligh
@ 2003-04-22 15:09                               ` Ingo Molnar
  0 siblings, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 15:09 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, Martin J. Bligh wrote:

> Oh, BTW. You're assuming no sharing of any pages in the above. Look what
> happens if 1000 processes share the same page ...

i'm not assuming anything - this is the per-process overhead.

processes have well-known RAM overhead associated to the size (and
fragmentation) of their virtual memory space, primarily caused by
pagetables. My suggestion triples this cost [where pte chains double the
costs], but leaves the scaling factor and generic characteristics the
same.

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:38                                 ` Martin J. Bligh
@ 2003-04-22 15:10                                   ` William Lee Irwin III
  2003-04-22 15:53                                     ` Martin J. Bligh
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 15:10 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 07:38:37AM -0700, Martin J. Bligh wrote:
> where the list of address_ranges is sorted by start address. This is
> intended to make use of the real-world case that many things (like shared
> libs) map the same exact address ranges over and over again (ie something
> like 3 ranges, but hundreds or thousands of mappings).

I'd have to see an empirical demonstration or some previously published
analysis (or previously published empirical demonstration) to believe
this does as it should.

Not to slight the originator, but it is a technique without an a priori
time (or possibly space either) guarantee, so the trials are warranted.

I'm overstating the argument because it's hard to make it sound slight;
it's very plausible something like this could resolve the time issue.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:29                             ` Martin J. Bligh
  2003-04-22 15:07                               ` Ingo Molnar
@ 2003-04-22 15:16                               ` Andrea Arcangeli
  2003-04-22 15:49                               ` Ingo Molnar
  2 siblings, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-22 15:16 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Ingo Molnar, Andrew Morton, mingo, hugh, dmccr, Linus Torvalds,
	linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 07:29:02AM -0700, Martin J. Bligh wrote:
> overhead itself. I think we're optimising for the wrong case here - isn't
> the 100x100 mappings case exactly what we have sys_remap_file_pages for?

yes IMHO.

> We can make the O(?) stuff look as fancy as we like. However, in reality,
> that makes the constants suck, and I'm not at all sure it's a good plan.

correct, it depends on what we care to run fast.

> It seems ironic that the solution to space consumption is do double the
> amount of space taken ;-) I see what you're trying to do (shove things up

Agreed.

> I think the holes in objrmap are quite small - and are already addressed by
> your sys_remap_file_pages mechanism.

Yep.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:56                                 ` William Lee Irwin III
@ 2003-04-22 15:26                                   ` Ingo Molnar
  2003-04-22 16:20                                     ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 15:26 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, William Lee Irwin III wrote:

> > just create a sparse enough memory layout (one page mapped every 2MB) and
> > pagetable overhead will dominate. Is it a problem in practice? I doubt it,
> > and you get what you asked for, and you can always offset it with RAM.
> 
> Actually it wasn't from sparse memory, it was from massive sharing.
> Basically 10000 processes whose virtualspace was dominated by shmem
> shared across all of them.
> 
> On some reflection I suspect a variety of techniques are needed here.

there are two main techniques to reduce per-context pagetable-alike
overhead: 1) the use of pagetable sharing via CLONE_VM 2) the use of
bigger MMU units with a much smaller pagetable hw cost [hugetlbs].

all of this is true, and still remains valid. None of this changes the
fact that objrmap, as proposed, introduces a quadratic component to a
central piece of code. If then we should simply abort any mmap() attempt
that increases the sharing factor above a certain level, or something like
that.

using nonlinear mappings adds the overhead of pte chains, which roughly
doubles the pagetable overhead. (or companion pagetables, which triple the
pagetable overhead) Purely RAM-wise the break-even point is at around 8
pages, 8 pte chain entries make up for 64 bytes of vma overhead.

the biggest problem i can see is that we (well, the kernel) has to make a
judgement of RAM footprint vs. algorithmic overhead, which is apples to
oranges. Nonlinear vmas [or just linear vmas with pte chains installed],
while being only O(N), double/triple the pagetable overhead. objrmap
linear vmas, while having only the pagetable overhead, are O(N^2). [well,
it's O(N*M)]

RAM-footprint wise the boundary is clear: above 8 pages of granularity,
vmas with objrmap cost less RAM than nonlinear mappings.

CPU-time-wise the nonlinear mappings with pte chains always beat objrmap.

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:07                               ` Ingo Molnar
@ 2003-04-22 15:42                                 ` William Lee Irwin III
  2003-04-22 15:55                                   ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, mingo, hugh,
	dmccr, Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 11:07:32AM -0400, Ingo Molnar wrote:
> also, it's an inherent DoS thing. Any application that has the 'privilege'
> of creating 1000 mappings in 1000 processes to the same inode (this is
> already being done, and it will be commonplace in a few years) will
> immediately DoS the whole VM. I might be repeating myself, but quadratic
> algorithms do get linearly _worse_ as the hw evolves. The pidhash
> quadratic behavior triggering the NMI watchdog on the biggest boxes is
> just one example of this.

I have to apologize for my misstatements of the problem here. You
yourself pointed out to me the hold time was, in fact, linear. Despite
the linearity of the algorithm, the failure mode persists. I've
postponed further investigation until later, when more invasive
techniques are admissible; /proc/ alone will not suffice if linear
algorithms under tasklist_lock can trigger this failure mode.

I believe further work is needed but I can't think of a 2.5.x mergeable
method to address it. I've attempted to devolve the work to others in
the hopes that future solutions might be devised. It's unfortunate but
general algorithmic scalability for scenarios like this has a real cost
for the low-end and it's a problem I don't feel comfortable trying to
fix in the middle of 2.5.x stabilization for more general systems.

Unless a refinement of either manfred's or your patches can be made to
pass the test (apologies again; I don't recall the results, my time on
the whole system is very limited and it was a while ago) I suspect very
little can be done for 2.5.x here. IMHO a series of patches to
eliminate all remaining linear scans under tasklist_lock alongside a
fair locking construct will be eventually required, though, of course,
only a solution is required, not my expectation.

-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 14:29                             ` Martin J. Bligh
  2003-04-22 15:07                               ` Ingo Molnar
  2003-04-22 15:16                               ` Andrea Arcangeli
@ 2003-04-22 15:49                               ` Ingo Molnar
  2003-04-22 16:16                                 ` Martin J. Bligh
  2 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 15:49 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, Martin J. Bligh wrote:

> > to solve this problem i believe the pte chains should be made
> > double-linked lists, and should be organized in a completely different
> 
> It seems ironic that the solution to space consumption is do double the
> amount of space taken ;-) I see what you're trying to do (shove things
> up into highmem), but it seems like a much better plan to me to just
> kill the bloat altogether.

at the expense of introducing a quadratic component?

then at least we should have a cutoff point at which point some smarter
algorithm kicks in.

am i willing to trade in 1.2% of RAM overhead vs. 0.4% of RAM overhead in
exchange of a predictable VM? Sure. I do believe that 0.8% of RAM will
make almost zero noticeable difference on a 768 MB system - i have a 768
MB system. Whether 1MB of extra RAM to a 128 MB system will make more of a
difference than a predictable VM - i dont know, it probably depends on the
app, but i'd go for more RAM. But it will make a _hell_ of a difference on
a 1 TB RAM 64-bit system where the sharing factor explodes. And that's
where Linux usage we will be by the time 2.6 based systems go production.

this is why i think we should have both options in the VM - even if this
is quite an amount of complexity. And we cannot get rid of pte chains
anyway, they are a must for anonymous mappings. The quadratic property of
objrmap should show up very nicely for anonymous mappings: they are in
theory just nonlinear mappings of one very large inode [swap space],
mapped in by zillions of tasks. Has anyone ever attempted to extend the
objrmap concept to anonymous mappings?

> [...] The overhead for pte-highmem is horrible as it is (something like
> 10% systime for kernel compile IIRC). [...]

i've got some very simple speedups for atomic kmaps that should mitigate
much of the mappings overhead. And i take issue with the 10% system-time
increase - are you sure about that? How much of total [wall-clock]
overhead was that? I never knew this was a problem - it's easy to solve.

> [...] And this is worse - you have to manipulate the prev and next
> elements in the list. I know how to fix pte_highmem kmap overhead
> already (via UKVA), but not rmap pages. Not only is it kmap, but you
> have double the number of cachelines touched.

well, highmem pte chains are a pain arguably, but doable.

also consider that currently rmap has the habit of not shortening the pte
chains upon unmap time - with a double linked list that becomes possible.  
Has anyone ever profiled the length of pte chains during a kernel compile?

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:10                                   ` William Lee Irwin III
@ 2003-04-22 15:53                                     ` Martin J. Bligh
  0 siblings, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 15:53 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

>> where the list of address_ranges is sorted by start address. This is
>> intended to make use of the real-world case that many things (like shared
>> libs) map the same exact address ranges over and over again (ie something
>> like 3 ranges, but hundreds or thousands of mappings).
> 
> I'd have to see an empirical demonstration or some previously published
> analysis (or previously published empirical demonstration) to believe
> this does as it should.
> 
> Not to slight the originator, but it is a technique without an a priori
> time (or possibly space either) guarantee, so the trials are warranted.
> 
> I'm overstating the argument because it's hard to make it sound slight;
> it's very plausible something like this could resolve the time issue.

I got sidetracked by the slowdown seeing for massive contention on the
i_shared_sem for even sorting the list. We need to fix that before this is
feasible to do ... (though maybe the list will be sufficiently shorter now
it's less of a problem .... hmmm). Maybe I'll just finish off the code.

M.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:42                                 ` William Lee Irwin III
@ 2003-04-22 15:55                                   ` Ingo Molnar
  2003-04-22 16:58                                     ` William Lee Irwin III
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 15:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, mingo, hugh,
	dmccr, Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, William Lee Irwin III wrote:

> I have to apologize for my misstatements of the problem here. You
> yourself pointed out to me the hold time was, in fact, linear. Despite
> the linearity of the algorithm, the failure mode persists. I've
> postponed further investigation until later, when more invasive
> techniques are admissible; /proc/ alone will not suffice if linear
> algorithms under tasklist_lock can trigger this failure mode.

well, i have myself reproduced 30+ secs worth of pid-alloc related lockups
on my box, so it's was definitely not a fata morgana, and the
pid-allocation code was definitely quadratic near the PID-space saturation
point.

There might be something else still biting your system, i'd really be
interested in hearing more about it. What workload are you using to
trigger it?

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:49                               ` Ingo Molnar
@ 2003-04-22 16:16                                 ` Martin J. Bligh
  2003-04-22 17:24                                   ` Ingo Molnar
  2003-04-22 17:45                                   ` John Bradford
  0 siblings, 2 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 16:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

>> > to solve this problem i believe the pte chains should be made
>> > double-linked lists, and should be organized in a completely different
>> 
>> It seems ironic that the solution to space consumption is do double the
>> amount of space taken ;-) I see what you're trying to do (shove things
>> up into highmem), but it seems like a much better plan to me to just
>> kill the bloat altogether.
> 
> at the expense of introducing a quadratic component?
> 
> then at least we should have a cutoff point at which point some smarter
> algorithm kicks in.

That's a very interesting point ... a cutover like we do for AVL stuff in
other places might well work. We can make it non-quadratic for the
real-world cases fairly easily I think though (see other emails).

> am i willing to trade in 1.2% of RAM overhead vs. 0.4% of RAM overhead in
> exchange of a predictable VM? Sure. I do believe that 0.8% of RAM will

If that was the only tradeoff, I'd be happy to make it too. But it's not
0.4% / 1.2% under any kind of heavy sharing (eg shared libs), it can be
something like 25% vs 75% ... the difference between the system living or
dying. If we had shared pagetables, and shlibs aligned on 2Mb boundaries so
they could be used, I'd be much less stressed about it, I guess.

The worse part of the tradeoff is not the space, it's the overhead. And
you're pushing many workloads that didn't have to use highpte before into
having to use it (or at least highchains) - that's a significant
performance hit.

> make almost zero noticeable difference on a 768 MB system - i have a 768
> MB system. Whether 1MB of extra RAM to a 128 MB system will make more of a
> difference than a predictable VM - i dont know, it probably depends on the
> app, but i'd go for more RAM. But it will make a _hell_ of a difference on
> a 1 TB RAM 64-bit system where the sharing factor explodes. And that's
> where Linux usage we will be by the time 2.6 based systems go production.

You obviously have a somewhat different timeline in mind for 2.6 than the
rest of us ;-) However, if you're looking at the heavy sharing factor, the
current rmap stuff explodes in terms of performance. Your rmap pages sounds
like it would reduce the list walking for delete by giving us a free index
into the list (hopping across from the pagetables), but without an
implementation that can be measured, it's really impossible to tell if
you'll hit other problems.
 
> this is why i think we should have both options in the VM - even if this
> is quite an amount of complexity. And we cannot get rid of pte chains
> anyway, they are a must for anonymous mappings. The quadratic property of
> objrmap should show up very nicely for anonymous mappings: they are in
> theory just nonlinear mappings of one very large inode [swap space],
> mapped in by zillions of tasks. Has anyone ever attempted to extend the
> objrmap concept to anonymous mappings?

Yes, Hugh did that code. But my original plan was not do that that -
because there isn't enough sharing for the payback - it was no faster on my
tests at least (it made fork/exec a little faster, but there was payback in
other places). I think the partial usage is less controversial.

>> [...] The overhead for pte-highmem is horrible as it is (something like
>> 10% systime for kernel compile IIRC). [...]
> 
> i've got some very simple speedups for atomic kmaps that should mitigate
> much of the mappings overhead. And i take issue with the 10% system-time
> increase - are you sure about that? How much of total [wall-clock]
> overhead was that? I never knew this was a problem - it's easy to solve.

Dunno. I published the results a while back ... I can try to dig them out
again. I tend to look at the system time for kernel compiles (except for
scheduler stuff) ... when running real mixed workloads, the rest of the CPU
time will be more heavily used. 

Kernbench: (make -j N vmlinux, where N = 2 x num_cpus)
                              Elapsed      System        User         CPU
              2.5.59-mjb5       45.63      110.69      564.75     1480.00
      2.5.59-mjb5-highpte       46.37      118.34      565.33     1473.75

Kernbench: (make -j N vmlinux, where N = 16 x num_cpus)
                              Elapsed      System        User         CPU
              2.5.59-mjb5       46.69      133.99      568.99     1505.50
      2.5.59-mjb5-highpte       47.56      142.35      569.46     1496.00

SDET 64  (see disclaimer)
                           Throughput    Std. Dev
              2.5.59-mjb5       100.0%         0.1%
      2.5.59-mjb5-highpte        97.8%         0.2%

OK, so it's more like 8% on system time, and a couple of % off wall time.
I'd love to test the atomic kmap speedups if you have them .... it's
heavily used now for all sorts of things - we ought to speed up that
mechanism as much as possible.

> well, highmem pte chains are a pain arguably, but doable.
> 
> also consider that currently rmap has the habit of not shortening the pte
> chains upon unmap time - with a double linked list that becomes possible.
>  Has anyone ever profiled the length of pte chains during a kernel
> compile?

Yes, I had some crude thing to draw a histogram (was actually the
page->count). Pretty much all the anon pages were singletons. There was
lots of sharing of shlib pages on a "number of tasks" basis ... I can't
find the figures or the patch right now, but will try to dig that out.

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:26                                   ` Ingo Molnar
@ 2003-04-22 16:20                                     ` William Lee Irwin III
  2003-04-22 16:57                                       ` Andrea Arcangeli
  2003-04-22 16:58                                       ` Martin J. Bligh
  0 siblings, 2 replies; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 16:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Tue, 22 Apr 2003, William Lee Irwin III wrote:
>> Actually it wasn't from sparse memory, it was from massive sharing.
>> Basically 10000 processes whose virtualspace was dominated by shmem
>> shared across all of them.
>> On some reflection I suspect a variety of techniques are needed here.

On Tue, Apr 22, 2003 at 11:26:21AM -0400, Ingo Molnar wrote:
> there are two main techniques to reduce per-context pagetable-alike
> overhead: 1) the use of pagetable sharing via CLONE_VM 2) the use of
> bigger MMU units with a much smaller pagetable hw cost [hugetlbs].

Sharing pagetables across process contexts seems to be relatively
effective. Reclaiming them also curtails various worst-case scenarious
and simultaneously renders all others techniques optimizations as
opposed to workload feasibility patches.

I'm having a tough time getting too interested in these today. I'll
just add to the list for now (if you will).


On Tue, Apr 22, 2003 at 11:26:21AM -0400, Ingo Molnar wrote:
> all of this is true, and still remains valid. None of this changes the
> fact that objrmap, as proposed, introduces a quadratic component to a
> central piece of code. If then we should simply abort any mmap() attempt
> that increases the sharing factor above a certain level, or something like
> that.

It does do poorly there according to benchmarks. I don't have anything
specific to say for or against it. It's sensible as a general idea and
has its benefits but has theoretical and practical drawbacks too. I'm
going to have to let those involved with it address things.


On Tue, Apr 22, 2003 at 11:26:21AM -0400, Ingo Molnar wrote:
> using nonlinear mappings adds the overhead of pte chains, which roughly
> doubles the pagetable overhead. (or companion pagetables, which triple the
> pagetable overhead) Purely RAM-wise the break-even point is at around 8
> pages, 8 pte chain entries make up for 64 bytes of vma overhead.
> the biggest problem i can see is that we (well, the kernel) has to make a
> judgement of RAM footprint vs. algorithmic overhead, which is apples to
> oranges. Nonlinear vmas [or just linear vmas with pte chains installed],
> while being only O(N), double/triple the pagetable overhead. objrmap
> linear vmas, while having only the pagetable overhead, are O(N^2). [well,
> it's O(N*M)]
> RAM-footprint wise the boundary is clear: above 8 pages of granularity,
> vmas with objrmap cost less RAM than nonlinear mappings.
> CPU-time-wise the nonlinear mappings with pte chains always beat objrmap.

There's definitely an argument brewing here. Large 32-bit is very space
conscious; the rest of the world is largely oblivious to these specific
forms of space consumption aside from those tight on space in general.
I don't know that there can be a general answer for all systems.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:20                                     ` William Lee Irwin III
@ 2003-04-22 16:57                                       ` Andrea Arcangeli
  2003-04-22 17:21                                         ` William Lee Irwin III
  2003-04-22 17:34                                         ` Ingo Molnar
  2003-04-22 16:58                                       ` Martin J. Bligh
  1 sibling, 2 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-22 16:57 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar, Andrew Morton, mbligh, mingo,
	hugh, dmccr, Linus Torvalds, linux-kernel, linux-mm

could we focus and solve the remap_file_pages current breakage first?

I proposed my fix that IMHO is optimal and simple (I recall Hugh also
proposed something on these lines):

1) allow it only inside mmap(VM_NONLINAER) vmas only
2) have the VM skip over VM_NONLINEAR vmas enterely
3) set vma->vm_file to NULL for those vams and forbid paging and allow
   multiple files to be mapped in the same nonlinaer vma (add an fd
   parameter to the syscall)
4) enable it as non-root (w/o IPC_LOCK capability) only with a sysctl
   enabled
5) avoid any overhead connected with the potential paging of the
   nonlinaer vmas
6) populate it with pmd on hugetlbfs
7) if a truncate happens leave the page pinned outside the pagecache
   but still mapped into userspace, we don't care about it and it will
   be freed during the munmap of the nonlinear vma

Note: in the longer run if you want, you can as well change the kernel
internals to make this area pageable and then you won't need a sysctl
anymore.

The mmap and remap_file_pages kind of overlaps, remap_file_pages is the
"hack" that should be quick and simple IMHO. Everything not just
intersting as a pte mangling vm-bypass should happen in the mmap layer
IMHO.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:20                                     ` William Lee Irwin III
  2003-04-22 16:57                                       ` Andrea Arcangeli
@ 2003-04-22 16:58                                       ` Martin J. Bligh
  1 sibling, 0 replies; 105+ messages in thread
From: Martin J. Bligh @ 2003-04-22 16:58 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

>> using nonlinear mappings adds the overhead of pte chains, which roughly
>> doubles the pagetable overhead. (or companion pagetables, which triple
>> the pagetable overhead) Purely RAM-wise the break-even point is at
>> around 8 pages, 8 pte chain entries make up for 64 bytes of vma overhead.
>> the biggest problem i can see is that we (well, the kernel) has to make a
>> judgement of RAM footprint vs. algorithmic overhead, which is apples to
>> oranges. Nonlinear vmas [or just linear vmas with pte chains installed],
>> while being only O(N), double/triple the pagetable overhead. objrmap
>> linear vmas, while having only the pagetable overhead, are O(N^2). [well,
>> it's O(N*M)]
>> RAM-footprint wise the boundary is clear: above 8 pages of granularity,
>> vmas with objrmap cost less RAM than nonlinear mappings.
>> CPU-time-wise the nonlinear mappings with pte chains always beat objrmap.
> 
> There's definitely an argument brewing here. Large 32-bit is very space
> conscious; the rest of the world is largely oblivious to these specific
> forms of space consumption aside from those tight on space in general.

However, the time consumption affects everybody. The overhead of pte-chains
is very significant ... people seem to be conveniently forgetting that for
some reason. Ingo's rmap_pages thing solves the lowmem space problem, but
the time problem is still there, if not worse.

Please don't create the impression that rmap methodologies are only an
issue for large 32 bit machines - that's not true at all.

People seem to be focused on one corner case of performance for objrmap ...
If you want a countercase for pte-chain based rmap, try creating 1000
processes in a machine with a decent amount of RAM. Make them share
libraries (libc, etc), and then fork and exit in a FIFO rolling fashion.
Just forking off a bunch of stuff (regular programs / shell scripts) that
do similar amounts of work will presumably approximate this. Kernel
compiles see large benefits here, for instance. Things that were less
dominated by userspace calculations would see even bigger changes.

I've not seen anything but a focused microbenchmark deliberately written
for the job do better on pte-chain based rmap that partial objrmap yet. If
we had something more realistic, it would become rather more interesting.

M.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 15:55                                   ` Ingo Molnar
@ 2003-04-22 16:58                                     ` William Lee Irwin III
  2003-04-22 17:07                                       ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 16:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, mingo, hugh,
	dmccr, Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 11:55:00AM -0400, Ingo Molnar wrote:
> well, i have myself reproduced 30+ secs worth of pid-alloc related lockups
> on my box, so it's was definitely not a fata morgana, and the
> pid-allocation code was definitely quadratic near the PID-space saturation
> point.
> There might be something else still biting your system, i'd really be
> interested in hearing more about it. What workload are you using to
> trigger it?

ISTR it being something on the order of running 32 instances of top(1),
one per cpu, and then trying to fork().

I think this is one of those that needs num_cpus_online() >= 32, and
possibly in combination with strong NUMA effects. I'm willing to accept
large delays with respect to addressing this unless my employer/funding
source makes equipment more readily available.

Seriously -- if those who could need and/or fund the fix don't see it
as a large enough problem to invest in a fix for, I see no need to
impose on the Linux kernel community to do so.

Otherwise, given sufficient hardware access, I'd be more than willing
to run regular tests on whatever patches you care to send me. As of now
I'm not even able to do so, regardless of willingness.

(e.g. access to 64GB hw, even while in-house, has been extremely limited)


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:58                                     ` William Lee Irwin III
@ 2003-04-22 17:07                                       ` Ingo Molnar
  0 siblings, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 17:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, mingo, hugh,
	dmccr, Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, William Lee Irwin III wrote:

> ISTR it being something on the order of running 32 instances of top(1),
> one per cpu, and then trying to fork().

oh, have you run any of the /proc fixes floating around? It still has some
pretty bad (quadratic) stuff left in, and done under tasklist_lock
read-help - if any write_lock_irq() of the tasklist lock hits this code
then you get an NMI assert. Please try either Manfred's or mine.

	Ingo



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:57                                       ` Andrea Arcangeli
@ 2003-04-22 17:21                                         ` William Lee Irwin III
  2003-04-22 18:08                                           ` Andrea Arcangeli
  2003-04-22 17:34                                         ` Ingo Molnar
  1 sibling, 1 reply; 105+ messages in thread
From: William Lee Irwin III @ 2003-04-22 17:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 06:57:46PM +0200, Andrea Arcangeli wrote:
> could we focus and solve the remap_file_pages current breakage first?
> I proposed my fix that IMHO is optimal and simple (I recall Hugh also
> proposed something on these lines):
> 1) allow it only inside mmap(VM_NONLINAER) vmas only
> 2) have the VM skip over VM_NONLINEAR vmas enterely
> 3) set vma->vm_file to NULL for those vams and forbid paging and allow
>    multiple files to be mapped in the same nonlinaer vma (add an fd
>    parameter to the syscall)
> 4) enable it as non-root (w/o IPC_LOCK capability) only with a sysctl
>    enabled
> 5) avoid any overhead connected with the potential paging of the
>    nonlinaer vmas

Some of these are controversial; it _can_ be fixed in other ways, but
the question is whether the sacrifice in functionality is with the
code simplification, or vice-versa evaluating the performance impact.

I'm trying to be objective here, though my own bias is in favor of full
retention of functionality (i.e. best of both worlds, maximal code
complexity).


On Tue, Apr 22, 2003 at 06:57:46PM +0200, Andrea Arcangeli wrote:
> 6) populate it with pmd on hugetlbfs
> 7) if a truncate happens leave the page pinned outside the pagecache
>    but still mapped into userspace, we don't care about it and it will
>    be freed during the munmap of the nonlinear vma

I'll implement the hugetlbfs part; it should fit nicely into the
infrastructure introduced with the rest of the virtwin patch, all it
really needs is some additional error checking. hugetlbfs is 100%
CAP_IPC_LOCK -- there should be no issue under either scheme.

There are also some unnameable databases clamoring for this
functionality but not quite ready to utilize it. They'll be quite happy
when closure on the issue is achieved even if not immediately able to
take advantage of it in userspace.


On Tue, Apr 22, 2003 at 06:57:46PM +0200, Andrea Arcangeli wrote:
> Note: in the longer run if you want, you can as well change the kernel
> internals to make this area pageable and then you won't need a sysctl
> anymore.
> The mmap and remap_file_pages kind of overlaps, remap_file_pages is the
> "hack" that should be quick and simple IMHO. Everything not just
> intersting as a pte mangling vm-bypass should happen in the mmap layer
> IMHO.

As it stands now it's supposed to be pageable. I think hugh and akpm
will have to intercede here. The true core of this discussion was around
a fresh state of a page that would have to be introduced to correctly
implement this that introduced certain complexities and some semantics
that negated most of the validation checks in mm/rmap.c. (AIUI --
please let them clarify, my understanding of this subtlety is limited
and at the risk of disseminating incorrect information I hoped to shed
_some_ light on it).

I think I know a bit more but would like to curtail the risk of
disseminating more incorrect information than necessary.


-- wli

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:16                                 ` Martin J. Bligh
@ 2003-04-22 17:24                                   ` Ingo Molnar
  2003-04-22 17:45                                   ` John Bradford
  1 sibling, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 17:24 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, Martin J. Bligh wrote:

> If that was the only tradeoff, I'd be happy to make it too. But it's not
> 0.4% / 1.2% under any kind of heavy sharing (eg shared libs), it can be
> something like 25% vs 75% ... the difference between the system living
> or dying. If we had shared pagetables, and shlibs aligned on 2Mb
> boundaries so they could be used, I'd be much less stressed about it, I
> guess.

sorry, but this is just games with numbers. _Sure_, you can find workload
as a demonstration against _any_ resource increase, by allocating that
resource enough times so that lots of stuff is allocated. "Look, we
increased the kernel stack size from 4K to 8K, and now this makes this
[add random heavily threaded workload] thing go from 40% RAM utilization
to 60% RAM utilization"

fact is that 'typical' pagetable usage is in the <1% range on typical
systems. Sure, you can increase it - like you can increase RAM allocation
for just about any resource if you want. The answer: if you want to do
that then add more RAM or dont do it.

so the real question is whether the size increase justifies the
advantages. It's a border case i agree.

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:57                                       ` Andrea Arcangeli
  2003-04-22 17:21                                         ` William Lee Irwin III
@ 2003-04-22 17:34                                         ` Ingo Molnar
  2003-04-22 18:04                                           ` Benjamin LaHaise
  1 sibling, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2003-04-22 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Andrew Morton, mbligh, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm


On Tue, 22 Apr 2003, Andrea Arcangeli wrote:

> could we focus and solve the remap_file_pages current breakage first?

truncate always used to be such a PITA in the VM. And so few code depends
on it doing the right thing to vmas. Which i claim to not be the right
thing at all.

is anything forcing us to fixing up mappings during a truncate? What we
need is just for the FS to recognize pages behind end-of-inode to still
potentially exist after truncation, if those areas were mapped before the
truncation. Apps that do not keep uptodate with truncaters can get
out-of-date data anyway, via read()/write() anyway. Are there good
arguments to be this strict across truncate()? We sure could make it safe
even thought it's not safe currently.

	Ingo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 16:16                                 ` Martin J. Bligh
  2003-04-22 17:24                                   ` Ingo Molnar
@ 2003-04-22 17:45                                   ` John Bradford
  1 sibling, 0 replies; 105+ messages in thread
From: John Bradford @ 2003-04-22 17:45 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, mingo, hugh, dmccr,
	Linus Torvalds, linux-kernel, linux-mm

> > make almost zero noticeable difference on a 768 MB system - i have a 768
> > MB system. Whether 1MB of extra RAM to a 128 MB system will make more of a
> > difference than a predictable VM - i dont know, it probably depends on the
> > app, but i'd go for more RAM. But it will make a _hell_ of a difference on
> > a 1 TB RAM 64-bit system where the sharing factor explodes. And that's
> > where Linux usage we will be by the time 2.6 based systems go production.

> You obviously have a somewhat different timeline in mind for 2.6 than the
> rest of us ;-)

It's certainly where Linux usage will be before 2.8 is ready.

(and anyway, I'm sure there's a subsystem that we haven't _yet_
re-written during the feature freeze...  :-) )


John.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 17:34                                         ` Ingo Molnar
@ 2003-04-22 18:04                                           ` Benjamin LaHaise
  0 siblings, 0 replies; 105+ messages in thread
From: Benjamin LaHaise @ 2003-04-22 18:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 01:34:46PM -0400, Ingo Molnar wrote:
> is anything forcing us to fixing up mappings during a truncate? What we
> need is just for the FS to recognize pages behind end-of-inode to still
> potentially exist after truncation, if those areas were mapped before the
> truncation. Apps that do not keep uptodate with truncaters can get
> out-of-date data anyway, via read()/write() anyway. Are there good
> arguments to be this strict across truncate()? We sure could make it safe
> even thought it's not safe currently.

Yes: access beyond EOF is required to SIGBUS according to various 
standards.  But keep in mind that this is a slow path and doesn't have to 
be anywhere near optimal, unlike page reclaim.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: objrmap and vmtruncate
  2003-04-22 17:21                                         ` William Lee Irwin III
@ 2003-04-22 18:08                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 105+ messages in thread
From: Andrea Arcangeli @ 2003-04-22 18:08 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar, Andrew Morton, mbligh, mingo,
	hugh, dmccr, Linus Torvalds, linux-kernel, linux-mm

On Tue, Apr 22, 2003 at 10:21:10AM -0700, William Lee Irwin III wrote:
> On Tue, Apr 22, 2003 at 06:57:46PM +0200, Andrea Arcangeli wrote:
> > could we focus and solve the remap_file_pages current breakage first?
> > I proposed my fix that IMHO is optimal and simple (I recall Hugh also
> > proposed something on these lines):
> > 1) allow it only inside mmap(VM_NONLINAER) vmas only
> > 2) have the VM skip over VM_NONLINEAR vmas enterely
> > 3) set vma->vm_file to NULL for those vams and forbid paging and allow
> >    multiple files to be mapped in the same nonlinaer vma (add an fd
> >    parameter to the syscall)
> > 4) enable it as non-root (w/o IPC_LOCK capability) only with a sysctl
> >    enabled
> > 5) avoid any overhead connected with the potential paging of the
> >    nonlinaer vmas
> 
> Some of these are controversial; it _can_ be fixed in other ways, but

which ones, could you elaborate? I don't see anything controversial in
the above points.

> I'm trying to be objective here, though my own bias is in favor of full
> retention of functionality (i.e. best of both worlds, maximal code
> complexity).

I'm not at all against code complexity when it is worthwhile, but given
the potential ram and cpu overhead of the complex code and considering
this stuff should be ready in a few months I would prefer to keep things
simple.  Especially because if you really want, as said you could make
things complex (and slower ;) behind this new API w/o userspace
noticing.  Then you can drop the sysctl (or make it insignificant).

> On Tue, Apr 22, 2003 at 06:57:46PM +0200, Andrea Arcangeli wrote:
> > 6) populate it with pmd on hugetlbfs
> > 7) if a truncate happens leave the page pinned outside the pagecache
> >    but still mapped into userspace, we don't care about it and it will
> >    be freed during the munmap of the nonlinear vma
> 
> I'll implement the hugetlbfs part; it should fit nicely into the
> infrastructure introduced with the rest of the virtwin patch, all it
> really needs is some additional error checking. hugetlbfs is 100%
> CAP_IPC_LOCK -- there should be no issue under either scheme.

yes.

Andrea

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2003-04-22 17:57 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-04 14:34 objrmap and vmtruncate Hugh Dickins
2003-04-04 16:14 ` William Lee Irwin III
2003-04-04 16:29   ` Hugh Dickins
2003-04-04 18:54 ` Andrew Morton
2003-04-04 21:43   ` Hugh Dickins
2003-04-04 21:45   ` Andrea Arcangeli
2003-04-04 21:58     ` Benjamin LaHaise
2003-04-04 23:07     ` Andrew Morton
2003-04-05  0:03       ` Andrea Arcangeli
2003-04-05  0:31         ` Andrew Morton
2003-04-05  1:31           ` Andrea Arcangeli
2003-04-05  1:52             ` Benjamin LaHaise
2003-04-05  2:22               ` Andrea Arcangeli
2003-04-05 10:01                 ` Jamie Lokier
2003-04-05 10:11                   ` William Lee Irwin III
2003-04-05  2:06             ` Andrew Morton
2003-04-05  2:24               ` Andrea Arcangeli
2003-04-05  2:13           ` Martin J. Bligh
2003-04-05  2:44             ` Andrea Arcangeli
2003-04-05  3:24               ` Andrew Morton
2003-04-05 12:06                 ` Andrew Morton
2003-04-05 15:11                   ` Martin J. Bligh
     [not found]                     ` <20030405161758.1ee19bfa.akpm@digeo.com>
2003-04-06  7:07                       ` William Lee Irwin III
2003-04-05 16:30                   ` Andrea Arcangeli
2003-04-05 19:01                     ` Andrea Arcangeli
2003-04-05 20:14                       ` Andrew Morton
2003-04-05 21:24                     ` Andrew Morton
2003-04-05 22:06                       ` Andrea Arcangeli
2003-04-05 22:31                         ` Andrew Morton
2003-04-05 23:10                           ` Andrea Arcangeli
2003-04-06  1:58                             ` Andrew Morton
2003-04-06 14:47                               ` Andrea Arcangeli
2003-04-06 21:35                                 ` William Lee Irwin III
2003-04-06  7:38                             ` William Lee Irwin III
2003-04-06 14:51                               ` Andrea Arcangeli
2003-04-06 12:37                           ` Jamie Lokier
2003-04-06 13:12                             ` William Lee Irwin III
2003-04-22 11:00                           ` Ingo Molnar
2003-04-22 11:54                             ` William Lee Irwin III
2003-04-22 14:31                               ` Ingo Molnar
2003-04-22 14:56                                 ` William Lee Irwin III
2003-04-22 15:26                                   ` Ingo Molnar
2003-04-22 16:20                                     ` William Lee Irwin III
2003-04-22 16:57                                       ` Andrea Arcangeli
2003-04-22 17:21                                         ` William Lee Irwin III
2003-04-22 18:08                                           ` Andrea Arcangeli
2003-04-22 17:34                                         ` Ingo Molnar
2003-04-22 18:04                                           ` Benjamin LaHaise
2003-04-22 16:58                                       ` Martin J. Bligh
2003-04-22 12:37                             ` Andrea Arcangeli
2003-04-22 13:20                               ` William Lee Irwin III
2003-04-22 14:38                                 ` Martin J. Bligh
2003-04-22 15:10                                   ` William Lee Irwin III
2003-04-22 15:53                                     ` Martin J. Bligh
2003-04-22 14:52                                 ` Andrea Arcangeli
2003-04-22 14:29                             ` Martin J. Bligh
2003-04-22 15:07                               ` Ingo Molnar
2003-04-22 15:42                                 ` William Lee Irwin III
2003-04-22 15:55                                   ` Ingo Molnar
2003-04-22 16:58                                     ` William Lee Irwin III
2003-04-22 17:07                                       ` Ingo Molnar
2003-04-22 15:16                               ` Andrea Arcangeli
2003-04-22 15:49                               ` Ingo Molnar
2003-04-22 16:16                                 ` Martin J. Bligh
2003-04-22 17:24                                   ` Ingo Molnar
2003-04-22 17:45                                   ` John Bradford
2003-04-22 14:32                             ` Martin J. Bligh
2003-04-22 15:09                               ` Ingo Molnar
2003-04-05 21:34                     ` Rik van Riel
2003-04-06  9:29                     ` Benjamin LaHaise
2003-04-05 23:25                   ` William Lee Irwin III
2003-04-05 23:57                     ` Andrew Morton
2003-04-06  0:14                       ` Andrea Arcangeli
2003-04-06  1:39                         ` Andrew Morton
2003-04-06  2:13                       ` William Lee Irwin III
2003-04-06  9:26                     ` Benjamin LaHaise
2003-04-06  9:41                       ` William Lee Irwin III
2003-04-06  9:54                         ` William Lee Irwin III
2003-04-06  2:23                   ` Martin J. Bligh
2003-04-06  3:55                     ` Andrew Morton
2003-04-06  3:08                       ` Martin J. Bligh
2003-04-06  7:42                         ` William Lee Irwin III
2003-04-06 14:49                     ` Alan Cox
2003-04-06 16:13                       ` Martin J. Bligh
2003-04-06 21:34                         ` subobj-rmap Martin J. Bligh
2003-04-06 21:42                           ` subobj-rmap Rik van Riel
2003-04-06 21:52                             ` subobj-rmap Davide Libenzi
2003-04-06 21:55                             ` subobj-rmap Jamie Lokier
2003-04-06 22:39                               ` subobj-rmap William Lee Irwin III
2003-04-06 22:03                             ` subobj-rmap Martin J. Bligh
2003-04-06 22:06                               ` subobj-rmap Martin J. Bligh
2003-04-06 22:15                               ` subobj-rmap Andrea Arcangeli
2003-04-06 22:25                                 ` subobj-rmap Martin J. Bligh
2003-04-07 21:25                                   ` subobj-rmap Andrea Arcangeli
2003-04-06 23:06                               ` subobj-rmap Jamie Lokier
2003-04-06 23:26                                 ` subobj-rmap Martin J. Bligh
2003-04-05  3:45               ` objrmap and vmtruncate Martin J. Bligh
2003-04-05  3:59               ` Rik van Riel
2003-04-05  4:10                 ` William Lee Irwin III
2003-04-05  4:49                   ` Martin J. Bligh
2003-04-05 13:31                     ` Rik van Riel
2003-04-05  4:52               ` Martin J. Bligh
2003-04-05  3:22             ` Andrew Morton
2003-04-05  3:35               ` Martin J. Bligh
2003-04-05  3:53       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).