All of lore.kernel.org
 help / color / mirror / Atom feed
* copy_page_range()
@ 2004-08-07  7:05 David S. Miller
  2004-08-07  8:07 ` copy_page_range() William Lee Irwin III
  2004-08-09  9:01 ` copy_page_range() David Mosberger
  0 siblings, 2 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-07  7:05 UTC (permalink / raw)
  To: torvalds; +Cc: linux-arch


Every couple months I look at this thing.

The main issue is that it's very cache unfriendly,
especially with how sparsely populated the page tables
are for 64-bit processes.

As a simple example, it's at the top of the kernel
profile for 64-bit lat_proc {fork,exec,shell} on
sparc64.

And it's in fact the pmd array scans that take all
of the cache misses, and thus most of the run time.

An idea I've always been entertaining is to associate
a bitmask with each pmd table.  For example, a possible
current implementation could be to abuse page_struct->index
for this bitmask, and use virt_to_page(pmdp)->index to get
at it.

This divides the pmd table into BITS_PER_LONG sections.
If the bit is set in ->index then we populated at least
one of the pmd entries in that section.  We never clear
bits, except at pmd table allocation time.

Then the pmd scan iterates over ->index, and only actually
dereferences the pmd entries iff it finds a set bit, and
it only dereferences the section of pmd entries represented
by that bit.

Another idea I've also considered is to implement the
pgd/pmd levels as a more compact tree, based upon virtual
address, such as a radix tree.

I think all of this could be experimented with if we
abstracted out the pmd/pgd/pte iteration.  So much stuff
in the kernel mm code is of the form:

	for_each_pgd(pgdp)
		for_each_pmd(pgdp, pmdp)
			for_each_pte(pmdp, ptep)
				do_something(ptep)

At 2-levels, as on most of the 32-bit platforms, things
aren't so bad.

Comments?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-07  7:05 copy_page_range() David S. Miller
@ 2004-08-07  8:07 ` William Lee Irwin III
  2004-08-11  7:07   ` copy_page_range() David S. Miller
  2004-08-09  9:01 ` copy_page_range() David Mosberger
  1 sibling, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-07  8:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-arch

On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> Every couple months I look at this thing.
> The main issue is that it's very cache unfriendly,
> especially with how sparsely populated the page tables
> are for 64-bit processes.
> As a simple example, it's at the top of the kernel
> profile for 64-bit lat_proc {fork,exec,shell} on
> sparc64.
> And it's in fact the pmd array scans that take all
> of the cache misses, and thus most of the run time.
> An idea I've always been entertaining is to associate
> a bitmask with each pmd table.  For example, a possible
> current implementation could be to abuse page_struct->index
> for this bitmask, and use virt_to_page(pmdp)->index to get
> at it.

Sounds generally reasonable.


On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote:
> This divides the pmd table into BITS_PER_LONG sections.
> If the bit is set in ->index then we populated at least
> one of the pmd entries in that section.  We never clear
> bits, except at pmd table allocation time.
> Then the pmd scan iterates over ->index, and only actually
> dereferences the pmd entries iff it finds a set bit, and
> it only dereferences the section of pmd entries represented
> by that bit.
> Another idea I've also considered is to implement the
> pgd/pmd levels as a more compact tree, based upon virtual
> address, such as a radix tree.
> I think all of this could be experimented with if we
> abstracted out the pmd/pgd/pte iteration.  So much stuff
> in the kernel mm code is of the form:
> 	for_each_pgd(pgdp)
> 		for_each_pmd(pgdp, pmdp)
> 			for_each_pte(pmdp, ptep)
> 				do_something(ptep)
> At 2-levels, as on most of the 32-bit platforms, things
> aren't so bad.
> Comments?

The number of levels can be abstracted easily. Something to give an
idea of how might be something like this:

struct pte_walk_state {
	pgd_t *pgd;
	pmd_t *pmd;
	pte_t *pte;
	unsigned long vaddr;
};

int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
						struct vm_area_struct *vma)
{
	int cow, ret = 0;
	struct pte_walk_state walk_parent, walk_child;

	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
	spin_lock(&dst->page_table_lock);
	pte_walk_descend_and_create(dst, &walk_child, vma->vm_start);
	for_each_inuse_pte(src, &walk_parent, vma->vm_start, vma->vm_end) {
		if (pte_walk_move_and_create(&walk_child, walk_parent.vaddr)) {
			ret = -ENOMEM;
			break;
		}
		/*
		 * do stuff to child and parent ptes
		 */
		 ...
	}
	spin_unlock(&dst->page_table_lock);
	return ret;
}

void zap_page_range(struct vm_area_struct *vma, unsigned long start,
				unsigned long len, struct zap_details *details)
{
	struct pte_walk_state walk;

	spin_lock(&vma->vm_mm->page_table_lock);
	for_each_inuse_pte(vma->vm_mm, &walk, vma->vm_start, vma->vm_end) {
		/*
		 * wipe pte and do stuff
		 */
		...
	}
	spin_unlock(&vma->vm_mm->page_table_lock);
}

where #define for_each_inuse_pte(mm, walk, start, end) \
	for (pte_walk_descend(mm, walk, start); (walk)->vaddr < (end); \
						next_inuse_pte(walk))

etc.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-07  7:05 copy_page_range() David S. Miller
  2004-08-07  8:07 ` copy_page_range() William Lee Irwin III
@ 2004-08-09  9:01 ` David Mosberger
  2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
  2004-08-09 17:45   ` copy_page_range() David S. Miller
  1 sibling, 2 replies; 16+ messages in thread
From: David Mosberger @ 2004-08-09  9:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-arch

>>>>> On Sat, 7 Aug 2004 00:05:29 -0700, "David S. Miller" <davem@redhat.com> said:

  DaveM> Every couple months I look at this thing.

  DaveM> The main issue is that it's very cache unfriendly, especially
  DaveM> with how sparsely populated the page tables are for 64-bit
  DaveM> processes.

  DaveM> As a simple example, it's at the top of the kernel profile
  DaveM> for 64-bit lat_proc {fork,exec,shell} on sparc64.

I didn't recall copy_page_range() being so high on ia64, but it's been
a while since looked at this, so I ran it again (this is with a simple
fork() loop; lmbench is trying to be too clever for me so I don't like
profiling it...):

% time      self     cumul     calls self/call  tot/call name
 36.89     10.78     10.78      267k     40.4u     40.4u clear_page_tables
 25.99      7.59     18.37      573k     13.2u     13.2u copy_page
 11.42      3.34     21.70     2.07M     1.61u     1.61u clear_page
  2.26      0.66     22.36     1.71M      385n      428n copy_page_range
  1.64      0.48     22.84      546k      878n      898n finish_task_switch
  1.50      0.44     23.28      314k     1.39u     1.48u unmap_vmas
  1.37      0.40     23.68     4.07M     98.2n     98.2n __copy_user
  1.32      0.39     24.06      302k     1.27u     1.38u release_task
  1.19      0.35     24.41     5.73M     60.6n     60.6n page_remove_rmap
  1.17      0.34     24.75     2.98M      114n      126n buffered_rmqueue
  1.01      0.30     25.05      316k      933n     6.37u copy_process
  0.92      0.27     25.32     2.68M      101n      107n free_hot_cold_page
  0.67      0.20     25.51     6.02M     32.7n     32.7n put_page

I suspect some reasons for the different profile may be:

 - 16KB page-size vs. 4KB page-size
 - My binary was statically linked

The good news is that your proposal should help clear_page_tables()
just as easily as copy_page_range(). ;-)

	--david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:01 ` copy_page_range() David Mosberger
@ 2004-08-09  9:04   ` William Lee Irwin III
  2004-08-09  9:27     ` copy_page_range() David Mosberger
  2004-08-09 17:08     ` copy_page_range() Linus Torvalds
  2004-08-09 17:45   ` copy_page_range() David S. Miller
  1 sibling, 2 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-09  9:04 UTC (permalink / raw)
  To: davidm; +Cc: David S. Miller, torvalds, linux-arch

On Mon, Aug 09, 2004 at 02:01:37AM -0700, David Mosberger wrote:
> I didn't recall copy_page_range() being so high on ia64, but it's been
> a while since looked at this, so I ran it again (this is with a simple
> fork() loop; lmbench is trying to be too clever for me so I don't like
> profiling it...):
> % time      self     cumul     calls self/call  tot/call name
>  36.89     10.78     10.78      267k     40.4u     40.4u clear_page_tables
>  25.99      7.59     18.37      573k     13.2u     13.2u copy_page
>  11.42      3.34     21.70     2.07M     1.61u     1.61u clear_page
> I suspect some reasons for the different profile may be:
>  - 16KB page-size vs. 4KB page-size
>  - My binary was statically linked
> The good news is that your proposal should help clear_page_tables()
> just as easily as copy_page_range(). ;-)

These results are actually consistent with large-memory ia32.
Instruction-level profiles showed that the largest overhead in
copy_page_range() on such ia32 boxen appeared to be mm->rss++.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
@ 2004-08-09  9:27     ` David Mosberger
  2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
  2004-08-09 17:46       ` copy_page_range() David S. Miller
  2004-08-09 17:08     ` copy_page_range() Linus Torvalds
  1 sibling, 2 replies; 16+ messages in thread
From: David Mosberger @ 2004-08-09  9:27 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: davidm, David S. Miller, torvalds, linux-arch

>>>>> On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said:

  William> These results are actually consistent with large-memory
  William> ia32.  Instruction-level profiles showed that the largest
  William> overhead in copy_page_range() on such ia32 boxen appeared
  William> to be mm->rss++.

Hmmh, for me, the single biggest stall seems to come from the pmd_none()
check in free_one_pmd().

	--david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:27     ` copy_page_range() David Mosberger
@ 2004-08-09  9:29       ` William Lee Irwin III
  2004-08-09 10:01         ` copy_page_range() David Mosberger
  2004-08-09 17:46       ` copy_page_range() David S. Miller
  1 sibling, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-09  9:29 UTC (permalink / raw)
  To: davidm; +Cc: David S. Miller, torvalds, linux-arch

On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said:
William> These results are actually consistent with large-memory
William> ia32.  Instruction-level profiles showed that the largest
William> overhead in copy_page_range() on such ia32 boxen appeared
William> to be mm->rss++.

On Mon, Aug 09, 2004 at 02:27:16AM -0700, David Mosberger wrote:
> Hmmh, for me, the single biggest stall seems to come from the pmd_none()
> check in free_one_pmd().

That was the case in clear_page_tables(); it was copy_page_range() that
saw mm->rss++ take an unusual amount of time.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
@ 2004-08-09 10:01         ` David Mosberger
  0 siblings, 0 replies; 16+ messages in thread
From: David Mosberger @ 2004-08-09 10:01 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: davidm, David S. Miller, torvalds, linux-arch

>>>>> On Mon, 9 Aug 2004 02:29:43 -0700, William Lee Irwin III <wli@holomorphy.com> said:

  William> On Mon, Aug 09, 2004 at 02:27:16AM -0700, David Mosberger
  William> wrote:
  >> Hmmh, for me, the single biggest stall seems to come from the
  >> pmd_none() check in free_one_pmd().

  William> That was the case in clear_page_tables(); it was
  William> copy_page_range() that saw mm->rss++ take an unusual amount
  William> of time.

Sorry, I misread your mail.  In my cause, the biggest staller in
copy_page_range() seems to be page_dup_rmap() (right after
dst->rss++).  Specifically, the test_and_set_bit() which comes from
page_dup_rmap()->page_map_lock()->bit_spin_lock() is causing the
stalls.

	--david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
  2004-08-09  9:27     ` copy_page_range() David Mosberger
@ 2004-08-09 17:08     ` Linus Torvalds
  2004-08-09 18:49       ` copy_page_range() William Lee Irwin III
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2004-08-09 17:08 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: davidm, David S. Miller, linux-arch



On Mon, 9 Aug 2004, William Lee Irwin III wrote:
> 
> These results are actually consistent with large-memory ia32.
> Instruction-level profiles showed that the largest overhead in
> copy_page_range() on such ia32 boxen appeared to be mm->rss++.

That sounds unlikely. Most ia32 instruction profiles will give high 
profile counts to instructions _following_ the one that was expensive, and 
in this case I'd strongyl suspect that the real expense on x86 is the 
"get_page(page)" thing. 

Which is an atomic increment, and thus very expensive.

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:01 ` copy_page_range() David Mosberger
  2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
@ 2004-08-09 17:45   ` David S. Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-09 17:45 UTC (permalink / raw)
  To: davidm; +Cc: davidm, torvalds, linux-arch

On Mon, 9 Aug 2004 02:01:37 -0700
David Mosberger <davidm@napali.hpl.hp.com> wrote:

> I didn't recall copy_page_range() being so high on ia64, but it's been
> a while since looked at this, so I ran it again

I really meant clear_page_tables(), sorry. :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09  9:27     ` copy_page_range() David Mosberger
  2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
@ 2004-08-09 17:46       ` David S. Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-09 17:46 UTC (permalink / raw)
  To: davidm; +Cc: davidm, wli, torvalds, linux-arch

On Mon, 9 Aug 2004 02:27:16 -0700
David Mosberger <davidm@napali.hpl.hp.com> wrote:

> >>>>> On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said:
> 
>   William> These results are actually consistent with large-memory
>   William> ia32.  Instruction-level profiles showed that the largest
>   William> overhead in copy_page_range() on such ia32 boxen appeared
>   William> to be mm->rss++.
> 
> Hmmh, for me, the single biggest stall seems to come from the pmd_none()
> check in free_one_pmd().

Right, that is what gets hit on sparc64 too.

On ia32, the tables are half the size, thus half the amount of
memory accesses per table traversal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-09 17:08     ` copy_page_range() Linus Torvalds
@ 2004-08-09 18:49       ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-09 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davidm, David S. Miller, linux-arch

On Mon, 9 Aug 2004, William Lee Irwin III wrote:
>> These results are actually consistent with large-memory ia32.
>> Instruction-level profiles showed that the largest overhead in
>> copy_page_range() on such ia32 boxen appeared to be mm->rss++.

On Mon, Aug 09, 2004 at 10:08:05AM -0700, Linus Torvalds wrote:
> That sounds unlikely. Most ia32 instruction profiles will give high 
> profile counts to instructions _following_ the one that was expensive, and 
> in this case I'd strongyl suspect that the real expense on x86 is the 
> "get_page(page)" thing. 
> Which is an atomic increment, and thus very expensive.

But it was real. The theory is that mm->rss++; was an off-node memory
access, where struct page (due to boot-time remapping voodoo) and pmd's
(thanks to my patchwerk) were node-local, and the 40:1 off-node memory
access latency for a remote cache miss (i.e. ZONE_NORMAL) killed it all.

Thankfully Oracle has me parked on 64-bit machines with cache
directories and vaguely speedy interconnects for this kind of work.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-07  8:07 ` copy_page_range() William Lee Irwin III
@ 2004-08-11  7:07   ` David S. Miller
  2004-08-11  7:35     ` copy_page_range() William Lee Irwin III
  2004-08-11 16:13     ` copy_page_range() Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-11  7:07 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: torvalds, linux-arch

On Sat, 7 Aug 2004 01:07:51 -0700
William Lee Irwin III <wli@holomorphy.com> wrote:

> The number of levels can be abstracted easily. Something to give an
> idea of how might be something like this:

I hacked up something slightly different today.  I only
have it being used by clear_page_range() but it is extremely
effective.

Things like fork+exit latencies on my 750Mhz sparc64 box went
from ~490 microseconds to ~367 microseconds.  fork+execve
latency went down from ~1595 microseconds to ~1351 microseconds.

Two issues:

1) I'm not terribly satisfied with the interface.  I think
   with some improvements it can be applies to the two other
   routines this thing really makes sense for, namely copy_page_range
   and unmap_page_range

2) I don't think it will collapse well for 2-level page tables,
   someone take a look?

It's easy to toy with the sparc64 optimization on other platforms,
just add the necessary hacks to pmd_set and pgd_set, allocation
of pmd and pgd tables, use "PAGE_SHIFT - 5" instead of "PAGE_SHIFT - 6"
on 32-bit platforms, and then copy the asm-sparc64/pgwalk.h bits over
into your platforms asm-${ARCH}/pgwalk.h

I just got also reminded that we walk these damn pagetables completely
twice every exit, once to unmap the VMAs pte mappings, once again to
zap the page tables.  It might be fruitful to explore combining
those two steps, perhaps not.

Anyways, comments and improvment suggestions welcome.  Particularly
interesting would be if this thing helps a lot on other platforms
too, such as x86_64, ia64, alpha and ppc64.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/08/10 23:44:24-07:00 davem@nuts.davemloft.net 
#   [MM]: Add arch-overridable page table walking machinery.
#   
#   Currently very rudimentary but is used fully for
#   clear_page_range().  An optimized implementation
#   is there for sparc64 and it is extremely effective
#   particularly for 64-bit processes.
#   
#   For things like lat_fork and friends clear_page_tables()
#   use to be 2nd or 3rd in the kernel profile, now it has
#   dropped to the 20th or so entry.
#   
#   Signed-off-by: David S. Miller <davem@redhat.com>
# 
# mm/memory.c
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -26
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgtable.h
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +28 -4
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgalloc.h
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -2
#   [MM]: Add arch-overridable page table walking machinery.
# 
# arch/sparc64/mm/init.c
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +2 -2
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-x86_64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-v850/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-um/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +114 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sh64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sh/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-s390/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ppc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ppc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-parisc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-mips/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-m68knommu/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-m68k/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ia64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-i386/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-h8300/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-generic/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +96 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-cris/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-arm26/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-arm/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-x86_64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-x86_64/pgwalk.h
# 
# include/asm-v850/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-v850/pgwalk.h
# 
# include/asm-um/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-um/pgwalk.h
# 
# include/asm-sparc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc64/pgwalk.h
# 
# include/asm-sparc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc/pgwalk.h
# 
# include/asm-sh64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh64/pgwalk.h
# 
# include/asm-sh/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh/pgwalk.h
# 
# include/asm-s390/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-s390/pgwalk.h
# 
# include/asm-ppc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc64/pgwalk.h
# 
# include/asm-ppc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc/pgwalk.h
# 
# include/asm-parisc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-parisc/pgwalk.h
# 
# include/asm-mips/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-mips/pgwalk.h
# 
# include/asm-m68knommu/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68knommu/pgwalk.h
# 
# include/asm-m68k/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68k/pgwalk.h
# 
# include/asm-ia64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ia64/pgwalk.h
# 
# include/asm-i386/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-i386/pgwalk.h
# 
# include/asm-h8300/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-h8300/pgwalk.h
# 
# include/asm-generic/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-generic/pgwalk.h
# 
# include/asm-cris/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-cris/pgwalk.h
# 
# include/asm-arm26/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm26/pgwalk.h
# 
# include/asm-arm/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm/pgwalk.h
# 
# include/asm-alpha/pgwalk.h
#   2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-alpha/pgwalk.h
#   2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-alpha/pgwalk.h
# 
diff -Nru a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
--- a/arch/sparc64/mm/init.c	2004-08-10 23:44:47 -07:00
+++ b/arch/sparc64/mm/init.c	2004-08-10 23:44:47 -07:00
@@ -419,7 +419,7 @@
 					if (ptep == NULL)
 						early_pgtable_allocfail("pte");
 					memset(ptep, 0, BASE_PAGE_SIZE);
-					pmd_set(pmdp, ptep);
+					pmd_set_k(pmdp, ptep);
 				}
 				ptep = (pte_t *)__pmd_page(*pmdp) +
 						((vaddr >> 13) & 0x3ff);
@@ -1455,7 +1455,7 @@
 	memset(swapper_pmd_dir, 0, sizeof(swapper_pmd_dir));
 
 	/* Now can init the kernel/bad page tables. */
-	pgd_set(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t)));
+	pgd_set_k(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t)));
 	
 	sparc64_vpte_patchme1[0] |=
 		(((unsigned long)pgd_val(init_mm.pgd[0])) >> 10);
diff -Nru a/include/asm-alpha/pgwalk.h b/include/asm-alpha/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-alpha/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_PGWALK_H
+#define _ALPHA_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ALPHA_PGWALK_H */
diff -Nru a/include/asm-arm/pgwalk.h b/include/asm-arm/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-arm/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ARM_PGWALK_H
+#define _ARM_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ARM_PGWALK_H */
diff -Nru a/include/asm-arm26/pgwalk.h b/include/asm-arm26/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-arm26/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ARM26_PGWALK_H
+#define _ARM26_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ARM26_PGWALK_H */
diff -Nru a/include/asm-cris/pgwalk.h b/include/asm-cris/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-cris/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _CRIS_PGWALK_H
+#define _CRIS_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _CRIS_PGWALK_H */
diff -Nru a/include/asm-generic/pgwalk.h b/include/asm-generic/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-generic/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,96 @@
+#ifndef _GENERIC_PGWALK_H
+#define _GENERIC_PGWALK_H
+
+#include <linux/mm.h>
+
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+struct pte_walk_state;
+typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *);
+typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *);
+typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *);
+
+struct pte_walk_state {
+	void *_client_state;
+	void *first;
+	void *last;
+};
+
+static inline void *pte_walk_client_state(struct pte_walk_state *walk)
+{
+	return walk->_client_state;
+}
+
+static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work)
+{
+	pte_t *ptep = walk->first;
+	pte_t *last = walk->last;
+
+	do {
+		if (pte_none(*ptep))
+			goto next;
+		pte_work(walk, ptep);
+	next:
+		ptep++;
+	} while (ptep < last);
+}
+
+static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work)
+{
+	pmd_t *page_dir = walk->first;
+	pmd_t *last = walk->last;
+
+	do {
+		if (pmd_none(*page_dir))
+			goto next;
+		if (unlikely(pmd_bad(*page_dir))) {
+			pmd_ERROR(*page_dir);
+			pmd_clear(page_dir);
+			goto next;
+		}
+		pmd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last)
+{
+	walk->_client_state = client_state;
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work)
+{
+	pgd_t *page_dir = walk->first;
+	pgd_t *last = walk->last;
+
+	do {
+		if (pgd_none(*page_dir))
+			goto next;
+		if (unlikely(pgd_bad(*page_dir))) {
+			pgd_ERROR(page_dir);
+			pgd_clear(page_dir);
+			goto next;
+		}
+		pgd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+#endif /* _GENERIC_PGWALK_H */
diff -Nru a/include/asm-h8300/pgwalk.h b/include/asm-h8300/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-h8300/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _H8300_PGWALK_H
+#define _H8300_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _H8300_PGWALK_H */
diff -Nru a/include/asm-i386/pgwalk.h b/include/asm-i386/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-i386/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _I386_PGWALK_H
+#define _I386_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _I386_PGWALK_H */
diff -Nru a/include/asm-ia64/pgwalk.h b/include/asm-ia64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ia64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _IA64_PGWALK_H
+#define _IA64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _IA64_PGWALK_H */
diff -Nru a/include/asm-m68k/pgwalk.h b/include/asm-m68k/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-m68k/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _M68K_PGWALK_H
+#define _M68K_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _M68K_PGWALK_H */
diff -Nru a/include/asm-m68knommu/pgwalk.h b/include/asm-m68knommu/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-m68knommu/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _M68KNOMMU_PGWALK_H
+#define _M68KNOMMU_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _M68KNOMMU_PGWALK_H */
diff -Nru a/include/asm-mips/pgwalk.h b/include/asm-mips/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-mips/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_PGWALK_H
+#define _ALPHA_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ALPHA_PGWALK_H */
diff -Nru a/include/asm-parisc/pgwalk.h b/include/asm-parisc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-parisc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PARISC_PGWALK_H
+#define _PARISC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PARISC_PGWALK_H */
diff -Nru a/include/asm-ppc/pgwalk.h b/include/asm-ppc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ppc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PPC_PGWALK_H
+#define _PPC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PPC_PGWALK_H */
diff -Nru a/include/asm-ppc64/pgwalk.h b/include/asm-ppc64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ppc64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PPC64_PGWALK_H
+#define _PPC64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PPC64_PGWALK_H */
diff -Nru a/include/asm-s390/pgwalk.h b/include/asm-s390/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-s390/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _S390_PGWALK_H
+#define _S390_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _S390_PGWALK_H */
diff -Nru a/include/asm-sh/pgwalk.h b/include/asm-sh/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sh/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SH_PGWALK_H
+#define _SH_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SH_PGWALK_H */
diff -Nru a/include/asm-sh64/pgwalk.h b/include/asm-sh64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sh64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SH64_PGWALK_H
+#define _SH64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SH64_PGWALK_H */
diff -Nru a/include/asm-sparc/pgwalk.h b/include/asm-sparc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sparc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SPARC_PGWALK_H
+#define _SPARC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SPARC_PGWALK_H */
diff -Nru a/include/asm-sparc64/pgalloc.h b/include/asm-sparc64/pgalloc.h
--- a/include/asm-sparc64/pgalloc.h	2004-08-10 23:44:47 -07:00
+++ b/include/asm-sparc64/pgalloc.h	2004-08-10 23:44:47 -07:00
@@ -93,6 +93,8 @@
 
 static __inline__ void free_pgd_fast(pgd_t *pgd)
 {
+	virt_to_page(pgd)->index = 0UL;
+
 	preempt_disable();
 	*(unsigned long *)pgd = (unsigned long) pgd_quicklist;
 	pgd_quicklist = (unsigned long *) pgd;
@@ -113,8 +115,10 @@
 	} else {
 		preempt_enable();
 		ret = (unsigned long *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-		if(ret)
+		if (ret) {
 			memset(ret, 0, PAGE_SIZE);
+			virt_to_page(ret)->index = 0UL;
+		}
 	}
 	return (pgd_t *)ret;
 }
@@ -162,8 +166,10 @@
 	pmd = pmd_alloc_one_fast(mm, address);
 	if (!pmd) {
 		pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-		if (pmd)
+		if (pmd) {
 			memset(pmd, 0, PAGE_SIZE);
+			virt_to_page(pmd)->index = 0UL;
+		}
 	}
 	return pmd;
 }
@@ -171,6 +177,8 @@
 static __inline__ void free_pmd_fast(pmd_t *pmd)
 {
 	unsigned long color = DCACHE_COLOR((unsigned long)pmd);
+
+	virt_to_page(pmd)->index = 0UL;
 
 	preempt_disable();
 	*(unsigned long *)pmd = (unsigned long) pte_quicklist[color];
diff -Nru a/include/asm-sparc64/pgtable.h b/include/asm-sparc64/pgtable.h
--- a/include/asm-sparc64/pgtable.h	2004-08-10 23:44:47 -07:00
+++ b/include/asm-sparc64/pgtable.h	2004-08-10 23:44:47 -07:00
@@ -259,10 +259,34 @@
 
 	return __pte;
 }
-#define pmd_set(pmdp, ptep)	\
-	(pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL))
-#define pgd_set(pgdp, pmdp)	\
-	(pgd_val(*(pgdp)) = (__pa((unsigned long) (pmdp)) >> 11UL))
+
+#define PGTABLE_BIT_SHIFT	(PAGE_SHIFT - 6)
+#define PGTABLE_BIT_MASK	((1UL << PGTABLE_BIT_SHIFT) - 1)
+#define PGTABLE_BIT_REGION	(1UL << PGTABLE_BIT_SHIFT)
+#define PGTABLE_BIT(ptr) \
+	(1UL << (((unsigned long)(ptr) & ~PAGE_MASK) >> PGTABLE_BIT_SHIFT))
+#define __PGTABLE_REGION_NEXT(ptr,type) \
+	((type *)(((unsigned long)(ptr) + PGTABLE_BIT_REGION) & \
+		  ~PGTABLE_BIT_MASK))
+#define PMD_REGION_NEXT(pmdp) __PGTABLE_REGION_NEXT(pmdp,pmd_t)
+#define PGD_REGION_NEXT(pgdp) __PGTABLE_REGION_NEXT(pgdp,pgd_t)
+
+#define pmd_set(pmdp, ptep) \
+do { \
+	virt_to_page(pmdp)->index |= PGTABLE_BIT(pmdp); \
+	pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL; \
+} while (0)
+#define pmd_set_k(pmdp, ptep) \
+	(pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL)
+
+#define pgd_set(pgdp, pmdp) \
+do { \
+	virt_to_page(pgdp)->index |= PGTABLE_BIT(pgdp); \
+	pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL; \
+} while (0)
+#define pgd_set_k(pgdp, pmdp) \
+	(pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL)
+
 #define __pmd_page(pmd)		\
 	((unsigned long) __va((((unsigned long)pmd_val(pmd))<<11UL)))
 #define pmd_page(pmd) 			virt_to_page((void *)__pmd_page(pmd))
diff -Nru a/include/asm-sparc64/pgwalk.h b/include/asm-sparc64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sparc64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,114 @@
+/* pgwalk.h: UltraSPARC fast page table traversal.
+ *
+ * Copyright 2004 David S. Miller <davem@redhat.com>
+ */
+
+#ifndef _SPARC64_PGWALK_H
+#define _SPARC64_PGWALK_H
+
+#include <linux/mm.h>
+
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+struct pte_walk_state;
+typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *);
+typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *);
+typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *);
+
+struct pte_walk_state {
+	void *_client_state;
+	void *first;
+	void *last;
+};
+
+static inline void *pte_walk_client_state(struct pte_walk_state *walk)
+{
+	return walk->_client_state;
+}
+
+static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work)
+{
+	pte_t *ptep = walk->first;
+	pte_t *last = walk->last;
+
+	do {
+		if (pte_none(*ptep))
+			goto next;
+		pte_work(walk, ptep);
+	next:
+		ptep++;
+	} while (ptep < last);
+}
+
+static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work)
+{
+	pmd_t *page_dir = walk->first;
+	pmd_t *last = walk->last;
+	unsigned long mask;
+
+	mask = virt_to_page(page_dir)->index;
+
+	do {
+		if (likely(!(PGTABLE_BIT(page_dir) & mask))) {
+			page_dir = PMD_REGION_NEXT(page_dir);
+			continue;
+		}
+		if (pmd_none(*page_dir))
+			goto next;
+		if (unlikely(pmd_bad(*page_dir))) {
+			pmd_ERROR(*page_dir);
+			pmd_clear(page_dir);
+			goto next;
+		}
+		pmd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last)
+{
+	walk->_client_state = client_state;
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work)
+{
+	pgd_t *page_dir = walk->first;
+	pgd_t *last = walk->last;
+	unsigned long mask;
+
+	mask = virt_to_page(page_dir)->index;
+
+	do {
+		if (likely(!(PGTABLE_BIT(page_dir) & mask))) {
+			page_dir = PGD_REGION_NEXT(page_dir);
+			continue;
+		}
+		if (pgd_none(*page_dir))
+			goto next;
+		if (unlikely(pgd_bad(*page_dir))) {
+			pgd_ERROR(page_dir);
+			pgd_clear(page_dir);
+			goto next;
+		}
+		pgd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+#endif /* _SPARC64_PGWALK_H */
diff -Nru a/include/asm-um/pgwalk.h b/include/asm-um/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-um/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _UM_PGWALK_H
+#define _UM_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _UM_PGWALK_H */
diff -Nru a/include/asm-v850/pgwalk.h b/include/asm-v850/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-v850/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _V850_PGWALK_H
+#define _V850_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _V850_PGWALK_H */
diff -Nru a/include/asm-x86_64/pgwalk.h b/include/asm-x86_64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-x86_64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _X86_64_PGWALK_H
+#define _X86_64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _X86_64_PGWALK_H */
diff -Nru a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c	2004-08-10 23:44:47 -07:00
+++ b/mm/memory.c	2004-08-10 23:44:47 -07:00
@@ -52,6 +52,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
+#include <asm/pgwalk.h>
 
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -100,40 +101,25 @@
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void free_one_pmd(struct mmu_gather *tlb, pmd_t * dir)
+static void free_one_pmd(struct pte_walk_state *walk, pmd_t *dir)
 {
 	struct page *page;
 
-	if (pmd_none(*dir))
-		return;
-	if (unlikely(pmd_bad(*dir))) {
-		pmd_ERROR(*dir);
-		pmd_clear(dir);
-		return;
-	}
 	page = pmd_page(*dir);
 	pmd_clear(dir);
 	dec_page_state(nr_page_table_pages);
-	pte_free_tlb(tlb, page);
+	pte_free_tlb(pte_walk_client_state(walk), page);
 }
 
-static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir)
+static void free_one_pgd(struct pte_walk_state *walk, pgd_t *dir)
 {
-	int j;
 	pmd_t * pmd;
 
-	if (pgd_none(*dir))
-		return;
-	if (unlikely(pgd_bad(*dir))) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
-		return;
-	}
 	pmd = pmd_offset(dir, 0);
 	pgd_clear(dir);
-	for (j = 0; j < PTRS_PER_PMD ; j++)
-		free_one_pmd(tlb, pmd+j);
-	pmd_free_tlb(tlb, pmd);
+	pmd_walk_init(walk, pmd, pmd + PTRS_PER_PMD);
+	pmd_walk(walk, free_one_pmd);
+	pmd_free_tlb(pte_walk_client_state(tlb), pmd);
 }
 
 /*
@@ -144,13 +130,11 @@
  */
 void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr)
 {
+	struct pte_walk_state walk;
 	pgd_t * page_dir = tlb->mm->pgd;
 
-	page_dir += first;
-	do {
-		free_one_pgd(tlb, page_dir);
-		page_dir++;
-	} while (--nr);
+	pgd_walk_init(&walk, tlb, page_dir, page_dir + nr);
+	pgd_walk(&walk, free_one_pgd);
 }
 
 pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-11  7:07   ` copy_page_range() David S. Miller
@ 2004-08-11  7:35     ` William Lee Irwin III
  2004-08-11 16:13     ` copy_page_range() Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-11  7:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-arch

On Sat, 7 Aug 2004 01:07:51 -0700 William Lee Irwin III wrote:
>> The number of levels can be abstracted easily. Something to give an
>> idea of how might be something like this:

On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote:
> I hacked up something slightly different today.  I only
> have it being used by clear_page_range() but it is extremely
> effective.
> Things like fork+exit latencies on my 750Mhz sparc64 box went
> from ~490 microseconds to ~367 microseconds.  fork+execve
> latency went down from ~1595 microseconds to ~1351 microseconds.

Nice results!


On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote:
> Two issues:
> 1) I'm not terribly satisfied with the interface.  I think
>    with some improvements it can be applies to the two other
>    routines this thing really makes sense for, namely copy_page_range
>    and unmap_page_range

I think this involves discriminating between walking in tandem
(over instantiated ptes in one and creation in the other),
single walking to instantiate new ptes in an address range, and walking
over instantiated ptes in an address range, possibly a fourth case for
destruction.


On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote:
> 2) I don't think it will collapse well for 2-level page tables,
>    someone take a look?

This is one of the reasons why I wanted to have the struct to put the
handling of levels in the arch bits of the walking. That way, 2-level
pagetables can be done without maintaining the extraneous pointer or
the extra level of calls.


On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote:
> It's easy to toy with the sparc64 optimization on other platforms,
> just add the necessary hacks to pmd_set and pgd_set, allocation
> of pmd and pgd tables, use "PAGE_SHIFT - 5" instead of "PAGE_SHIFT - 6"
> on 32-bit platforms, and then copy the asm-sparc64/pgwalk.h bits over
> into your platforms asm-${ARCH}/pgwalk.h
> I just got also reminded that we walk these damn pagetables completely
> twice every exit, once to unmap the VMAs pte mappings, once again to
> zap the page tables.  It might be fruitful to explore combining
> those two steps, perhaps not.

We really need to be freeing up pagetables during unmapping better,
since they do "leak" a bit. This is causing pain elsewhere (hugetlb).
Once we do that, clear_page_tables() is a nop and all its work is done
while unmapping all the vmas. I vaguely remember some patches
(associated with the shpte efforts) to do something like this having
gone around before, though those were specifically directed at exit()
and not at general munmap().


On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote:
> Anyways, comments and improvment suggestions welcome.  Particularly
> interesting would be if this thing helps a lot on other platforms
> too, such as x86_64, ia64, alpha and ppc64.

I need to play with it a little to see what I can do.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-11  7:07   ` copy_page_range() David S. Miller
  2004-08-11  7:35     ` copy_page_range() William Lee Irwin III
@ 2004-08-11 16:13     ` Linus Torvalds
  2004-08-11 20:45       ` copy_page_range() David S. Miller
  2004-08-12  3:53       ` copy_page_range() David S. Miller
  1 sibling, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2004-08-11 16:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: William Lee Irwin III, linux-arch



On Wed, 11 Aug 2004, David S. Miller wrote:
> 
> I hacked up something slightly different today.  I only
> have it being used by clear_page_range() but it is extremely
> effective.

Hmm.. I don't see any of this being arch-dependent, so I wonder why you 
did it that way. 

Also, one comment: the page directory "index" is never zeroed, as far as I
can tell.

> Things like fork+exit latencies on my 750Mhz sparc64 box went
> from ~490 microseconds to ~367 microseconds.  fork+execve
> latency went down from ~1595 microseconds to ~1351 microseconds.

That's definitely fascinating, and implies either a bug (hey, who knows?) 
or that page tables are a lot sparser than I'd have expected them to be.

Ahh.. I see. The "clear_page_tables()" interface was really designed for a 
two-level page table. And even there I was lazy in exit_mmap(). 

Yeah, I think clear_page_tables() is broken, and it's increasingly broken 
on three or four-level setups. Your patch really works around the fact 
that we're extremely lazy about tearing down page tables.

Ho humm. Maybe it's the right way to go, but I have to say that I would
_really_ prefer to make this generic. There's absolutely nothing
architecture-specific anywhere there except for the place where you hide
the bitmap ("page->index" depends on a pgd/pmd being one page).

I hate "asm-generic" if it's just hiding the fact that it really _is_ 
generic, but people wanted macros.

David, could you look at instead of doing this <asm/page-walk.c> thing, 
just do a few _trivial_ macros in the asm page table headers:

	/* We use a bitmap in the pmd page to mark things busy,
	 * where we reduce the pmd index into 64 bits
	 */
	#define PMD_BITMAP_SHIFT (PAGE_SHIFT-6)
	#define pmd_usage_bitmap(pmd)	(virt_to_page(pmd)->index)

	#define PGD_BITMAP_SHIFT (PAGE_SHIFT-6)
	#define pgd_usage_bitmap(pgd)	(virt_to_page(pgd)->index)

and then the two-level folding can be done in the generic code by not 
defining the PGD_BITMAP_SHIFT at all or something like that, ie the 
generic code would have exactly _one_ #ifdef:

	clear_pgd_tables(...)
	{
		do {
	#ifdef PGD_BITMAP_SHIFT
		if (!(pgd_usage_bitmap(pgd) & mask))
			continue;
	#endif
		...
		} while (pgd < end)
	}

(you get the idea).

What do you think? Actually - make the same #ifdef in the pmd case too,
since that allows architectures that don't fold things but don't have any
good _room_ to hide the bitmap to also just not do this.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-11 16:13     ` copy_page_range() Linus Torvalds
@ 2004-08-11 20:45       ` David S. Miller
  2004-08-12  3:53       ` copy_page_range() David S. Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-11 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: wli, linux-arch

On Wed, 11 Aug 2004 09:13:36 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> Hmm.. I don't see any of this being arch-dependent, so I wonder why you 
> did it that way. 

I'm trying to achieve two goals.  The first I've demonstrated is
achieveable, the second is still not fully grasped yet.

Firstly, I wanted to get clear_page_tables() out of my profiles.
Secondly, I wanted to abstract out completely the page table
traversing the generic kernel does.

I want the latter so I can experiment with different data structures
for page tables, and the current pgd/pmd/pte array assumptions in
the kernel generic vm code disallow any kind of tinkering in that
area.

If we end up with an interface that says: "walk page tables for vaddr
range 'start' to 'end', and do func() for each pte" then anything can
be experimented with.

You're absolutely right, and I've mentioned this earlier in this thread,
that the current page tables are way too sparse.  On 64-bit a simple
hello world program with a 3-level page table looks roughly like:

PGD_BASE:
   ...
	X --> PMD_BASE1
		...
	              Y --> PTE_BASE1
			... some ptes ...
	        ...
	Z --> PMD_BASE2
	        ...
		      A --> PTE_BASE2
			... some ptes ...
	        ...
	B --> PMD_BASE3
		...
		      C --> PTE_BASE3
			... some ptes ...
		...
  ...

The X-->Y branch is for the program text.
The Z-->A branch is for the dynamic mmap() area (shared libraries,
anonymous mmaps, etc.)
The B-->C branch is for the program stack.

We've got maybe 10 to 20 present pte's in this tree.

On sparc64 pgd_t and pmd_t are both 32-bit (this is in order to
encode the most address space possible, we can encode the full
physical address by simply shifting out the page offset bits)
So each pgd_t table holds 2048 entries as does each pmd_t table.
Therefore, in the above example during clear_page_tables() we'd
scan 2048 pgd's, 3 * 2048 pmd's and 3 * 1024 pte's.

That's 7 * 8192 (PAGE_SIZE) byte worth of pointer derefing.
It's no wonder this shows up in the profiles.  All of that just
for 10 to 20 actual user mappings.  This is broken.

I want to try and use a less sparse data structure on sparc
just for the pgd/pmd level, and use pages of ptes for the pte_t
level as those tend to be well populated.  I also need to retain
the pte_t level as a full page due to the virtual linear page table
stuff I do to speed up TLB miss processing (roughly the same as
what ia64 does).

I can't experiment with all the generic code assuming these things
are arrays.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: copy_page_range()
  2004-08-11 16:13     ` copy_page_range() Linus Torvalds
  2004-08-11 20:45       ` copy_page_range() David S. Miller
@ 2004-08-12  3:53       ` David S. Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David S. Miller @ 2004-08-12  3:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: wli, linux-arch

On Wed, 11 Aug 2004 09:13:36 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> Also, one comment: the page directory "index" is never zeroed, as far as I
> can tell.

It's done in the pmd/pgd freeing methods.

 static __inline__ void free_pgd_fast(pgd_t *pgd)
 {
+	virt_to_page(pgd)->index = 0UL;
+
 ...

 static __inline__ void free_pmd_fast(pmd_t *pmd)
 {
 	unsigned long color = DCACHE_COLOR((unsigned long)pmd);
+
+	virt_to_page(pmd)->index = 0UL;

and also at pmd/pgd allocation time.

> David, could you look at instead of doing this <asm/page-walk.c> thing, 
> just do a few _trivial_ macros in the asm page table headers:

I assume you mean asm/page-walk.h, and sure I'll whip something up.

But please keep in mind what I said in my other email, that
I really want to (in the end) abstract away all page table
walking, so that the only thing the generic VM code really
plays around with are pte's.  All page table traversal goes
through an interface, so platforms can use whatever data structure
(ie. something that isn't a flat out array) they want.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-08-12  3:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-07  7:05 copy_page_range() David S. Miller
2004-08-07  8:07 ` copy_page_range() William Lee Irwin III
2004-08-11  7:07   ` copy_page_range() David S. Miller
2004-08-11  7:35     ` copy_page_range() William Lee Irwin III
2004-08-11 16:13     ` copy_page_range() Linus Torvalds
2004-08-11 20:45       ` copy_page_range() David S. Miller
2004-08-12  3:53       ` copy_page_range() David S. Miller
2004-08-09  9:01 ` copy_page_range() David Mosberger
2004-08-09  9:04   ` copy_page_range() William Lee Irwin III
2004-08-09  9:27     ` copy_page_range() David Mosberger
2004-08-09  9:29       ` copy_page_range() William Lee Irwin III
2004-08-09 10:01         ` copy_page_range() David Mosberger
2004-08-09 17:46       ` copy_page_range() David S. Miller
2004-08-09 17:08     ` copy_page_range() Linus Torvalds
2004-08-09 18:49       ` copy_page_range() William Lee Irwin III
2004-08-09 17:45   ` copy_page_range() David S. Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.