linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] optimise copy page range
@ 2005-02-17 13:53 Nick Piggin
  2005-02-17 14:03 ` [PATCH 2/2] page table iterators Nick Piggin
  0 siblings, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-17 13:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Andrew Morton, Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 100 bytes --]

Some of you have seen this before. Just resending because
I based my next patch on top of this one.

[-- Attachment #2: mm-opt-cpr.patch --]
[-- Type: text/plain, Size: 1397 bytes --]



Suggested by Linus: optimise a condition in the clear_p?d_range functions.
Results in one less conditional branch on i386 with gcc-3.4.4

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/mm/memory.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff -puN mm/memory.c~mm-opt-cpr mm/memory.c
--- linux-2.6/mm/memory.c~mm-opt-cpr	2005-02-16 13:45:11.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2005-02-16 13:45:11.000000000 +1100
@@ -98,7 +98,8 @@ static inline void clear_pmd_range(struc
 		pmd_clear(pmd);
 		return;
 	}
-	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) {
+	if (!((start | end) & ~PMD_MASK)) {
+		/* Only clear full, aligned ranges */
 		page = pmd_page(*pmd);
 		pmd_clear(pmd);
 		dec_page_state(nr_page_table_pages);
@@ -131,7 +132,8 @@ static inline void clear_pud_range(struc
 		addr = next;
 	} while (addr && (addr < end));
 
-	if (!(start & ~PUD_MASK) && !(end & ~PUD_MASK)) {
+	if (!((start | end) & ~PUD_MASK)) {
+		/* Only clear full, aligned ranges */
 		pud_clear(pud);
 		pmd_free_tlb(tlb, __pmd);
 	}
@@ -162,7 +164,8 @@ static inline void clear_pgd_range(struc
 		addr = next;
 	} while (addr && (addr < end));
 
-	if (!(start & ~PGDIR_MASK) && !(end & ~PGDIR_MASK)) {
+	if (!((start | end) & ~PGDIR_MASK)) {
+		/* Only clear full, aligned ranges */
 		pgd_clear(pgd);
 		pud_free_tlb(tlb, __pud);
 	}

_

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/2] page table iterators
  2005-02-17 13:53 [PATCH 1/2] optimise copy page range Nick Piggin
@ 2005-02-17 14:03 ` Nick Piggin
  2005-02-17 15:56   ` Linus Torvalds
  2005-02-17 19:43   ` Andi Kleen
  0 siblings, 2 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-17 14:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Andrew Morton, Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1476 bytes --]

I am pretty surprised myself that I was able to consolidate
all "page table range" functions into a single type of iterator
(well, there are a couple of variations, but it's not too bad).

I thought at least the functions which allocate new page tables
would have to be seperate from those which don't.... it turns
out that if you slightly change the implementation of the
allocating functions, they start walking, quacking, etc. like
the other type.

This may also opens the way for things like:

#define for_each_pud(pgd, start, end, pud, pud_start, pud_end) \
if (pud = (pud_t *)pgd, pud_start = start, pud_end = end, 1)

for 2 and 3 level page tables (don't laugh if I've messed up the
above completely - you get the idea).

Then the last bit of the performance puzzle I think will just come
from inlining things. Now Andi doesn't want to do that (probably
rightly so)... what about compiling mm/ with -funit-at-a-time? It
doesn't do much harm to the stack, and it shaves off quite a few
KB...

Anyway, this iterator patch shaves off a bit itself. I haven't
looked at generated code, but I hope it is alright.

npiggin@didi:~/usr/src/linux-2.6$ size mm/memory.o.before
    text    data     bss     dec     hex filename
   11349       4     120   11473    2cd1 mm/memory.o
npiggin@didi:~/usr/src/linux-2.6$ size mm/memory.o.after
    text    data     bss     dec     hex filename
   11221       4     120   11345    2c51 mm/memory.o

Suggestions, help, etc. welcome.

Nick

[-- Attachment #2: vm-pgt-walkers.patch --]
[-- Type: text/plain, Size: 50084 bytes --]




---

 drivers/char/mem.c                              |    0 
 linux-2.6-npiggin/arch/i386/mm/ioremap.c        |   78 +--
 linux-2.6-npiggin/include/asm-generic/pgtable.h |  128 +++++
 linux-2.6-npiggin/mm/memory.c                   |  591 ++++++++++--------------
 linux-2.6-npiggin/mm/mprotect.c                 |  118 +---
 linux-2.6-npiggin/mm/msync.c                    |  159 ++----
 linux-2.6-npiggin/mm/vmalloc.c                  |  222 +++------
 7 files changed, 600 insertions(+), 696 deletions(-)

diff -puN include/asm-generic/pgtable.h~vm-pgt-walkers include/asm-generic/pgtable.h
--- linux-2.6/include/asm-generic/pgtable.h~vm-pgt-walkers	2005-02-17 23:59:16.000000000 +1100
+++ linux-2.6-npiggin/include/asm-generic/pgtable.h	2005-02-18 00:50:19.000000000 +1100
@@ -134,4 +134,132 @@ static inline void ptep_mkdirty(pte_t *p
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif
 
+/*
+ * for_each_pgd - iterates through pgd entries in a given mm structa
+ *
+ * @mm: the mm to use
+ * @start: the first address, inclusive
+ * @end: the last address, exclusive
+ * @pgd: the pgd iterator
+ * @pgd_start: the first address within the current 'pgd'
+ * @pgd_end: the last address within the current 'pgd'
+ *
+ * mm, start, end are all unchanged
+ * pgd, pgd_start, pgd_end may all be changed
+ */
+#define for_each_pgd(mm, start, end, pgd, pgd_start, pgd_end)		\
+	for (	pgd = pgd_offset(mm, start),				\
+		  pgd_start = start;					\
+		pgd_end = (pgd_start + PGDIR_SIZE) & PGDIR_MASK,	\
+		  pgd_end = ((pgd_end && pgd_end <= end) ? pgd_end : end), \
+		  pgd <= pgd_offset(mm, end-1);				\
+		pgd_start = pgd_end,					\
+		  pgd++ )
+
+/*
+ * for_each_pgd_k - iterates through pgd entries in the kernel mapping
+ *
+ * see for_each_pgd
+ */
+#define for_each_pgd_k(start, end, pgd, pgd_start, pgd_end)		\
+	for (	pgd = pgd_offset_k(start),				\
+		  pgd_start = start;					\
+		pgd_end = (pgd_start + PGDIR_SIZE) & PGDIR_MASK,	\
+		  pgd_end = ((pgd_end && pgd_end <= end) ? pgd_end : end), \
+		  pgd <= pgd_offset_k(end-1);				\
+		pgd_start = pgd_end,					\
+		  pgd++ )
+
+/*
+ * for_each_pud - iterate through pud entries in a given pgd
+ *
+ * see for_each_pgd
+ */
+#define for_each_pud(pgd, start, end, pud, pud_start, pud_end)		\
+	for (	pud = pud_offset(pgd, start),				\
+		  pud_start = start;					\
+		pud_end = (pud_start + PUD_SIZE) & PUD_MASK,		\
+		  pud_end = ((pud_end && pud_end <= end) ? pud_end : end), \
+		  pud <= pud_offset(pgd, end-1);			\
+		pud_start = pud_end,					\
+		  pud++ )
+
+/*
+ * for_each_pmd - iterate through pmd entries in a given pud
+ *
+ * see for_each_pgd
+ */
+#define for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end)		\
+	for (	pmd = pmd_offset(pud, start),				\
+		  pmd_start = start;					\
+		pmd_end = (pmd_start + PMD_SIZE) & PMD_MASK,		\
+		  pmd_end = ((pmd_end && pmd_end <= end) ? pmd_end : end), \
+		  pmd <= pmd_offset(pud, end-1);			\
+		pmd_start = pmd_end,					\
+		  pmd++ )
+
+/*
+ * for_each_pte_map - iterate through pte entries in a given pmd
+ *
+ * @pmd: the pmd to use
+ * @start: the first address, inclusive
+ * @end: the last address, exclusive
+ * @pte: the pte iterator
+ * @addr: the address of the current 'pte'
+ *
+ * for_each_pte_map maps the ptes which it iterates over.
+ *
+ * Usage:
+ * for_each_pte_map(pmd, start, end, pte, addr) {
+ * 	// do something with pte and/or addr
+ * } for_each_pte_map_end;
+ */
+#define for_each_pte_map(pmd, start, end, pte, addr) 			\
+do {									\
+	int ___i = (end - start) >> PAGE_SHIFT;				\
+	pte_t *___p = pte_offset_map(pmd, start);			\
+	pte = ___p;							\
+	for (	addr = start;						\
+		___i--;							\
+		addr += PAGE_SIZE, pte++)
+
+#define for_each_pte_map_end			 			\
+	pte_unmap(___p);						\
+} while (0)
+
+/*
+ * for_each_pte_map_nested
+ *
+ * See for_each_pte_map. Does a nested mapping of the pte.
+ */
+#define for_each_pte_map_nested(pmd, start, end, pte, addr) 		\
+do {									\
+	int ___i = (end - start) >> PAGE_SHIFT;				\
+	pte_t *___p = pte_offset_map_nested(pmd, start);		\
+	pte = ___p;							\
+	for (	addr = start;						\
+		___i--;							\
+		addr += PAGE_SIZE, pte++)
+
+#define for_each_pte_map_nested_end		 			\
+	pte_unmap_nested(___p);						\
+} while (0)
+
+/*
+ * for_each_pte_kernel
+ *
+ * See for_each_pte_map. Iterates over kernel ptes.
+ */
+#define for_each_pte_kernel(pmd, start, end, pte, addr) 		\
+do {									\
+	int ___i = (end - start) >> PAGE_SHIFT;				\
+	pte_t *___p = pte_offset_kernel(pmd, start);			\
+	pte = ___p;							\
+	for (	addr = start;						\
+		___i--;							\
+		addr += PAGE_SIZE, pte++)
+
+#define for_each_pte_kernel_end			 			\
+} while (0)
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff -puN mm/memory.c~vm-pgt-walkers mm/memory.c
--- linux-2.6/mm/memory.c~vm-pgt-walkers	2005-02-17 23:59:16.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2005-02-18 00:27:52.000000000 +1100
@@ -89,18 +89,9 @@ EXPORT_SYMBOL(vmalloc_earlyreserve);
  */
 static inline void clear_pmd_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long start, unsigned long end)
 {
-	struct page *page;
-
-	if (pmd_none(*pmd))
-		return;
-	if (unlikely(pmd_bad(*pmd))) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
 	if (!((start | end) & ~PMD_MASK)) {
 		/* Only clear full, aligned ranges */
-		page = pmd_page(*pmd);
+		struct page *page = pmd_page(*pmd);
 		pmd_clear(pmd);
 		dec_page_state(nr_page_table_pages);
 		tlb->mm->nr_ptes--;
@@ -110,64 +101,50 @@ static inline void clear_pmd_range(struc
 
 static inline void clear_pud_range(struct mmu_gather *tlb, pud_t *pud, unsigned long start, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pmd_t *pmd, *__pmd;
+	unsigned long pmd_start, pmd_end;
+	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return;
-	if (unlikely(pud_bad(*pud))) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (pmd_none(*pmd))
+			continue;
+		if (unlikely(pmd_bad(*pmd))) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
 
-	pmd = __pmd = pmd_offset(pud, start);
-	do {
-		next = (addr + PMD_SIZE) & PMD_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
-		clear_pmd_range(tlb, pmd, addr, next);
-		pmd++;
-		addr = next;
-	} while (addr && (addr < end));
+		clear_pmd_range(tlb, pmd, pmd_start, pmd_end);
+	}
 
 	if (!((start | end) & ~PUD_MASK)) {
 		/* Only clear full, aligned ranges */
 		pud_clear(pud);
-		pmd_free_tlb(tlb, __pmd);
+		pmd_free_tlb(tlb, pmd_offset(pud, start));
 	}
 }
 
 
 static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pud_t *pud, *__pud;
+	unsigned long pud_start, pud_end;
+	pud_t *pud;
 
-	if (pgd_none(*pgd))
-		return;
-	if (unlikely(pgd_bad(*pgd))) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (pud_none(*pud))
+			continue;
+		if (unlikely(pud_bad(*pud))) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
 
-	pud = __pud = pud_offset(pgd, start);
-	do {
-		next = (addr + PUD_SIZE) & PUD_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
-		clear_pud_range(tlb, pud, addr, next);
-		pud++;
-		addr = next;
-	} while (addr && (addr < end));
+		clear_pud_range(tlb, pud, pud_start, pud_end);
+	}
 
 	if (!((start | end) & ~PGDIR_MASK)) {
 		/* Only clear full, aligned ranges */
 		pgd_clear(pgd);
-		pud_free_tlb(tlb, __pud);
+		pud_free_tlb(tlb, pud_offset(pgd, start));
 	}
 }
 
@@ -178,45 +155,54 @@ static inline void clear_pgd_range(struc
  */
 void clear_page_range(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pgd_t * pgd = pgd_offset(tlb->mm, start);
-	unsigned long i;
-
-	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
-		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
-		clear_pgd_range(tlb, pgd, addr, next);
-		pgd++;
-		addr = next;
+	unsigned long pgd_start, pgd_end;
+	pgd_t * pgd;
+
+	for_each_pgd(tlb->mm, start, end, pgd, pgd_start, pgd_end) {
+		if (pgd_none(*pgd))
+			continue;
+		if (unlikely(pgd_bad(*pgd))) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+
+		clear_pgd_range(tlb, pgd, pgd_start, pgd_end);
 	}
 }
 
-pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+static int pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
-	if (!pmd_present(*pmd)) {
-		struct page *new;
+	struct page *new;
 
-		spin_unlock(&mm->page_table_lock);
-		new = pte_alloc_one(mm, address);
-		spin_lock(&mm->page_table_lock);
-		if (!new)
-			return NULL;
-		/*
-		 * Because we dropped the lock, we should re-check the
-		 * entry, as somebody else could have populated it..
-		 */
-		if (pmd_present(*pmd)) {
-			pte_free(new);
-			goto out;
-		}
-		mm->nr_ptes++;
-		inc_page_state(nr_page_table_pages);
-		pmd_populate(mm, pmd, new);
+	if (pmd_present(*pmd))
+		return 1;
+
+	spin_unlock(&mm->page_table_lock);
+	new = pte_alloc_one(mm, address);
+	spin_lock(&mm->page_table_lock);
+	if (!new)
+		return 0;
+	/*
+	 * Because we dropped the lock, we should re-check the
+	 * entry, as somebody else could have populated it..
+	 */
+	if (pmd_present(*pmd)) {
+		pte_free(new);
+		return 1;
 	}
-out:
-	return pte_offset_map(pmd, address);
+	mm->nr_ptes++;
+	inc_page_state(nr_page_table_pages);
+	pmd_populate(mm, pmd, new);
+
+	return 1;
+}
+
+pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+{
+	if (pte_alloc(mm, pmd, address))
+		return pte_offset_map(pmd, address);
+	return NULL;
 }
 
 pte_t fastcall * pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
@@ -322,90 +308,91 @@ copy_one_pte(struct mm_struct *dst_mm,  
 
 static int copy_pte_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long start, unsigned long end)
 {
+	unsigned long address;
 	pte_t *src_pte, *dst_pte;
-	pte_t *s, *d;
+	pte_t *d;
 	unsigned long vm_flags = vma->vm_flags;
 
-	d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
+	d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, start);
 	if (!dst_pte)
 		return -ENOMEM;
 
 	spin_lock(&src_mm->page_table_lock);
-	s = src_pte = pte_offset_map_nested(src_pmd, addr);
-	for (; addr < end; addr += PAGE_SIZE, s++, d++) {
-		if (pte_none(*s))
-			continue;
-		copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
-	}
-	pte_unmap_nested(src_pte);
-	pte_unmap(dst_pte);
+	for_each_pte_map_nested(src_pmd, start, end, src_pte, address) {
+		if (pte_none(*src_pte))
+			goto next_pte;
+		copy_one_pte(dst_mm, src_mm, d, src_pte, vm_flags, address);
+
+next_pte:
+		d++;
+	} for_each_pte_map_nested_end;
 	spin_unlock(&src_mm->page_table_lock);
+
+	pte_unmap(dst_pte);
 	cond_resched_lock(&dst_mm->page_table_lock);
 	return 0;
 }
 
 static int copy_pmd_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
 		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long start, unsigned long end)
 {
+	unsigned long pmd_start, pmd_end;
 	pmd_t *src_pmd, *dst_pmd;
 	int err = 0;
-	unsigned long next;
 
-	src_pmd = pmd_offset(src_pud, addr);
-	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, start);
 	if (!dst_pmd)
 		return -ENOMEM;
 
-	for (; addr < end; addr = next, src_pmd++, dst_pmd++) {
-		next = (addr + PMD_SIZE) & PMD_MASK;
-		if (next > end)
-			next = end;
+	for_each_pmd(src_pud, start, end, src_pmd, pmd_start, pmd_end) {
 		if (pmd_none(*src_pmd))
-			continue;
+			goto next_pmd;
 		if (pmd_bad(*src_pmd)) {
 			pmd_ERROR(*src_pmd);
 			pmd_clear(src_pmd);
-			continue;
+			goto next_pmd;
 		}
-		err = copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-							vma, addr, next);
-		if (err)
+		err = copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, vma,
+						pmd_start, pmd_end);
+		if (unlikely(err))
 			break;
+
+next_pmd:
+		dst_pmd++;
 	}
 	return err;
 }
 
 static int copy_pud_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
 		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long start, unsigned long end)
 {
+	unsigned long pud_start, pud_end;
 	pud_t *src_pud, *dst_pud;
 	int err = 0;
-	unsigned long next;
 
-	src_pud = pud_offset(src_pgd, addr);
-	dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+	dst_pud = pud_alloc(dst_mm, dst_pgd, start);
 	if (!dst_pud)
 		return -ENOMEM;
 
-	for (; addr < end; addr = next, src_pud++, dst_pud++) {
-		next = (addr + PUD_SIZE) & PUD_MASK;
-		if (next > end)
-			next = end;
+	for_each_pud(src_pgd, start, end, src_pud, pud_start, pud_end) {
 		if (pud_none(*src_pud))
-			continue;
+			goto next_pud;
 		if (pud_bad(*src_pud)) {
 			pud_ERROR(*src_pud);
 			pud_clear(src_pud);
-			continue;
+			goto next_pud;
 		}
-		err = copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-							vma, addr, next);
-		if (err)
+		err = copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, vma,
+						pud_start, pud_end);
+		if (unlikely(err))
 			break;
+
+next_pud:
+		dst_pud++;
 	}
 	return err;
 }
@@ -413,23 +400,19 @@ static int copy_pud_range(struct mm_stru
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 		struct vm_area_struct *vma)
 {
+	unsigned long pgd_start, pgd_end;
+	unsigned long start, end;
 	pgd_t *src_pgd, *dst_pgd;
-	unsigned long addr, start, end, next;
 	int err = 0;
 
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst, src, vma);
 
 	start = vma->vm_start;
-	src_pgd = pgd_offset(src, start);
+	end = vma->vm_end;
 	dst_pgd = pgd_offset(dst, start);
 
-	end = vma->vm_end;
-	addr = start;
-	while (addr && (addr < end-1)) {
-		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= addr)
-			next = end;
+	for_each_pgd(src, start, end, src_pgd, pgd_start, pgd_end) {
 		if (pgd_none(*src_pgd))
 			goto next_pgd;
 		if (pgd_bad(*src_pgd)) {
@@ -437,42 +420,27 @@ int copy_page_range(struct mm_struct *ds
 			pgd_clear(src_pgd);
 			goto next_pgd;
 		}
-		err = copy_pud_range(dst, src, dst_pgd, src_pgd,
-							vma, addr, next);
-		if (err)
+
+		err = copy_pud_range(dst, src, dst_pgd, src_pgd, vma,
+						pgd_start, pgd_end);
+		if (unlikely(err))
 			break;
 
 next_pgd:
-		src_pgd++;
 		dst_pgd++;
-		addr = next;
 	}
 
 	return err;
 }
 
 static void zap_pte_range(struct mmu_gather *tlb,
-		pmd_t *pmd, unsigned long address,
-		unsigned long size, struct zap_details *details)
+		pmd_t *pmd, unsigned long start,
+		unsigned long end, struct zap_details *details)
 {
-	unsigned long offset;
+	unsigned long address;
 	pte_t *ptep;
 
-	if (pmd_none(*pmd))
-		return;
-	if (unlikely(pmd_bad(*pmd))) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-	ptep = pte_offset_map(pmd, address);
-	offset = address & ~PMD_MASK;
-	if (offset + size > PMD_SIZE)
-		size = PMD_SIZE - offset;
-	size &= PAGE_MASK;
-	if (details && !details->check_mapping && !details->nonlinear_vma)
-		details = NULL;
-	for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
+	for_each_pte_map(pmd, start, end, ptep, address) {
 		pte_t pte = *ptep;
 		if (pte_none(pte))
 			continue;
@@ -503,12 +471,12 @@ static void zap_pte_range(struct mmu_gat
 					continue;
 			}
 			pte = ptep_get_and_clear(ptep);
-			tlb_remove_tlb_entry(tlb, ptep, address+offset);
+			tlb_remove_tlb_entry(tlb, ptep, address);
 			if (unlikely(!page))
 				continue;
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
-					address+offset) != page->index)
+					address) != page->index)
 				set_pte(ptep, pgoff_to_pte(page->index));
 			if (pte_dirty(pte))
 				set_page_dirty(page);
@@ -530,74 +498,71 @@ static void zap_pte_range(struct mmu_gat
 		if (!pte_file(pte))
 			free_swap_and_cache(pte_to_swp_entry(pte));
 		pte_clear(ptep);
-	}
-	pte_unmap(ptep-1);
+	} for_each_pte_map_end;
 }
 
 static void zap_pmd_range(struct mmu_gather *tlb,
-		pud_t *pud, unsigned long address,
-		unsigned long size, struct zap_details *details)
+		pud_t *pud, unsigned long start,
+		unsigned long end, struct zap_details *details)
 {
+	unsigned long pmd_start, pmd_end;
 	pmd_t * pmd;
-	unsigned long end;
 
-	if (pud_none(*pud))
-		return;
-	if (unlikely(pud_bad(*pud))) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (pmd_none(*pmd))
+			continue;
+		if (unlikely(pmd_bad(*pmd))) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+
+		zap_pte_range(tlb, pmd, pmd_start, pmd_end, details);
 	}
-	pmd = pmd_offset(pud, address);
-	end = address + size;
-	if (end > ((address + PUD_SIZE) & PUD_MASK))
-		end = ((address + PUD_SIZE) & PUD_MASK);
-	do {
-		zap_pte_range(tlb, pmd, address, end - address, details);
-		address = (address + PMD_SIZE) & PMD_MASK; 
-		pmd++;
-	} while (address && (address < end));
 }
 
 static void zap_pud_range(struct mmu_gather *tlb,
-		pgd_t * pgd, unsigned long address,
+		pgd_t * pgd, unsigned long start,
 		unsigned long end, struct zap_details *details)
 {
+	unsigned long pud_start, pud_end;
 	pud_t * pud;
 
-	if (pgd_none(*pgd))
-		return;
-	if (unlikely(pgd_bad(*pgd))) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (pud_none(*pud))
+			continue;
+		if (unlikely(pud_bad(*pud))) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+
+		zap_pmd_range(tlb, pud, pud_start, pud_end, details);
 	}
-	pud = pud_offset(pgd, address);
-	do {
-		zap_pmd_range(tlb, pud, address, end - address, details);
-		address = (address + PUD_SIZE) & PUD_MASK; 
-		pud++;
-	} while (address && (address < end));
 }
 
 static void unmap_page_range(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, unsigned long address,
+		struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, struct zap_details *details)
 {
-	unsigned long next;
+	unsigned long pgd_start, pgd_end;
 	pgd_t *pgd;
-	int i;
 
-	BUG_ON(address >= end);
-	pgd = pgd_offset(vma->vm_mm, address);
+	BUG_ON(start >= end);
+	if (details && !details->check_mapping && !details->nonlinear_vma)
+		details = NULL;
+
 	tlb_start_vma(tlb, vma);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		zap_pud_range(tlb, pgd, address, next, details);
-		address = next;
-		pgd++;
+	for_each_pgd(vma->vm_mm, start, end, pgd, pgd_start, pgd_end) {
+		if (pgd_none(*pgd))
+			continue;
+		if (unlikely(pgd_bad(*pgd))) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+
+		zap_pud_range(tlb, pgd, pgd_start, pgd_end, details);
 	}
 	tlb_end_vma(tlb, vma);
 }
@@ -987,108 +952,78 @@ out:
 
 EXPORT_SYMBOL(get_user_pages);
 
-static void zeromap_pte_range(pte_t * pte, unsigned long address,
-                                     unsigned long size, pgprot_t prot)
+static void zeromap_pte_range(pmd_t * pmd, unsigned long start,
+                                     unsigned long end, pgprot_t prot)
 {
-	unsigned long end;
+	unsigned long addr;
+	pte_t *pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-	do {
-		pte_t zero_pte = pte_wrprotect(mk_pte(ZERO_PAGE(address), prot));
+	for_each_pte_map(pmd, start, end, pte, addr) {
+		pte_t zero_pte = pte_wrprotect(mk_pte(ZERO_PAGE(addr), prot));
 		BUG_ON(!pte_none(*pte));
 		set_pte(pte, zero_pte);
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
+	} for_each_pte_map_end;
 }
 
-static inline int zeromap_pmd_range(struct mm_struct *mm, pmd_t * pmd,
-		unsigned long address, unsigned long size, pgprot_t prot)
+static inline int zeromap_pmd_range(struct mm_struct *mm, pud_t * pud,
+			unsigned long start, unsigned long end, pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long pmd_start, pmd_end;
+	pmd_t * pmd;
 
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-	do {
-		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
-		if (!pte)
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (!pte_alloc(mm, pmd, pmd_start))
 			return -ENOMEM;
-		zeromap_pte_range(pte, base + address, end - address, prot);
-		pte_unmap(pte);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		zeromap_pte_range(pmd, start, end, prot);
+	}
 	return 0;
 }
 
-static inline int zeromap_pud_range(struct mm_struct *mm, pud_t * pud,
-				    unsigned long address,
-                                    unsigned long size, pgprot_t prot)
+static inline int zeromap_pud_range(struct mm_struct *mm, pgd_t * pgd,
+					unsigned long start, unsigned long end,
+					pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long pud_start, pud_end;
+	pud_t * pud;
 	int error = 0;
 
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	do {
-		pmd_t * pmd = pmd_alloc(mm, pud, base + address);
-		error = -ENOMEM;
-		if (!pmd)
-			break;
-		error = zeromap_pmd_range(mm, pmd, address, end - address, prot);
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (unlikely(!pmd_alloc(mm, pud, pud_start)))
+			return -ENOMEM;
+		error = zeromap_pmd_range(mm, pud, pud_start, pud_end, prot);
 		if (error)
 			break;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
-	return 0;
+	}
+	return error;
 }
-
-int zeromap_page_range(struct vm_area_struct *vma, unsigned long address,
+int zeromap_page_range(struct vm_area_struct *vma, unsigned long start,
 					unsigned long size, pgprot_t prot)
 {
-	int i;
-	int error = 0;
-	pgd_t * pgd;
-	unsigned long beg = address;
-	unsigned long end = address + size;
-	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long end = start + size;
+	unsigned long pgd_start, pgd_end;
+	pgd_t * pgd;
+	int error = 0;
 
-	pgd = pgd_offset(mm, address);
-	flush_cache_range(vma, beg, end);
-	BUG_ON(address >= end);
+	BUG_ON(start >= end);
 	BUG_ON(end > vma->vm_end);
 
+	pgd = pgd_offset(mm, start);
+	flush_cache_range(vma, start, end);
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(mm, pgd, address);
-		error = -ENOMEM;
-		if (!pud)
+	for_each_pgd(mm, start, end, pgd, pgd_start, pgd_end) {
+		if (unlikely(!pud_alloc(mm, pgd, pgd_start))) {
+			error = -ENOMEM;
 			break;
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= beg || next > end)
-			next = end;
-		error = zeromap_pud_range(mm, pud, address,
-						next - address, prot);
+		}
+		error = zeromap_pud_range(mm, pgd, pgd_start, pgd_end, prot);
 		if (error)
 			break;
-		address = next;
-		pgd++;
 	}
 	/*
 	 * Why flush? zeromap_pte_range has a BUG_ON for !pte_none()
 	 */
-	flush_tlb_range(vma, beg, end);
+	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
 	return error;
 }
@@ -1099,94 +1034,71 @@ int zeromap_page_range(struct vm_area_st
  * in null mappings (currently treated as "copy-on-access")
  */
 static inline void
-remap_pte_range(pte_t * pte, unsigned long address, unsigned long size,
+remap_pte_range(pmd_t * pmd, unsigned long start, unsigned long end,
 		unsigned long pfn, pgprot_t prot)
 {
-	unsigned long end;
+	unsigned long address;
+	pte_t * pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-	do {
+	for_each_pte_map(pmd, start, end, pte, address) {
 		BUG_ON(!pte_none(*pte));
 		if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
  			set_pte(pte, pfn_pte(pfn, prot));
-		address += PAGE_SIZE;
 		pfn++;
-		pte++;
-	} while (address && (address < end));
+	} for_each_pte_map_end;
 }
 
 static inline int
-remap_pmd_range(struct mm_struct *mm, pmd_t * pmd, unsigned long address,
-		unsigned long size, unsigned long pfn, pgprot_t prot)
+remap_pmd_range(struct mm_struct *mm, pud_t * pud, unsigned long start,
+		unsigned long end, unsigned long pfn, pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long pmd_start, pmd_end;
+	pmd_t * pmd;
 
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-	pfn -= (address >> PAGE_SHIFT);
-	do {
-		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
-		if (!pte)
+	pfn -= start >> PAGE_SHIFT;
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (!pte_alloc(mm, pmd, pmd_start))
 			return -ENOMEM;
-		remap_pte_range(pte, base + address, end - address,
-				(address >> PAGE_SHIFT) + pfn, prot);
-		pte_unmap(pte);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		remap_pte_range(pmd, pmd_start, pmd_end,
+				(pmd_start >> PAGE_SHIFT) + pfn, prot);
+	}
 	return 0;
 }
 
-static inline int remap_pud_range(struct mm_struct *mm, pud_t * pud,
-				  unsigned long address, unsigned long size,
-				  unsigned long pfn, pgprot_t prot)
-{
-	unsigned long base, end;
-	int error;
-
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	pfn -= address >> PAGE_SHIFT;
-	do {
-		pmd_t *pmd = pmd_alloc(mm, pud, base+address);
-		error = -ENOMEM;
-		if (!pmd)
-			break;
-		error = remap_pmd_range(mm, pmd, base + address, end - address,
-				(address >> PAGE_SHIFT) + pfn, prot);
+static inline int remap_pud_range(struct mm_struct *mm, pgd_t * pgd,
+				unsigned long start, unsigned long end,
+				unsigned long pfn, pgprot_t prot)
+{
+	unsigned long pud_start, pud_end;
+	pud_t * pud;
+	int error = 0;
+
+	pfn -= start >> PAGE_SHIFT;
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (!pmd_alloc(mm, pud, pud_start))
+			return -ENOMEM;
+		error = remap_pmd_range(mm, pud, pud_start, pud_end,
+				(pud_start >> PAGE_SHIFT) + pfn, prot);
 		if (error)
 			break;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
+	}
 	return error;
 }
 
 /*  Note: this is only safe if the mm semaphore is held when called. */
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long start,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-	int error = 0;
-	pgd_t *pgd;
-	unsigned long beg = from;
-	unsigned long end = from + size;
-	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
-	int i;
+	unsigned long pgd_start, pgd_end;
+	unsigned long end = start + size;
+	pgd_t *pgd;
+	int error = 0;
 
-	pfn -= from >> PAGE_SHIFT;
-	pgd = pgd_offset(mm, from);
-	flush_cache_range(vma, beg, end);
-	BUG_ON(from >= end);
+	BUG_ON(start >= end);
+
+	pfn -= start >> PAGE_SHIFT;
+	flush_cache_range(vma, start, end);
 
 	/*
 	 * Physically remapped pages are special. Tell the
@@ -1199,25 +1111,20 @@ int remap_pfn_range(struct vm_area_struc
 	vma->vm_flags |= VM_IO | VM_RESERVED;
 
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(beg); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(mm, pgd, from);
-		error = -ENOMEM;
-		if (!pud)
+	for_each_pgd(mm, start, end, pgd, pgd_start, pgd_end) {
+		if (!pud_alloc(mm, pgd, pgd_start)) {
+			error = -ENOMEM;
 			break;
-		next = (from + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= from)
-			next = end;
-		error = remap_pud_range(mm, pud, from, end - from,
-					pfn + (from >> PAGE_SHIFT), prot);
+		}
+		error = remap_pud_range(mm, pgd, pgd_start, pgd_end,
+					pfn + (pgd_start >> PAGE_SHIFT), prot);
 		if (error)
 			break;
-		from = next;
-		pgd++;
 	}
 	/*
 	 * Why flush? remap_pte_range has a BUG_ON for !pte_none()
 	 */
-	flush_tlb_range(vma, beg, end);
+	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
 
 	return error;
diff -puN mm/msync.c~vm-pgt-walkers mm/msync.c
--- linux-2.6/mm/msync.c~vm-pgt-walkers	2005-02-17 23:59:16.000000000 +1100
+++ linux-2.6-npiggin/mm/msync.c	2005-02-17 23:59:16.000000000 +1100
@@ -21,7 +21,7 @@
  * Called with mm->page_table_lock held to protect against other
  * threads/the swapper from ripping pte's out from under us.
  */
-static int filemap_sync_pte(pte_t *ptep, struct vm_area_struct *vma,
+static void filemap_sync_pte(pte_t *ptep, struct vm_area_struct *vma,
 	unsigned long address, unsigned int flags)
 {
 	pte_t pte = *ptep;
@@ -35,106 +35,74 @@ static int filemap_sync_pte(pte_t *ptep,
 		     page_test_and_clear_dirty(page)))
 			set_page_dirty(page);
 	}
-	return 0;
 }
 
-static int filemap_sync_pte_range(pmd_t * pmd,
-	unsigned long address, unsigned long end, 
+static void filemap_sync_pte_range(pmd_t * pmd,
+	unsigned long start, unsigned long end, 
 	struct vm_area_struct *vma, unsigned int flags)
 {
+	unsigned long address;
 	pte_t *pte;
-	int error;
 
-	if (pmd_none(*pmd))
-		return 0;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return 0;
-	}
-	pte = pte_offset_map(pmd, address);
-	if ((address & PMD_MASK) != (end & PMD_MASK))
-		end = (address & PMD_MASK) + PMD_SIZE;
-	error = 0;
-	do {
-		error |= filemap_sync_pte(pte, vma, address, flags);
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
-
-	pte_unmap(pte - 1);
-
-	return error;
+	for_each_pte_map(pmd, start, end, pte, address) {
+		filemap_sync_pte(pte, vma, address, flags);
+	} for_each_pte_map_end;
 }
 
-static inline int filemap_sync_pmd_range(pud_t * pud,
-	unsigned long address, unsigned long end, 
+static void filemap_sync_pmd_range(pud_t * pud,
+	unsigned long start, unsigned long end, 
 	struct vm_area_struct *vma, unsigned int flags)
 {
+	unsigned long pmd_start, pmd_end;
 	pmd_t * pmd;
-	int error;
 
-	if (pud_none(*pud))
-		return 0;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return 0;
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+
+		filemap_sync_pte_range(pmd, pmd_start, pmd_end, vma, flags);
 	}
-	pmd = pmd_offset(pud, address);
-	if ((address & PUD_MASK) != (end & PUD_MASK))
-		end = (address & PUD_MASK) + PUD_SIZE;
-	error = 0;
-	do {
-		error |= filemap_sync_pte_range(pmd, address, end, vma, flags);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
-	return error;
 }
 
-static inline int filemap_sync_pud_range(pgd_t *pgd,
-	unsigned long address, unsigned long end,
+static void filemap_sync_pud_range(pgd_t *pgd,
+	unsigned long start, unsigned long end,
 	struct vm_area_struct *vma, unsigned int flags)
 {
+	unsigned long pud_start, pud_end;
 	pud_t *pud;
-	int error;
 
-	if (pgd_none(*pgd))
-		return 0;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return 0;
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+
+		filemap_sync_pmd_range(pud, pud_start, pud_end, vma, flags);
 	}
-	pud = pud_offset(pgd, address);
-	if ((address & PGDIR_MASK) != (end & PGDIR_MASK))
-		end = (address & PGDIR_MASK) + PGDIR_SIZE;
-	error = 0;
-	do {
-		error |= filemap_sync_pmd_range(pud, address, end, vma, flags);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
-	return error;
 }
 
-static int __filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void __filemap_sync(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned int flags)
 {
+	unsigned long pgd_start, pgd_end;
 	pgd_t *pgd;
-	unsigned long end = address + size;
-	unsigned long next;
-	int i;
-	int error = 0;
+
+	BUG_ON(start >= end);
 
 	/* Aquire the lock early; it may be possible to avoid dropping
 	 * and reaquiring it repeatedly.
 	 */
 	spin_lock(&vma->vm_mm->page_table_lock);
 
-	pgd = pgd_offset(vma->vm_mm, address);
-	flush_cache_range(vma, address, end);
+	flush_cache_range(vma, start, end);
 
 	/* For hugepages we can't go walking the page table normally,
 	 * but that's ok, hugetlbfs is memory based, so we don't need
@@ -142,49 +110,46 @@ static int __filemap_sync(struct vm_area
 	if (is_vm_hugetlb_page(vma))
 		goto out;
 
-	if (address >= end)
-		BUG();
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		error |= filemap_sync_pud_range(pgd, address, next, vma, flags);
-		address = next;
-		pgd++;
+	for_each_pgd(vma->vm_mm, start, end, pgd, pgd_start, pgd_end) {
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+
+		filemap_sync_pud_range(pgd, pgd_start, pgd_end, vma, flags);
 	}
+
 	/*
 	 * Why flush ? filemap_sync_pte already flushed the tlbs with the
 	 * dirty bits.
 	 */
-	flush_tlb_range(vma, end - size, end);
+	flush_tlb_range(vma, start, end);
  out:
 	spin_unlock(&vma->vm_mm->page_table_lock);
-
-	return error;
 }
 
 #ifdef CONFIG_PREEMPT
-static int filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void filemap_sync(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned int flags)
 {
 	const size_t chunk = 64 * 1024;	/* bytes */
-	int error = 0;
 
-	while (size) {
-		size_t sz = min(size, chunk);
+	while (start < end) {
+		size_t sz = min((size_t)(end-start), chunk);
 
-		error |= __filemap_sync(vma, address, sz, flags);
+		__filemap_sync(vma, start, start+sz, flags);
+		start += sz;
 		cond_resched();
-		address += sz;
-		size -= sz;
 	}
-	return error;
 }
 #else
-static int filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void filemap_sync(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned int flags)
 {
-	return __filemap_sync(vma, address, size, flags);
+	__filemap_sync(vma, start, end, flags);
 }
 #endif
 
@@ -209,9 +174,9 @@ static int msync_interval(struct vm_area
 		return -EBUSY;
 
 	if (file && (vma->vm_flags & VM_SHARED)) {
-		ret = filemap_sync(vma, start, end-start, flags);
+		filemap_sync(vma, start, end, flags);
 
-		if (!ret && (flags & MS_SYNC)) {
+		if (flags & MS_SYNC) {
 			struct address_space *mapping = file->f_mapping;
 			int err;
 
diff -puN mm/mprotect.c~vm-pgt-walkers mm/mprotect.c
--- linux-2.6/mm/mprotect.c~vm-pgt-walkers	2005-02-17 23:59:16.000000000 +1100
+++ linux-2.6-npiggin/mm/mprotect.c	2005-02-17 23:59:16.000000000 +1100
@@ -26,25 +26,13 @@
 #include <asm/tlbflush.h>
 
 static inline void
-change_pte_range(pmd_t *pmd, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+change_pte_range(pmd_t *pmd, unsigned long start,
+		unsigned long end, pgprot_t newprot)
 {
+	unsigned long address;
 	pte_t * pte;
-	unsigned long end;
 
-	if (pmd_none(*pmd))
-		return;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-	pte = pte_offset_map(pmd, address);
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-	do {
+	for_each_pte_map(pmd, start, end, pte, address) {
 		if (pte_present(*pte)) {
 			pte_t entry;
 
@@ -55,62 +43,47 @@ change_pte_range(pmd_t *pmd, unsigned lo
 			entry = ptep_get_and_clear(pte);
 			set_pte(pte, pte_modify(entry, newprot));
 		}
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
-	pte_unmap(pte - 1);
+	} for_each_pte_map_end;
 }
 
 static inline void
-change_pmd_range(pud_t *pud, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+change_pmd_range(pud_t *pud, unsigned long start,
+		unsigned long end, pgprot_t newprot)
 {
+	unsigned long pmd_start, pmd_end;
 	pmd_t * pmd;
-	unsigned long end;
 
-	if (pud_none(*pud))
-		return;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
-	pmd = pmd_offset(pud, address);
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-	do {
-		change_pte_range(pmd, address, end - address, newprot);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+
+		change_pte_range(pmd, pmd_start, pmd_end, newprot);
+	}
 }
 
 static inline void
-change_pud_range(pgd_t *pgd, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+change_pud_range(pgd_t *pgd, unsigned long start,
+		unsigned long end, pgprot_t newprot)
 {
+	unsigned long pud_start, pud_end;
 	pud_t * pud;
-	unsigned long end;
 
-	if (pgd_none(*pgd))
-		return;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
-	pud = pud_offset(pgd, address);
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	do {
-		change_pmd_range(pud, address, end - address, newprot);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+
+		change_pmd_range(pud, pud_start, pud_end, newprot);
+	}
 }
 
 static void
@@ -118,23 +91,24 @@ change_protection(struct vm_area_struct 
 		unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = current->mm;
+	unsigned long pgd_start, pgd_end;
 	pgd_t *pgd;
-	unsigned long beg = start, next;
-	int i;
 
-	pgd = pgd_offset(mm, start);
-	flush_cache_range(vma, beg, end);
 	BUG_ON(start >= end);
+	flush_cache_range(vma, start, end);
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
-		next = (start + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= start || next > end)
-			next = end;
-		change_pud_range(pgd, start, next - start, newprot);
-		start = next;
-		pgd++;
+	for_each_pgd(mm, start, end, pgd, pgd_start, pgd_end) {
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+
+		change_pud_range(pgd, pgd_start, pgd_end, newprot);
 	}
-	flush_tlb_range(vma, beg, end);
+	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
 }
 
diff -puN mm/vmalloc.c~vm-pgt-walkers mm/vmalloc.c
--- linux-2.6/mm/vmalloc.c~vm-pgt-walkers	2005-02-17 23:59:16.000000000 +1100
+++ linux-2.6-npiggin/mm/vmalloc.c	2005-02-17 23:59:16.000000000 +1100
@@ -23,212 +23,156 @@
 DEFINE_RWLOCK(vmlist_lock);
 struct vm_struct *vmlist;
 
-static void unmap_area_pte(pmd_t *pmd, unsigned long address,
-				  unsigned long size)
+static void unmap_area_pte(pmd_t *pmd, unsigned long start, unsigned long end)
 {
-	unsigned long end;
+	unsigned long address;
 	pte_t *pte;
 
-	if (pmd_none(*pmd))
-		return;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
+	for_each_pte_kernel(pmd, start, end, pte, address) {
+		pte_t page = ptep_get_and_clear(pte);
 
-	pte = pte_offset_kernel(pmd, address);
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-
-	do {
-		pte_t page;
-		page = ptep_get_and_clear(pte);
-		address += PAGE_SIZE;
-		pte++;
-		if (pte_none(page))
-			continue;
-		if (pte_present(page))
-			continue;
-		printk(KERN_CRIT "Whee.. Swapped out page in kernel page table\n");
-	} while (address < end);
+		if (unlikely(!pte_none(page) && !pte_present(page))) {
+			printk(KERN_CRIT "ERROR: swapped out kernel page\n");
+			dump_stack();
+		}
+	} for_each_pte_kernel_end;
 }
 
-static void unmap_area_pmd(pud_t *pud, unsigned long address,
-				  unsigned long size)
+static void unmap_area_pmd(pud_t *pud, unsigned long start, unsigned long end)
 {
-	unsigned long end;
+	unsigned long pmd_start, pmd_end;
 	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
 
-	pmd = pmd_offset(pud, address);
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-
-	do {
-		unmap_area_pte(pmd, address, end - address);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address < end);
+		unmap_area_pte(pmd, pmd_start, pmd_end);
+	}
 }
 
-static void unmap_area_pud(pgd_t *pgd, unsigned long address,
-			   unsigned long size)
+static void unmap_area_pud(pgd_t *pgd, unsigned long start, unsigned long end)
 {
+	unsigned long pud_start, pud_end;
 	pud_t *pud;
-	unsigned long end;
 
-	if (pgd_none(*pgd))
-		return;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+
+		unmap_area_pmd(pud, pud_start, pud_end);
 	}
+}
+
+void unmap_vm_area(struct vm_struct *area)
+{
+	unsigned long start = (unsigned long) area->addr;
+	unsigned long end = (start + area->size);
+	unsigned long pgd_start, pgd_end;
+	pgd_t *pgd;
 
-	pud = pud_offset(pgd, address);
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-
-	do {
-		unmap_area_pmd(pud, address, end - address);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
+	flush_cache_vunmap(address, end);
+	for_each_pgd_k(start, end, pgd, pgd_start, pgd_end) {
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+
+		unmap_area_pud(pgd, pgd_start, pgd_end);
+	}
+	flush_tlb_kernel_range((unsigned long) area->addr, end);
 }
 
-static int map_area_pte(pte_t *pte, unsigned long address,
-			       unsigned long size, pgprot_t prot,
+static int map_area_pte(pmd_t *pmd, unsigned long start,
+			       unsigned long end, pgprot_t prot,
 			       struct page ***pages)
 {
-	unsigned long end;
-
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
+	unsigned long address;
+	pte_t * pte;
 
-	do {
+	for_each_pte_kernel(pmd, start, end, pte, address) {
 		struct page *page = **pages;
 		WARN_ON(!pte_none(*pte));
 		if (!page)
 			return -ENOMEM;
 
 		set_pte(pte, mk_pte(page, prot));
-		address += PAGE_SIZE;
-		pte++;
 		(*pages)++;
-	} while (address < end);
+	} for_each_pte_kernel_end;
 	return 0;
 }
 
-static int map_area_pmd(pmd_t *pmd, unsigned long address,
-			       unsigned long size, pgprot_t prot,
+static int map_area_pmd(pud_t *pud, unsigned long start,
+			       unsigned long end, pgprot_t prot,
 			       struct page ***pages)
 {
-	unsigned long base, end;
+	unsigned long pmd_start, pmd_end;
+	pmd_t * pmd;
 
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-
-	do {
-		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, base + address);
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, pmd_start);
 		if (!pte)
 			return -ENOMEM;
-		if (map_area_pte(pte, address, end - address, prot, pages))
+		if (map_area_pte(pmd, pmd_start, pmd_end, prot, pages))
 			return -ENOMEM;
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address < end);
+	}
 
 	return 0;
 }
 
-static int map_area_pud(pud_t *pud, unsigned long address,
+static int map_area_pud(pgd_t *pgd, unsigned long start,
 			       unsigned long end, pgprot_t prot,
 			       struct page ***pages)
 {
-	do {
-		pmd_t *pmd = pmd_alloc(&init_mm, pud, address);
+	unsigned long pud_start, pud_end;
+	pud_t * pud;
+
+	for_each_pud(pgd, start, end, pud, pud_start, pud_end) {
+		pmd_t *pmd = pmd_alloc(&init_mm, pud, pud_start);
 		if (!pmd)
 			return -ENOMEM;
-		if (map_area_pmd(pmd, address, end - address, prot, pages))
+		if (map_area_pmd(pud, pud_start, pud_end, prot, pages))
 			return -ENOMEM;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && address < end);
+	}
 
 	return 0;
 }
 
-void unmap_vm_area(struct vm_struct *area)
-{
-	unsigned long address = (unsigned long) area->addr;
-	unsigned long end = (address + area->size);
-	unsigned long next;
-	pgd_t *pgd;
-	int i;
-
-	pgd = pgd_offset_k(address);
-	flush_cache_vunmap(address, end);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		unmap_area_pud(pgd, address, next - address);
-		address = next;
-	        pgd++;
-	}
-	flush_tlb_kernel_range((unsigned long) area->addr, end);
-}
-
 int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
 {
-	unsigned long address = (unsigned long) area->addr;
-	unsigned long end = address + (area->size-PAGE_SIZE);
-	unsigned long next;
+	unsigned long start = (unsigned long) area->addr;
+	unsigned long end = start + (area->size-PAGE_SIZE);
+	unsigned long pgd_start, pgd_end;
 	pgd_t *pgd;
 	int err = 0;
-	int i;
 
-	pgd = pgd_offset_k(address);
 	spin_lock(&init_mm.page_table_lock);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(&init_mm, pgd, address);
+	for_each_pgd_k(start, end, pgd, pgd_start, pgd_end) {
+		pud_t *pud = pud_alloc(&init_mm, pgd, pgd_start);
 		if (!pud) {
 			err = -ENOMEM;
 			break;
 		}
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next < address || next > end)
-			next = end;
-		if (map_area_pud(pud, address, next, prot, pages)) {
+		if (map_area_pud(pgd, pgd_start, pgd_end, prot, pages)) {
 			err = -ENOMEM;
 			break;
 		}
-
-		address = next;
-		pgd++;
 	}
-
 	spin_unlock(&init_mm.page_table_lock);
-	flush_cache_vmap((unsigned long) area->addr, end);
+	flush_cache_vmap(start, end);
 	return err;
 }
 
diff -puN arch/i386/mm/ioremap.c~vm-pgt-walkers arch/i386/mm/ioremap.c
--- linux-2.6/arch/i386/mm/ioremap.c~vm-pgt-walkers	2005-02-17 23:59:58.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/ioremap.c	2005-02-18 00:29:58.000000000 +1100
@@ -17,86 +17,72 @@
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
 
-static inline void remap_area_pte(pte_t * pte, unsigned long address, unsigned long size,
-	unsigned long phys_addr, unsigned long flags)
+static inline void remap_area_pte(pmd_t *pmd, unsigned long start,
+		unsigned long end, unsigned long phys_addr, unsigned long flags)
 {
-	unsigned long end;
+	unsigned long address;
 	unsigned long pfn;
+	pte_t * pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-	if (address >= end)
-		BUG();
 	pfn = phys_addr >> PAGE_SHIFT;
-	do {
+	for_each_pte_kernel(pmd, start, end, pte, address) {
 		if (!pte_none(*pte)) {
 			printk("remap_area_pte: page already exists\n");
 			BUG();
 		}
 		set_pte(pte, pfn_pte(pfn, __pgprot(_PAGE_PRESENT | _PAGE_RW | 
 					_PAGE_DIRTY | _PAGE_ACCESSED | flags)));
-		address += PAGE_SIZE;
 		pfn++;
-		pte++;
-	} while (address && (address < end));
+	} for_each_pte_kernel_end;
 }
 
-static inline int remap_area_pmd(pmd_t * pmd, unsigned long address, unsigned long size,
-	unsigned long phys_addr, unsigned long flags)
+static inline int remap_area_pmd(pud_t * pud, unsigned long start,
+		unsigned long end, unsigned long phys_addr, unsigned long flags)
 {
-	unsigned long end;
+	unsigned long pmd_start, pmd_end;
+	pmd_t * pmd;
 
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	phys_addr -= address;
-	if (address >= end)
-		BUG();
-	do {
-		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
-		if (!pte)
+	phys_addr -= start;
+
+	for_each_pmd(pud, start, end, pmd, pmd_start, pmd_end) {
+		if (!pte_alloc_kernel(&init_mm, pmd, pmd_start))
 			return -ENOMEM;
-		remap_area_pte(pte, address, end - address, address + phys_addr, flags);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		remap_area_pte(pmd, pmd_start, pmd_end, pmd_start + phys_addr, flags);
+	}
 	return 0;
 }
 
-static int remap_area_pages(unsigned long address, unsigned long phys_addr,
+static int remap_area_pages(unsigned long start, unsigned long phys_addr,
 				 unsigned long size, unsigned long flags)
 {
-	int error;
-	pgd_t * dir;
-	unsigned long end = address + size;
+	unsigned long pgd_start, pgd_end;
+	unsigned long end = start + size;
+	pgd_t * pgd;
+	int error = 0;
+
+	BUG_ON(start >= end);
 
-	phys_addr -= address;
-	dir = pgd_offset(&init_mm, address);
 	flush_cache_all();
-	if (address >= end)
-		BUG();
+	phys_addr -= start;
 	spin_lock(&init_mm.page_table_lock);
-	do {
+	for_each_pgd(&init_mm, start, end, pgd, pgd_start, pgd_end) {
 		pud_t *pud;
 		pmd_t *pmd;
 		
+		/* We can get away with this because i386 has no
+		 * more than 3-level page tables */
 		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
+		pud = pud_alloc(&init_mm, pgd, pgd_start);
 		if (!pud)
 			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, pgd_start);
 		if (!pmd)
 			break;
-		if (remap_area_pmd(pmd, address, end - address,
-					 phys_addr + address, flags))
+		if (remap_area_pmd(pud, pgd_start, pgd_end,
+					phys_addr + pgd_start, flags))
 			break;
 		error = 0;
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	flush_tlb_all();
 	return error;
diff -puN include/linux/mm.h~vm-pgt-walkers include/linux/mm.h
diff -puN drivers/char/mem.c~vm-pgt-walkers drivers/char/mem.c

_

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 14:03 ` [PATCH 2/2] page table iterators Nick Piggin
@ 2005-02-17 15:56   ` Linus Torvalds
  2005-02-17 16:13     ` Nick Piggin
  2005-02-17 19:43   ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-02-17 15:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, Andrew Morton, Andi Kleen, linux-kernel



On Fri, 18 Feb 2005, Nick Piggin wrote:
>
> I am pretty surprised myself that I was able to consolidate
> all "page table range" functions into a single type of iterator
> (well, there are a couple of variations, but it's not too bad).

Ok, this is post-2.6.11 material, so please remind me.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 15:56   ` Linus Torvalds
@ 2005-02-17 16:13     ` Nick Piggin
  0 siblings, 0 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-17 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Andi Kleen, linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 18 Feb 2005, Nick Piggin wrote:
> 
>>I am pretty surprised myself that I was able to consolidate
>>all "page table range" functions into a single type of iterator
>>(well, there are a couple of variations, but it's not too bad).
> 
> 
> Ok, this is post-2.6.11 material, so please remind me.
> 

Sure... it will probably be best to go through -mm, but either
way I'll package the patches up nicely and rediff them against
2.6.11 when it comes out.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 14:03 ` [PATCH 2/2] page table iterators Nick Piggin
  2005-02-17 15:56   ` Linus Torvalds
@ 2005-02-17 19:43   ` Andi Kleen
  2005-02-17 22:49     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2005-02-17 19:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, Linus Torvalds, Andrew Morton,
	Andi Kleen, linux-kernel

On Fri, Feb 18, 2005 at 01:03:35AM +1100, Nick Piggin wrote:
> I am pretty surprised myself that I was able to consolidate
> all "page table range" functions into a single type of iterator
> (well, there are a couple of variations, but it's not too bad).

I started a similar project - but it uses the existing loops,
just using {pte,pmd,pud,pgd}_next. The idea is to optimize
page table walking by keeping some state in the struct page
of the page table page that says whether an entry is set 
or not. To make this work I switched everything to indexes
instead of pointers.

Main problem are some nasty include loops. 

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 19:43   ` Andi Kleen
@ 2005-02-17 22:49     ` Benjamin Herrenschmidt
  2005-02-17 23:03       ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-17 22:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel

On Thu, 2005-02-17 at 20:43 +0100, Andi Kleen wrote:
> On Fri, Feb 18, 2005 at 01:03:35AM +1100, Nick Piggin wrote:
> > I am pretty surprised myself that I was able to consolidate
> > all "page table range" functions into a single type of iterator
> > (well, there are a couple of variations, but it's not too bad).
> 
> I started a similar project - but it uses the existing loops,
> just using {pte,pmd,pud,pgd}_next. The idea is to optimize
> page table walking by keeping some state in the struct page
> of the page table page that says whether an entry is set 
> or not. To make this work I switched everything to indexes
> instead of pointers.
> 
> Main problem are some nasty include loops. 

I though about both ways yesterday, and in the end, I prefer Nick stuff,
at least for now. It gives us also more flexibility to change gory
implementation details in the future. I still have to run it through a
bit of torture testing though.

Ben.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 22:49     ` Benjamin Herrenschmidt
@ 2005-02-17 23:03       ` Andi Kleen
  2005-02-17 23:21         ` Benjamin Herrenschmidt
  2005-02-17 23:30         ` David S. Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2005-02-17 23:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andi Kleen, Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel

> I though about both ways yesterday, and in the end, I prefer Nick stuff,
> at least for now. It gives us also more flexibility to change gory
> implementation details in the future. I still have to run it through a
> bit of torture testing though.

They're really solving different problems. My code is just aimed
at getting x86-64 fork/exec/etc. as fast as before 4level again
(currently they are significantly slower because they have to walk
a lot more page tables) 

The problem is that the index based approach (I think you have to use
indexes for this, pointers get very messy) probably does not 
fit very well into Nick's complex macros.  

Nick's macros are essentially just code transformations with
some micro optimizations. 

That's not bad, but it won't give you the big speedups 
the lazy walking approach will give.

And to be honest we only have about 6 or 7 of these walkers
in the whole kernel. And 90% of them are in memory.c
While doing 4level I think I changed all of them around several
times and it wasn't that big an issue.  So it's not that we
have a big pressing problem here... 

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 23:03       ` Andi Kleen
@ 2005-02-17 23:21         ` Benjamin Herrenschmidt
  2005-02-17 23:34           ` Andi Kleen
  2005-02-17 23:30         ` David S. Miller
  1 sibling, 1 reply; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-17 23:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel

On Fri, 2005-02-18 at 00:03 +0100, Andi Kleen wrote:

> And to be honest we only have about 6 or 7 of these walkers
> in the whole kernel. And 90% of them are in memory.c
> While doing 4level I think I changed all of them around several
> times and it wasn't that big an issue.  So it's not that we
> have a big pressing problem here... 

We have about 50% of them in memory.c :) But my main problem is more
that every single of them is implemented slightly differently.

Going Nick's way is a good start. If they are all consolidated to use
the same macro, they will be easier for you to change later on anyway.

Ben.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 23:03       ` Andi Kleen
  2005-02-17 23:21         ` Benjamin Herrenschmidt
@ 2005-02-17 23:30         ` David S. Miller
  2005-02-17 23:57           ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: David S. Miller @ 2005-02-17 23:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: benh, ak, nickpiggin, torvalds, akpm, linux-kernel

On Fri, 18 Feb 2005 00:03:42 +0100
Andi Kleen <ak@suse.de> wrote:

> And to be honest we only have about 6 or 7 of these walkers
> in the whole kernel. And 90% of them are in memory.c
> While doing 4level I think I changed all of them around several
> times and it wasn't that big an issue.  So it's not that we
> have a big pressing problem here... 

It's super error prone.  A regression added by your edit of these
walkers for the 4level changes was only discovered and fixed
yesterday by the ppc folks.

I absolutely support any change which consolidates these things.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 23:21         ` Benjamin Herrenschmidt
@ 2005-02-17 23:34           ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2005-02-17 23:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andi Kleen, Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel

On Fri, Feb 18, 2005 at 10:21:03AM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2005-02-18 at 00:03 +0100, Andi Kleen wrote:
> 
> > And to be honest we only have about 6 or 7 of these walkers
> > in the whole kernel. And 90% of them are in memory.c
> > While doing 4level I think I changed all of them around several
> > times and it wasn't that big an issue.  So it's not that we
> > have a big pressing problem here... 
> 
> We have about 50% of them in memory.c :) But my main problem is more
> that every single of them is implemented slightly differently.

No much more. But I only count real walkers, not stuff like vmalloc. 

The ioremap duplication over architectures is a bit annoying, but
the fix for that would be to factor the code out completely, not
only improve walking.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 23:30         ` David S. Miller
@ 2005-02-17 23:57           ` Andi Kleen
  2005-02-20 12:35             ` Nick Piggin
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2005-02-17 23:57 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andi Kleen, benh, nickpiggin, torvalds, akpm, linux-kernel

On Thu, Feb 17, 2005 at 03:30:31PM -0800, David S. Miller wrote:
> On Fri, 18 Feb 2005 00:03:42 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > And to be honest we only have about 6 or 7 of these walkers
> > in the whole kernel. And 90% of them are in memory.c
> > While doing 4level I think I changed all of them around several
> > times and it wasn't that big an issue.  So it's not that we
> > have a big pressing problem here... 
> 
> It's super error prone.  A regression added by your edit of these

Actually it was in Nick's code (PUD layer ;-).  But I won't argue
that my code didn't have bugs too...

> walkers for the 4level changes was only discovered and fixed
> yesterday by the ppc folks.
> 
> I absolutely support any change which consolidates these things.

The problem is just that these walker macros when they
do all the lazy walking stuff will be quite complicated.
And I don't really want another uaccess.h-like macro mess.

Yes currently they look simple, but that will change.

Open coding is probably the smaller evil.

And they're really not changed that often.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-17 23:57           ` Andi Kleen
@ 2005-02-20 12:35             ` Nick Piggin
  2005-02-21  6:35               ` Hugh Dickins
  0 siblings, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-20 12:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, benh, torvalds, akpm, linux-kernel

Andi Kleen wrote:
> On Thu, Feb 17, 2005 at 03:30:31PM -0800, David S. Miller wrote:
> 
>>On Fri, 18 Feb 2005 00:03:42 +0100
>>Andi Kleen <ak@suse.de> wrote:
>>
>>
>>>And to be honest we only have about 6 or 7 of these walkers
>>>in the whole kernel. And 90% of them are in memory.c
>>>While doing 4level I think I changed all of them around several
>>>times and it wasn't that big an issue.  So it's not that we
>>>have a big pressing problem here... 
>>
>>It's super error prone.  A regression added by your edit of these
> 
> 
> Actually it was in Nick's code (PUD layer ;-).  But I won't argue
> that my code didn't have bugs too...
> 
> 

I won't look back to see where the error came from :) But
yeah it is equally (if not more) likely to have come from
me. And it probably did happen because all the code is
slightly different and hard to understand.

>>walkers for the 4level changes was only discovered and fixed
>>yesterday by the ppc folks.
>>
>>I absolutely support any change which consolidates these things.
> 
> 
> The problem is just that these walker macros when they
> do all the lazy walking stuff will be quite complicated.
> And I don't really want another uaccess.h-like macro mess.
> 
> Yes currently they look simple, but that will change.
> 

But even in that case, it will still be better to have the
extra complexity once in the macro rather than throughout mm/

> Open coding is probably the smaller evil.
> 
> And they're really not changed that often.
> 

It is not so much a matter of changing, so much as having 10
slightly different implementations.

I think it should be easier to go from the iterators patch to
perhaps more complex iterators, or some open coding, etc etc.
rather than try to put a big complex pt walker on top of these
10 different open coded implementations.

But perhaps I'm missing something you're not - I'd need to see
the lazy walking code I guess.

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-20 12:35             ` Nick Piggin
@ 2005-02-21  6:35               ` Hugh Dickins
  2005-02-21  6:40                 ` Andrew Morton
  2005-02-22  9:54                 ` Nick Piggin
  0 siblings, 2 replies; 33+ messages in thread
From: Hugh Dickins @ 2005-02-21  6:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Sun, 20 Feb 2005, Nick Piggin wrote:
> Andi Kleen wrote:
> > 
> > The problem is just that these walker macros when they
> > do all the lazy walking stuff will be quite complicated.
> > And I don't really want another uaccess.h-like macro mess.
> > 
> > Yes currently they look simple, but that will change.
> 
> But even in that case, it will still be better to have the
> extra complexity once in the macro rather than throughout mm/
> 
> > Open coding is probably the smaller evil.
> > And they're really not changed that often.

My opinion FWIW: I'm all for regularizing the pagetable loops to
work the same way, changing their variables to use the same names,
improving their efficiency; but I do like to see what a loop is up to.

list_for_each and friends are very widely used, they're fine, and I'm
quite glad to have their prefetching hidden away from me; but usually
I groan, grin and bear it, each time someone devises a clever new
for_each macro concealing half the details of some specialist loop.

In a minority?
Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-21  6:35               ` Hugh Dickins
@ 2005-02-21  6:40                 ` Andrew Morton
  2005-02-21  7:09                   ` Benjamin Herrenschmidt
  2005-02-22  9:54                 ` Nick Piggin
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2005-02-21  6:40 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nickpiggin, ak, davem, benh, torvalds, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> My opinion FWIW: I'm all for regularizing the pagetable loops to
>  work the same way, changing their variables to use the same names,
>  improving their efficiency; but I do like to see what a loop is up to.
> 
>  list_for_each and friends are very widely used, they're fine, and I'm
>  quite glad to have their prefetching hidden away from me; but usually
>  I groan, grin and bear it, each time someone devises a clever new
>  for_each macro concealing half the details of some specialist loop.
> 
>  In a minority?

Of two.

Let's see what they look like.  They'd need to be very good, IMO.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-21  6:40                 ` Andrew Morton
@ 2005-02-21  7:09                   ` Benjamin Herrenschmidt
  2005-02-21  8:09                     ` Nick Piggin
  0 siblings, 1 reply; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2005-02-21  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, nickpiggin, Andi Kleen, davem, Linus Torvalds,
	Linux Kernel list

On Sun, 2005-02-20 at 22:40 -0800, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> >
> > My opinion FWIW: I'm all for regularizing the pagetable loops to
> >  work the same way, changing their variables to use the same names,
> >  improving their efficiency; but I do like to see what a loop is up to.
> > 
> >  list_for_each and friends are very widely used, they're fine, and I'm
> >  quite glad to have their prefetching hidden away from me; but usually
> >  I groan, grin and bear it, each time someone devises a clever new
> >  for_each macro concealing half the details of some specialist loop.
> > 
> >  In a minority?
> 
> Of two.

Well, we basically have that bunch of loops that all do the same thing
to iterate the page tables. Only the inner part is different (that is
what is done on each PTE).

All of them are slightly differently implemented, some check overflow,
some don't, some have redudant checking, some aren't even consistent
between all 3/4 loops of a given walk routine set, and we have seen the
tendency to introduce subtle bugs in one of them when they all have to
be changed for some reason.

I'm all for turning them into something more consistent, and I like the
for_each_* idea...

It also allows to completely remove the code of the unused levels on 2
and 3 level page tables easily, regaining some of the perfs lost by the
move to 4 levels.

Now, we also need, in the long run, to improve perfs of walking the page
tables, especially PTEs, for things like tearing down processes or fork,
for example via a bitmap of used PGD entries etc... 

With proper iterators, such a thing could be implemented just by
modifying the iterator, and all loops would benefit from it.

I think that is enough to justify the move.

Ben.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-21  7:09                   ` Benjamin Herrenschmidt
@ 2005-02-21  8:09                     ` Nick Piggin
  2005-02-21  9:04                       ` Nick Piggin
  0 siblings, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-21  8:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Hugh Dickins, Andi Kleen, davem, Linus Torvalds,
	Linux Kernel list

Benjamin Herrenschmidt wrote:

> All of them are slightly differently implemented, some check overflow,
> some don't, some have redudant checking, some aren't even consistent
> between all 3/4 loops of a given walk routine set, and we have seen the
> tendency to introduce subtle bugs in one of them when they all have to
> be changed for some reason.
> 
> I'm all for turning them into something more consistent, and I like the
> for_each_* idea...
> 
> It also allows to completely remove the code of the unused levels on 2
> and 3 level page tables easily, regaining some of the perfs lost by the
> move to 4 levels.
> 

It appears to do even better on 2-levels (i386, !PAE) than the old
3-level code, not surprisingly. lmbench fork+exit overhead is under
100us on a 3.4GHz xeon now, which is the lowest I've seen.

Haven't yet pulled out a pre-4-level kernel to see how 3-level compares
I guess I'll do that now.

> Now, we also need, in the long run, to improve perfs of walking the page
> tables, especially PTEs, for things like tearing down processes or fork,
> for example via a bitmap of used PGD entries etc... 
> 
> With proper iterators, such a thing could be implemented just by
> modifying the iterator, and all loops would benefit from it.
> 

After looking at David's bitmap walking code, I'm starting to think
that my current macros only _just_ scrape by because of the uniform
nature of the walkers, and their relative simplicity. Anything much
more complex will start to get ugly.

I'd like to look at a slightly more involved reworking in order to
nicely support optimisations like bitmap walking, without blowing out
the complexity of the macros and without hiding too much of the
workings.

However, my main aim for these macros was mainly to fix the
performance regressions on 2 and 3 level architectures. Ben's
complaints about these loops just served to hurry it along. I think
that these reasons (performance, code consistency) make it a good
idea.

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-21  8:09                     ` Nick Piggin
@ 2005-02-21  9:04                       ` Nick Piggin
  0 siblings, 0 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-21  9:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Hugh Dickins, Andi Kleen, davem, Linus Torvalds,
	Linux Kernel list

Nick Piggin wrote:

> Haven't yet pulled out a pre-4-level kernel to see how 3-level compares
> I guess I'll do that now.
> 

Close.
Before 4level: 119.5us, after folded walkers: 132.8us

I think most of this is now coming from clear_page_range, rather
than the actual traversing of the page tables (because they should
be completely folded by now):

before:
   4089 total                                      0.0017
    753 kmap_atomic                                4.7358
    682 do_wp_page                                 0.6713
    680 do_page_fault                              0.4561
    261 zap_pte_range                              0.3625
    176 copy_page_range                            0.2133
    159 pte_alloc_one                              1.5743
    145 clear_page_tables                          0.4866

after:
   4307 total                                      0.0018
    676 kmap_atomic                                4.2516
    665 do_page_fault                              0.4472
    615 do_wp_page                                 0.6225
    550 clear_page_range                           0.9982
    262 zap_pte_range                              0.4870

I think the additional work done by clear_page_range (versus
clear_page_tables) justifies the extra cost, even for 3-level
architectures.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-21  6:35               ` Hugh Dickins
  2005-02-21  6:40                 ` Andrew Morton
@ 2005-02-22  9:54                 ` Nick Piggin
  2005-02-23  2:06                   ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-22  9:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

Hugh Dickins wrote:
> On Sun, 20 Feb 2005, Nick Piggin wrote:

>>
>>>Open coding is probably the smaller evil.
>>>And they're really not changed that often.
> 
> 
> My opinion FWIW: I'm all for regularizing the pagetable loops to
> work the same way, changing their variables to use the same names,
> improving their efficiency; but I do like to see what a loop is up to.
> 
> list_for_each and friends are very widely used, they're fine, and I'm
> quite glad to have their prefetching hidden away from me; but usually
> I groan, grin and bear it, each time someone devises a clever new
> for_each macro concealing half the details of some specialist loop.
> 
> In a minority?

OK, I think Andrew is now sitting on the fence after seeing the
code. So you (and Andi?) are the ones with remaining reservations
about this.

I don't disagree with your stance entirely, Hugh. I think these
macros are close to being too complicated... But I don't think
they is hiding too much detail: we all know that conceptually,
walking a page table page is reasonably simple. There are just a
few tricky bits like wrapping and termination that caused such
a divergent range of implementations - I would argue that hiding
these details is OK, because they are basically inconsequencial
to the job at hand. I think that actually makes the high level
intention of the code clearer, if anything.

If you are reading just the patch, that doesn't quite do it
justice IMO - in that case, have a look at the code after the
patch is applied (I can send you one which applies to current
kernels if you'd like).

Also, the implementation of the macros is not insanely difficult
to understand, so the details are still accessible.

Lastly, they fold to 2 and 3 levels easily, which is something
that couldn't sanely be done with the open-coded implementation.
I think with an infinitely smart compiler, there shouldn't need
to be any folding here. But in practice I see quite a large
speedup, which is something we shouldn't ignore.

I do think that they are probably not ideal candidates for a
more general abstraction that would allow for example the
transparent drop in of Dave's bitmap walking functions (it
would be possible, but would not be pretty AFAIKS). I have some
other ideas to work towards those goals, but before that I
think these macros do help with the deficiencies of the current
situation.

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-22  9:54                 ` Nick Piggin
@ 2005-02-23  2:06                   ` Hugh Dickins
  2005-02-23  4:31                     ` David S. Miller
  2005-02-23 23:52                     ` Nick Piggin
  0 siblings, 2 replies; 33+ messages in thread
From: Hugh Dickins @ 2005-02-23  2:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Tue, 22 Feb 2005, Nick Piggin wrote:
> 
> Also, the implementation of the macros is not insanely difficult
> to understand, so the details are still accessible.

True, they're okay; but I do think they've got one argument too
many, and I'd prefer to avoid that extra evaluation of the limit,
which comes from your move from "do while ()" to "for (;;)".

> Lastly, they fold to 2 and 3 levels easily, which is something
> that couldn't sanely be done with the open-coded implementation.
> I think with an infinitely smart compiler, there shouldn't need
> to be any folding here. But in practice I see quite a large
> speedup, which is something we shouldn't ignore.

Ben made the same point, yes, that is the real value of it.
But I rather think the same can be done without for_each_* macros:
compiling with gcc 3.3.4 suggests my *_limit macros may result in
slightly bigger code, but shorter codepath (no extra eval).

> I do think that they are probably not ideal candidates for a
> more general abstraction that would allow for example the
> transparent drop in of Dave's bitmap walking functions (it
> would be possible, but would not be pretty AFAIKS). I have some
> other ideas to work towards those goals, but before that I
> think these macros do help with the deficiencies of the current
> situation.

As I said before, I am keen on regularizing the implementations
and speeding them up; but not so keen on forcing into the
straitjacket of for_each_* macros instead of opencoded loops.
I've not seen Dave's bitmap walking functions (for clearing?),
would they fit in better with my way?

I'm off to bed, but since your appetite for looking at patches
is greater than mine, I'll throw what I'm currently testing over
the wall to you now.  Against 2.6.11-rc4-bk9, but my starting point
was obviously your patches.  Not yet split up, but clearly should be.
Includes mm/swapfile.c which you missed.  I'm inlining pmd and pud
levels, but not pte and pgd levels.  No description yet, sorry.
One point worth making, I do believe throughout that whatever the
address layout, "end" cannot be 0 - BUG_ON(addr >= end) assures.

Hugh

--- 2.6.11-rc4-bk9/arch/i386/mm/ioremap.c	2005-02-21 12:03:54.000000000 +0000
+++ linux/arch/i386/mm/ioremap.c	2005-02-22 23:57:48.000000000 +0000
@@ -17,86 +17,85 @@
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
 
-static inline void remap_area_pte(pte_t * pte, unsigned long address, unsigned long size,
-	unsigned long phys_addr, unsigned long flags)
+static inline void remap_area_pte(pmd_t *pmd, unsigned long addr,
+		unsigned long end, unsigned long phys_addr, unsigned long flags)
 {
-	unsigned long end;
 	unsigned long pfn;
+	pte_t *pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-	if (address >= end)
-		BUG();
 	pfn = phys_addr >> PAGE_SHIFT;
+	pte = pte_offset_kernel(pmd, addr);
 	do {
-		if (!pte_none(*pte)) {
-			printk("remap_area_pte: page already exists\n");
-			BUG();
-		}
+		BUG_ON(!pte_none(*pte));
 		set_pte(pte, pfn_pte(pfn, __pgprot(_PAGE_PRESENT | _PAGE_RW | 
 					_PAGE_DIRTY | _PAGE_ACCESSED | flags)));
-		address += PAGE_SIZE;
 		pfn++;
-		pte++;
-	} while (address && (address < end));
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 }
 
-static inline int remap_area_pmd(pmd_t * pmd, unsigned long address, unsigned long size,
-	unsigned long phys_addr, unsigned long flags)
+static inline int remap_area_pmd(pud_t *pud, unsigned long addr,
+		unsigned long end, unsigned long phys_addr, unsigned long flags)
 {
-	unsigned long end;
+	unsigned long next;
+	pmd_t *pmd;
 
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	phys_addr -= address;
-	if (address >= end)
-		BUG();
+	phys_addr -= addr;
+	pmd = pmd_offset(pud, addr);
 	do {
-		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
-		if (!pte)
+		next = pmd_limit(addr, end);
+		if (!pte_alloc_kernel(&init_mm, pmd, addr))
 			return -ENOMEM;
-		remap_area_pte(pte, address, end - address, address + phys_addr, flags);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		remap_area_pte(pmd, addr, next, phys_addr + addr, flags);
+	} while (pmd++, addr = next, addr < end);
 	return 0;
 }
 
-static int remap_area_pages(unsigned long address, unsigned long phys_addr,
+static inline int remap_area_pud(pgd_t *pgd, unsigned long addr,
+		unsigned long end, unsigned long phys_addr, unsigned long flags)
+{
+	unsigned long next;
+	pud_t *pud;
+	int error;
+
+	phys_addr -= addr;
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_limit(addr, end);
+		if (pmd_alloc(&init_mm, pud, addr))
+			error = remap_area_pmd(pud, addr, next,
+						phys_addr + addr, flags);
+		else
+			error = -ENOMEM;
+		if (error)
+			break;
+	} while (pud++, addr = next, addr < end);
+	return error;
+}
+
+static int remap_area_pages(unsigned long addr, unsigned long phys_addr,
 				 unsigned long size, unsigned long flags)
 {
+	unsigned long end = addr + size;
+	unsigned long next;
+	pgd_t *pgd;
 	int error;
-	pgd_t * dir;
-	unsigned long end = address + size;
 
-	phys_addr -= address;
-	dir = pgd_offset(&init_mm, address);
+	BUG_ON(addr >= end);
+
 	flush_cache_all();
-	if (address >= end)
-		BUG();
+	phys_addr -= addr;
+	pgd = pgd_offset_k(addr);
 	spin_lock(&init_mm.page_table_lock);
 	do {
-		pud_t *pud;
-		pmd_t *pmd;
-		
-		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
-		if (!pud)
-			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
-		if (!pmd)
-			break;
-		if (remap_area_pmd(pmd, address, end - address,
-					 phys_addr + address, flags))
+		next = pgd_limit(addr, end);
+		if (pud_alloc(&init_mm, pgd, addr))
+			error = remap_area_pud(pgd, addr, next,
+						phys_addr + addr, flags);
+		else
+			error = -ENOMEM;
+		if (error)
 			break;
-		error = 0;
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+	} while (pgd++, addr = next, addr < end);
 	spin_unlock(&init_mm.page_table_lock);
 	flush_tlb_all();
 	return error;
--- 2.6.11-rc4-bk9/include/asm-generic/pgtable-nopmd.h	2005-02-21 12:04:08.000000000 +0000
+++ linux/include/asm-generic/pgtable-nopmd.h	2005-02-22 23:26:44.000000000 +0000
@@ -5,6 +5,8 @@
 
 #include <asm-generic/pgtable-nopud.h>
 
+#define __PAGETABLE_PMD_FOLDED
+
 /*
  * Having the pmd type consist of a pud gets the size right, and allows
  * us to conceptually access the pud entry that this pmd is folded into
@@ -55,6 +57,9 @@ static inline pmd_t * pmd_offset(pud_t *
 #define pmd_free(x)				do { } while (0)
 #define __pmd_free_tlb(tlb, x)			do { } while (0)
 
+#undef  pmd_limit
+#define pmd_limit(addr, end)			(end)
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _PGTABLE_NOPMD_H */
--- 2.6.11-rc4-bk9/include/asm-generic/pgtable-nopud.h	2005-02-21 12:04:08.000000000 +0000
+++ linux/include/asm-generic/pgtable-nopud.h	2005-02-22 23:25:42.000000000 +0000
@@ -3,6 +3,8 @@
 
 #ifndef __ASSEMBLY__
 
+#define __PAGETABLE_PUD_FOLDED
+
 /*
  * Having the pud type consist of a pgd gets the size right, and allows
  * us to conceptually access the pgd entry that this pud is folded into
@@ -52,5 +54,8 @@ static inline pud_t * pud_offset(pgd_t *
 #define pud_free(x)				do { } while (0)
 #define __pud_free_tlb(tlb, x)			do { } while (0)
 
+#undef  pud_limit
+#define pud_limit(addr, end)			(end)
+
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
--- 2.6.11-rc4-bk9/include/asm-generic/pgtable.h	2004-10-18 22:56:28.000000000 +0100
+++ linux/include/asm-generic/pgtable.h	2005-02-22 23:42:53.000000000 +0000
@@ -134,4 +134,23 @@ static inline void ptep_mkdirty(pte_t *p
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif
 
+#ifndef pmd_limit
+#define pmd_limit(addr, end)						\
+({	unsigned long __limit = ((addr) + PMD_SIZE) & PMD_MASK;		\
+	(__limit <= (end) && __limit)? __limit: (end);			\
+})
+#endif
+
+#ifndef pud_limit
+#define pud_limit(addr, end)						\
+({	unsigned long __limit = ((addr) + PUD_SIZE) & PUD_MASK;		\
+	(__limit <= (end) && __limit)? __limit: (end);			\
+})
+#endif
+
+#define pgd_limit(addr, end)						\
+({	unsigned long __limit = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
+	(__limit <= (end) && __limit)? __limit: (end);			\
+})
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
--- 2.6.11-rc4-bk9/mm/memory.c	2005-02-21 12:04:51.000000000 +0000
+++ linux/mm/memory.c	2005-02-22 23:19:52.000000000 +0000
@@ -87,20 +87,12 @@ EXPORT_SYMBOL(vmalloc_earlyreserve);
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void clear_pmd_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long start, unsigned long end)
+static inline void clear_pmd_range(struct mmu_gather *tlb,
+		pmd_t *pmd, unsigned long addr, unsigned long end)
 {
-	struct page *page;
-
-	if (pmd_none(*pmd))
-		return;
-	if (unlikely(pmd_bad(*pmd))) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-	if (!((start | end) & ~PMD_MASK)) {
+	if (!((addr | end) & ~PMD_MASK)) {
 		/* Only clear full, aligned ranges */
-		page = pmd_page(*pmd);
+		struct page *page = pmd_page(*pmd);
 		pmd_clear(pmd);
 		dec_page_state(nr_page_table_pages);
 		tlb->mm->nr_ptes--;
@@ -108,66 +100,57 @@ static inline void clear_pmd_range(struc
 	}
 }
 
-static inline void clear_pud_range(struct mmu_gather *tlb, pud_t *pud, unsigned long start, unsigned long end)
+static inline void clear_pud_range(struct mmu_gather *tlb,
+		pud_t *pud, unsigned long addr, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pmd_t *pmd, *__pmd;
-
-	if (pud_none(*pud))
-		return;
-	if (unlikely(pud_bad(*pud))) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
+	unsigned long start = addr;
+	unsigned long next;
+	pmd_t *pmd, *start_pmd;
 
-	pmd = __pmd = pmd_offset(pud, start);
+	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = (addr + PMD_SIZE) & PMD_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (unlikely(pmd_bad(*pmd))) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
 		clear_pmd_range(tlb, pmd, addr, next);
-		pmd++;
-		addr = next;
-	} while (addr && (addr < end));
+	} while (pmd++, addr = next, addr < end);
 
 	if (!((start | end) & ~PUD_MASK)) {
 		/* Only clear full, aligned ranges */
 		pud_clear(pud);
-		pmd_free_tlb(tlb, __pmd);
+		pmd_free_tlb(tlb, start_pmd);
 	}
 }
 
-
-static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
+static inline void clear_pgd_range(struct mmu_gather *tlb,
+		pgd_t *pgd, unsigned long addr, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pud_t *pud, *__pud;
-
-	if (pgd_none(*pgd))
-		return;
-	if (unlikely(pgd_bad(*pgd))) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
+	unsigned long start = addr;
+	unsigned long next;
+	pud_t *pud, *start_pud;
 
-	pud = __pud = pud_offset(pgd, start);
+	start_pud = pud = pud_offset(pgd, addr);
 	do {
-		next = (addr + PUD_SIZE) & PUD_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (unlikely(pud_bad(*pud))) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
 		clear_pud_range(tlb, pud, addr, next);
-		pud++;
-		addr = next;
-	} while (addr && (addr < end));
+	} while (pud++, addr = next, addr < end);
 
 	if (!((start | end) & ~PGDIR_MASK)) {
 		/* Only clear full, aligned ranges */
 		pgd_clear(pgd);
-		pud_free_tlb(tlb, __pud);
+		pud_free_tlb(tlb, start_pud);
 	}
 }
 
@@ -176,47 +159,60 @@ static inline void clear_pgd_range(struc
  *
  * Must be called with pagetable lock held.
  */
-void clear_page_range(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+void clear_page_range(struct mmu_gather *tlb,
+		unsigned long addr, unsigned long end)
 {
-	unsigned long addr = start, next;
-	pgd_t * pgd = pgd_offset(tlb->mm, start);
-	unsigned long i;
-
-	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
-		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= addr)
-			next = end;
-		
+	unsigned long next;
+	pgd_t *pgd;
+
+	BUG_ON(addr >= end);
+
+	pgd = pgd_offset(tlb->mm, addr);
+	do {
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (unlikely(pgd_bad(*pgd))) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
 		clear_pgd_range(tlb, pgd, addr, next);
-		pgd++;
-		addr = next;
-	}
+	} while (pgd++, addr = next, addr < end);
 }
 
-pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+static int pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
-	if (!pmd_present(*pmd)) {
-		struct page *new;
+	struct page *new;
 
-		spin_unlock(&mm->page_table_lock);
-		new = pte_alloc_one(mm, address);
-		spin_lock(&mm->page_table_lock);
-		if (!new)
-			return NULL;
-		/*
-		 * Because we dropped the lock, we should re-check the
-		 * entry, as somebody else could have populated it..
-		 */
-		if (pmd_present(*pmd)) {
-			pte_free(new);
-			goto out;
-		}
-		mm->nr_ptes++;
-		inc_page_state(nr_page_table_pages);
-		pmd_populate(mm, pmd, new);
+	if (pmd_present(*pmd))
+		return 1;
+
+	spin_unlock(&mm->page_table_lock);
+	new = pte_alloc_one(mm, address);
+	spin_lock(&mm->page_table_lock);
+	if (!new)
+		return 0;
+	/*
+	 * Because we dropped the lock, we should re-check the
+	 * entry, as somebody else could have populated it..
+	 */
+	if (pmd_present(*pmd)) {
+		pte_free(new);
+		return 1;
 	}
-out:
-	return pte_offset_map(pmd, address);
+	mm->nr_ptes++;
+	inc_page_state(nr_page_table_pages);
+	pmd_populate(mm, pmd, new);
+
+	return 1;
+}
+
+pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+{
+	if (pte_alloc(mm, pmd, address))
+		return pte_offset_map(pmd, address);
+	return NULL;
 }
 
 pte_t fastcall * pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
@@ -253,23 +249,8 @@ out:
  * but may be dropped within p[mg]d_alloc() and pte_alloc_map().
  */
 
-static inline void
-copy_swap_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t pte)
-{
-	if (pte_file(pte))
-		return;
-	swap_duplicate(pte_to_swp_entry(pte));
-	if (list_empty(&dst_mm->mmlist)) {
-		spin_lock(&mmlist_lock);
-		list_add(&dst_mm->mmlist, &src_mm->mmlist);
-		spin_unlock(&mmlist_lock);
-	}
-}
-
-static inline void
-copy_one_pte(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, unsigned long vm_flags,
-		unsigned long addr)
+static inline int copy_one_pte(pte_t *dst_pte, pte_t *src_pte,
+				unsigned long vm_flags)
 {
 	pte_t pte = *src_pte;
 	struct page *page;
@@ -277,10 +258,13 @@ copy_one_pte(struct mm_struct *dst_mm,  
 
 	/* pte contains position in swap, so copy. */
 	if (!pte_present(pte)) {
-		copy_swap_pte(dst_mm, src_mm, pte);
 		set_pte(dst_pte, pte);
-		return;
+		if (pte_file(pte))
+			return 0;	/* no special action */
+		swap_duplicate(pte_to_swp_entry(pte));
+		return 1;		/* check swapoff's mmlist */
 	}
+
 	pfn = pte_pfn(pte);
 	/* the pte points outside of valid memory, the
 	 * mapping is assumed to be good, meaningful
@@ -293,7 +277,7 @@ copy_one_pte(struct mm_struct *dst_mm,  
 
 	if (!page || PageReserved(page)) {
 		set_pte(dst_pte, pte);
-		return;
+		return 0;		/* no special action */
 	}
 
 	/*
@@ -306,63 +290,71 @@ copy_one_pte(struct mm_struct *dst_mm,  
 	}
 
 	/*
-	 * If it's a shared mapping, mark it clean in
-	 * the child
+	 * If it's a shared mapping, mark it clean in the child
 	 */
 	if (vm_flags & VM_SHARED)
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 	get_page(page);
-	dst_mm->rss++;
-	if (PageAnon(page))
-		dst_mm->anon_rss++;
 	set_pte(dst_pte, pte);
 	page_dup_rmap(page);
+
+	return 2 + !!PageAnon(page);	/* count rss and anon_rss */
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long vm_flags,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
-	pte_t *s, *d;
-	unsigned long vm_flags = vma->vm_flags;
+	int count[4] = {0, 0, 0, 0};
 
-	d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
+	dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
 	if (!dst_pte)
 		return -ENOMEM;
 
 	spin_lock(&src_mm->page_table_lock);
-	s = src_pte = pte_offset_map_nested(src_pmd, addr);
-	for (; addr < end; addr += PAGE_SIZE, s++, d++) {
-		if (pte_none(*s))
-			continue;
-		copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
-	}
-	pte_unmap_nested(src_pte);
-	pte_unmap(dst_pte);
+	src_pte = pte_offset_map_nested(src_pmd, addr);
+
+	do {
+		if (!pte_none(*src_pte))
+			count[copy_one_pte(dst_pte, src_pte, vm_flags)]++;
+	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr < end);
+
+	pte_unmap_nested(src_pte - 1);
 	spin_unlock(&src_mm->page_table_lock);
+
+	/* Make sure dst_mm is on mmlist if it has any swap */
+	if (count[1] && list_empty(&dst_mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		list_add(&dst_mm->mmlist, &src_mm->mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+	if (count[2] += count[3])
+		dst_mm->rss += count[2];
+	if (count[3])
+		dst_mm->anon_rss += count[3];
+
+	pte_unmap(dst_pte - 1);
 	cond_resched_lock(&dst_mm->page_table_lock);
 	return 0;
 }
 
-static int copy_pmd_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pmd_range(struct mm_struct *dst_mm,
+		struct mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud,
+		unsigned long vm_flags, unsigned long addr, unsigned long end)
 {
+	unsigned long next;
 	pmd_t *src_pmd, *dst_pmd;
 	int err = 0;
-	unsigned long next;
 
-	src_pmd = pmd_offset(src_pud, addr);
 	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
 	if (!dst_pmd)
 		return -ENOMEM;
 
-	for (; addr < end; addr = next, src_pmd++, dst_pmd++) {
-		next = (addr + PMD_SIZE) & PMD_MASK;
-		if (next > end || next <= addr)
-			next = end;
+	src_pmd = pmd_offset(src_pud, addr);
+	do {
+		next = pmd_limit(addr, end);
 		if (pmd_none(*src_pmd))
 			continue;
 		if (pmd_bad(*src_pmd)) {
@@ -371,30 +363,28 @@ static int copy_pmd_range(struct mm_stru
 			continue;
 		}
 		err = copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-							vma, addr, next);
-		if (err)
+						vm_flags, addr, next);
+		if (unlikely(err))
 			break;
-	}
+	} while (dst_pmd++, src_pmd++, addr = next, addr < end);
 	return err;
 }
 
-static int copy_pud_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pud_range(struct mm_struct *dst_mm,
+		struct mm_struct *src_mm, pgd_t *dst_pgd, pgd_t *src_pgd,
+		unsigned long vm_flags, unsigned long addr, unsigned long end)
 {
+	unsigned long next;
 	pud_t *src_pud, *dst_pud;
 	int err = 0;
-	unsigned long next;
 
-	src_pud = pud_offset(src_pgd, addr);
 	dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
 	if (!dst_pud)
 		return -ENOMEM;
 
-	for (; addr < end; addr = next, src_pud++, dst_pud++) {
-		next = (addr + PUD_SIZE) & PUD_MASK;
-		if (next > end || next <= addr)
-			next = end;
+	src_pud = pud_offset(src_pgd, addr);
+	do {
+		next = pud_limit(addr, end);
 		if (pud_none(*src_pud))
 			continue;
 		if (pud_bad(*src_pud)) {
@@ -403,82 +393,58 @@ static int copy_pud_range(struct mm_stru
 			continue;
 		}
 		err = copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-							vma, addr, next);
-		if (err)
+						vm_flags, addr, next);
+		if (unlikely(err))
 			break;
-	}
+	} while (dst_pud++, src_pud++, addr = next, addr < end);
 	return err;
 }
 
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
+int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		struct vm_area_struct *vma)
 {
+	unsigned long addr, end, next;
 	pgd_t *src_pgd, *dst_pgd;
-	unsigned long addr, start, end, next;
 	int err = 0;
 
 	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst, src, vma);
-
-	start = vma->vm_start;
-	src_pgd = pgd_offset(src, start);
-	dst_pgd = pgd_offset(dst, start);
+		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	addr = vma->vm_start;
 	end = vma->vm_end;
-	addr = start;
-	while (addr && (addr < end-1)) {
-		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= addr)
-			next = end;
+	dst_pgd = pgd_offset(dst_mm, addr);
+	src_pgd = pgd_offset(src_mm, addr);
+	do {
+		next = pgd_limit(addr, end);
 		if (pgd_none(*src_pgd))
-			goto next_pgd;
+			continue;
 		if (pgd_bad(*src_pgd)) {
 			pgd_ERROR(*src_pgd);
 			pgd_clear(src_pgd);
-			goto next_pgd;
+			continue;
 		}
-		err = copy_pud_range(dst, src, dst_pgd, src_pgd,
-							vma, addr, next);
-		if (err)
+		err = copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					vma->vm_flags, addr, next);
+		if (unlikely(err))
 			break;
-
-next_pgd:
-		src_pgd++;
-		dst_pgd++;
-		addr = next;
-	}
-
+	} while (dst_pgd++, src_pgd++, addr = next, addr < end);
 	return err;
 }
 
 static void zap_pte_range(struct mmu_gather *tlb,
-		pmd_t *pmd, unsigned long address,
-		unsigned long size, struct zap_details *details)
+		pmd_t *pmd, unsigned long addr, unsigned long end,
+		struct zap_details *details)
 {
-	unsigned long offset;
-	pte_t *ptep;
+	pte_t *pte;
 
-	if (pmd_none(*pmd))
-		return;
-	if (unlikely(pmd_bad(*pmd))) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-	ptep = pte_offset_map(pmd, address);
-	offset = address & ~PMD_MASK;
-	if (offset + size > PMD_SIZE)
-		size = PMD_SIZE - offset;
-	size &= PAGE_MASK;
-	if (details && !details->check_mapping && !details->nonlinear_vma)
-		details = NULL;
-	for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
-		pte_t pte = *ptep;
-		if (pte_none(pte))
+	pte = pte_offset_map(pmd, addr);
+	do {
+		pte_t entry = *pte;
+		if (pte_none(entry))
 			continue;
-		if (pte_present(pte)) {
+		if (pte_present(entry)) {
 			struct page *page = NULL;
-			unsigned long pfn = pte_pfn(pte);
+			unsigned long pfn = pte_pfn(entry);
 			if (pfn_valid(pfn)) {
 				page = pfn_to_page(pfn);
 				if (PageReserved(page))
@@ -502,19 +468,19 @@ static void zap_pte_range(struct mmu_gat
 				     page->index > details->last_index))
 					continue;
 			}
-			pte = ptep_get_and_clear(ptep);
-			tlb_remove_tlb_entry(tlb, ptep, address+offset);
+			entry = ptep_get_and_clear(pte);
+			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
-					address+offset) != page->index)
-				set_pte(ptep, pgoff_to_pte(page->index));
-			if (pte_dirty(pte))
+					addr) != page->index)
+				set_pte(pte, pgoff_to_pte(page->index));
+			if (pte_dirty(entry))
 				set_page_dirty(page);
 			if (PageAnon(page))
 				tlb->mm->anon_rss--;
-			else if (pte_young(pte))
+			else if (pte_young(entry))
 				mark_page_accessed(page);
 			tlb->freed++;
 			page_remove_rmap(page);
@@ -527,78 +493,79 @@ static void zap_pte_range(struct mmu_gat
 		 */
 		if (unlikely(details))
 			continue;
-		if (!pte_file(pte))
-			free_swap_and_cache(pte_to_swp_entry(pte));
-		pte_clear(ptep);
-	}
-	pte_unmap(ptep-1);
+		if (!pte_file(entry))
+			free_swap_and_cache(pte_to_swp_entry(entry));
+		pte_clear(pte);
+	} while (pte++, addr += PAGE_SIZE, addr < end);
+	pte_unmap(pte - 1);
 }
 
-static void zap_pmd_range(struct mmu_gather *tlb,
-		pud_t *pud, unsigned long address,
-		unsigned long size, struct zap_details *details)
+static inline void zap_pmd_range(struct mmu_gather *tlb,
+		pud_t *pud, unsigned long addr, unsigned long end,
+		struct zap_details *details)
 {
-	pmd_t * pmd;
-	unsigned long end;
+	unsigned long next;
+	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return;
-	if (unlikely(pud_bad(*pud))) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
-	pmd = pmd_offset(pud, address);
-	end = address + size;
-	if (end > ((address + PUD_SIZE) & PUD_MASK))
-		end = ((address + PUD_SIZE) & PUD_MASK);
+	pmd = pmd_offset(pud, addr);
 	do {
-		zap_pte_range(tlb, pmd, address, end - address, details);
-		address = (address + PMD_SIZE) & PMD_MASK; 
-		pmd++;
-	} while (address && (address < end));
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (unlikely(pmd_bad(*pmd))) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+		zap_pte_range(tlb, pmd, addr, next, details);
+	} while (pmd++, addr = next, addr < end);
 }
 
-static void zap_pud_range(struct mmu_gather *tlb,
-		pgd_t * pgd, unsigned long address,
-		unsigned long end, struct zap_details *details)
+static inline void zap_pud_range(struct mmu_gather *tlb,
+		pgd_t *pgd, unsigned long addr, unsigned long end,
+		struct zap_details *details)
 {
-	pud_t * pud;
+	unsigned long next;
+	pud_t *pud;
 
-	if (pgd_none(*pgd))
-		return;
-	if (unlikely(pgd_bad(*pgd))) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
-	pud = pud_offset(pgd, address);
+	pud = pud_offset(pgd, addr);
 	do {
-		zap_pmd_range(tlb, pud, address, end - address, details);
-		address = (address + PUD_SIZE) & PUD_MASK; 
-		pud++;
-	} while (address && (address < end));
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (unlikely(pud_bad(*pud))) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+		zap_pmd_range(tlb, pud, addr, next, details);
+	} while (pud++, addr = next, addr < end);
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, unsigned long address,
+static void zap_pgd_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long addr,
 		unsigned long end, struct zap_details *details)
 {
 	unsigned long next;
 	pgd_t *pgd;
-	int i;
 
-	BUG_ON(address >= end);
-	pgd = pgd_offset(vma->vm_mm, address);
+	BUG_ON(addr >= end);
+	if (details && !details->check_mapping && !details->nonlinear_vma)
+		details = NULL;
+
 	tlb_start_vma(tlb, vma);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		zap_pud_range(tlb, pgd, address, next, details);
-		address = next;
-		pgd++;
-	}
+	pgd = pgd_offset(vma->vm_mm, addr);
+	do {
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (unlikely(pgd_bad(*pgd))) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+		zap_pud_range(tlb, pgd, addr, next, details);
+	} while (pgd++, addr = next, addr < end);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -676,7 +643,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
 				unmap_hugepage_range(vma, start, end);
 			} else {
 				block = min(zap_bytes, end - start);
-				unmap_page_range(*tlbp, vma, start,
+				zap_pgd_range(*tlbp, vma, start,
 						start + block, details);
 			}
 
@@ -987,109 +954,80 @@ out:
 
 EXPORT_SYMBOL(get_user_pages);
 
-static void zeromap_pte_range(pte_t * pte, unsigned long address,
-                                     unsigned long size, pgprot_t prot)
+static void zeromap_pte_range(pmd_t *pmd, unsigned long addr,
+		unsigned long end, pgprot_t prot)
 {
-	unsigned long end;
+	pte_t *pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
+	pte = pte_offset_map(pmd, addr);
 	do {
-		pte_t zero_pte = pte_wrprotect(mk_pte(ZERO_PAGE(address), prot));
+		pte_t zero_pte = pte_wrprotect(mk_pte(ZERO_PAGE(addr), prot));
 		BUG_ON(!pte_none(*pte));
 		set_pte(pte, zero_pte);
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
+	} while (pte++, addr += PAGE_SIZE, addr < end);
+	pte_unmap(pte - 1);
 }
 
-static inline int zeromap_pmd_range(struct mm_struct *mm, pmd_t * pmd,
-		unsigned long address, unsigned long size, pgprot_t prot)
+static inline int zeromap_pmd_range(struct mm_struct *mm, pud_t *pud,
+		unsigned long addr, unsigned long end, pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long next;
+	pmd_t *pmd;
 
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
+	pmd = pmd_offset(pud, addr);
 	do {
-		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
-		if (!pte)
+		next = pmd_limit(addr, end);
+		if (!pte_alloc(mm, pmd, addr))
 			return -ENOMEM;
-		zeromap_pte_range(pte, base + address, end - address, prot);
-		pte_unmap(pte);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		zeromap_pte_range(pmd, addr, next, prot);
+	} while (pmd++, addr = next, addr < end);
 	return 0;
 }
 
-static inline int zeromap_pud_range(struct mm_struct *mm, pud_t * pud,
-				    unsigned long address,
-                                    unsigned long size, pgprot_t prot)
-{
-	unsigned long base, end;
-	int error = 0;
-
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+static inline int zeromap_pud_range(struct mm_struct *mm, pgd_t *pgd,
+		unsigned long addr, unsigned long end, pgprot_t prot)
+{
+	unsigned long next;
+	pud_t *pud;
+	int error;
+
+	pud = pud_offset(pgd, addr);
 	do {
-		pmd_t * pmd = pmd_alloc(mm, pud, base + address);
-		error = -ENOMEM;
-		if (!pmd)
-			break;
-		error = zeromap_pmd_range(mm, pmd, base + address,
-					  end - address, prot);
+		next = pud_limit(addr, end);
+		if (pmd_alloc(mm, pud, addr))
+			error = zeromap_pmd_range(mm, pud, addr, next, prot);
+		else
+			error = -ENOMEM;
 		if (error)
 			break;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
-	return 0;
+	} while (pud++, addr = next, addr < end);
+	return error;
 }
 
-int zeromap_page_range(struct vm_area_struct *vma, unsigned long address,
-					unsigned long size, pgprot_t prot)
+int zeromap_page_range(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long size, pgprot_t prot)
 {
-	int i;
-	int error = 0;
-	pgd_t * pgd;
-	unsigned long beg = address;
-	unsigned long end = address + size;
-	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long end = addr + size;
+	unsigned long next;
+	pgd_t *pgd;
+	int error;
 
-	pgd = pgd_offset(mm, address);
-	flush_cache_range(vma, beg, end);
-	BUG_ON(address >= end);
+	BUG_ON(addr >= end);
 	BUG_ON(end > vma->vm_end);
 
+	pgd = pgd_offset(mm, addr);
+	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(mm, pgd, address);
-		error = -ENOMEM;
-		if (!pud)
-			break;
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= beg || next > end)
-			next = end;
-		error = zeromap_pud_range(mm, pud, address,
-						next - address, prot);
+	do {
+		next = pgd_limit(addr, end);
+		if (pud_alloc(mm, pgd, addr))
+			error = zeromap_pud_range(mm, pgd, addr, next, prot);
+		else
+			error = -ENOMEM;
 		if (error)
 			break;
-		address = next;
-		pgd++;
-	}
-	/*
-	 * Why flush? zeromap_pte_range has a BUG_ON for !pte_none()
-	 */
-	flush_tlb_range(vma, beg, end);
+	} while (pgd++, addr = next, addr < end);
 	spin_unlock(&mm->page_table_lock);
 	return error;
 }
@@ -1099,95 +1037,74 @@ int zeromap_page_range(struct vm_area_st
  * mappings are removed. any references to nonexistent pages results
  * in null mappings (currently treated as "copy-on-access")
  */
-static inline void
-remap_pte_range(pte_t * pte, unsigned long address, unsigned long size,
+static void remap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		unsigned long pfn, pgprot_t prot)
 {
-	unsigned long end;
+	pte_t *pte;
 
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
+	pte = pte_offset_map(pmd, addr);
 	do {
 		BUG_ON(!pte_none(*pte));
 		if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
  			set_pte(pte, pfn_pte(pfn, prot));
-		address += PAGE_SIZE;
 		pfn++;
-		pte++;
-	} while (address && (address < end));
+	} while (pte++, addr += PAGE_SIZE, addr < end);
+	pte_unmap(pte - 1);
 }
 
-static inline int
-remap_pmd_range(struct mm_struct *mm, pmd_t * pmd, unsigned long address,
-		unsigned long size, unsigned long pfn, pgprot_t prot)
+static inline int remap_pmd_range(struct mm_struct *mm,
+		pud_t *pud, unsigned long addr, unsigned long end,
+		unsigned long pfn, pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long next;
+	pmd_t *pmd;
 
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-	pfn -= (address >> PAGE_SHIFT);
+	pfn -= addr >> PAGE_SHIFT;
+	pmd = pmd_offset(pud, addr);
 	do {
-		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
-		if (!pte)
+		next = pmd_limit(addr, end);
+		if (!pte_alloc(mm, pmd, addr))
 			return -ENOMEM;
-		remap_pte_range(pte, base + address, end - address,
-				(address >> PAGE_SHIFT) + pfn, prot);
-		pte_unmap(pte);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		remap_pte_range(pmd, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+	} while (pmd++, addr = next, addr < end);
 	return 0;
 }
 
-static inline int remap_pud_range(struct mm_struct *mm, pud_t * pud,
-				  unsigned long address, unsigned long size,
-				  unsigned long pfn, pgprot_t prot)
+static inline int remap_pud_range(struct mm_struct *mm,
+		pgd_t *pgd, unsigned long addr, unsigned long end,
+		unsigned long pfn, pgprot_t prot)
 {
-	unsigned long base, end;
+	unsigned long next;
+	pud_t *pud;
 	int error;
 
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	pfn -= address >> PAGE_SHIFT;
+	pfn -= addr >> PAGE_SHIFT;
+	pud = pud_offset(pgd, addr);
 	do {
-		pmd_t *pmd = pmd_alloc(mm, pud, base+address);
-		error = -ENOMEM;
-		if (!pmd)
-			break;
-		error = remap_pmd_range(mm, pmd, base + address, end - address,
-				(address >> PAGE_SHIFT) + pfn, prot);
+		next = pud_limit(addr, end);
+		if (pmd_alloc(mm, pud, addr))
+			error = remap_pmd_range(mm, pud, addr, next,
+					pfn + (addr >> PAGE_SHIFT), prot);
+		else
+			error = -ENOMEM;
 		if (error)
 			break;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
+	} while (pud++, addr = next, addr < end);
 	return error;
 }
 
 /*  Note: this is only safe if the mm semaphore is held when called. */
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-	int error = 0;
-	pgd_t *pgd;
-	unsigned long beg = from;
-	unsigned long end = from + size;
-	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
-	int i;
+	unsigned long next;
+	unsigned long end = addr + size;
+	pgd_t *pgd;
+	int error;
 
-	pfn -= from >> PAGE_SHIFT;
-	pgd = pgd_offset(mm, from);
-	flush_cache_range(vma, beg, end);
-	BUG_ON(from >= end);
+	BUG_ON(addr >= end);
 
 	/*
 	 * Physically remapped pages are special. Tell the
@@ -1199,28 +1116,21 @@ int remap_pfn_range(struct vm_area_struc
 	 */
 	vma->vm_flags |= VM_IO | VM_RESERVED;
 
+	pfn -= addr >> PAGE_SHIFT;
+	flush_cache_range(vma, addr, end);
+	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(beg); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(mm, pgd, from);
-		error = -ENOMEM;
-		if (!pud)
-			break;
-		next = (from + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || next <= from)
-			next = end;
-		error = remap_pud_range(mm, pud, from, end - from,
-					pfn + (from >> PAGE_SHIFT), prot);
+	do {
+		next = pgd_limit(addr, end);
+		if (pud_alloc(mm, pgd, addr))
+			error = remap_pud_range(mm, pgd, addr, next,
+					pfn + (addr >> PAGE_SHIFT), prot);
+		else
+			error = -ENOMEM;
 		if (error)
 			break;
-		from = next;
-		pgd++;
-	}
-	/*
-	 * Why flush? remap_pte_range has a BUG_ON for !pte_none()
-	 */
-	flush_tlb_range(vma, beg, end);
+	} while (pgd++, addr = next, addr < end);
 	spin_unlock(&mm->page_table_lock);
-
 	return error;
 }
 
@@ -2100,6 +2010,8 @@ int handle_mm_fault(struct mm_struct *mm
 }
 
 #ifndef __ARCH_HAS_4LEVEL_HACK
+
+#ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
  *
@@ -2131,7 +2043,9 @@ pud_t fastcall *__pud_alloc(struct mm_st
  out:
 	return pud_offset(pgd, address);
 }
+#endif /* __PAGETABLE_PUD_FOLDED */
 
+#ifndef __PAGETABLE_PMD_FOLDED
 /*
  * Allocate page middle directory.
  *
@@ -2163,7 +2077,9 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
  out:
 	return pmd_offset(pud, address);
 }
-#else
+#endif /* __PAGETABLE_PMD_FOLDED */
+
+#else /* __ARCH_HAS_4LEVEL_HACK */
 pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 {
 	pmd_t *new;
--- 2.6.11-rc4-bk9/mm/mprotect.c	2005-02-21 12:04:11.000000000 +0000
+++ linux/mm/mprotect.c	2005-02-22 20:01:08.000000000 +0000
@@ -25,25 +25,12 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-static inline void
-change_pte_range(pmd_t *pmd, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+static void change_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, pgprot_t newprot)
 {
-	pte_t * pte;
-	unsigned long end;
+	pte_t *pte;
 
-	if (pmd_none(*pmd))
-		return;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-	pte = pte_offset_map(pmd, address);
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
+	pte = pte_offset_map(pmd, addr);
 	do {
 		if (pte_present(*pte)) {
 			pte_t entry;
@@ -55,86 +42,73 @@ change_pte_range(pmd_t *pmd, unsigned lo
 			entry = ptep_get_and_clear(pte);
 			set_pte(pte, pte_modify(entry, newprot));
 		}
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 	pte_unmap(pte - 1);
 }
 
-static inline void
-change_pmd_range(pud_t *pud, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+static inline void change_pmd_range(pud_t *pud, unsigned long addr,
+				unsigned long end, pgprot_t newprot)
 {
-	pmd_t * pmd;
-	unsigned long end;
+	unsigned long next;
+	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
-	pmd = pmd_offset(pud, address);
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
+	pmd = pmd_offset(pud, addr);
 	do {
-		change_pte_range(pmd, address, end - address, newprot);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+		change_pte_range(pmd, addr, next, newprot);
+	} while (pmd++, addr = next, addr < end);
 }
 
-static inline void
-change_pud_range(pgd_t *pgd, unsigned long address,
-		unsigned long size, pgprot_t newprot)
+static inline void change_pud_range(pgd_t *pgd, unsigned long addr,
+				unsigned long end, pgprot_t newprot)
 {
-	pud_t * pud;
-	unsigned long end;
+	unsigned long next;
+	pud_t *pud;
 
-	if (pgd_none(*pgd))
-		return;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
-	pud = pud_offset(pgd, address);
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	pud = pud_offset(pgd, addr);
 	do {
-		change_pmd_range(pud, address, end - address, newprot);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+		change_pmd_range(pud, addr, next, newprot);
+	} while (pud++, addr = next, addr < end);
 }
 
-static void
-change_protection(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, pgprot_t newprot)
+static void change_protection(struct vm_area_struct *vma, unsigned long addr,
+				unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = current->mm;
+	unsigned long start = addr;
+	unsigned long next;
 	pgd_t *pgd;
-	unsigned long beg = start, next;
-	int i;
 
-	pgd = pgd_offset(mm, start);
-	flush_cache_range(vma, beg, end);
-	BUG_ON(start >= end);
+	pgd = pgd_offset(mm, addr);
+	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
-	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
-		next = (start + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= start || next > end)
-			next = end;
-		change_pud_range(pgd, start, next - start, newprot);
-		start = next;
-		pgd++;
-	}
-	flush_tlb_range(vma, beg, end);
+	do {
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+		change_pud_range(pgd, addr, next, newprot);
+	} while (pgd++, addr = next, addr < end);
+	flush_tlb_range(vma, start, end);
 	spin_unlock(&mm->page_table_lock);
 }
 
--- 2.6.11-rc4-bk9/mm/msync.c	2005-02-21 12:04:11.000000000 +0000
+++ linux/mm/msync.c	2005-02-22 19:59:51.000000000 +0000
@@ -21,170 +21,125 @@
  * Called with mm->page_table_lock held to protect against other
  * threads/the swapper from ripping pte's out from under us.
  */
-static int filemap_sync_pte(pte_t *ptep, struct vm_area_struct *vma,
-	unsigned long address, unsigned int flags)
-{
-	pte_t pte = *ptep;
-	unsigned long pfn = pte_pfn(pte);
-	struct page *page;
-
-	if (pte_present(pte) && pfn_valid(pfn)) {
-		page = pfn_to_page(pfn);
-		if (!PageReserved(page) &&
-		    (ptep_clear_flush_dirty(vma, address, ptep) ||
-		     page_test_and_clear_dirty(page)))
-			set_page_dirty(page);
-	}
-	return 0;
-}
 
-static int filemap_sync_pte_range(pmd_t * pmd,
-	unsigned long address, unsigned long end, 
-	struct vm_area_struct *vma, unsigned int flags)
+static void sync_pte_range(pmd_t *pmd,
+		unsigned long addr, unsigned long end, 
+		struct vm_area_struct *vma, unsigned int flags)
 {
 	pte_t *pte;
-	int error;
 
-	if (pmd_none(*pmd))
-		return 0;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return 0;
-	}
-	pte = pte_offset_map(pmd, address);
-	if ((address & PMD_MASK) != (end & PMD_MASK))
-		end = (address & PMD_MASK) + PMD_SIZE;
-	error = 0;
+	pte = pte_offset_map(pmd, addr);
 	do {
-		error |= filemap_sync_pte(pte, vma, address, flags);
-		address += PAGE_SIZE;
-		pte++;
-	} while (address && (address < end));
-
+		pte_t entry = *pte;
+		unsigned long pfn;
+		struct page *page;
+
+		if (!pte_present(entry))
+			continue;
+		pfn = pte_pfn(entry);
+		if (!pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			continue;
+		if (ptep_clear_flush_dirty(vma, addr, pte) ||
+		    page_test_and_clear_dirty(page))
+			set_page_dirty(page);
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 	pte_unmap(pte - 1);
-
-	return error;
 }
 
-static inline int filemap_sync_pmd_range(pud_t * pud,
-	unsigned long address, unsigned long end, 
-	struct vm_area_struct *vma, unsigned int flags)
+static inline void sync_pmd_range(pud_t *pud,
+		unsigned long addr, unsigned long end, 
+		struct vm_area_struct *vma, unsigned int flags)
 {
-	pmd_t * pmd;
-	int error;
+	unsigned long next;
+	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return 0;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return 0;
-	}
-	pmd = pmd_offset(pud, address);
-	if ((address & PUD_MASK) != (end & PUD_MASK))
-		end = (address & PUD_MASK) + PUD_SIZE;
-	error = 0;
+	pmd = pmd_offset(pud, addr);
 	do {
-		error |= filemap_sync_pte_range(pmd, address, end, vma, flags);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address && (address < end));
-	return error;
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+		sync_pte_range(pmd, addr, next, vma, flags);
+	} while (pmd++, addr = next, addr < end);
 }
 
-static inline int filemap_sync_pud_range(pgd_t *pgd,
-	unsigned long address, unsigned long end,
-	struct vm_area_struct *vma, unsigned int flags)
+static inline void sync_pud_range(pgd_t *pgd,
+		unsigned long addr, unsigned long end,
+		struct vm_area_struct *vma, unsigned int flags)
 {
+	unsigned long next;
 	pud_t *pud;
-	int error;
 
-	if (pgd_none(*pgd))
-		return 0;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return 0;
-	}
-	pud = pud_offset(pgd, address);
-	if ((address & PGDIR_MASK) != (end & PGDIR_MASK))
-		end = (address & PGDIR_MASK) + PGDIR_SIZE;
-	error = 0;
+	pud = pud_offset(pgd, addr);
 	do {
-		error |= filemap_sync_pmd_range(pud, address, end, vma, flags);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
-	return error;
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+		sync_pmd_range(pud, addr, next, vma, flags);
+	} while (pud++, addr = next, addr < end);
 }
 
-static int __filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void sync_pgd_range(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end, unsigned int flags)
 {
-	pgd_t *pgd;
-	unsigned long end = address + size;
+	struct mm_struct *mm = vma->vm_mm;
 	unsigned long next;
-	int i;
-	int error = 0;
-
-	/* Aquire the lock early; it may be possible to avoid dropping
-	 * and reaquiring it repeatedly.
-	 */
-	spin_lock(&vma->vm_mm->page_table_lock);
-
-	pgd = pgd_offset(vma->vm_mm, address);
-	flush_cache_range(vma, address, end);
+	pgd_t *pgd;
 
 	/* For hugepages we can't go walking the page table normally,
 	 * but that's ok, hugetlbfs is memory based, so we don't need
 	 * to do anything more on an msync() */
 	if (is_vm_hugetlb_page(vma))
-		goto out;
-
-	if (address >= end)
-		BUG();
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		error |= filemap_sync_pud_range(pgd, address, next, vma, flags);
-		address = next;
-		pgd++;
-	}
-	/*
-	 * Why flush ? filemap_sync_pte already flushed the tlbs with the
-	 * dirty bits.
-	 */
-	flush_tlb_range(vma, end - size, end);
- out:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+		return;
 
-	return error;
+	pgd = pgd_offset(mm, addr);
+	flush_cache_range(vma, addr, end);
+	spin_lock(&mm->page_table_lock);
+	do {
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+		sync_pud_range(pgd, addr, next, vma, flags);
+	} while (pgd++, addr = next, addr < end);
+	spin_unlock(&mm->page_table_lock);
 }
 
 #ifdef CONFIG_PREEMPT
-static int filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void filemap_sync(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned int flags)
 {
 	const size_t chunk = 64 * 1024;	/* bytes */
-	int error = 0;
 
-	while (size) {
-		size_t sz = min(size, chunk);
+	while (start < end) {
+		size_t sz = min((size_t)(end-start), chunk);
 
-		error |= __filemap_sync(vma, address, sz, flags);
+		sync_pgd_range(vma, start, start+sz, flags);
+		start += sz;
 		cond_resched();
-		address += sz;
-		size -= sz;
 	}
-	return error;
 }
 #else
-static int filemap_sync(struct vm_area_struct *vma, unsigned long address,
-			size_t size, unsigned int flags)
+static void filemap_sync(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned int flags)
 {
-	return __filemap_sync(vma, address, size, flags);
+	sync_pgd_range(vma, start, end, flags);
 }
 #endif
 
@@ -209,9 +164,9 @@ static int msync_interval(struct vm_area
 		return -EBUSY;
 
 	if (file && (vma->vm_flags & VM_SHARED)) {
-		ret = filemap_sync(vma, start, end-start, flags);
+		filemap_sync(vma, start, end, flags);
 
-		if (!ret && (flags & MS_SYNC)) {
+		if (flags & MS_SYNC) {
 			struct address_space *mapping = file->f_mapping;
 			int err;
 
--- 2.6.11-rc4-bk9/mm/swapfile.c	2005-02-21 12:04:11.000000000 +0000
+++ linux/mm/swapfile.c	2005-02-22 20:00:26.000000000 +0000
@@ -427,162 +427,124 @@ void free_swap_and_cache(swp_entry_t ent
  * share this swap entry, so be cautious and let do_wp_page work out
  * what to do if a write is requested later.
  */
-/* vma->vm_mm->page_table_lock is held */
-static void
-unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
-	swp_entry_t entry, struct page *page)
+static void unuse_pte(struct vm_area_struct *vma, pte_t *pte,
+		unsigned long addr, swp_entry_t entry, struct page *page)
 {
 	vma->vm_mm->rss++;
 	get_page(page);
-	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	page_add_anon_rmap(page, vma, address);
+	set_pte(pte, pte_mkold(mk_pte(page, vma->vm_page_prot)));
+	page_add_anon_rmap(page, vma, addr);
 	swap_free(entry);
 	acct_update_integrals();
 	update_mem_hiwater();
 }
 
-/* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pmd(struct vm_area_struct *vma, pmd_t *dir,
-	unsigned long address, unsigned long end,
-	swp_entry_t entry, struct page *page)
+static int unuse_pte_range(struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long addr, unsigned long end,
+		swp_entry_t entry, struct page *page)
 {
-	pte_t *pte;
 	pte_t swp_pte = swp_entry_to_pte(entry);
+	pte_t *pte;
 
-	if (pmd_none(*dir))
-		return 0;
-	if (pmd_bad(*dir)) {
-		pmd_ERROR(*dir);
-		pmd_clear(dir);
-		return 0;
-	}
-	pte = pte_offset_map(dir, address);
+	pte = pte_offset_map(pmd, addr);
 	do {
 		/*
 		 * swapoff spends a _lot_ of time in this loop!
 		 * Test inline before going to call unuse_pte.
 		 */
 		if (unlikely(pte_same(*pte, swp_pte))) {
-			unuse_pte(vma, address, pte, entry, page);
+			unuse_pte(vma, pte, addr, entry, page);
 			pte_unmap(pte);
-
-			/*
-			 * Move the page to the active list so it is not
-			 * immediately swapped out again after swapon.
-			 */
-			activate_page(page);
-
-			/* add 1 since address may be 0 */
-			return 1 + address;
+			return 1;
 		}
-		address += PAGE_SIZE;
-		pte++;
-	} while (address < end);
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 	pte_unmap(pte - 1);
 	return 0;
 }
 
-/* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pud(struct vm_area_struct *vma, pud_t *pud,
-        unsigned long address, unsigned long end,
-	swp_entry_t entry, struct page *page)
+static inline int unuse_pmd_range(struct vm_area_struct *vma,
+		pud_t *pud, unsigned long addr, unsigned long end,
+		swp_entry_t entry, struct page *page)
 {
-	pmd_t *pmd;
 	unsigned long next;
-	unsigned long foundaddr;
+	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return 0;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return 0;
-	}
-	pmd = pmd_offset(pud, address);
+	pmd = pmd_offset(pud, addr);
 	do {
-		next = (address + PMD_SIZE) & PMD_MASK;
-		if (next > end || !next)
-			next = end;
-		foundaddr = unuse_pmd(vma, pmd, address, next, entry, page);
-		if (foundaddr)
-			return foundaddr;
-		address = next;
-		pmd++;
-	} while (address < end);
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+		if (unuse_pte_range(vma, pmd, addr, next, entry, page))
+			return 1;
+	} while (pmd++, addr = next, addr < end);
 	return 0;
 }
 
-/* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pgd(struct vm_area_struct *vma, pgd_t *pgd,
-	unsigned long address, unsigned long end,
-	swp_entry_t entry, struct page *page)
+static inline int unuse_pud_range(struct vm_area_struct *vma,
+		pgd_t *pgd, unsigned long addr, unsigned long end,
+		swp_entry_t entry, struct page *page)
 {
-	pud_t *pud;
 	unsigned long next;
-	unsigned long foundaddr;
+	pud_t *pud;
 
-	if (pgd_none(*pgd))
-		return 0;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return 0;
-	}
-	pud = pud_offset(pgd, address);
+	pud = pud_offset(pgd, addr);
 	do {
-		next = (address + PUD_SIZE) & PUD_MASK;
-		if (next > end || !next)
-			next = end;
-		foundaddr = unuse_pud(vma, pud, address, next, entry, page);
-		if (foundaddr)
-			return foundaddr;
-		address = next;
-		pud++;
-	} while (address < end);
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+		if (unuse_pmd_range(vma, pud, addr, next, entry, page))
+			return 1;
+	} while (pud++, addr = next, addr < end);
 	return 0;
 }
 
-/* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_vma(struct vm_area_struct *vma,
-	swp_entry_t entry, struct page *page)
+static int unuse_vma(struct vm_area_struct *vma,
+		swp_entry_t entry, struct page *page)
 {
+	unsigned long addr, end, next;
 	pgd_t *pgd;
-	unsigned long address, next, end;
-	unsigned long foundaddr;
 
 	if (page->mapping) {
-		address = page_address_in_vma(page, vma);
-		if (address == -EFAULT)
+		addr = page_address_in_vma(page, vma);
+		if (addr == -EFAULT)
 			return 0;
 		else
-			end = address + PAGE_SIZE;
+			end = addr + PAGE_SIZE;
 	} else {
-		address = vma->vm_start;
+		addr = vma->vm_start;
 		end = vma->vm_end;
 	}
-	pgd = pgd_offset(vma->vm_mm, address);
+
+	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next > end || !next)
-			next = end;
-		foundaddr = unuse_pgd(vma, pgd, address, next, entry, page);
-		if (foundaddr)
-			return foundaddr;
-		address = next;
-		pgd++;
-	} while (address < end);
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+		if (unuse_pud_range(vma, pgd, addr, next, entry, page))
+			return 1;
+	} while (pgd++, addr = next, addr < end);
 	return 0;
 }
 
-static int unuse_process(struct mm_struct * mm,
-			swp_entry_t entry, struct page* page)
+static int unuse_mm(struct mm_struct *mm, swp_entry_t entry, struct page *page)
 {
-	struct vm_area_struct* vma;
-	unsigned long foundaddr = 0;
+	struct vm_area_struct *vma;
 
-	/*
-	 * Go through process' page directory.
-	 */
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		/*
 		 * Our reference to the page stops try_to_unmap_one from
@@ -594,16 +556,19 @@ static int unuse_process(struct mm_struc
 	}
 	spin_lock(&mm->page_table_lock);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
-		if (vma->anon_vma) {
-			foundaddr = unuse_vma(vma, entry, page);
-			if (foundaddr)
-				break;
+		if (vma->anon_vma && unuse_vma(vma, entry, page)) {
+			/*
+			 * Move the page to the active list so it is not
+			 * immediately swapped out again after swapon.
+			 */
+			activate_page(page);
+			break;
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
 	up_read(&mm->mmap_sem);
 	/*
-	 * Currently unuse_process cannot fail, but leave error handling
+	 * Currently unuse_mm cannot fail, but leave error handling
 	 * at call sites for now, since we change it from time to time.
 	 */
 	return 0;
@@ -747,7 +712,7 @@ static int try_to_unuse(unsigned int typ
 			if (start_mm == &init_mm)
 				shmem = shmem_unuse(entry, page);
 			else
-				retval = unuse_process(start_mm, entry, page);
+				retval = unuse_mm(start_mm, entry, page);
 		}
 		if (*swap_map > 1) {
 			int set_start_mm = (*swap_map >= swcount);
@@ -779,7 +744,7 @@ static int try_to_unuse(unsigned int typ
 					set_start_mm = 1;
 					shmem = shmem_unuse(entry, page);
 				} else
-					retval = unuse_process(mm, entry, page);
+					retval = unuse_mm(mm, entry, page);
 				if (set_start_mm && *swap_map < swcount) {
 					mmput(new_start_mm);
 					atomic_inc(&mm->mm_users);
--- 2.6.11-rc4-bk9/mm/vmalloc.c	2005-02-21 12:04:11.000000000 +0000
+++ linux/mm/vmalloc.c	2005-02-22 23:17:29.000000000 +0000
@@ -23,104 +23,90 @@
 DEFINE_RWLOCK(vmlist_lock);
 struct vm_struct *vmlist;
 
-static void unmap_area_pte(pmd_t *pmd, unsigned long address,
-				  unsigned long size)
+static void unmap_area_pte(pmd_t *pmd, unsigned long addr, unsigned long end)
 {
-	unsigned long end;
 	pte_t *pte;
 
-	if (pmd_none(*pmd))
-		return;
-	if (pmd_bad(*pmd)) {
-		pmd_ERROR(*pmd);
-		pmd_clear(pmd);
-		return;
-	}
-
-	pte = pte_offset_kernel(pmd, address);
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
-
+	pte = pte_offset_kernel(pmd, addr);
 	do {
-		pte_t page;
-		page = ptep_get_and_clear(pte);
-		address += PAGE_SIZE;
-		pte++;
-		if (pte_none(page))
-			continue;
-		if (pte_present(page))
-			continue;
-		printk(KERN_CRIT "Whee.. Swapped out page in kernel page table\n");
-	} while (address < end);
+		pte_t entry = ptep_get_and_clear(pte);
+		if (unlikely(!pte_none(entry) && !pte_present(entry))) {
+			printk(KERN_CRIT "ERROR: swapped out kernel page\n");
+			dump_stack();
+		}
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 }
 
-static void unmap_area_pmd(pud_t *pud, unsigned long address,
-				  unsigned long size)
+static inline void unmap_area_pmd(pud_t *pud,
+		unsigned long addr, unsigned long end)
 {
-	unsigned long end;
+	unsigned long next;
 	pmd_t *pmd;
 
-	if (pud_none(*pud))
-		return;
-	if (pud_bad(*pud)) {
-		pud_ERROR(*pud);
-		pud_clear(pud);
-		return;
-	}
-
-	pmd = pmd_offset(pud, address);
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
-
+	pmd = pmd_offset(pud, addr);
 	do {
-		unmap_area_pte(pmd, address, end - address);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address < end);
+		next = pmd_limit(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_bad(*pmd)) {
+			pmd_ERROR(*pmd);
+			pmd_clear(pmd);
+			continue;
+		}
+		unmap_area_pte(pmd, addr, next);
+	} while (pmd++, addr = next, addr < end);
 }
 
-static void unmap_area_pud(pgd_t *pgd, unsigned long address,
-			   unsigned long size)
+static inline void unmap_area_pud(pgd_t *pgd,
+		unsigned long addr, unsigned long end)
 {
+	unsigned long next;
 	pud_t *pud;
-	unsigned long end;
 
-	if (pgd_none(*pgd))
-		return;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
-		return;
-	}
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_limit(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (pud_bad(*pud)) {
+			pud_ERROR(*pud);
+			pud_clear(pud);
+			continue;
+		}
+		unmap_area_pmd(pud, addr, next);
+	} while (pud++, addr = next, addr < end);
+}
 
-	pud = pud_offset(pgd, address);
-	address &= ~PGDIR_MASK;
-	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+void unmap_vm_area(struct vm_struct *area)
+{
+	unsigned long addr = (unsigned long) area->addr;
+	unsigned long end = addr + area->size;
+	unsigned long next;
+	pgd_t *pgd;
 
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_k(addr);
+	flush_cache_vunmap(addr, end);
 	do {
-		unmap_area_pmd(pud, address, end - address);
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && (address < end));
-}
-
-static int map_area_pte(pte_t *pte, unsigned long address,
-			       unsigned long size, pgprot_t prot,
-			       struct page ***pages)
-{
-	unsigned long end;
-
-	address &= ~PMD_MASK;
-	end = address + size;
-	if (end > PMD_SIZE)
-		end = PMD_SIZE;
+		next = pgd_limit(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_bad(*pgd)) {
+			pgd_ERROR(*pgd);
+			pgd_clear(pgd);
+			continue;
+		}
+		unmap_area_pud(pgd, addr, next);
+	} while (pgd++, addr = next, addr < end);
+	flush_tlb_kernel_range((unsigned long) area->addr, end);
+}
+
+static int map_area_pte(pmd_t *pmd, unsigned long addr,
+		unsigned long end, pgprot_t prot, struct page ***pages)
+{
+	pte_t *pte;
 
+	pte = pte_offset_kernel(pmd, addr);
 	do {
 		struct page *page = **pages;
 		WARN_ON(!pte_none(*pte));
@@ -128,108 +114,68 @@ static int map_area_pte(pte_t *pte, unsi
 			return -ENOMEM;
 
 		set_pte(pte, mk_pte(page, prot));
-		address += PAGE_SIZE;
-		pte++;
 		(*pages)++;
-	} while (address < end);
+	} while (pte++, addr += PAGE_SIZE, addr < end);
 	return 0;
 }
 
-static int map_area_pmd(pmd_t *pmd, unsigned long address,
-			       unsigned long size, pgprot_t prot,
-			       struct page ***pages)
-{
-	unsigned long base, end;
-
-	base = address & PUD_MASK;
-	address &= ~PUD_MASK;
-	end = address + size;
-	if (end > PUD_SIZE)
-		end = PUD_SIZE;
+static inline int map_area_pmd(pud_t *pud, unsigned long addr,
+	       unsigned long end, pgprot_t prot, struct page ***pages)
+{
+	unsigned long next;
+	pmd_t *pmd;
 
+	pmd = pmd_offset(pud, addr);
 	do {
-		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, base + address);
-		if (!pte)
+		next = pmd_limit(addr, end);
+		if (!pte_alloc_kernel(&init_mm, pmd, addr))
 			return -ENOMEM;
-		if (map_area_pte(pte, address, end - address, prot, pages))
+		if (map_area_pte(pmd, addr, next, prot, pages))
 			return -ENOMEM;
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
-	} while (address < end);
-
+	} while (pmd++, addr = next, addr < end);
 	return 0;
 }
 
-static int map_area_pud(pud_t *pud, unsigned long address,
-			       unsigned long end, pgprot_t prot,
-			       struct page ***pages)
+static inline int map_area_pud(pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t prot, struct page ***pages)
 {
+	unsigned long next;
+	pud_t *pud;
+
+	pud = pud_offset(pgd, addr);
 	do {
-		pmd_t *pmd = pmd_alloc(&init_mm, pud, address);
-		if (!pmd)
+		next = pud_limit(addr, end);
+		if (!pmd_alloc(&init_mm, pud, addr))
 			return -ENOMEM;
-		if (map_area_pmd(pmd, address, end - address, prot, pages))
+		if (map_area_pmd(pud, addr, next, prot, pages))
 			return -ENOMEM;
-		address = (address + PUD_SIZE) & PUD_MASK;
-		pud++;
-	} while (address && address < end);
-
+	} while (pud++, addr = next, addr < end);
 	return 0;
 }
 
-void unmap_vm_area(struct vm_struct *area)
-{
-	unsigned long address = (unsigned long) area->addr;
-	unsigned long end = (address + area->size);
-	unsigned long next;
-	pgd_t *pgd;
-	int i;
-
-	pgd = pgd_offset_k(address);
-	flush_cache_vunmap(address, end);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next <= address || next > end)
-			next = end;
-		unmap_area_pud(pgd, address, next - address);
-		address = next;
-	        pgd++;
-	}
-	flush_tlb_kernel_range((unsigned long) area->addr, end);
-}
-
 int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
 {
-	unsigned long address = (unsigned long) area->addr;
-	unsigned long end = address + (area->size-PAGE_SIZE);
+	unsigned long addr = (unsigned long) area->addr;
+	unsigned long end = addr + area->size - PAGE_SIZE;
 	unsigned long next;
 	pgd_t *pgd;
-	int err = 0;
-	int i;
+	int error;
 
-	pgd = pgd_offset_k(address);
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_k(addr);
 	spin_lock(&init_mm.page_table_lock);
-	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
-		pud_t *pud = pud_alloc(&init_mm, pgd, address);
-		if (!pud) {
-			err = -ENOMEM;
-			break;
-		}
-		next = (address + PGDIR_SIZE) & PGDIR_MASK;
-		if (next < address || next > end)
-			next = end;
-		if (map_area_pud(pud, address, next, prot, pages)) {
-			err = -ENOMEM;
+	do {
+		next = pgd_limit(addr, end);
+		if (pud_alloc(&init_mm, pgd, addr))
+			error = map_area_pud(pgd, addr, next, prot, pages);
+		else
+			error = -ENOMEM;
+		if (error)
 			break;
-		}
-
-		address = next;
-		pgd++;
-	}
-
+	} while (pgd++, addr = next, addr < end);
 	spin_unlock(&init_mm.page_table_lock);
 	flush_cache_vmap((unsigned long) area->addr, end);
-	return err;
+	return error;
 }
 
 #define IOREMAP_MAX_ORDER	(7 + PAGE_SHIFT)	/* 128 pages */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23  2:06                   ` Hugh Dickins
@ 2005-02-23  4:31                     ` David S. Miller
  2005-02-23  4:49                       ` Nick Piggin
  2005-02-23  5:23                       ` Nick Piggin
  2005-02-23 23:52                     ` Nick Piggin
  1 sibling, 2 replies; 33+ messages in thread
From: David S. Miller @ 2005-02-23  4:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nickpiggin, ak, benh, torvalds, akpm, linux-kernel

On Wed, 23 Feb 2005 02:06:28 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> I've not seen Dave's bitmap walking functions (for clearing?),
> would they fit in better with my way?

This is what Nick is referring to:

--------------------

I hacked up something slightly different today.  I only
have it being used by clear_page_range() but it is extremely
effective.

Things like fork+exit latencies on my 750Mhz sparc64 box went
from ~490 microseconds to ~367 microseconds.  fork+execve
latency went down from ~1595 microseconds to ~1351 microseconds.

Two issues:

1) I'm not terribly satisfied with the interface.  I think
   with some improvements it can be applies to the two other
   routines this thing really makes sense for, namely copy_page_range
   and unmap_page_range

2) I don't think it will collapse well for 2-level page tables,
   someone take a look?

It's easy to toy with the sparc64 optimization on other platforms,
just add the necessary hacks to pmd_set and pgd_set, allocation
of pmd and pgd tables, use "PAGE_SHIFT - 5" instead of "PAGE_SHIFT - 6"
on 32-bit platforms, and then copy the asm-sparc64/pgwalk.h bits over
into your platforms asm-${ARCH}/pgwalk.h

I just got also reminded that we walk these damn pagetables completely
twice every exit, once to unmap the VMAs pte mappings, once again to
zap the page tables.  It might be fruitful to explore combining
those two steps, perhaps not.

Anyways, comments and improvment suggestions welcome.  Particularly
interesting would be if this thing helps a lot on other platforms
too, such as x86_64, ia64, alpha and ppc64.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/08/10 23:44:24-07:00 davem@nuts.davemloft.net 
#   [MM]: Add arch-overridable page table walking machinery.
#   
#   Currently very rudimentary but is used fully for
#   clear_page_range().  An optimized implementation
#   is there for sparc64 and it is extremely effective
#   particularly for 64-bit processes.
#   
#   For things like lat_fork and friends clear_page_tables()
#   use to be 2nd or 3rd in the kernel profile, now it has
#   dropped to the 20th or so entry.
#   
#   Signed-off-by: David S. Miller <davem@redhat.com>
# 
# mm/memory.c
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -26
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgtable.h
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +28 -4
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgalloc.h
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -2
#   [MM]: Add arch-overridable page table walking machinery.
# 
# arch/sparc64/mm/init.c
#   2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +2 -2
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-x86_64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-v850/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-um/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +114 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sparc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sh64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-sh/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-s390/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ppc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ppc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-parisc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-mips/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-m68knommu/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-m68k/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-ia64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-i386/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-h8300/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-generic/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +96 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-cris/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-arm26/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-arm/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-x86_64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-x86_64/pgwalk.h
# 
# include/asm-v850/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-v850/pgwalk.h
# 
# include/asm-um/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-um/pgwalk.h
# 
# include/asm-sparc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc64/pgwalk.h
# 
# include/asm-sparc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc/pgwalk.h
# 
# include/asm-sh64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh64/pgwalk.h
# 
# include/asm-sh/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh/pgwalk.h
# 
# include/asm-s390/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-s390/pgwalk.h
# 
# include/asm-ppc64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc64/pgwalk.h
# 
# include/asm-ppc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc/pgwalk.h
# 
# include/asm-parisc/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-parisc/pgwalk.h
# 
# include/asm-mips/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-mips/pgwalk.h
# 
# include/asm-m68knommu/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68knommu/pgwalk.h
# 
# include/asm-m68k/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68k/pgwalk.h
# 
# include/asm-ia64/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-ia64/pgwalk.h
# 
# include/asm-i386/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-i386/pgwalk.h
# 
# include/asm-h8300/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-h8300/pgwalk.h
# 
# include/asm-generic/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-generic/pgwalk.h
# 
# include/asm-cris/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-cris/pgwalk.h
# 
# include/asm-arm26/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm26/pgwalk.h
# 
# include/asm-arm/pgwalk.h
#   2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm/pgwalk.h
# 
# include/asm-alpha/pgwalk.h
#   2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +6 -0
#   [MM]: Add arch-overridable page table walking machinery.
# 
# include/asm-alpha/pgwalk.h
#   2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +0 -0
#   BitKeeper file /disk1/BK/sparc-2.6/include/asm-alpha/pgwalk.h
# 
diff -Nru a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
--- a/arch/sparc64/mm/init.c	2004-08-10 23:44:47 -07:00
+++ b/arch/sparc64/mm/init.c	2004-08-10 23:44:47 -07:00
@@ -419,7 +419,7 @@
 					if (ptep == NULL)
 						early_pgtable_allocfail("pte");
 					memset(ptep, 0, BASE_PAGE_SIZE);
-					pmd_set(pmdp, ptep);
+					pmd_set_k(pmdp, ptep);
 				}
 				ptep = (pte_t *)__pmd_page(*pmdp) +
 						((vaddr >> 13) & 0x3ff);
@@ -1455,7 +1455,7 @@
 	memset(swapper_pmd_dir, 0, sizeof(swapper_pmd_dir));
 
 	/* Now can init the kernel/bad page tables. */
-	pgd_set(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t)));
+	pgd_set_k(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t)));
 	
 	sparc64_vpte_patchme1[0] |=
 		(((unsigned long)pgd_val(init_mm.pgd[0])) >> 10);
diff -Nru a/include/asm-alpha/pgwalk.h b/include/asm-alpha/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-alpha/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_PGWALK_H
+#define _ALPHA_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ALPHA_PGWALK_H */
diff -Nru a/include/asm-arm/pgwalk.h b/include/asm-arm/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-arm/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ARM_PGWALK_H
+#define _ARM_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ARM_PGWALK_H */
diff -Nru a/include/asm-arm26/pgwalk.h b/include/asm-arm26/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-arm26/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ARM26_PGWALK_H
+#define _ARM26_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ARM26_PGWALK_H */
diff -Nru a/include/asm-cris/pgwalk.h b/include/asm-cris/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-cris/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _CRIS_PGWALK_H
+#define _CRIS_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _CRIS_PGWALK_H */
diff -Nru a/include/asm-generic/pgwalk.h b/include/asm-generic/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-generic/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,96 @@
+#ifndef _GENERIC_PGWALK_H
+#define _GENERIC_PGWALK_H
+
+#include <linux/mm.h>
+
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+struct pte_walk_state;
+typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *);
+typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *);
+typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *);
+
+struct pte_walk_state {
+	void *_client_state;
+	void *first;
+	void *last;
+};
+
+static inline void *pte_walk_client_state(struct pte_walk_state *walk)
+{
+	return walk->_client_state;
+}
+
+static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work)
+{
+	pte_t *ptep = walk->first;
+	pte_t *last = walk->last;
+
+	do {
+		if (pte_none(*ptep))
+			goto next;
+		pte_work(walk, ptep);
+	next:
+		ptep++;
+	} while (ptep < last);
+}
+
+static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work)
+{
+	pmd_t *page_dir = walk->first;
+	pmd_t *last = walk->last;
+
+	do {
+		if (pmd_none(*page_dir))
+			goto next;
+		if (unlikely(pmd_bad(*page_dir))) {
+			pmd_ERROR(*page_dir);
+			pmd_clear(page_dir);
+			goto next;
+		}
+		pmd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last)
+{
+	walk->_client_state = client_state;
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work)
+{
+	pgd_t *page_dir = walk->first;
+	pgd_t *last = walk->last;
+
+	do {
+		if (pgd_none(*page_dir))
+			goto next;
+		if (unlikely(pgd_bad(*page_dir))) {
+			pgd_ERROR(page_dir);
+			pgd_clear(page_dir);
+			goto next;
+		}
+		pgd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+#endif /* _GENERIC_PGWALK_H */
diff -Nru a/include/asm-h8300/pgwalk.h b/include/asm-h8300/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-h8300/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _H8300_PGWALK_H
+#define _H8300_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _H8300_PGWALK_H */
diff -Nru a/include/asm-i386/pgwalk.h b/include/asm-i386/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-i386/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _I386_PGWALK_H
+#define _I386_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _I386_PGWALK_H */
diff -Nru a/include/asm-ia64/pgwalk.h b/include/asm-ia64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ia64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _IA64_PGWALK_H
+#define _IA64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _IA64_PGWALK_H */
diff -Nru a/include/asm-m68k/pgwalk.h b/include/asm-m68k/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-m68k/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _M68K_PGWALK_H
+#define _M68K_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _M68K_PGWALK_H */
diff -Nru a/include/asm-m68knommu/pgwalk.h b/include/asm-m68knommu/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-m68knommu/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _M68KNOMMU_PGWALK_H
+#define _M68KNOMMU_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _M68KNOMMU_PGWALK_H */
diff -Nru a/include/asm-mips/pgwalk.h b/include/asm-mips/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-mips/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_PGWALK_H
+#define _ALPHA_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _ALPHA_PGWALK_H */
diff -Nru a/include/asm-parisc/pgwalk.h b/include/asm-parisc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-parisc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PARISC_PGWALK_H
+#define _PARISC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PARISC_PGWALK_H */
diff -Nru a/include/asm-ppc/pgwalk.h b/include/asm-ppc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ppc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PPC_PGWALK_H
+#define _PPC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PPC_PGWALK_H */
diff -Nru a/include/asm-ppc64/pgwalk.h b/include/asm-ppc64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-ppc64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _PPC64_PGWALK_H
+#define _PPC64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _PPC64_PGWALK_H */
diff -Nru a/include/asm-s390/pgwalk.h b/include/asm-s390/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-s390/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _S390_PGWALK_H
+#define _S390_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _S390_PGWALK_H */
diff -Nru a/include/asm-sh/pgwalk.h b/include/asm-sh/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sh/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SH_PGWALK_H
+#define _SH_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SH_PGWALK_H */
diff -Nru a/include/asm-sh64/pgwalk.h b/include/asm-sh64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sh64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SH64_PGWALK_H
+#define _SH64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SH64_PGWALK_H */
diff -Nru a/include/asm-sparc/pgwalk.h b/include/asm-sparc/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sparc/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _SPARC_PGWALK_H
+#define _SPARC_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _SPARC_PGWALK_H */
diff -Nru a/include/asm-sparc64/pgalloc.h b/include/asm-sparc64/pgalloc.h
--- a/include/asm-sparc64/pgalloc.h	2004-08-10 23:44:47 -07:00
+++ b/include/asm-sparc64/pgalloc.h	2004-08-10 23:44:47 -07:00
@@ -93,6 +93,8 @@
 
 static __inline__ void free_pgd_fast(pgd_t *pgd)
 {
+	virt_to_page(pgd)->index = 0UL;
+
 	preempt_disable();
 	*(unsigned long *)pgd = (unsigned long) pgd_quicklist;
 	pgd_quicklist = (unsigned long *) pgd;
@@ -113,8 +115,10 @@
 	} else {
 		preempt_enable();
 		ret = (unsigned long *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-		if(ret)
+		if (ret) {
 			memset(ret, 0, PAGE_SIZE);
+			virt_to_page(ret)->index = 0UL;
+		}
 	}
 	return (pgd_t *)ret;
 }
@@ -162,8 +166,10 @@
 	pmd = pmd_alloc_one_fast(mm, address);
 	if (!pmd) {
 		pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-		if (pmd)
+		if (pmd) {
 			memset(pmd, 0, PAGE_SIZE);
+			virt_to_page(pmd)->index = 0UL;
+		}
 	}
 	return pmd;
 }
@@ -171,6 +177,8 @@
 static __inline__ void free_pmd_fast(pmd_t *pmd)
 {
 	unsigned long color = DCACHE_COLOR((unsigned long)pmd);
+
+	virt_to_page(pmd)->index = 0UL;
 
 	preempt_disable();
 	*(unsigned long *)pmd = (unsigned long) pte_quicklist[color];
diff -Nru a/include/asm-sparc64/pgtable.h b/include/asm-sparc64/pgtable.h
--- a/include/asm-sparc64/pgtable.h	2004-08-10 23:44:47 -07:00
+++ b/include/asm-sparc64/pgtable.h	2004-08-10 23:44:47 -07:00
@@ -259,10 +259,34 @@
 
 	return __pte;
 }
-#define pmd_set(pmdp, ptep)	\
-	(pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL))
-#define pgd_set(pgdp, pmdp)	\
-	(pgd_val(*(pgdp)) = (__pa((unsigned long) (pmdp)) >> 11UL))
+
+#define PGTABLE_BIT_SHIFT	(PAGE_SHIFT - 6)
+#define PGTABLE_BIT_MASK	((1UL << PGTABLE_BIT_SHIFT) - 1)
+#define PGTABLE_BIT_REGION	(1UL << PGTABLE_BIT_SHIFT)
+#define PGTABLE_BIT(ptr) \
+	(1UL << (((unsigned long)(ptr) & ~PAGE_MASK) >> PGTABLE_BIT_SHIFT))
+#define __PGTABLE_REGION_NEXT(ptr,type) \
+	((type *)(((unsigned long)(ptr) + PGTABLE_BIT_REGION) & \
+		  ~PGTABLE_BIT_MASK))
+#define PMD_REGION_NEXT(pmdp) __PGTABLE_REGION_NEXT(pmdp,pmd_t)
+#define PGD_REGION_NEXT(pgdp) __PGTABLE_REGION_NEXT(pgdp,pgd_t)
+
+#define pmd_set(pmdp, ptep) \
+do { \
+	virt_to_page(pmdp)->index |= PGTABLE_BIT(pmdp); \
+	pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL; \
+} while (0)
+#define pmd_set_k(pmdp, ptep) \
+	(pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL)
+
+#define pgd_set(pgdp, pmdp) \
+do { \
+	virt_to_page(pgdp)->index |= PGTABLE_BIT(pgdp); \
+	pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL; \
+} while (0)
+#define pgd_set_k(pgdp, pmdp) \
+	(pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL)
+
 #define __pmd_page(pmd)		\
 	((unsigned long) __va((((unsigned long)pmd_val(pmd))<<11UL)))
 #define pmd_page(pmd) 			virt_to_page((void *)__pmd_page(pmd))
diff -Nru a/include/asm-sparc64/pgwalk.h b/include/asm-sparc64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-sparc64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,114 @@
+/* pgwalk.h: UltraSPARC fast page table traversal.
+ *
+ * Copyright 2004 David S. Miller <davem@redhat.com>
+ */
+
+#ifndef _SPARC64_PGWALK_H
+#define _SPARC64_PGWALK_H
+
+#include <linux/mm.h>
+
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+struct pte_walk_state;
+typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *);
+typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *);
+typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *);
+
+struct pte_walk_state {
+	void *_client_state;
+	void *first;
+	void *last;
+};
+
+static inline void *pte_walk_client_state(struct pte_walk_state *walk)
+{
+	return walk->_client_state;
+}
+
+static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work)
+{
+	pte_t *ptep = walk->first;
+	pte_t *last = walk->last;
+
+	do {
+		if (pte_none(*ptep))
+			goto next;
+		pte_work(walk, ptep);
+	next:
+		ptep++;
+	} while (ptep < last);
+}
+
+static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last)
+{
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work)
+{
+	pmd_t *page_dir = walk->first;
+	pmd_t *last = walk->last;
+	unsigned long mask;
+
+	mask = virt_to_page(page_dir)->index;
+
+	do {
+		if (likely(!(PGTABLE_BIT(page_dir) & mask))) {
+			page_dir = PMD_REGION_NEXT(page_dir);
+			continue;
+		}
+		if (pmd_none(*page_dir))
+			goto next;
+		if (unlikely(pmd_bad(*page_dir))) {
+			pmd_ERROR(*page_dir);
+			pmd_clear(page_dir);
+			goto next;
+		}
+		pmd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+
+static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last)
+{
+	walk->_client_state = client_state;
+	walk->first = first;
+	walk->last = last;
+}
+
+static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work)
+{
+	pgd_t *page_dir = walk->first;
+	pgd_t *last = walk->last;
+	unsigned long mask;
+
+	mask = virt_to_page(page_dir)->index;
+
+	do {
+		if (likely(!(PGTABLE_BIT(page_dir) & mask))) {
+			page_dir = PGD_REGION_NEXT(page_dir);
+			continue;
+		}
+		if (pgd_none(*page_dir))
+			goto next;
+		if (unlikely(pgd_bad(*page_dir))) {
+			pgd_ERROR(page_dir);
+			pgd_clear(page_dir);
+			goto next;
+		}
+		pgd_work(walk, page_dir);
+	next:
+		page_dir++;
+	} while (page_dir < last);
+}
+#endif /* _SPARC64_PGWALK_H */
diff -Nru a/include/asm-um/pgwalk.h b/include/asm-um/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-um/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _UM_PGWALK_H
+#define _UM_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _UM_PGWALK_H */
diff -Nru a/include/asm-v850/pgwalk.h b/include/asm-v850/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-v850/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _V850_PGWALK_H
+#define _V850_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _V850_PGWALK_H */
diff -Nru a/include/asm-x86_64/pgwalk.h b/include/asm-x86_64/pgwalk.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-x86_64/pgwalk.h	2004-08-10 23:44:47 -07:00
@@ -0,0 +1,6 @@
+#ifndef _X86_64_PGWALK_H
+#define _X86_64_PGWALK_H
+
+#include <asm-generic/pgwalk.h>
+
+#endif /* _X86_64_PGWALK_H */
diff -Nru a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c	2004-08-10 23:44:47 -07:00
+++ b/mm/memory.c	2004-08-10 23:44:47 -07:00
@@ -52,6 +52,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
+#include <asm/pgwalk.h>
 
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -100,40 +101,25 @@
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void free_one_pmd(struct mmu_gather *tlb, pmd_t * dir)
+static void free_one_pmd(struct pte_walk_state *walk, pmd_t *dir)
 {
 	struct page *page;
 
-	if (pmd_none(*dir))
-		return;
-	if (unlikely(pmd_bad(*dir))) {
-		pmd_ERROR(*dir);
-		pmd_clear(dir);
-		return;
-	}
 	page = pmd_page(*dir);
 	pmd_clear(dir);
 	dec_page_state(nr_page_table_pages);
-	pte_free_tlb(tlb, page);
+	pte_free_tlb(pte_walk_client_state(walk), page);
 }
 
-static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir)
+static void free_one_pgd(struct pte_walk_state *walk, pgd_t *dir)
 {
-	int j;
 	pmd_t * pmd;
 
-	if (pgd_none(*dir))
-		return;
-	if (unlikely(pgd_bad(*dir))) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
-		return;
-	}
 	pmd = pmd_offset(dir, 0);
 	pgd_clear(dir);
-	for (j = 0; j < PTRS_PER_PMD ; j++)
-		free_one_pmd(tlb, pmd+j);
-	pmd_free_tlb(tlb, pmd);
+	pmd_walk_init(walk, pmd, pmd + PTRS_PER_PMD);
+	pmd_walk(walk, free_one_pmd);
+	pmd_free_tlb(pte_walk_client_state(tlb), pmd);
 }
 
 /*
@@ -144,13 +130,11 @@
  */
 void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr)
 {
+	struct pte_walk_state walk;
 	pgd_t * page_dir = tlb->mm->pgd;
 
-	page_dir += first;
-	do {
-		free_one_pgd(tlb, page_dir);
-		page_dir++;
-	} while (--nr);
+	pgd_walk_init(&walk, tlb, page_dir, page_dir + nr);
+	pgd_walk(&walk, free_one_pgd);
 }
 
 pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23  4:31                     ` David S. Miller
@ 2005-02-23  4:49                       ` Nick Piggin
  2005-02-23  4:57                         ` David S. Miller
  2005-02-23  5:23                       ` Nick Piggin
  1 sibling, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-23  4:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: Hugh Dickins, ak, benh, torvalds, akpm, linux-kernel

On Tue, 2005-02-22 at 20:31 -0800, David S. Miller wrote:
> On Wed, 23 Feb 2005 02:06:28 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > I've not seen Dave's bitmap walking functions (for clearing?),
> > would they fit in better with my way?
> 

Hugh: I'll have more of a look through your patch when I get
some time... to be honest I'm not too worried either way, so
long as one or the other gets in.

Very trivial point, but I'm not sure that I like the name
p?d_limit... maybe p?d_span or _span_end... hmm, they're not
really pleasing either.

You _are_ repeating a bit of mindless loop accounting in every
page table walk, and it isn't completely clear to me that it is
giving you much more flexibility (than for_each_*). But my loops
_are_ a bit contorted.

> This is what Nick is referring to:
> 

[snip]

> It's easy to toy with the sparc64 optimization on other platforms,
> just add the necessary hacks to pmd_set and pgd_set, allocation
> of pmd and pgd tables

David: just an implementation detail that I had meant to bring
up earlier - would it feel like less of a hack to put these in
pmd_populate and pgd_populate?

Nick





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23  4:49                       ` Nick Piggin
@ 2005-02-23  4:57                         ` David S. Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David S. Miller @ 2005-02-23  4:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: hugh, ak, benh, torvalds, akpm, linux-kernel

On Wed, 23 Feb 2005 15:49:30 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > It's easy to toy with the sparc64 optimization on other platforms,
> > just add the necessary hacks to pmd_set and pgd_set, allocation
> > of pmd and pgd tables
> 
> David: just an implementation detail that I had meant to bring
> up earlier - would it feel like less of a hack to put these in
> pmd_populate and pgd_populate?

Sure, no problem.  They get defined to pmd_set/pgd_set calls
anyways.  But wouldn't that miss pgd_clear() and pmd_clear()?
Someone may find it worthwhile to, on a *_clear(), to see if
a set bit can now be clear because all the neighboring entries
are empty as well.

That might have been the reason I put it there, but I may be
giving myself too much credit :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23  4:31                     ` David S. Miller
  2005-02-23  4:49                       ` Nick Piggin
@ 2005-02-23  5:23                       ` Nick Piggin
  1 sibling, 0 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-23  5:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: Hugh Dickins, ak, benh, torvalds, akpm, linux-kernel

On Tue, 2005-02-22 at 20:31 -0800, David S. Miller wrote:

> I just got also reminded that we walk these damn pagetables completely
> twice every exit, once to unmap the VMAs pte mappings, once again to
> zap the page tables.  It might be fruitful to explore combining
> those two steps, perhaps not.
> 

I'm going to have a look at refcounting page table pages, which
will hopefully allow us to get back (and more) the clear_page_range
overhead introduced by the aggressive page table freeing.

It may also allow nice things like dropping file backed page table
mappings if they get reclaimed, and also a single walk to do the
freeing. I haven't looked into details yet though, these are just
vague hopes.


> Anyways, comments and improvment suggestions welcome.  Particularly
> interesting would be if this thing helps a lot on other platforms
> too, such as x86_64, ia64, alpha and ppc64.
> 

I have a feeling it should provide nice benefits to all archs if
we get it into all the walkers. Downsides are few - the bitmap walk
probably only becomes more expensive when all but a handful of
cachelines are present in a page table page.

I'd like to look at ways to make this patch happen with you soon...
First, for 2.6.12 my main concern is to get pt walking consistent,
and try to claw back some of the clear_page_range regression.

Thanks,
Nick


Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23  2:06                   ` Hugh Dickins
  2005-02-23  4:31                     ` David S. Miller
@ 2005-02-23 23:52                     ` Nick Piggin
  2005-02-24  0:00                       ` David S. Miller
  2005-02-24  5:12                       ` Hugh Dickins
  1 sibling, 2 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-23 23:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

Hugh Dickins wrote:

> I'm off to bed, but since your appetite for looking at patches
> is greater than mine, I'll throw what I'm currently testing over
> the wall to you now.  Against 2.6.11-rc4-bk9, but my starting point
> was obviously your patches.  Not yet split up, but clearly should be.

Yeah you've snuck a few other clever things in there ;)

> Includes mm/swapfile.c which you missed.  I'm inlining pmd and pud

Thanks.

> levels, but not pte and pgd levels.  No description yet, sorry.

OK - that's probably sufficient for debugging. There is only so
much that can go wrong in the middle levels... how does it look
performance wise? (I can give it a test when it gets split out)

> One point worth making, I do believe throughout that whatever the
> address layout, "end" cannot be 0 - BUG_ON(addr >= end) assures.
> 

OK after sleeping on it, I'm warming to your way.

I don't think it makes something like David's modifications any
easier, but mine didn't go a long way to that end either. And
being a more incremental approach gives us more room to move in
future (for example, maybe toward something that really *will*
accommodate the bitmap walking code nicely).

So I'd be pretty happy for you to queue this up with Andrew for
2.6.12. Anyone else?

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23 23:52                     ` Nick Piggin
@ 2005-02-24  0:00                       ` David S. Miller
  2005-02-24  5:12                       ` Hugh Dickins
  1 sibling, 0 replies; 33+ messages in thread
From: David S. Miller @ 2005-02-24  0:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: hugh, ak, benh, torvalds, akpm, linux-kernel

On Thu, 24 Feb 2005 10:52:23 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> So I'd be pretty happy for you to queue this up with Andrew for
> 2.6.12. Anyone else?

No objections from me.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-23 23:52                     ` Nick Piggin
  2005-02-24  0:00                       ` David S. Miller
@ 2005-02-24  5:12                       ` Hugh Dickins
  2005-02-24  5:59                         ` Nick Piggin
  1 sibling, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2005-02-24  5:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Thu, 24 Feb 2005, Nick Piggin wrote:
> Hugh Dickins wrote:
> 
> > I'm inlining pmd and pud levels, but not pte and pgd levels.
> 
> OK - that's probably sufficient for debugging. There is only so
> much that can go wrong in the middle levels... 

Yes, that was my thinking.

> how does it look
> performance wise? (I can give it a test when it gets split out)

Yesterday shattered in various directions, I hope to try today.

> > One point worth making, I do believe throughout that whatever the
> > address layout, "end" cannot be 0 - BUG_ON(addr >= end) assures.

Of course, that does allow some simplifications in your for_each
macros; but it still looked like my p??_limits were better for
shortest codepath, and close to yours for codesize.

> OK after sleeping on it, I'm warming to your way.
> 
> I don't think it makes something like David's modifications any
> easier, but mine didn't go a long way to that end either. And
> being a more incremental approach gives us more room to move in
> future (for example, maybe toward something that really *will*
> accommodate the bitmap walking code nicely).

I'll take a quick look at David's today.
Just so long as we don't make them harder.

> So I'd be pretty happy for you to queue this up with Andrew for
> 2.6.12. Anyone else?

Oh, okay, thanks.  You weren't very happy with p??_limit(addr, end),
and good naming is important to me.  I didn't care for your tentative
p??_span or p??_span_end.  Would p??_end be better?  p??_enda would
be fun for one of them...

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24  5:12                       ` Hugh Dickins
@ 2005-02-24  5:59                         ` Nick Piggin
  2005-02-24 11:58                           ` Hugh Dickins
  0 siblings, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-24  5:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Thu, 2005-02-24 at 05:12 +0000, Hugh Dickins wrote:
> On Thu, 24 Feb 2005, Nick Piggin wrote:

> > OK after sleeping on it, I'm warming to your way.
> > 
> > I don't think it makes something like David's modifications any
> > easier, but mine didn't go a long way to that end either. And
> > being a more incremental approach gives us more room to move in
> > future (for example, maybe toward something that really *will*
> > accommodate the bitmap walking code nicely).
> 
> I'll take a quick look at David's today.
> Just so long as we don't make them harder.
> 

No, I think we may want to move to something better abstracted:
it makes things sufficiently complex that you wouldn't want to
have it open coded everywhere.

But no, you're not making it harder than the present situation.

> > So I'd be pretty happy for you to queue this up with Andrew for
> > 2.6.12. Anyone else?
> 
> Oh, okay, thanks.  You weren't very happy with p??_limit(addr, end),
> and good naming is important to me.  I didn't care for your tentative
> p??_span or p??_span_end.  Would p??_end be better?  p??_enda would
> be fun for one of them...
> 

pud_addr_end?



http://mobile.yahoo.com.au - Yahoo! Mobile
- Check & compose your email via SMS on your Telstra or Vodafone mobile.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24  5:59                         ` Nick Piggin
@ 2005-02-24 11:58                           ` Hugh Dickins
  2005-02-24 19:33                             ` David S. Miller
  2005-02-24 21:59                             ` Nick Piggin
  0 siblings, 2 replies; 33+ messages in thread
From: Hugh Dickins @ 2005-02-24 11:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Thu, 24 Feb 2005, Nick Piggin wrote:
> 
> pud_addr_end?

		next = pud_addr_end(addr, end);

Hmm, yes, I'll go with that, thanks (unless a better idea follows).

Something I do intend on top of what I sent before, is another set
of three macros, like

		if (pud_none_or_clear_bad(pud))
			continue;

to replace all the p??_none, p??_bad clauses: not to save space,
but just for clarity, those loops now seeming dominated by the
unlikeliest of cases.

Has anyone _ever_ seen a p??_ERROR message?  I'm inclined to just
put three functions into mm/memory.c to do the p??_ERROR and p??_clear,
but that way the __FILE__ and __LINE__ will always come out the same.
I think if it ever proves a problem, we'd just add in a dump_stack.

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24 11:58                           ` Hugh Dickins
@ 2005-02-24 19:33                             ` David S. Miller
  2005-02-25 10:44                               ` Andi Kleen
  2005-02-24 21:59                             ` Nick Piggin
  1 sibling, 1 reply; 33+ messages in thread
From: David S. Miller @ 2005-02-24 19:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nickpiggin, ak, benh, torvalds, akpm, linux-kernel

On Thu, 24 Feb 2005 11:58:42 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Has anyone _ever_ seen a p??_ERROR message?

It triggers when you're writing new platform pagetable support
or making drastric changes in same.  But on sparc64 I've set
them all to nops to make the code output smaller. :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24 11:58                           ` Hugh Dickins
  2005-02-24 19:33                             ` David S. Miller
@ 2005-02-24 21:59                             ` Nick Piggin
  2005-02-24 22:32                               ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Nick Piggin @ 2005-02-24 21:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

Hugh Dickins wrote:
> On Thu, 24 Feb 2005, Nick Piggin wrote:
> 
>>pud_addr_end?
> 
> 
> 		next = pud_addr_end(addr, end);
> 
> Hmm, yes, I'll go with that, thanks (unless a better idea follows).
> 
> Something I do intend on top of what I sent before, is another set
> of three macros, like
> 
> 		if (pud_none_or_clear_bad(pud))
> 			continue;
> 
> to replace all the p??_none, p??_bad clauses: not to save space,
> but just for clarity, those loops now seeming dominated by the
> unlikeliest of cases.
> 
> Has anyone _ever_ seen a p??_ERROR message?  I'm inclined to just
> put three functions into mm/memory.c to do the p??_ERROR and p??_clear,
> but that way the __FILE__ and __LINE__ will always come out the same.
> I think if it ever proves a problem, we'd just add in a dump_stack.
> 

I think a function is the most sensible. And a good idea, it should
reduce the icache pressure in the loops (although gcc does seem to
do a pretty good job of moving unlikely()s away from the fastpath).

I think at the point these things get detected, there is little use
for having a dump_stack. But we may as well add one anyway if it is
an out of line function?

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24 21:59                             ` Nick Piggin
@ 2005-02-24 22:32                               ` Hugh Dickins
  2005-02-24 22:52                                 ` Nick Piggin
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2005-02-24 22:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

On Fri, 25 Feb 2005, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > Has anyone _ever_ seen a p??_ERROR message?  I'm inclined to just
> > put three functions into mm/memory.c to do the p??_ERROR and p??_clear,
> > but that way the __FILE__ and __LINE__ will always come out the same.
> > I think if it ever proves a problem, we'd just add in a dump_stack.
> 
> I think a function is the most sensible. And a good idea, it should
> reduce the icache pressure in the loops (although gcc does seem to
> do a pretty good job of moving unlikely()s away from the fastpath).

At one stage I was adding unlikelies to all the p??_bads, then it
seemed more sensible to hide that in a new macro (which of course
must do the none and bad tests inline, before going off to the function).

David's response confirms that __FILE__,__LINE__ shouldn't be an issue.

> I think at the point these things get detected, there is little use
> for having a dump_stack. But we may as well add one anyway if it is
> an out of line function?

We could at little cost.  But I think if these messages come up at all,
they're likely to come up in clumps, where the backtrace won't actually
be giving any interesting info, and the quantity of them be a nuisance
itself.  I'd rather leave it to the next person who gets the error and
wants the backtrace to add it.

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24 22:32                               ` Hugh Dickins
@ 2005-02-24 22:52                                 ` Nick Piggin
  0 siblings, 0 replies; 33+ messages in thread
From: Nick Piggin @ 2005-02-24 22:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, David S. Miller, benh, torvalds, akpm, linux-kernel

Hugh Dickins wrote:
> 
> At one stage I was adding unlikelies to all the p??_bads, then it
> seemed more sensible to hide that in a new macro (which of course
> must do the none and bad tests inline, before going off to the function).
> 

Yeah that sounds OK. I think (un)likely can propagate through
inline functions too, if that's any help to you.

> 
> We could at little cost.  But I think if these messages come up at all,
> they're likely to come up in clumps, where the backtrace won't actually
> be giving any interesting info, and the quantity of them be a nuisance
> itself.  I'd rather leave it to the next person who gets the error and
> wants the backtrace to add it.
> 

You're probably right - I know when I see them (from my
hacking up the code) they usually come in clumps :P

Nick


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] page table iterators
  2005-02-24 19:33                             ` David S. Miller
@ 2005-02-25 10:44                               ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2005-02-25 10:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: Hugh Dickins, nickpiggin, ak, benh, torvalds, akpm, linux-kernel

On Thu, Feb 24, 2005 at 11:33:50AM -0800, David S. Miller wrote:
> On Thu, 24 Feb 2005 11:58:42 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > Has anyone _ever_ seen a p??_ERROR message?
> 
> It triggers when you're writing new platform pagetable support
> or making drastric changes in same.  But on sparc64 I've set
> them all to nops to make the code output smaller. :-)

I don't think it's useful except for early debugging.

Also at least on i386/x86-64 the CPU sets a bit in the page fault
handler when it encounters a corrupted page table. On x86-64 
it is handled (not on i386) 


-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2005-02-25 10:44 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-17 13:53 [PATCH 1/2] optimise copy page range Nick Piggin
2005-02-17 14:03 ` [PATCH 2/2] page table iterators Nick Piggin
2005-02-17 15:56   ` Linus Torvalds
2005-02-17 16:13     ` Nick Piggin
2005-02-17 19:43   ` Andi Kleen
2005-02-17 22:49     ` Benjamin Herrenschmidt
2005-02-17 23:03       ` Andi Kleen
2005-02-17 23:21         ` Benjamin Herrenschmidt
2005-02-17 23:34           ` Andi Kleen
2005-02-17 23:30         ` David S. Miller
2005-02-17 23:57           ` Andi Kleen
2005-02-20 12:35             ` Nick Piggin
2005-02-21  6:35               ` Hugh Dickins
2005-02-21  6:40                 ` Andrew Morton
2005-02-21  7:09                   ` Benjamin Herrenschmidt
2005-02-21  8:09                     ` Nick Piggin
2005-02-21  9:04                       ` Nick Piggin
2005-02-22  9:54                 ` Nick Piggin
2005-02-23  2:06                   ` Hugh Dickins
2005-02-23  4:31                     ` David S. Miller
2005-02-23  4:49                       ` Nick Piggin
2005-02-23  4:57                         ` David S. Miller
2005-02-23  5:23                       ` Nick Piggin
2005-02-23 23:52                     ` Nick Piggin
2005-02-24  0:00                       ` David S. Miller
2005-02-24  5:12                       ` Hugh Dickins
2005-02-24  5:59                         ` Nick Piggin
2005-02-24 11:58                           ` Hugh Dickins
2005-02-24 19:33                             ` David S. Miller
2005-02-25 10:44                               ` Andi Kleen
2005-02-24 21:59                             ` Nick Piggin
2005-02-24 22:32                               ` Hugh Dickins
2005-02-24 22:52                                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).