* copy_page_range() @ 2004-08-07 7:05 David S. Miller 2004-08-07 8:07 ` copy_page_range() William Lee Irwin III 2004-08-09 9:01 ` copy_page_range() David Mosberger 0 siblings, 2 replies; 16+ messages in thread From: David S. Miller @ 2004-08-07 7:05 UTC (permalink / raw) To: torvalds; +Cc: linux-arch Every couple months I look at this thing. The main issue is that it's very cache unfriendly, especially with how sparsely populated the page tables are for 64-bit processes. As a simple example, it's at the top of the kernel profile for 64-bit lat_proc {fork,exec,shell} on sparc64. And it's in fact the pmd array scans that take all of the cache misses, and thus most of the run time. An idea I've always been entertaining is to associate a bitmask with each pmd table. For example, a possible current implementation could be to abuse page_struct->index for this bitmask, and use virt_to_page(pmdp)->index to get at it. This divides the pmd table into BITS_PER_LONG sections. If the bit is set in ->index then we populated at least one of the pmd entries in that section. We never clear bits, except at pmd table allocation time. Then the pmd scan iterates over ->index, and only actually dereferences the pmd entries iff it finds a set bit, and it only dereferences the section of pmd entries represented by that bit. Another idea I've also considered is to implement the pgd/pmd levels as a more compact tree, based upon virtual address, such as a radix tree. I think all of this could be experimented with if we abstracted out the pmd/pgd/pte iteration. So much stuff in the kernel mm code is of the form: for_each_pgd(pgdp) for_each_pmd(pgdp, pmdp) for_each_pte(pmdp, ptep) do_something(ptep) At 2-levels, as on most of the 32-bit platforms, things aren't so bad. Comments? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-07 7:05 copy_page_range() David S. Miller @ 2004-08-07 8:07 ` William Lee Irwin III 2004-08-11 7:07 ` copy_page_range() David S. Miller 2004-08-09 9:01 ` copy_page_range() David Mosberger 1 sibling, 1 reply; 16+ messages in thread From: William Lee Irwin III @ 2004-08-07 8:07 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-arch On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote: > Every couple months I look at this thing. > The main issue is that it's very cache unfriendly, > especially with how sparsely populated the page tables > are for 64-bit processes. > As a simple example, it's at the top of the kernel > profile for 64-bit lat_proc {fork,exec,shell} on > sparc64. > And it's in fact the pmd array scans that take all > of the cache misses, and thus most of the run time. > An idea I've always been entertaining is to associate > a bitmask with each pmd table. For example, a possible > current implementation could be to abuse page_struct->index > for this bitmask, and use virt_to_page(pmdp)->index to get > at it. Sounds generally reasonable. On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote: > This divides the pmd table into BITS_PER_LONG sections. > If the bit is set in ->index then we populated at least > one of the pmd entries in that section. We never clear > bits, except at pmd table allocation time. > Then the pmd scan iterates over ->index, and only actually > dereferences the pmd entries iff it finds a set bit, and > it only dereferences the section of pmd entries represented > by that bit. > Another idea I've also considered is to implement the > pgd/pmd levels as a more compact tree, based upon virtual > address, such as a radix tree. > I think all of this could be experimented with if we > abstracted out the pmd/pgd/pte iteration. So much stuff > in the kernel mm code is of the form: > for_each_pgd(pgdp) > for_each_pmd(pgdp, pmdp) > for_each_pte(pmdp, ptep) > do_something(ptep) > At 2-levels, as on most of the 32-bit platforms, things > aren't so bad. > Comments? The number of levels can be abstracted easily. Something to give an idea of how might be something like this: struct pte_walk_state { pgd_t *pgd; pmd_t *pmd; pte_t *pte; unsigned long vaddr; }; int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { int cow, ret = 0; struct pte_walk_state walk_parent, walk_child; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; spin_lock(&dst->page_table_lock); pte_walk_descend_and_create(dst, &walk_child, vma->vm_start); for_each_inuse_pte(src, &walk_parent, vma->vm_start, vma->vm_end) { if (pte_walk_move_and_create(&walk_child, walk_parent.vaddr)) { ret = -ENOMEM; break; } /* * do stuff to child and parent ptes */ ... } spin_unlock(&dst->page_table_lock); return ret; } void zap_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long len, struct zap_details *details) { struct pte_walk_state walk; spin_lock(&vma->vm_mm->page_table_lock); for_each_inuse_pte(vma->vm_mm, &walk, vma->vm_start, vma->vm_end) { /* * wipe pte and do stuff */ ... } spin_unlock(&vma->vm_mm->page_table_lock); } where #define for_each_inuse_pte(mm, walk, start, end) \ for (pte_walk_descend(mm, walk, start); (walk)->vaddr < (end); \ next_inuse_pte(walk)) etc. -- wli ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-07 8:07 ` copy_page_range() William Lee Irwin III @ 2004-08-11 7:07 ` David S. Miller 2004-08-11 7:35 ` copy_page_range() William Lee Irwin III 2004-08-11 16:13 ` copy_page_range() Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: David S. Miller @ 2004-08-11 7:07 UTC (permalink / raw) To: William Lee Irwin III; +Cc: torvalds, linux-arch On Sat, 7 Aug 2004 01:07:51 -0700 William Lee Irwin III <wli@holomorphy.com> wrote: > The number of levels can be abstracted easily. Something to give an > idea of how might be something like this: I hacked up something slightly different today. I only have it being used by clear_page_range() but it is extremely effective. Things like fork+exit latencies on my 750Mhz sparc64 box went from ~490 microseconds to ~367 microseconds. fork+execve latency went down from ~1595 microseconds to ~1351 microseconds. Two issues: 1) I'm not terribly satisfied with the interface. I think with some improvements it can be applies to the two other routines this thing really makes sense for, namely copy_page_range and unmap_page_range 2) I don't think it will collapse well for 2-level page tables, someone take a look? It's easy to toy with the sparc64 optimization on other platforms, just add the necessary hacks to pmd_set and pgd_set, allocation of pmd and pgd tables, use "PAGE_SHIFT - 5" instead of "PAGE_SHIFT - 6" on 32-bit platforms, and then copy the asm-sparc64/pgwalk.h bits over into your platforms asm-${ARCH}/pgwalk.h I just got also reminded that we walk these damn pagetables completely twice every exit, once to unmap the VMAs pte mappings, once again to zap the page tables. It might be fruitful to explore combining those two steps, perhaps not. Anyways, comments and improvment suggestions welcome. Particularly interesting would be if this thing helps a lot on other platforms too, such as x86_64, ia64, alpha and ppc64. # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2004/08/10 23:44:24-07:00 davem@nuts.davemloft.net # [MM]: Add arch-overridable page table walking machinery. # # Currently very rudimentary but is used fully for # clear_page_range(). An optimized implementation # is there for sparc64 and it is extremely effective # particularly for 64-bit processes. # # For things like lat_fork and friends clear_page_tables() # use to be 2nd or 3rd in the kernel profile, now it has # dropped to the 20th or so entry. # # Signed-off-by: David S. Miller <davem@redhat.com> # # mm/memory.c # 2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -26 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sparc64/pgtable.h # 2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +28 -4 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sparc64/pgalloc.h # 2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +10 -2 # [MM]: Add arch-overridable page table walking machinery. # # arch/sparc64/mm/init.c # 2004/08/10 23:42:42-07:00 davem@nuts.davemloft.net +2 -2 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-x86_64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-v850/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-um/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sparc64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +114 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sparc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sh64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-sh/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-s390/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-ppc64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-ppc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-parisc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-mips/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-m68knommu/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-m68k/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-ia64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-i386/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-h8300/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-generic/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +96 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-cris/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-arm26/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-arm/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-x86_64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-x86_64/pgwalk.h # # include/asm-v850/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-v850/pgwalk.h # # include/asm-um/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-um/pgwalk.h # # include/asm-sparc64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc64/pgwalk.h # # include/asm-sparc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-sparc/pgwalk.h # # include/asm-sh64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh64/pgwalk.h # # include/asm-sh/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-sh/pgwalk.h # # include/asm-s390/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-s390/pgwalk.h # # include/asm-ppc64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc64/pgwalk.h # # include/asm-ppc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-ppc/pgwalk.h # # include/asm-parisc/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-parisc/pgwalk.h # # include/asm-mips/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-mips/pgwalk.h # # include/asm-m68knommu/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68knommu/pgwalk.h # # include/asm-m68k/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-m68k/pgwalk.h # # include/asm-ia64/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-ia64/pgwalk.h # # include/asm-i386/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-i386/pgwalk.h # # include/asm-h8300/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-h8300/pgwalk.h # # include/asm-generic/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-generic/pgwalk.h # # include/asm-cris/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-cris/pgwalk.h # # include/asm-arm26/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm26/pgwalk.h # # include/asm-arm/pgwalk.h # 2004/08/10 23:42:14-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-arm/pgwalk.h # # include/asm-alpha/pgwalk.h # 2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +6 -0 # [MM]: Add arch-overridable page table walking machinery. # # include/asm-alpha/pgwalk.h # 2004/08/10 23:42:13-07:00 davem@nuts.davemloft.net +0 -0 # BitKeeper file /disk1/BK/sparc-2.6/include/asm-alpha/pgwalk.h # diff -Nru a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c --- a/arch/sparc64/mm/init.c 2004-08-10 23:44:47 -07:00 +++ b/arch/sparc64/mm/init.c 2004-08-10 23:44:47 -07:00 @@ -419,7 +419,7 @@ if (ptep == NULL) early_pgtable_allocfail("pte"); memset(ptep, 0, BASE_PAGE_SIZE); - pmd_set(pmdp, ptep); + pmd_set_k(pmdp, ptep); } ptep = (pte_t *)__pmd_page(*pmdp) + ((vaddr >> 13) & 0x3ff); @@ -1455,7 +1455,7 @@ memset(swapper_pmd_dir, 0, sizeof(swapper_pmd_dir)); /* Now can init the kernel/bad page tables. */ - pgd_set(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t))); + pgd_set_k(&swapper_pg_dir[0], swapper_pmd_dir + (shift / sizeof(pgd_t))); sparc64_vpte_patchme1[0] |= (((unsigned long)pgd_val(init_mm.pgd[0])) >> 10); diff -Nru a/include/asm-alpha/pgwalk.h b/include/asm-alpha/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-alpha/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _ALPHA_PGWALK_H +#define _ALPHA_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _ALPHA_PGWALK_H */ diff -Nru a/include/asm-arm/pgwalk.h b/include/asm-arm/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-arm/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _ARM_PGWALK_H +#define _ARM_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _ARM_PGWALK_H */ diff -Nru a/include/asm-arm26/pgwalk.h b/include/asm-arm26/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-arm26/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _ARM26_PGWALK_H +#define _ARM26_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _ARM26_PGWALK_H */ diff -Nru a/include/asm-cris/pgwalk.h b/include/asm-cris/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-cris/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _CRIS_PGWALK_H +#define _CRIS_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _CRIS_PGWALK_H */ diff -Nru a/include/asm-generic/pgwalk.h b/include/asm-generic/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-generic/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,96 @@ +#ifndef _GENERIC_PGWALK_H +#define _GENERIC_PGWALK_H + +#include <linux/mm.h> + +#include <asm/page.h> +#include <asm/pgtable.h> + +struct pte_walk_state; +typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *); +typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *); +typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *); + +struct pte_walk_state { + void *_client_state; + void *first; + void *last; +}; + +static inline void *pte_walk_client_state(struct pte_walk_state *walk) +{ + return walk->_client_state; +} + +static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last) +{ + walk->first = first; + walk->last = last; +} + +static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work) +{ + pte_t *ptep = walk->first; + pte_t *last = walk->last; + + do { + if (pte_none(*ptep)) + goto next; + pte_work(walk, ptep); + next: + ptep++; + } while (ptep < last); +} + +static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last) +{ + walk->first = first; + walk->last = last; +} + +static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work) +{ + pmd_t *page_dir = walk->first; + pmd_t *last = walk->last; + + do { + if (pmd_none(*page_dir)) + goto next; + if (unlikely(pmd_bad(*page_dir))) { + pmd_ERROR(*page_dir); + pmd_clear(page_dir); + goto next; + } + pmd_work(walk, page_dir); + next: + page_dir++; + } while (page_dir < last); +} + +static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last) +{ + walk->_client_state = client_state; + walk->first = first; + walk->last = last; +} + +static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work) +{ + pgd_t *page_dir = walk->first; + pgd_t *last = walk->last; + + do { + if (pgd_none(*page_dir)) + goto next; + if (unlikely(pgd_bad(*page_dir))) { + pgd_ERROR(page_dir); + pgd_clear(page_dir); + goto next; + } + pgd_work(walk, page_dir); + next: + page_dir++; + } while (page_dir < last); +} + +#endif /* _GENERIC_PGWALK_H */ diff -Nru a/include/asm-h8300/pgwalk.h b/include/asm-h8300/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-h8300/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _H8300_PGWALK_H +#define _H8300_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _H8300_PGWALK_H */ diff -Nru a/include/asm-i386/pgwalk.h b/include/asm-i386/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-i386/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _I386_PGWALK_H +#define _I386_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _I386_PGWALK_H */ diff -Nru a/include/asm-ia64/pgwalk.h b/include/asm-ia64/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-ia64/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _IA64_PGWALK_H +#define _IA64_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _IA64_PGWALK_H */ diff -Nru a/include/asm-m68k/pgwalk.h b/include/asm-m68k/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-m68k/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _M68K_PGWALK_H +#define _M68K_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _M68K_PGWALK_H */ diff -Nru a/include/asm-m68knommu/pgwalk.h b/include/asm-m68knommu/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-m68knommu/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _M68KNOMMU_PGWALK_H +#define _M68KNOMMU_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _M68KNOMMU_PGWALK_H */ diff -Nru a/include/asm-mips/pgwalk.h b/include/asm-mips/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-mips/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _ALPHA_PGWALK_H +#define _ALPHA_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _ALPHA_PGWALK_H */ diff -Nru a/include/asm-parisc/pgwalk.h b/include/asm-parisc/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-parisc/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _PARISC_PGWALK_H +#define _PARISC_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _PARISC_PGWALK_H */ diff -Nru a/include/asm-ppc/pgwalk.h b/include/asm-ppc/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-ppc/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _PPC_PGWALK_H +#define _PPC_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _PPC_PGWALK_H */ diff -Nru a/include/asm-ppc64/pgwalk.h b/include/asm-ppc64/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-ppc64/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _PPC64_PGWALK_H +#define _PPC64_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _PPC64_PGWALK_H */ diff -Nru a/include/asm-s390/pgwalk.h b/include/asm-s390/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-s390/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _S390_PGWALK_H +#define _S390_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _S390_PGWALK_H */ diff -Nru a/include/asm-sh/pgwalk.h b/include/asm-sh/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-sh/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _SH_PGWALK_H +#define _SH_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _SH_PGWALK_H */ diff -Nru a/include/asm-sh64/pgwalk.h b/include/asm-sh64/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-sh64/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _SH64_PGWALK_H +#define _SH64_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _SH64_PGWALK_H */ diff -Nru a/include/asm-sparc/pgwalk.h b/include/asm-sparc/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-sparc/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _SPARC_PGWALK_H +#define _SPARC_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _SPARC_PGWALK_H */ diff -Nru a/include/asm-sparc64/pgalloc.h b/include/asm-sparc64/pgalloc.h --- a/include/asm-sparc64/pgalloc.h 2004-08-10 23:44:47 -07:00 +++ b/include/asm-sparc64/pgalloc.h 2004-08-10 23:44:47 -07:00 @@ -93,6 +93,8 @@ static __inline__ void free_pgd_fast(pgd_t *pgd) { + virt_to_page(pgd)->index = 0UL; + preempt_disable(); *(unsigned long *)pgd = (unsigned long) pgd_quicklist; pgd_quicklist = (unsigned long *) pgd; @@ -113,8 +115,10 @@ } else { preempt_enable(); ret = (unsigned long *) __get_free_page(GFP_KERNEL|__GFP_REPEAT); - if(ret) + if (ret) { memset(ret, 0, PAGE_SIZE); + virt_to_page(ret)->index = 0UL; + } } return (pgd_t *)ret; } @@ -162,8 +166,10 @@ pmd = pmd_alloc_one_fast(mm, address); if (!pmd) { pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pmd) + if (pmd) { memset(pmd, 0, PAGE_SIZE); + virt_to_page(pmd)->index = 0UL; + } } return pmd; } @@ -171,6 +177,8 @@ static __inline__ void free_pmd_fast(pmd_t *pmd) { unsigned long color = DCACHE_COLOR((unsigned long)pmd); + + virt_to_page(pmd)->index = 0UL; preempt_disable(); *(unsigned long *)pmd = (unsigned long) pte_quicklist[color]; diff -Nru a/include/asm-sparc64/pgtable.h b/include/asm-sparc64/pgtable.h --- a/include/asm-sparc64/pgtable.h 2004-08-10 23:44:47 -07:00 +++ b/include/asm-sparc64/pgtable.h 2004-08-10 23:44:47 -07:00 @@ -259,10 +259,34 @@ return __pte; } -#define pmd_set(pmdp, ptep) \ - (pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL)) -#define pgd_set(pgdp, pmdp) \ - (pgd_val(*(pgdp)) = (__pa((unsigned long) (pmdp)) >> 11UL)) + +#define PGTABLE_BIT_SHIFT (PAGE_SHIFT - 6) +#define PGTABLE_BIT_MASK ((1UL << PGTABLE_BIT_SHIFT) - 1) +#define PGTABLE_BIT_REGION (1UL << PGTABLE_BIT_SHIFT) +#define PGTABLE_BIT(ptr) \ + (1UL << (((unsigned long)(ptr) & ~PAGE_MASK) >> PGTABLE_BIT_SHIFT)) +#define __PGTABLE_REGION_NEXT(ptr,type) \ + ((type *)(((unsigned long)(ptr) + PGTABLE_BIT_REGION) & \ + ~PGTABLE_BIT_MASK)) +#define PMD_REGION_NEXT(pmdp) __PGTABLE_REGION_NEXT(pmdp,pmd_t) +#define PGD_REGION_NEXT(pgdp) __PGTABLE_REGION_NEXT(pgdp,pgd_t) + +#define pmd_set(pmdp, ptep) \ +do { \ + virt_to_page(pmdp)->index |= PGTABLE_BIT(pmdp); \ + pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL; \ +} while (0) +#define pmd_set_k(pmdp, ptep) \ + (pmd_val(*pmdp) = __pa((unsigned long) (ptep)) >> 11UL) + +#define pgd_set(pgdp, pmdp) \ +do { \ + virt_to_page(pgdp)->index |= PGTABLE_BIT(pgdp); \ + pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL; \ +} while (0) +#define pgd_set_k(pgdp, pmdp) \ + (pgd_val(*pgdp) = __pa((unsigned long) (pmdp)) >> 11UL) + #define __pmd_page(pmd) \ ((unsigned long) __va((((unsigned long)pmd_val(pmd))<<11UL))) #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) diff -Nru a/include/asm-sparc64/pgwalk.h b/include/asm-sparc64/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-sparc64/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,114 @@ +/* pgwalk.h: UltraSPARC fast page table traversal. + * + * Copyright 2004 David S. Miller <davem@redhat.com> + */ + +#ifndef _SPARC64_PGWALK_H +#define _SPARC64_PGWALK_H + +#include <linux/mm.h> + +#include <asm/page.h> +#include <asm/pgtable.h> + +struct pte_walk_state; +typedef void (*pgd_work_func_t)(struct pte_walk_state *, pgd_t *); +typedef void (*pmd_work_func_t)(struct pte_walk_state *, pmd_t *); +typedef void (*pte_work_func_t)(struct pte_walk_state *, pte_t *); + +struct pte_walk_state { + void *_client_state; + void *first; + void *last; +}; + +static inline void *pte_walk_client_state(struct pte_walk_state *walk) +{ + return walk->_client_state; +} + +static inline void pte_walk_init(struct pte_walk_state *walk, pte_t *first, pte_t *last) +{ + walk->first = first; + walk->last = last; +} + +static inline void pte_walk(struct pte_walk_state *walk, pte_work_func_t pte_work) +{ + pte_t *ptep = walk->first; + pte_t *last = walk->last; + + do { + if (pte_none(*ptep)) + goto next; + pte_work(walk, ptep); + next: + ptep++; + } while (ptep < last); +} + +static inline void pmd_walk_init(struct pte_walk_state *walk, pmd_t *first, pmd_t *last) +{ + walk->first = first; + walk->last = last; +} + +static inline void pmd_walk(struct pte_walk_state *walk, pmd_work_func_t pmd_work) +{ + pmd_t *page_dir = walk->first; + pmd_t *last = walk->last; + unsigned long mask; + + mask = virt_to_page(page_dir)->index; + + do { + if (likely(!(PGTABLE_BIT(page_dir) & mask))) { + page_dir = PMD_REGION_NEXT(page_dir); + continue; + } + if (pmd_none(*page_dir)) + goto next; + if (unlikely(pmd_bad(*page_dir))) { + pmd_ERROR(*page_dir); + pmd_clear(page_dir); + goto next; + } + pmd_work(walk, page_dir); + next: + page_dir++; + } while (page_dir < last); +} + +static inline void pgd_walk_init(struct pte_walk_state *walk, void *client_state, pgd_t *first, pgd_t *last) +{ + walk->_client_state = client_state; + walk->first = first; + walk->last = last; +} + +static inline void pgd_walk(struct pte_walk_state *walk, pgd_work_func_t pgd_work) +{ + pgd_t *page_dir = walk->first; + pgd_t *last = walk->last; + unsigned long mask; + + mask = virt_to_page(page_dir)->index; + + do { + if (likely(!(PGTABLE_BIT(page_dir) & mask))) { + page_dir = PGD_REGION_NEXT(page_dir); + continue; + } + if (pgd_none(*page_dir)) + goto next; + if (unlikely(pgd_bad(*page_dir))) { + pgd_ERROR(page_dir); + pgd_clear(page_dir); + goto next; + } + pgd_work(walk, page_dir); + next: + page_dir++; + } while (page_dir < last); +} +#endif /* _SPARC64_PGWALK_H */ diff -Nru a/include/asm-um/pgwalk.h b/include/asm-um/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-um/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _UM_PGWALK_H +#define _UM_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _UM_PGWALK_H */ diff -Nru a/include/asm-v850/pgwalk.h b/include/asm-v850/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-v850/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _V850_PGWALK_H +#define _V850_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _V850_PGWALK_H */ diff -Nru a/include/asm-x86_64/pgwalk.h b/include/asm-x86_64/pgwalk.h --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/include/asm-x86_64/pgwalk.h 2004-08-10 23:44:47 -07:00 @@ -0,0 +1,6 @@ +#ifndef _X86_64_PGWALK_H +#define _X86_64_PGWALK_H + +#include <asm-generic/pgwalk.h> + +#endif /* _X86_64_PGWALK_H */ diff -Nru a/mm/memory.c b/mm/memory.c --- a/mm/memory.c 2004-08-10 23:44:47 -07:00 +++ b/mm/memory.c 2004-08-10 23:44:47 -07:00 @@ -52,6 +52,7 @@ #include <asm/tlb.h> #include <asm/tlbflush.h> #include <asm/pgtable.h> +#include <asm/pgwalk.h> #include <linux/swapops.h> #include <linux/elf.h> @@ -100,40 +101,25 @@ * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. */ -static inline void free_one_pmd(struct mmu_gather *tlb, pmd_t * dir) +static void free_one_pmd(struct pte_walk_state *walk, pmd_t *dir) { struct page *page; - if (pmd_none(*dir)) - return; - if (unlikely(pmd_bad(*dir))) { - pmd_ERROR(*dir); - pmd_clear(dir); - return; - } page = pmd_page(*dir); pmd_clear(dir); dec_page_state(nr_page_table_pages); - pte_free_tlb(tlb, page); + pte_free_tlb(pte_walk_client_state(walk), page); } -static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir) +static void free_one_pgd(struct pte_walk_state *walk, pgd_t *dir) { - int j; pmd_t * pmd; - if (pgd_none(*dir)) - return; - if (unlikely(pgd_bad(*dir))) { - pgd_ERROR(*dir); - pgd_clear(dir); - return; - } pmd = pmd_offset(dir, 0); pgd_clear(dir); - for (j = 0; j < PTRS_PER_PMD ; j++) - free_one_pmd(tlb, pmd+j); - pmd_free_tlb(tlb, pmd); + pmd_walk_init(walk, pmd, pmd + PTRS_PER_PMD); + pmd_walk(walk, free_one_pmd); + pmd_free_tlb(pte_walk_client_state(tlb), pmd); } /* @@ -144,13 +130,11 @@ */ void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr) { + struct pte_walk_state walk; pgd_t * page_dir = tlb->mm->pgd; - page_dir += first; - do { - free_one_pgd(tlb, page_dir); - page_dir++; - } while (--nr); + pgd_walk_init(&walk, tlb, page_dir, page_dir + nr); + pgd_walk(&walk, free_one_pgd); } pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-11 7:07 ` copy_page_range() David S. Miller @ 2004-08-11 7:35 ` William Lee Irwin III 2004-08-11 16:13 ` copy_page_range() Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: William Lee Irwin III @ 2004-08-11 7:35 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-arch On Sat, 7 Aug 2004 01:07:51 -0700 William Lee Irwin III wrote: >> The number of levels can be abstracted easily. Something to give an >> idea of how might be something like this: On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote: > I hacked up something slightly different today. I only > have it being used by clear_page_range() but it is extremely > effective. > Things like fork+exit latencies on my 750Mhz sparc64 box went > from ~490 microseconds to ~367 microseconds. fork+execve > latency went down from ~1595 microseconds to ~1351 microseconds. Nice results! On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote: > Two issues: > 1) I'm not terribly satisfied with the interface. I think > with some improvements it can be applies to the two other > routines this thing really makes sense for, namely copy_page_range > and unmap_page_range I think this involves discriminating between walking in tandem (over instantiated ptes in one and creation in the other), single walking to instantiate new ptes in an address range, and walking over instantiated ptes in an address range, possibly a fourth case for destruction. On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote: > 2) I don't think it will collapse well for 2-level page tables, > someone take a look? This is one of the reasons why I wanted to have the struct to put the handling of levels in the arch bits of the walking. That way, 2-level pagetables can be done without maintaining the extraneous pointer or the extra level of calls. On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote: > It's easy to toy with the sparc64 optimization on other platforms, > just add the necessary hacks to pmd_set and pgd_set, allocation > of pmd and pgd tables, use "PAGE_SHIFT - 5" instead of "PAGE_SHIFT - 6" > on 32-bit platforms, and then copy the asm-sparc64/pgwalk.h bits over > into your platforms asm-${ARCH}/pgwalk.h > I just got also reminded that we walk these damn pagetables completely > twice every exit, once to unmap the VMAs pte mappings, once again to > zap the page tables. It might be fruitful to explore combining > those two steps, perhaps not. We really need to be freeing up pagetables during unmapping better, since they do "leak" a bit. This is causing pain elsewhere (hugetlb). Once we do that, clear_page_tables() is a nop and all its work is done while unmapping all the vmas. I vaguely remember some patches (associated with the shpte efforts) to do something like this having gone around before, though those were specifically directed at exit() and not at general munmap(). On Wed, Aug 11, 2004 at 12:07:08AM -0700, David S. Miller wrote: > Anyways, comments and improvment suggestions welcome. Particularly > interesting would be if this thing helps a lot on other platforms > too, such as x86_64, ia64, alpha and ppc64. I need to play with it a little to see what I can do. -- wli ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-11 7:07 ` copy_page_range() David S. Miller 2004-08-11 7:35 ` copy_page_range() William Lee Irwin III @ 2004-08-11 16:13 ` Linus Torvalds 2004-08-11 20:45 ` copy_page_range() David S. Miller 2004-08-12 3:53 ` copy_page_range() David S. Miller 1 sibling, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2004-08-11 16:13 UTC (permalink / raw) To: David S. Miller; +Cc: William Lee Irwin III, linux-arch On Wed, 11 Aug 2004, David S. Miller wrote: > > I hacked up something slightly different today. I only > have it being used by clear_page_range() but it is extremely > effective. Hmm.. I don't see any of this being arch-dependent, so I wonder why you did it that way. Also, one comment: the page directory "index" is never zeroed, as far as I can tell. > Things like fork+exit latencies on my 750Mhz sparc64 box went > from ~490 microseconds to ~367 microseconds. fork+execve > latency went down from ~1595 microseconds to ~1351 microseconds. That's definitely fascinating, and implies either a bug (hey, who knows?) or that page tables are a lot sparser than I'd have expected them to be. Ahh.. I see. The "clear_page_tables()" interface was really designed for a two-level page table. And even there I was lazy in exit_mmap(). Yeah, I think clear_page_tables() is broken, and it's increasingly broken on three or four-level setups. Your patch really works around the fact that we're extremely lazy about tearing down page tables. Ho humm. Maybe it's the right way to go, but I have to say that I would _really_ prefer to make this generic. There's absolutely nothing architecture-specific anywhere there except for the place where you hide the bitmap ("page->index" depends on a pgd/pmd being one page). I hate "asm-generic" if it's just hiding the fact that it really _is_ generic, but people wanted macros. David, could you look at instead of doing this <asm/page-walk.c> thing, just do a few _trivial_ macros in the asm page table headers: /* We use a bitmap in the pmd page to mark things busy, * where we reduce the pmd index into 64 bits */ #define PMD_BITMAP_SHIFT (PAGE_SHIFT-6) #define pmd_usage_bitmap(pmd) (virt_to_page(pmd)->index) #define PGD_BITMAP_SHIFT (PAGE_SHIFT-6) #define pgd_usage_bitmap(pgd) (virt_to_page(pgd)->index) and then the two-level folding can be done in the generic code by not defining the PGD_BITMAP_SHIFT at all or something like that, ie the generic code would have exactly _one_ #ifdef: clear_pgd_tables(...) { do { #ifdef PGD_BITMAP_SHIFT if (!(pgd_usage_bitmap(pgd) & mask)) continue; #endif ... } while (pgd < end) } (you get the idea). What do you think? Actually - make the same #ifdef in the pmd case too, since that allows architectures that don't fold things but don't have any good _room_ to hide the bitmap to also just not do this. Hmm? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-11 16:13 ` copy_page_range() Linus Torvalds @ 2004-08-11 20:45 ` David S. Miller 2004-08-12 3:53 ` copy_page_range() David S. Miller 1 sibling, 0 replies; 16+ messages in thread From: David S. Miller @ 2004-08-11 20:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: wli, linux-arch On Wed, 11 Aug 2004 09:13:36 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > Hmm.. I don't see any of this being arch-dependent, so I wonder why you > did it that way. I'm trying to achieve two goals. The first I've demonstrated is achieveable, the second is still not fully grasped yet. Firstly, I wanted to get clear_page_tables() out of my profiles. Secondly, I wanted to abstract out completely the page table traversing the generic kernel does. I want the latter so I can experiment with different data structures for page tables, and the current pgd/pmd/pte array assumptions in the kernel generic vm code disallow any kind of tinkering in that area. If we end up with an interface that says: "walk page tables for vaddr range 'start' to 'end', and do func() for each pte" then anything can be experimented with. You're absolutely right, and I've mentioned this earlier in this thread, that the current page tables are way too sparse. On 64-bit a simple hello world program with a 3-level page table looks roughly like: PGD_BASE: ... X --> PMD_BASE1 ... Y --> PTE_BASE1 ... some ptes ... ... Z --> PMD_BASE2 ... A --> PTE_BASE2 ... some ptes ... ... B --> PMD_BASE3 ... C --> PTE_BASE3 ... some ptes ... ... ... The X-->Y branch is for the program text. The Z-->A branch is for the dynamic mmap() area (shared libraries, anonymous mmaps, etc.) The B-->C branch is for the program stack. We've got maybe 10 to 20 present pte's in this tree. On sparc64 pgd_t and pmd_t are both 32-bit (this is in order to encode the most address space possible, we can encode the full physical address by simply shifting out the page offset bits) So each pgd_t table holds 2048 entries as does each pmd_t table. Therefore, in the above example during clear_page_tables() we'd scan 2048 pgd's, 3 * 2048 pmd's and 3 * 1024 pte's. That's 7 * 8192 (PAGE_SIZE) byte worth of pointer derefing. It's no wonder this shows up in the profiles. All of that just for 10 to 20 actual user mappings. This is broken. I want to try and use a less sparse data structure on sparc just for the pgd/pmd level, and use pages of ptes for the pte_t level as those tend to be well populated. I also need to retain the pte_t level as a full page due to the virtual linear page table stuff I do to speed up TLB miss processing (roughly the same as what ia64 does). I can't experiment with all the generic code assuming these things are arrays. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-11 16:13 ` copy_page_range() Linus Torvalds 2004-08-11 20:45 ` copy_page_range() David S. Miller @ 2004-08-12 3:53 ` David S. Miller 1 sibling, 0 replies; 16+ messages in thread From: David S. Miller @ 2004-08-12 3:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: wli, linux-arch On Wed, 11 Aug 2004 09:13:36 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > Also, one comment: the page directory "index" is never zeroed, as far as I > can tell. It's done in the pmd/pgd freeing methods. static __inline__ void free_pgd_fast(pgd_t *pgd) { + virt_to_page(pgd)->index = 0UL; + ... static __inline__ void free_pmd_fast(pmd_t *pmd) { unsigned long color = DCACHE_COLOR((unsigned long)pmd); + + virt_to_page(pmd)->index = 0UL; and also at pmd/pgd allocation time. > David, could you look at instead of doing this <asm/page-walk.c> thing, > just do a few _trivial_ macros in the asm page table headers: I assume you mean asm/page-walk.h, and sure I'll whip something up. But please keep in mind what I said in my other email, that I really want to (in the end) abstract away all page table walking, so that the only thing the generic VM code really plays around with are pte's. All page table traversal goes through an interface, so platforms can use whatever data structure (ie. something that isn't a flat out array) they want. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-07 7:05 copy_page_range() David S. Miller 2004-08-07 8:07 ` copy_page_range() William Lee Irwin III @ 2004-08-09 9:01 ` David Mosberger 2004-08-09 9:04 ` copy_page_range() William Lee Irwin III 2004-08-09 17:45 ` copy_page_range() David S. Miller 1 sibling, 2 replies; 16+ messages in thread From: David Mosberger @ 2004-08-09 9:01 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, linux-arch >>>>> On Sat, 7 Aug 2004 00:05:29 -0700, "David S. Miller" <davem@redhat.com> said: DaveM> Every couple months I look at this thing. DaveM> The main issue is that it's very cache unfriendly, especially DaveM> with how sparsely populated the page tables are for 64-bit DaveM> processes. DaveM> As a simple example, it's at the top of the kernel profile DaveM> for 64-bit lat_proc {fork,exec,shell} on sparc64. I didn't recall copy_page_range() being so high on ia64, but it's been a while since looked at this, so I ran it again (this is with a simple fork() loop; lmbench is trying to be too clever for me so I don't like profiling it...): % time self cumul calls self/call tot/call name 36.89 10.78 10.78 267k 40.4u 40.4u clear_page_tables 25.99 7.59 18.37 573k 13.2u 13.2u copy_page 11.42 3.34 21.70 2.07M 1.61u 1.61u clear_page 2.26 0.66 22.36 1.71M 385n 428n copy_page_range 1.64 0.48 22.84 546k 878n 898n finish_task_switch 1.50 0.44 23.28 314k 1.39u 1.48u unmap_vmas 1.37 0.40 23.68 4.07M 98.2n 98.2n __copy_user 1.32 0.39 24.06 302k 1.27u 1.38u release_task 1.19 0.35 24.41 5.73M 60.6n 60.6n page_remove_rmap 1.17 0.34 24.75 2.98M 114n 126n buffered_rmqueue 1.01 0.30 25.05 316k 933n 6.37u copy_process 0.92 0.27 25.32 2.68M 101n 107n free_hot_cold_page 0.67 0.20 25.51 6.02M 32.7n 32.7n put_page I suspect some reasons for the different profile may be: - 16KB page-size vs. 4KB page-size - My binary was statically linked The good news is that your proposal should help clear_page_tables() just as easily as copy_page_range(). ;-) --david ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:01 ` copy_page_range() David Mosberger @ 2004-08-09 9:04 ` William Lee Irwin III 2004-08-09 9:27 ` copy_page_range() David Mosberger 2004-08-09 17:08 ` copy_page_range() Linus Torvalds 2004-08-09 17:45 ` copy_page_range() David S. Miller 1 sibling, 2 replies; 16+ messages in thread From: William Lee Irwin III @ 2004-08-09 9:04 UTC (permalink / raw) To: davidm; +Cc: David S. Miller, torvalds, linux-arch On Mon, Aug 09, 2004 at 02:01:37AM -0700, David Mosberger wrote: > I didn't recall copy_page_range() being so high on ia64, but it's been > a while since looked at this, so I ran it again (this is with a simple > fork() loop; lmbench is trying to be too clever for me so I don't like > profiling it...): > % time self cumul calls self/call tot/call name > 36.89 10.78 10.78 267k 40.4u 40.4u clear_page_tables > 25.99 7.59 18.37 573k 13.2u 13.2u copy_page > 11.42 3.34 21.70 2.07M 1.61u 1.61u clear_page > I suspect some reasons for the different profile may be: > - 16KB page-size vs. 4KB page-size > - My binary was statically linked > The good news is that your proposal should help clear_page_tables() > just as easily as copy_page_range(). ;-) These results are actually consistent with large-memory ia32. Instruction-level profiles showed that the largest overhead in copy_page_range() on such ia32 boxen appeared to be mm->rss++. -- wli ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:04 ` copy_page_range() William Lee Irwin III @ 2004-08-09 9:27 ` David Mosberger 2004-08-09 9:29 ` copy_page_range() William Lee Irwin III 2004-08-09 17:46 ` copy_page_range() David S. Miller 2004-08-09 17:08 ` copy_page_range() Linus Torvalds 1 sibling, 2 replies; 16+ messages in thread From: David Mosberger @ 2004-08-09 9:27 UTC (permalink / raw) To: William Lee Irwin III; +Cc: davidm, David S. Miller, torvalds, linux-arch >>>>> On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said: William> These results are actually consistent with large-memory William> ia32. Instruction-level profiles showed that the largest William> overhead in copy_page_range() on such ia32 boxen appeared William> to be mm->rss++. Hmmh, for me, the single biggest stall seems to come from the pmd_none() check in free_one_pmd(). --david ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:27 ` copy_page_range() David Mosberger @ 2004-08-09 9:29 ` William Lee Irwin III 2004-08-09 10:01 ` copy_page_range() David Mosberger 2004-08-09 17:46 ` copy_page_range() David S. Miller 1 sibling, 1 reply; 16+ messages in thread From: William Lee Irwin III @ 2004-08-09 9:29 UTC (permalink / raw) To: davidm; +Cc: David S. Miller, torvalds, linux-arch On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said: William> These results are actually consistent with large-memory William> ia32. Instruction-level profiles showed that the largest William> overhead in copy_page_range() on such ia32 boxen appeared William> to be mm->rss++. On Mon, Aug 09, 2004 at 02:27:16AM -0700, David Mosberger wrote: > Hmmh, for me, the single biggest stall seems to come from the pmd_none() > check in free_one_pmd(). That was the case in clear_page_tables(); it was copy_page_range() that saw mm->rss++ take an unusual amount of time. -- wli ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:29 ` copy_page_range() William Lee Irwin III @ 2004-08-09 10:01 ` David Mosberger 0 siblings, 0 replies; 16+ messages in thread From: David Mosberger @ 2004-08-09 10:01 UTC (permalink / raw) To: William Lee Irwin III; +Cc: davidm, David S. Miller, torvalds, linux-arch >>>>> On Mon, 9 Aug 2004 02:29:43 -0700, William Lee Irwin III <wli@holomorphy.com> said: William> On Mon, Aug 09, 2004 at 02:27:16AM -0700, David Mosberger William> wrote: >> Hmmh, for me, the single biggest stall seems to come from the >> pmd_none() check in free_one_pmd(). William> That was the case in clear_page_tables(); it was William> copy_page_range() that saw mm->rss++ take an unusual amount William> of time. Sorry, I misread your mail. In my cause, the biggest staller in copy_page_range() seems to be page_dup_rmap() (right after dst->rss++). Specifically, the test_and_set_bit() which comes from page_dup_rmap()->page_map_lock()->bit_spin_lock() is causing the stalls. --david ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:27 ` copy_page_range() David Mosberger 2004-08-09 9:29 ` copy_page_range() William Lee Irwin III @ 2004-08-09 17:46 ` David S. Miller 1 sibling, 0 replies; 16+ messages in thread From: David S. Miller @ 2004-08-09 17:46 UTC (permalink / raw) To: davidm; +Cc: davidm, wli, torvalds, linux-arch On Mon, 9 Aug 2004 02:27:16 -0700 David Mosberger <davidm@napali.hpl.hp.com> wrote: > >>>>> On Mon, 9 Aug 2004 02:04:58 -0700, William Lee Irwin III <wli@holomorphy.com> said: > > William> These results are actually consistent with large-memory > William> ia32. Instruction-level profiles showed that the largest > William> overhead in copy_page_range() on such ia32 boxen appeared > William> to be mm->rss++. > > Hmmh, for me, the single biggest stall seems to come from the pmd_none() > check in free_one_pmd(). Right, that is what gets hit on sparc64 too. On ia32, the tables are half the size, thus half the amount of memory accesses per table traversal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:04 ` copy_page_range() William Lee Irwin III 2004-08-09 9:27 ` copy_page_range() David Mosberger @ 2004-08-09 17:08 ` Linus Torvalds 2004-08-09 18:49 ` copy_page_range() William Lee Irwin III 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2004-08-09 17:08 UTC (permalink / raw) To: William Lee Irwin III; +Cc: davidm, David S. Miller, linux-arch On Mon, 9 Aug 2004, William Lee Irwin III wrote: > > These results are actually consistent with large-memory ia32. > Instruction-level profiles showed that the largest overhead in > copy_page_range() on such ia32 boxen appeared to be mm->rss++. That sounds unlikely. Most ia32 instruction profiles will give high profile counts to instructions _following_ the one that was expensive, and in this case I'd strongyl suspect that the real expense on x86 is the "get_page(page)" thing. Which is an atomic increment, and thus very expensive. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 17:08 ` copy_page_range() Linus Torvalds @ 2004-08-09 18:49 ` William Lee Irwin III 0 siblings, 0 replies; 16+ messages in thread From: William Lee Irwin III @ 2004-08-09 18:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David S. Miller, linux-arch On Mon, 9 Aug 2004, William Lee Irwin III wrote: >> These results are actually consistent with large-memory ia32. >> Instruction-level profiles showed that the largest overhead in >> copy_page_range() on such ia32 boxen appeared to be mm->rss++. On Mon, Aug 09, 2004 at 10:08:05AM -0700, Linus Torvalds wrote: > That sounds unlikely. Most ia32 instruction profiles will give high > profile counts to instructions _following_ the one that was expensive, and > in this case I'd strongyl suspect that the real expense on x86 is the > "get_page(page)" thing. > Which is an atomic increment, and thus very expensive. But it was real. The theory is that mm->rss++; was an off-node memory access, where struct page (due to boot-time remapping voodoo) and pmd's (thanks to my patchwerk) were node-local, and the 40:1 off-node memory access latency for a remote cache miss (i.e. ZONE_NORMAL) killed it all. Thankfully Oracle has me parked on 64-bit machines with cache directories and vaguely speedy interconnects for this kind of work. -- wli ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: copy_page_range() 2004-08-09 9:01 ` copy_page_range() David Mosberger 2004-08-09 9:04 ` copy_page_range() William Lee Irwin III @ 2004-08-09 17:45 ` David S. Miller 1 sibling, 0 replies; 16+ messages in thread From: David S. Miller @ 2004-08-09 17:45 UTC (permalink / raw) To: davidm; +Cc: davidm, torvalds, linux-arch On Mon, 9 Aug 2004 02:01:37 -0700 David Mosberger <davidm@napali.hpl.hp.com> wrote: > I didn't recall copy_page_range() being so high on ia64, but it's been > a while since looked at this, so I ran it again I really meant clear_page_tables(), sorry. :-) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2004-08-12 3:53 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-08-07 7:05 copy_page_range() David S. Miller 2004-08-07 8:07 ` copy_page_range() William Lee Irwin III 2004-08-11 7:07 ` copy_page_range() David S. Miller 2004-08-11 7:35 ` copy_page_range() William Lee Irwin III 2004-08-11 16:13 ` copy_page_range() Linus Torvalds 2004-08-11 20:45 ` copy_page_range() David S. Miller 2004-08-12 3:53 ` copy_page_range() David S. Miller 2004-08-09 9:01 ` copy_page_range() David Mosberger 2004-08-09 9:04 ` copy_page_range() William Lee Irwin III 2004-08-09 9:27 ` copy_page_range() David Mosberger 2004-08-09 9:29 ` copy_page_range() William Lee Irwin III 2004-08-09 10:01 ` copy_page_range() David Mosberger 2004-08-09 17:46 ` copy_page_range() David S. Miller 2004-08-09 17:08 ` copy_page_range() Linus Torvalds 2004-08-09 18:49 ` copy_page_range() William Lee Irwin III 2004-08-09 17:45 ` copy_page_range() David S. Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.