* Re: Anticipatory prefaulting in the page fault handler V2
[not found] ` <Pine.LNX.4.58.0412131730410.817@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
@ 2004-12-16 7:27 ` Andi Kleen
0 siblings, 0 replies; 3+ messages in thread
From: Andi Kleen @ 2004-12-16 7:27 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel
Christoph Lameter <clameter@sgi.com> writes:
>
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.
I would be very careful with this. Windows does something like this
by default and one application that I know needs twice as much swap+memory
on Windows than on Linux because of this. Since it uses a lot of memory
it would be a bad regression.
When you add it there should be at least some easy way for an application
to turn it off (madvise and probably sysctl?) and make the heuristic very
conservative.
-Andi
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: page fault scalability patch V11 [1/7]: sloppy rss
@ 2004-11-22 15:00 Hugh Dickins
2004-12-08 22:50 ` Anticipatory prefaulting in the page fault handler V1 Martin J. Bligh
0 siblings, 1 reply; 3+ messages in thread
From: Hugh Dickins @ 2004-11-22 15:00 UTC (permalink / raw)
To: Christoph Lameter
Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm,
linux-ia64, linux-kernel
On Fri, 19 Nov 2004, Christoph Lameter wrote:
> On Fri, 19 Nov 2004, Hugh Dickins wrote:
>
> > Sorry, against what tree do these patches apply?
> > Apparently not linux-2.6.9, nor latest -bk, nor -mm?
>
> 2.6.10-rc2-bk3
Ah, thanks - got it patched now, but your mailer (or something else)
is eating trailing spaces. Better than adding them, but we have to
apply this patch before your set:
--- 2.6.10-rc2-bk3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
#define cmpxchg(ptr,o,n)\
((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
(unsigned long)(n),sizeof(*(ptr))))
-
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */
--- 2.6.10-rc2-bk3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m
#endif /* __s390x__ */
-static inline void
+static inline void
pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
{
#ifndef __s390x__
--- 2.6.10-rc2-bk3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
/*
* We are called with the MM semaphore and page_table_lock
* spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * multithreaded programs.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
* Fall back to the linear mapping if the fs does not support
* ->populate:
*/
- if (!vma->vm_ops || !vma->vm_ops->populate ||
+ if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
return do_no_page(mm, vma, address, write_access, pte, pmd);
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
@ 2004-12-08 22:50 ` Martin J. Bligh
2004-12-09 19:32 ` Christoph Lameter
0 siblings, 1 reply; 3+ messages in thread
From: Martin J. Bligh @ 2004-12-08 22:50 UTC (permalink / raw)
To: Christoph Lameter, nickpiggin
Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel
> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
>
> In the page table scalability patches, we addressed the issue by changing
> the locking scheme so that multiple fault handlers are able to be processed
> concurrently on multiple cpus. This patch attempts to aggregate multiple
> page faults into a single one. It does that by noting
> anonymous page faults generated in sequence by an application.
>
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.
I tried benchmarking it ... but processes just segfault all the time.
Any chance you could try it out on SMP ia32 system?
M.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 22:50 ` Anticipatory prefaulting in the page fault handler V1 Martin J. Bligh
@ 2004-12-09 19:32 ` Christoph Lameter
2004-12-13 14:30 ` Akinobu Mita
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Lameter @ 2004-12-09 19:32 UTC (permalink / raw)
To: Martin J. Bligh
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wed, 8 Dec 2004, Martin J. Bligh wrote:
> I tried benchmarking it ... but processes just segfault all the time.
> Any chance you could try it out on SMP ia32 system?
I tried it on my i386 system and it works fine. Sorry about the puny
memory sizes (the system is a PIII-450 with 384k memory)
clameter@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 3 1 0.000s 0.004s 0.000s 37407.481 29200.500
0 3 2 0.002s 0.002s 0.000s 31177.059 27227.723
clameter@schroedinger:~/pfault/code$ uname -a
Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST
2004 i686 GNU/Linux
Could you send me your .config?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-09 19:32 ` Christoph Lameter
@ 2004-12-13 14:30 ` Akinobu Mita
2004-12-13 17:10 ` Christoph Lameter
0 siblings, 1 reply; 3+ messages in thread
From: Akinobu Mita @ 2004-12-13 14:30 UTC (permalink / raw)
To: Christoph Lameter, Martin J. Bligh
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Friday 10 December 2004 04:32, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, Martin J. Bligh wrote:
> > I tried benchmarking it ... but processes just segfault all the time.
> > Any chance you could try it out on SMP ia32 system?
>
> I tried it on my i386 system and it works fine. Sorry about the puny
> memory sizes (the system is a PIII-450 with 384k memory)
>
I also encountered processes segfault.
Below patch fix several problems.
1) if no pages could allocated, returns VM_FAULT_OOM
2) fix duplicated pte_offset_map() call
3) don't set_pte() for the entry which already have been set
Acutually, 3) fixes my segfault problem.
--- 2.6-rc/mm/memory.c.orig 2004-12-13 22:17:04.000000000 +0900
+++ 2.6-rc/mm/memory.c 2004-12-13 22:22:14.000000000 +0900
@@ -1483,6 +1483,8 @@ do_anonymous_page(struct mm_struct *mm,
} else
break;
}
+ if (a == addr)
+ goto no_mem;
end_addr = a;
spin_lock(&mm->page_table_lock);
@@ -1514,8 +1516,17 @@ do_anonymous_page(struct mm_struct *mm,
}
} else {
/* Read */
+ int first = 1;
+
for(;addr < end_addr; addr += PAGE_SIZE) {
- page_table = pte_offset_map(pmd, addr);
+ if (!first)
+ page_table = pte_offset_map(pmd, addr);
+ first = 0;
+ if (!pte_none(*page_table)) {
+ /* Someone else got there first */
+ pte_unmap(page_table);
+ continue;
+ }
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
set_pte(page_table, entry);
pte_unmap(page_table);
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-13 14:30 ` Akinobu Mita
@ 2004-12-13 17:10 ` Christoph Lameter
2004-12-13 22:16 ` Martin J. Bligh
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Lameter @ 2004-12-13 17:10 UTC (permalink / raw)
To: Akinobu Mita
Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Mon, 13 Dec 2004, Akinobu Mita wrote:
> I also encountered processes segfault.
> Below patch fix several problems.
>
> 1) if no pages could allocated, returns VM_FAULT_OOM
> 2) fix duplicated pte_offset_map() call
I also saw these two issues and I think I dealt with them in a forthcoming
patch.
> 3) don't set_pte() for the entry which already have been set
Not sure how this could have happened in the patch.
Could you try my updated version:
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-08 15:01:48.801457702 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-08 15:02:04.286479345 -0800
@@ -537,6 +537,8 @@
#endif
struct list_head tasks;
+ unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
+ int anon_fault_order; /* Last order of allocation on fault */
/*
* ptrace_list/ptrace_children forms the list of my children
* that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-08 15:01:50.668339751 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-09 14:21:17.090061608 -0800
@@ -55,6 +55,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/pagevec.h>
#ifndef CONFIG_DISCONTIGMEM
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,52 +1433,99 @@
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
-
- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ unsigned long end_addr;
+
+ addr &= PAGE_MASK;
+
+ if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) {
+ /* Single page */
+ current->anon_fault_order = 0;
+ end_addr = addr + PAGE_SIZE;
+ } else {
+ /* Sequence of faults detect. Perform preallocation */
+ int order = ++current->anon_fault_order;
+
+ if ((1 << order) < PAGEVEC_SIZE)
+ end_addr = addr + (PAGE_SIZE << order);
+ else
+ end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
- /* ..except if it's a write access */
+ if (end_addr > vma->vm_end)
+ end_addr = vma->vm_end;
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ end_addr &= PMD_MASK;
+ }
if (write_access) {
- /* Allocate our own private page. */
+
+ unsigned long a;
+ struct page **p;
+ struct pagevec pv;
+
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+ pagevec_init(&pv, 0);
+
if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
+ return VM_FAULT_OOM;
+
+ /* Allocate the necessary pages */
+ for(a = addr; a < end_addr ; a += PAGE_SIZE) {
+ struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+ if (likely(p)) {
+ clear_user_highpage(p, a);
+ pagevec_add(&pv, p);
+ } else {
+ if (a == addr)
+ return VM_FAULT_OOM;
+ break;
+ }
+ }
spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
- if (!pte_none(*page_table)) {
+ for(p = pv.pages; addr < a; addr += PAGE_SIZE, p++) {
+
+ page_table = pte_offset_map(pmd, addr);
+ if (unlikely(!pte_none(*page_table))) {
+ /* Someone else got there first */
+ pte_unmap(page_table);
+ page_cache_release(*p);
+ continue;
+ }
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ vma->vm_page_prot)),
+ vma);
+
+ mm->rss++;
+ lru_cache_add_active(*p);
+ mark_page_accessed(*p);
+ page_add_anon_rmap(*p, vma, addr);
+
+ set_pte(page_table, entry);
pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+ }
+ } else {
+ /* Read */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+nextread:
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+ update_mmu_cache(vma, addr, entry);
+ addr += PAGE_SIZE;
+ if (unlikely(addr < end_addr)) {
+ pte_offset_map(pmd, addr);
+ goto nextread;
}
- mm->rss++;
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}
-
- set_pte(page_table, entry);
- pte_unmap(page_table);
-
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ current->anon_fault_next_addr = addr;
spin_unlock(&mm->page_table_lock);
-out:
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}
/*
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-13 17:10 ` Christoph Lameter
@ 2004-12-13 22:16 ` Martin J. Bligh
2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter
0 siblings, 1 reply; 3+ messages in thread
From: Martin J. Bligh @ 2004-12-13 22:16 UTC (permalink / raw)
To: Christoph Lameter, Akinobu Mita
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
>> I also encountered processes segfault.
>> Below patch fix several problems.
>>
>> 1) if no pages could allocated, returns VM_FAULT_OOM
>> 2) fix duplicated pte_offset_map() call
>
> I also saw these two issues and I think I dealt with them in a forthcoming
> patch.
>
>> 3) don't set_pte() for the entry which already have been set
>
> Not sure how this could have happened in the patch.
>
> Could you try my updated version:
Urgle. There was a fix from Hugh too ... any chance you could just stick
a whole new patch somewhere? I'm too idle/stupid to work it out ;-)
M.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Anticipatory prefaulting in the page fault handler V2
2004-12-13 22:16 ` Martin J. Bligh
@ 2004-12-14 1:32 ` Christoph Lameter
2004-12-14 19:31 ` Adam Litke
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Lameter @ 2004-12-14 1:32 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Akinobu Mita, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
Changes from V1 to V2:
- Eliminate duplicate code and reorganize things
- Use SetReferenced instead of mark_accessed (Hugh Dickins)
- Fix the problem of the preallocation order increasing out of bounds
(leading to memory being overwritten with pointers to struct page)
- Return VM_FAULT_OOM if not able to allocate a single page
- Tested on i386 and ia64
- New performance test for low cpu counts (up to 8 so that this does not
seem to be too exotic)
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.
In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.
If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.
The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.
The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.
Standard Kernel on a 8 Cpu machine allocating 1 and 4GB with an increasing
number of threads (and thus increasing parallellism of page faults):
ia64 2.6.10-rc3-bk7
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 3 1 0.047s 2.163s 2.021s 88925.153 88859.030
1 3 2 0.040s 3.215s 1.069s 60385.889 115677.685
1 3 4 0.041s 3.509s 1.023s 55370.338 158971.609
1 3 8 0.047s 4.130s 1.014s 47049.904 172405.990
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.155s 11.277s 11.043s 68788.420 68747.223
4 3 2 0.161s 16.459s 8.061s 47315.277 91322.962
4 3 4 0.170s 14.708s 4.079s 52852.007 164043.773
4 3 8 0.171s 23.257s 4.028s 33565.604 183348.574
ia64 Patched kernel:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 3 1 0.008s 2.080s 2.008s 94121.792 94101.359
1 3 2 0.015s 3.128s 1.064s 62523.771 119563.496
1 3 4 0.008s 2.714s 1.012s 72185.910 175020.971
1 3 8 0.016s 2.963s 0.087s 65965.457 223921.949
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.034s 10.861s 10.089s 72179.444 72181.353
4 3 2 0.050s 14.303s 7.072s 54786.447 101738.901
4 3 4 0.038s 13.478s 4.044s 58182.649 176913.840
4 3 8 0.063s 13.584s 3.007s 57620.638 256109.927
i386 2.6.10-rc3-bk3 256M allocation 2x Pentium III 500 Mhz
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 3 1 0.020s 1.566s 1.058s123827.513 123842.098
0 3 2 0.017s 2.439s 1.043s 79999.154 136931.671
i386 2.6.10-rc3-bk3 patches
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 3 1 0.020s 1.527s 1.039s126945.181 140930.664
0 3 2 0.016s 2.417s 1.026s 80754.809 155162.903
Patch against 2.6.10-rc3-bk7:
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-13 15:14:40.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-13 15:15:55.000000000 -0800
@@ -537,6 +537,8 @@
#endif
struct list_head tasks;
+ unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
+ int anon_fault_order; /* Last order of allocation on fault */
/*
* ptrace_list/ptrace_children forms the list of my children
* that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-13 15:14:40.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-13 16:49:31.000000000 -0800
@@ -55,6 +55,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/pagevec.h>
#ifndef CONFIG_DISCONTIGMEM
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,52 +1433,102 @@
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
-
- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ unsigned long end_addr;
+
+ addr &= PAGE_MASK;
+
+ if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) {
+ /* Single page */
+ current->anon_fault_order = 0;
+ end_addr = addr + PAGE_SIZE;
+ } else {
+ /* Sequence of faults detect. Perform preallocation */
+ int order = ++current->anon_fault_order;
+
+ if ((1 << order) < PAGEVEC_SIZE)
+ end_addr = addr + (PAGE_SIZE << order);
+ else {
+ end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
+ current->anon_fault_order = 3;
+ }
- /* ..except if it's a write access */
+ if (end_addr > vma->vm_end)
+ end_addr = vma->vm_end;
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ end_addr &= PMD_MASK;
+ }
if (write_access) {
- /* Allocate our own private page. */
+
+ unsigned long a;
+ int i;
+ struct pagevec pv;
+
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+ pagevec_init(&pv, 0);
+
if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
+ return VM_FAULT_OOM;
+
+ /* Allocate the necessary pages */
+ for(a = addr; a < end_addr ; a += PAGE_SIZE) {
+ struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+ if (likely(p)) {
+ clear_user_highpage(p, a);
+ pagevec_add(&pv, p);
+ } else {
+ if (a == addr)
+ return VM_FAULT_OOM;
+ break;
+ }
+ }
spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
- if (!pte_none(*page_table)) {
+ for(i = 0; addr < a; addr += PAGE_SIZE, i++) {
+ struct page *p = pv.pages[i];
+
+ page_table = pte_offset_map(pmd, addr);
+ if (unlikely(!pte_none(*page_table))) {
+ /* Someone else got there first */
+ pte_unmap(page_table);
+ page_cache_release(p);
+ continue;
+ }
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(p,
+ vma->vm_page_prot)),
+ vma);
+
+ mm->rss++;
+ lru_cache_add_active(p);
+ SetPageReferenced(p);
+ page_add_anon_rmap(p, vma, addr);
+
+ set_pte(page_table, entry);
pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+ }
+ } else {
+ /* Read */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+nextread:
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+ update_mmu_cache(vma, addr, entry);
+ addr += PAGE_SIZE;
+ if (unlikely(addr < end_addr)) {
+ pte_offset_map(pmd, addr);
+ goto nextread;
}
- mm->rss++;
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}
-
- set_pte(page_table, entry);
- pte_unmap(page_table);
-
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ current->anon_fault_next_addr = addr;
spin_unlock(&mm->page_table_lock);
-out:
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}
/*
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V2
2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter
@ 2004-12-14 19:31 ` Adam Litke
0 siblings, 0 replies; 3+ messages in thread
From: Adam Litke @ 2004-12-14 19:31 UTC (permalink / raw)
To: Christoph Lameter
Cc: Martin J. Bligh, Akinobu Mita, nickpiggin, Jeff Garzik, torvalds,
hugh, benh, linux-mm, linux-ia64, linux-kernel
Just to add another data point: This works on my 4-way ppc64 (Power4)
box. I am seeing no degradation when running this on kernbench (which
is expected). For the curious, here are the results:
Kernbench results with anon-prefault:
349.86user 49.64system 1:57.85elapsed 338%CPU (0avgtext+0avgdata 0maxresident)k
349.65user 49.81system 1:58.31elapsed 337%CPU (0avgtext+0avgdata 0maxresident)k
349.48user 50.00system 1:53.70elapsed 351%CPU (0avgtext+0avgdata 0maxresident)k
349.73user 49.69system 1:57.67elapsed 339%CPU (0avgtext+0avgdata 0maxresident)k
349.75user 49.85system 1:52.71elapsed 354%CPU (0avgtext+0avgdata 0maxresident)k
Elapsed: 116.048s User: 349.694s System: 49.798s CPU: 343.8%
Kernbench results without anon-prefault:
350.86user 52.54system 1:53.45elapsed 355%CPU (0avgtext+0avgdata 0maxresident)k
350.99user 52.36system 1:52.05elapsed 359%CPU (0avgtext+0avgdata 0maxresident)k
350.92user 52.68system 1:54.14elapsed 353%CPU (0avgtext+0avgdata 0maxresident)k
350.98user 52.38system 1:56.17elapsed 347%CPU (0avgtext+0avgdata 0maxresident)k
351.16user 52.31system 1:53.90elapsed 354%CPU (0avgtext+0avgdata 0maxresident)k
Elapsed: 113.942s User: 350.982s System: 52.454s CPU: 353.6%
On Mon, 2004-12-13 at 19:32, Christoph Lameter wrote:
> Changes from V1 to V2:
> - Eliminate duplicate code and reorganize things
> - Use SetReferenced instead of mark_accessed (Hugh Dickins)
> - Fix the problem of the preallocation order increasing out of bounds
> (leading to memory being overwritten with pointers to struct page)
> - Return VM_FAULT_OOM if not able to allocate a single page
> - Tested on i386 and ia64
> - New performance test for low cpu counts (up to 8 so that this does not
> seem to be too exotic)
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2004-12-16 7:28 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.44.0411221457240.2970-100000@localhost.localdomain.suse.lists.linux.kernel>
[not found] ` <156610000.1102546207@flay.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0412091130160.796@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
[not found] ` <200412132330.23893.amgta@yacht.ocn.ne.jp.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0412130905140.360@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
[not found] ` <8880000.1102976179@flay.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0412131730410.817@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
2004-12-16 7:27 ` Anticipatory prefaulting in the page fault handler V2 Andi Kleen
2004-11-22 15:00 page fault scalability patch V11 [1/7]: sloppy rss Hugh Dickins
2004-12-08 22:50 ` Anticipatory prefaulting in the page fault handler V1 Martin J. Bligh
2004-12-09 19:32 ` Christoph Lameter
2004-12-13 14:30 ` Akinobu Mita
2004-12-13 17:10 ` Christoph Lameter
2004-12-13 22:16 ` Martin J. Bligh
2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter
2004-12-14 19:31 ` Adam Litke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).