linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* page fault fastpath: Increasing SMP scalability by introducing pte locks?
@ 2004-08-15 13:50 Christoph Lameter
  2004-08-15 20:09 ` David S. Miller
  2004-08-15 22:38 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Benjamin Herrenschmidt
  0 siblings, 2 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-15 13:50 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-kernel

Well this is more an idea than a real patch yet. The page_table_lock
becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
memory. "pft" is a program that measures the performance of page faults on
SMP system. It allocates memory simultaneously in multiple threads thereby
causing lots of page faults for anonymous pages.

Results for a standard 2.6.8.1 kernel. Allocating 2G of RAM in an 8
processor SMP system:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.094s      4.500s   4.059s 85561.646  85568.398
  2   3    2    0.092s      6.390s   3.043s 60649.650 114521.474
  2   3    4    0.081s      6.500s   1.093s 59740.813 203552.963
  2   3    8    0.101s     12.001s   2.035s 32487.736 167082.560

Scalablity problems set in over 4 CPUs.

With pte locks and the fastpath:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.071s      4.535s   4.061s 85362.102  85288.646
  2   3    2    0.073s      4.793s   2.013s 80789.137 184196.199
  2   3    4    0.087s      5.119s   1.057s 75516.326 249716.547
  2   3    8    0.096s      7.089s   1.019s 54715.728 328540.988

The performance in SMP situation is significantly enhanced with this
patch.

Note that the patch does not address various race conditions that may
result from using a pte lock only in handle_mm_fault. Some rules
need to be develop how to coordinate pte locks and the page_table_lock in
order to avoid these.

pte locks are realized by finding a spare bit in the ptes (TLB structures
on IA64 and i386) and settings the bit atomically via bitops for locking.
The fastpath does not allocate the page_table_lock but instead immediately
locks the pte. Thus the logic to release and later reacquire the
page_table_lock is avoided. Multiple page faults can run concurrently
using pte locks avoiding the page_table_lock. Essentially pte locks would
allow a finer granularity of locking.

I would like to get some feedback if people feel that this is the right
way to solve the issue. Most of this is based on work of Ray
Bryant and others at SGI.

Attached are:
1. pte lock patch for i386 and ia64
2. page_fault fastpath
3. page fault test program
4. test script

=========== PTE LOCK PATCH

Index: linux-2.6.8-rc4/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-generic/pgtable.h	2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-generic/pgtable.h	2004-08-13 10:05:36.000000000 -0700
@@ -126,4 +126,11 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_PTE_LOCK
+/* need to fall back to the mm spinlock if PTE locks are not supported */
+#define ptep_lock(ptep)		!spin_trylock(&mm->page_table_lock)
+#define ptep_unlock(ptep)	spin_unlock(&mm->page_table_lock)
+#define pte_locked(pte)		spin_is_locked(&mm->page_table_lock)
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-ia64/pgtable.h	2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-ia64/pgtable.h	2004-08-13 10:19:15.000000000 -0700
@@ -30,6 +30,8 @@
 #define _PAGE_P_BIT		0
 #define _PAGE_A_BIT		5
 #define _PAGE_D_BIT		6
+#define _PAGE_IG_BITS		53
+#define _PAGE_LOCK_BIT		(_PAGE_IG_BITS+3)	/* bit 56. Aligned to 8 bits */

 #define _PAGE_P			(1 << _PAGE_P_BIT)	/* page present bit */
 #define _PAGE_MA_WB		(0x0 <<  2)	/* write back memory attribute */
@@ -58,6 +60,7 @@
 #define _PAGE_PPN_MASK		(((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
 #define _PAGE_ED		(__IA64_UL(1) << 52)	/* exception deferral */
 #define _PAGE_PROTNONE		(__IA64_UL(1) << 63)
+#define _PAGE_LOCK		(__IA64_UL(1) << _PAGE_LOCK_BIT)

 /* Valid only for a PTE with the present bit cleared: */
 #define _PAGE_FILE		(1 << 1)		/* see swap & file pte remarks below */
@@ -282,6 +285,13 @@
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))

+/*
+ * Lock functions for pte's
+*/
+#define ptep_lock(ptep)		test_and_set_bit(_PAGE_LOCK_BIT,ptep)
+#define ptep_unlock(ptep)	{ clear_bit(_PAGE_LOCK_BIT,ptep);smp_mb__after_clear_bit(); }
+#define pte_locked(pte)		((pte_val(pte) & _PAGE_LOCK)!=0)
+
 /*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
  * misnomer here as the protection value contains the memory attribute bits, dirty bits,
@@ -558,6 +568,7 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-i386/pgtable.h	2004-08-09 19:23:35.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-i386/pgtable.h	2004-08-13 10:04:19.000000000 -0700
@@ -101,7 +101,7 @@
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1	9	/* available for programmer */
+#define _PAGE_BIT_LOCK		9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
 #define _PAGE_BIT_NX		63
@@ -115,7 +115,7 @@
 #define _PAGE_DIRTY	0x040
 #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
-#define _PAGE_UNUSED1	0x200	/* available for programmer */
+#define _PAGE_LOCK	0x200	/* available for programmer */
 #define _PAGE_UNUSED2	0x400
 #define _PAGE_UNUSED3	0x800

@@ -260,6 +260,10 @@
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }

+#define ptep_lock(ptep) test_and_set_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define ptep_unlock(ptep) clear_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define pte_locked(pte) ((ptep->pte_low & _PAGE_LOCK) !=0)
+
 /*
  * Macro to mark a page protection value as "uncacheable".  On processors which do not support
  * it, this is a no-op.
@@ -419,6 +423,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */

======= PAGEFAULT FASTPATH
Index: linux-2.6.8-rc4/mm/memory.c
===================================================================
--- linux-2.6.8-rc4.orig/mm/memory.c	2004-08-09 19:23:02.000000000 -0700
+++ linux-2.6.8-rc4/mm/memory.c	2004-08-13 10:19:21.000000000 -0700
@@ -1680,6 +1680,10 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+#ifdef __HAVE_ARCH_PTE_LOCK
+	pte_t *pte;
+	pte_t entry;
+#endif

 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
@@ -1688,7 +1692,64 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/*
+	 * Fast path for anonymous pages, not found faults bypassing
+	 * the necessity to acquire the page_table_lock
+	 */
+
+	if ((vma->vm_ops && vma->vm_ops->nopage) || pgd_none(*pgd)) goto use_page_table_lock;
+	pmd = pmd_offset(pgd,address);
+	if (pmd_none(*pmd)) goto use_page_table_lock;
+	pte = pte_offset_kernel(pmd,address);
+	if (pte_locked(*pte)) return VM_FAULT_MINOR;
+	if (!pte_none(*pte)) goto use_page_table_lock;
+
+	/*
+	 * Page not present, so kswapd and PTE updates will not touch the pte
+	 * so we are able to just use a pte lock.
+	 */
+
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+		/*
+		 * PTE already locked so this code is already running on another processor. Wait
+		 * until that processor does our work and then return. If something went
+		 * wrong in the handling of the other processor then we will get another page fault
+		 * that may then handle the error condition
+		 */
+
+	/* Read-only mapping of ZERO_PAGE. */
+	entry = pte_wrprotect(mk_pte(ZERO_PAGE(address), vma->vm_page_prot));
+
+	if (write_access) {
+		struct page *page;
+
+		if (unlikely(anon_vma_prepare(vma))) goto no_mem;
+
+		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!page)  goto no_mem;
+		clear_user_highpage(page, address);
+
+		mm->rss++;
+		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)),vma);
+		lru_cache_add_active(page);
+		mark_page_accessed(page);
+		page_add_anon_rmap(page, vma, address);
+	}
+	/* Setting the pte clears the pte lock so there is no need for unlocking */
+	set_pte(pte, entry);
+	pte_unmap(pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	return VM_FAULT_MINOR;		/* Minor fault */

+no_mem:
+	ptep_unlock(pte);
+	return VM_FAULT_OOM;
+
+use_page_table_lock:
+#endif
 	/*
 	 * We need the page table lock to synchronize with kswapd
 	 * and the SMP-safe atomic PTE updates.

======= PFT.C test program
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#include <sys/mman.h>
#include <time.h>
#include <errno.h>
#include <sys/resource.h>

extern int      optind, opterr;
extern char     *optarg;

long     bytes=16384;
long     sleepsec=0;
long     verbose=0;
long     forkcnt=1;
long     repeatcount=1;
long     do_bzero=0;
long     mypid;
int 	title=0;

volatile int    go, state[128];

struct timespec wall;
struct rusage ruse;
long faults;
long pages;
long gbyte;
double faults_per_sec;
double faults_per_sec_per_cpu;

#define perrorx(s)      (perror(s), exit(1))
#define NBPP            16384

void* test(void*);
void  launch(void);


main (int argc, char *argv[])
{
        int                     i, j, c, stat, er=0;
        static  char            optstr[] = "b:f:g:r:s:vzHt";


        opterr=1;
        while ((c = getopt(argc, argv, optstr)) != EOF)
                switch (c) {
                case 'g':
                        bytes = atol(optarg)*1024*1024*1024;
                        break;
                case 'b':
                        bytes = atol(optarg);
                        break;
                case 'f':
                        forkcnt = atol(optarg);
                        break;
                case 'r':
                        repeatcount = atol(optarg);
                        break;
                case 's':
                        sleepsec = atol(optarg);
                        break;
                case 'v':
                        verbose++;
                        break;
                case 'z':
                        do_bzero++;
                        break;
                case 'H':
                        er++;
                        break;
		case 't' :
			title++;
			break;
                case '?':
                        er = 1;
                        break;
                }

        if (er) {
                printf("usage: %s %s\n", argv[0], optstr);
                exit(1);
        }

	pages = bytes*repeatcount/getpagesize();
	gbyte = bytes/(1024*1024*1024);
	bytes = bytes/forkcnt;

	if (verbose) printf("Calculated pages=%ld pagesize=%ld.\n",pages,getpagesize());

        mypid = getpid();
        setpgid(0, mypid);

        for (i=0; i<repeatcount; i++) {
                if (fork() == 0)
                        launch();
                while (wait(&stat) > 0);
        }

	getrusage(RUSAGE_CHILDREN,&ruse);
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&wall);
	if (verbose) printf("Calculated faults=%ld. Real minor faults=%ld, major faults=%ld\n",pages,ruse.ru_minflt+ruse.ru_majflt);
	faults_per_sec=(double) pages / ((double) wall.tv_sec + (double) wall.tv_nsec / 1000000000.0);
	faults_per_sec_per_cpu=(double) pages /  (
		(double) (ruse.ru_utime.tv_sec + ruse.ru_stime.tv_sec) + ((double) (ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec) / 1000000.0));
	if (title) printf(" Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec\n");
	printf("%3ld %3ld %4ld %4ld.%03lds%7ld.%03lds%4ld.%03lds%10.3f %10.3f\n",
		gbyte,repeatcount,forkcnt,
		ruse.ru_utime.tv_sec,ruse.ru_utime.tv_usec/1000,
		ruse.ru_stime.tv_sec,ruse.ru_stime.tv_usec/1000,
		wall.tv_sec,wall.tv_nsec/10000000,
		faults_per_sec_per_cpu,faults_per_sec);
        exit(0);
}

char *
do_shm(long shmlen) {
        char    *p;
        int     shmid;

        printf ("Try to allocate TOTAL shm segment of %ld bytes\n", shmlen);

        if ((shmid = shmget(IPC_PRIVATE, shmlen, SHM_R|SHM_W))  == -1)
                perrorx("shmget faiiled");

        p=(char*)shmat(shmid, (void*)0, SHM_R|SHM_W);
	printf("  created, adr: 0x%lx\n", (long)p);
	printf("  attached\n");
        bzero(p, shmlen);
	printf("  zeroed\n");

        // if (shmctl(shmid,IPC_RMID,0) == -1)
        //        perrorx("shmctl failed");
	// printf("  deleted\n");

	return p;


}

void
launch()
{
        pthread_t                       ptid[128];
        int     i, j;

        for (j=0; j<forkcnt; j++)
                if (pthread_create(&ptid[j], NULL, test, (void*) (long)j) < 0)
                        perrorx("pthread create");

        if(0) for (j=0; j<forkcnt; j++)
                while(state[j] == 0);
        go = 1;
        if(0) for (j=0; j<forkcnt; j++)
                while(state[j] == 1);
        for (j=0; j<forkcnt; j++)
                pthread_join(ptid[j], NULL);
        exit(0);
}

void*
test(void *arg)
{
        char    *p, *pe;
        long    id;

        id = (long) arg;
        state[id] = 1;
        while(!go);
        p = malloc(bytes);
        // p = do_shm(bytes);
	if (p == 0) {
	    printf("malloc of %Ld bytes failed.\n",bytes);
	    exit;
	} else
	    if (verbose) printf("malloc of %Ld bytes succeeded\n",bytes);
        if (do_bzero)
                bzero(p, bytes);
        else {
                for(pe=p+bytes; p<pe; p+=16384)
                        *p = 'r';
        }
        sleep(sleepsec);
        state[id] = 2;
        pthread_exit(0);
}

===== Test script

./pft -t -g2 -r3 -f1
./pft -g2 -r3 -f2
./pft -g2 -r3 -f4
./pft -g2 -r3 -f8


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 13:50 page fault fastpath: Increasing SMP scalability by introducing pte locks? Christoph Lameter
@ 2004-08-15 20:09 ` David S. Miller
  2004-08-15 22:58   ` Christoph Lameter
  2004-08-15 22:38 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Benjamin Herrenschmidt
  1 sibling, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-15 20:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel


Is the read lock in the VMA semaphore enough to let you do
the pgd/pmd walking without the page_table_lock?
I think it is, but just checking.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 13:50 page fault fastpath: Increasing SMP scalability by introducing pte locks? Christoph Lameter
  2004-08-15 20:09 ` David S. Miller
@ 2004-08-15 22:38 ` Benjamin Herrenschmidt
  2004-08-16 17:28   ` Christoph Lameter
  1 sibling, 1 reply; 106+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-15 22:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard

On Sun, 2004-08-15 at 23:50, Christoph Lameter wrote:
> Well this is more an idea than a real patch yet. The page_table_lock
> becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
> memory. "pft" is a program that measures the performance of page faults on
> SMP system. It allocates memory simultaneously in multiple threads thereby
> causing lots of page faults for anonymous pages.

Just a note: on ppc64, we already have a PTE lock bit, we use it to
guard against concurrent hash table insertion, it could be extended
to the whole page fault path provided we can guarantee we will never
fault in the hash table on that PTE while it is held. This shouldn't
be a problem as long as only user pages are locked that way (which
should be the case with do_page_fault) provided update_mmu_cache()
is updated to not take this lock, but assume it already held.

Ben.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 20:09 ` David S. Miller
@ 2004-08-15 22:58   ` Christoph Lameter
  2004-08-15 23:58     ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-15 22:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:

>
> Is the read lock in the VMA semaphore enough to let you do
> the pgd/pmd walking without the page_table_lock?
> I think it is, but just checking.

That would be great.... May I change the page_table lock to
be a read write spinlock instead?

I would then convert all spin_locks to write_locks and
then use read locks to switch to a "pte locking mode". The read
lock would allow simultanous threads operating on the page table
that will only modify individual pte's via pte locks. Write locks
still exclude the readers and thus the whole scheme should allow
a gradual transition.

Maybe such a locking policy could do some good.

However, performance is only increased somewhat. Scalability
is still bad with more than 32 CPUs despite my hack. More
extensive work is needed <sigh>:

Regular kernel 512 CPU's 16G allocation per thread:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.748s     67.200s  67.098s 46295.921  46270.533
 16   3    2    0.899s    100.189s  52.021s 31118.426  60242.544
 16   3    4    1.517s    103.467s  31.021s 29963.479 100777.788
 16   3    8    1.268s    166.023s  26.035s 18803.807 119350.434
 16   3   16    6.296s    453.445s  33.082s  6842.371  92987.774
 16   3   32   22.434s   1341.205s  48.026s  2306.860  65174.913
 16   3   64   54.189s   4633.748s  81.089s   671.026  38411.466
 16   3  128  244.333s  17584.111s 152.026s   176.444  20659.132
 16   3  256  222.936s   8167.241s  73.018s   374.930  42983.366
 16   3  512  207.464s   4259.264s  39.044s   704.258  79741.366

Modified kernel:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.884s     64.241s  65.014s 48302.177  48287.787
 16   3    2    0.931s     99.156s  51.058s 31429.640  60979.126
 16   3    4    1.028s     88.451s  26.096s 35155.837 116669.999
 16   3    8    1.957s     61.395s  12.099s 49654.307 242078.305
 16   3   16    5.701s     81.382s   9.039s 36122.904 334774.381
 16   3   32   15.207s    163.893s   9.094s 17564.021 316284.690
 16   3   64   76.056s    440.771s  13.037s  6086.601 235120.800
 16   3  128  203.843s   1535.909s  19.084s  1808.145 158495.679
 16   3  256  274.815s    755.764s  12.058s  3052.387 250010.942
 16   3  512  205.505s    381.106s   7.060s  5362.531 413531.352



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 22:58   ` Christoph Lameter
@ 2004-08-15 23:58     ` David S. Miller
  2004-08-16  0:11       ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-15 23:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004 15:58:27 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sun, 15 Aug 2004, David S. Miller wrote:
> 
> >
> > Is the read lock in the VMA semaphore enough to let you do
> > the pgd/pmd walking without the page_table_lock?
> > I think it is, but just checking.
> 
> That would be great.... May I change the page_table lock to
> be a read write spinlock instead?

No, I means "is the read long _ON_ the VMA semaphore".
The VMA semaphore is a read/write semaphore, and we grab
it for reading in the code path you're modifying.

Please don't change page_table_lock to a rwlock, it's
only needed for write accesses.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 23:58     ` David S. Miller
@ 2004-08-16  0:11       ` Christoph Lameter
  2004-08-16  1:56         ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-16  0:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel


On Sun, 15 Aug 2004, David S. Miller wrote:
> > On Sun, 15 Aug 2004, David S. Miller wrote:
> > > Is the read lock in the VMA semaphore enough to let you do
> > > the pgd/pmd walking without the page_table_lock?
> > > I think it is, but just checking.
> >
> > That would be great.... May I change the page_table lock to
> > be a read write spinlock instead?
>
> No, I means "is the read long _ON_ the VMA semaphore".
> The VMA semaphore is a read/write semaphore, and we grab
> it for reading in the code path you're modifying.
>
> Please don't change page_table_lock to a rwlock, it's
> only needed for write accesses.

pgd/pmd walking should be possible always even without the vma semaphore
since the CPU can potentially walk the chain at anytime.

The modification of the pte is not without issue since there are other
code paths that may modify pte's and rely on the page_table_lock to
exclude others from modifying ptes. One known problem is the swap
code which sets the page to the pte_not_present condition to insure that
nothing else touches the page while it is figuring out where to put it. A
page fault during that time (skipping the checking of the
page_table_lock) will cause the fastpath to be taken which will then
assign new memory to it.

We need to have some kind of system how finer granularity locks could be
realized.

One possibility is to abuse the rw spinlock to not only allow exclusive
access to the page tables(as done right now with the spinlock) but also
allow shared access with pte locking after a read lock.

Is there any other way to realize this?



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  0:11       ` Christoph Lameter
@ 2004-08-16  1:56         ` David S. Miller
  2004-08-16  3:29           ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-16  1:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> pgd/pmd walking should be possible always even without the vma semaphore
> since the CPU can potentially walk the chain at anytime.

munmap() can destroy pmd and pte tables.  somehow we have
to protect against that, and currently that is having the
VMA semaphore held for reading, see free_pgtables().


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  1:56         ` David S. Miller
@ 2004-08-16  3:29           ` Christoph Lameter
  2004-08-16  7:00             ` Ray Bryant
  2004-08-16 14:39             ` William Lee Irwin III
  0 siblings, 2 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-16  3:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:

> On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > pgd/pmd walking should be possible always even without the vma semaphore
> > since the CPU can potentially walk the chain at anytime.
>
> munmap() can destroy pmd and pte tables.  somehow we have
> to protect against that, and currently that is having the
> VMA semaphore held for reading, see free_pgtables().

It looks to me like the code takes care to provide the correct
sequencing so that the integrity of pgd,pmd and pte links is
guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
protect one kernel thread messing with the addresses of these entities
that might be stored in another threads register.

Therefore it is safe to walk the chain only holding the semaphore read
lock?

If the mmap lock already guarantees the integrity of the pgd,pmd,pte
system, then pte locking would be okay as long as integrity of the
pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would
work.

So then there are two ways of modifying the pgd,pmd and pte's.

A) Processor obtains vma semaphore write lock and does large scale
modifications to pgd,pmd,pte.

B) Processor obtains vma semaphore read lock but is still free to do
modifications on individual pte's while holding that vma lock. There is no
need to acquire the page_table_lock. These changes must be atomic.

The role of the page_table_lock is restricted *only* to the "struct
page" stuff? It says in the comments regarding handle_mm_fault that the
lock is taken for synchronization with kswapd in regards to the pte
entries. Seems that this use of the page_table_lock is wrong. A or B
should have been used.

We could simply remove the page_table_lock from handle_mm_fault and
provide the synchronization with kswapd with pte locks right? Both
processes are essentially doing modifications on pte's while holding the
vma read lock and I would be changing the way of synchronization between
these two processes.

F.e. something along these lines removing the page_table_lock from
handle_mm_fault and friends. Surprisingly this will also avoid many
rereads of the pte's since the pte's are really locked. This is just for
illustrative purpose and unfinished...

Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-15 06:03:04.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-15 20:26:29.000000000 -0700
@@ -1035,8 +1035,7 @@
  * change only once the write actually happens. This avoids a few races,
  * and potentially makes it more efficient.
  *
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore.
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
 	unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1051,10 +1050,10 @@
 		 * at least the kernel stops what it's doing before it corrupts
 		 * data, but for the moment just pretend this is OOM.
 		 */
+		ptep_unlock(page_table);
 		pte_unmap(page_table);
 		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
 				address);
-		spin_unlock(&mm->page_table_lock);
 		return VM_FAULT_OOM;
 	}
 	old_page = pfn_to_page(pfn);
@@ -1069,7 +1068,7 @@
 			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
-			spin_unlock(&mm->page_table_lock);
+			/* pte lock unlocked by ptep_set_access */
 			return VM_FAULT_MINOR;
 		}
 	}
@@ -1080,7 +1079,7 @@
 	 */
 	if (!PageReserved(old_page))
 		page_cache_get(old_page);
-	spin_unlock(&mm->page_table_lock);
+	ptep_unlock(page_table);

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
@@ -1090,26 +1089,21 @@
 	copy_cow_page(old_page,new_page,address);

 	/*
-	 * Re-check the pte - we dropped the lock
+	 * There is no need to recheck. The pte was locked
 	 */
-	spin_lock(&mm->page_table_lock);
-	page_table = pte_offset_map(pmd, address);
-	if (likely(pte_same(*page_table, pte))) {
-		if (PageReserved(old_page))
-			++mm->rss;
-		else
-			page_remove_rmap(old_page);
-		break_cow(vma, new_page, address, page_table);
-		lru_cache_add_active(new_page);
-		page_add_anon_rmap(new_page, vma, address);
+	if (PageReserved(old_page))
+		++mm->rss;
+	else
+		page_remove_rmap(old_page);
+	break_cow(vma, new_page, address, page_table);
+	lru_cache_add_active(new_page);
+	page_add_anon_rmap(new_page, vma, address);

-		/* Free the old page.. */
-		new_page = old_page;
-	}
+	/* Free the old page.. */
+	new_page = old_page;
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;

 no_new_page:
@@ -1314,8 +1308,8 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and a pte lock n entry and
+ * should release the pte lock on exit..
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,27 +1321,10 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
-		if (!page) {
-			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
-			 */
-			spin_lock(&mm->page_table_lock);
-			page_table = pte_offset_map(pmd, address);
-			if (likely(pte_same(*page_table, orig_pte)))
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_MINOR;
-			pte_unmap(page_table);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		inc_page_state(pgmajfault);
@@ -1356,21 +1333,6 @@
 	mark_page_accessed(page);
 	lock_page(page);

-	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
-	 */
-	spin_lock(&mm->page_table_lock);
-	page_table = pte_offset_map(pmd, address);
-	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
-		unlock_page(page);
-		page_cache_release(page);
-		ret = VM_FAULT_MINOR;
-		goto out;
-	}
-
 	/* The page isn't present yet, go ahead with the fault. */

 	swap_free(entry);
@@ -1398,8 +1360,8 @@

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
+	ptep_unlock(page_table);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return ret;
 }
@@ -1424,7 +1386,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1433,13 +1394,12 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);

 		if (!pte_none(*page_table)) {
+			ptep_unlock(page_table);
 			pte_unmap(page_table);
 			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
 		mm->rss++;
@@ -1456,7 +1416,6 @@

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1472,8 +1431,8 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and pte lock
+ * held. Exit with the pte lock released.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1489,9 +1448,9 @@
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
 					pmd, write_access, address);
+	ptep_unlock(page_table);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
-
+
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 		sequence = atomic_read(&mapping->truncate_count);
@@ -1523,7 +1482,7 @@
 		anon = 1;
 	}

-	spin_lock(&mm->page_table_lock);
+	while (ptep_lock(page_table)) ;
 	/*
 	 * For a file-backed vma, someone could have truncated or otherwise
 	 * invalidated this page.  If unmap_mapping_range got called,
@@ -1532,7 +1491,7 @@
 	if (mapping &&
 	      (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
 		sequence = atomic_read(&mapping->truncate_count);
-		spin_unlock(&mm->page_table_lock);
+		ptep_unlock(page_table);
 		page_cache_release(new_page);
 		goto retry;
 	}
@@ -1565,15 +1524,15 @@
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
+		ptep_unlock(page_table);
 		pte_unmap(page_table);
 		page_cache_release(new_page);
-		spin_unlock(&mm->page_table_lock);
 		goto out;
 	}

 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
-	spin_unlock(&mm->page_table_lock);
+	ptep_unlock(page_table);
 out:
 	return ret;
 oom:
@@ -1606,8 +1565,8 @@

 	pgoff = pte_to_pgoff(*pte);

+	ptep_unlock(pte);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1644,13 +1603,11 @@
 {
 	pte_t entry;

-	entry = *pte;
+	entry = *pte;	/* get the unlocked value so that we do not write the lock bit back */
+
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+
 	if (!pte_present(entry)) {
-		/*
-		 * If it truly wasn't present, we know that kswapd
-		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
-		 */
 		if (pte_none(entry))
 			return do_no_page(mm, vma, address, write_access, pte, pmd);
 		if (pte_file(entry))
@@ -1668,7 +1625,6 @@
 	ptep_set_access_flags(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1688,12 +1644,6 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
-
-	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
-	 */
-	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);

 	if (pmd) {
@@ -1701,7 +1651,6 @@
 		if (pte)
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-15 19:59:32.000000000 -0700
@@ -494,8 +494,14 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/* If we would simply zero the pte then handle_mm_fault might
+	 * race against this code and reinstate an anonymous mapping
+	 */
+	pteval = ptep_clear_and_lock_flush(vma, address, pte);
+#else
 	pteval = ptep_clear_flush(vma, address, pte);
-
+#endif
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
 		set_page_dirty(page);
@@ -508,9 +514,13 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
+		/* This is going to clear the lock that may have been set on the pte */
 		set_pte(pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
 	}
+#ifdef __HAVE_ARCH_PTE_LOCK
+	else ptep_unlock(pte);
+#endif

 	mm->rss--;
 	BUG_ON(!page->mapcount);

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  3:29           ` Christoph Lameter
@ 2004-08-16  7:00             ` Ray Bryant
  2004-08-16 15:18               ` Christoph Lameter
  2004-08-16 14:39             ` William Lee Irwin III
  1 sibling, 1 reply; 106+ messages in thread
From: Ray Bryant @ 2004-08-16  7:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David S. Miller, linux-ia64, linux-kernel

Christoph,

Something else to worry about here is mm->rss.  Previously, this was updated 
only with the page_table_lock held, so concurrent increments were not a 
problem.  rss may need to converted be an atomic_t if you use pte_locks.
It may be that an approximate value for rss is good enough, but I'm not sure 
how to bound the error that could be introduced by a couple of hundred 
processers handling page faults in parallel and updating rss without locking 
it or making it an atomic_t.

Christoph Lameter wrote:
> On Sun, 15 Aug 2004, David S. Miller wrote:
> 
> 
>>On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
>>Christoph Lameter <clameter@sgi.com> wrote:
>>
>>
>>>pgd/pmd walking should be possible always even without the vma semaphore
>>>since the CPU can potentially walk the chain at anytime.
>>
>>munmap() can destroy pmd and pte tables.  somehow we have
>>to protect against that, and currently that is having the
>>VMA semaphore held for reading, see free_pgtables().
> 
> 
> It looks to me like the code takes care to provide the correct
> sequencing so that the integrity of pgd,pmd and pte links is
> guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
> protect one kernel thread messing with the addresses of these entities
> that might be stored in another threads register.
> 
> Therefore it is safe to walk the chain only holding the semaphore read
> lock?
> 
> If the mmap lock already guarantees the integrity of the pgd,pmd,pte
> system, then pte locking would be okay as long as integrity of the
> pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would
> work.
> 
> So then there are two ways of modifying the pgd,pmd and pte's.
> 
> A) Processor obtains vma semaphore write lock and does large scale
> modifications to pgd,pmd,pte.
> 
> B) Processor obtains vma semaphore read lock but is still free to do
> modifications on individual pte's while holding that vma lock. There is no
> need to acquire the page_table_lock. These changes must be atomic.
> 
> The role of the page_table_lock is restricted *only* to the "struct
> page" stuff? It says in the comments regarding handle_mm_fault that the
> lock is taken for synchronization with kswapd in regards to the pte
> entries. Seems that this use of the page_table_lock is wrong. A or B
> should have been used.
> 
> We could simply remove the page_table_lock from handle_mm_fault and
> provide the synchronization with kswapd with pte locks right? Both
> processes are essentially doing modifications on pte's while holding the
> vma read lock and I would be changing the way of synchronization between
> these two processes.
> 
> F.e. something along these lines removing the page_table_lock from
> handle_mm_fault and friends. Surprisingly this will also avoid many
> rereads of the pte's since the pte's are really locked. This is just for
> illustrative purpose and unfinished...
> 
> Index: linux-2.6.8.1/mm/memory.c
> ===================================================================
> --- linux-2.6.8.1.orig/mm/memory.c	2004-08-15 06:03:04.000000000 -0700
> +++ linux-2.6.8.1/mm/memory.c	2004-08-15 20:26:29.000000000 -0700
> @@ -1035,8 +1035,7 @@
>   * change only once the write actually happens. This avoids a few races,
>   * and potentially makes it more efficient.
>   *
> - * We hold the mm semaphore and the page_table_lock on entry and exit
> - * with the page_table_lock released.
> + * We hold the mm semaphore.
>   */
>  static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
>  	unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
> @@ -1051,10 +1050,10 @@
>  		 * at least the kernel stops what it's doing before it corrupts
>  		 * data, but for the moment just pretend this is OOM.
>  		 */
> +		ptep_unlock(page_table);
>  		pte_unmap(page_table);
>  		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
>  				address);
> -		spin_unlock(&mm->page_table_lock);
>  		return VM_FAULT_OOM;
>  	}
>  	old_page = pfn_to_page(pfn);
> @@ -1069,7 +1068,7 @@
>  			ptep_set_access_flags(vma, address, page_table, entry, 1);
>  			update_mmu_cache(vma, address, entry);
>  			pte_unmap(page_table);
> -			spin_unlock(&mm->page_table_lock);
> +			/* pte lock unlocked by ptep_set_access */
>  			return VM_FAULT_MINOR;
>  		}
>  	}
> @@ -1080,7 +1079,7 @@
>  	 */
>  	if (!PageReserved(old_page))
>  		page_cache_get(old_page);
> -	spin_unlock(&mm->page_table_lock);
> +	ptep_unlock(page_table);
> 
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto no_new_page;
> @@ -1090,26 +1089,21 @@
>  	copy_cow_page(old_page,new_page,address);
> 
>  	/*
> -	 * Re-check the pte - we dropped the lock
> +	 * There is no need to recheck. The pte was locked
>  	 */
> -	spin_lock(&mm->page_table_lock);
> -	page_table = pte_offset_map(pmd, address);
> -	if (likely(pte_same(*page_table, pte))) {
> -		if (PageReserved(old_page))
> -			++mm->rss;
> -		else
> -			page_remove_rmap(old_page);
> -		break_cow(vma, new_page, address, page_table);
> -		lru_cache_add_active(new_page);
> -		page_add_anon_rmap(new_page, vma, address);
> +	if (PageReserved(old_page))
> +		++mm->rss;
> +	else
> +		page_remove_rmap(old_page);
> +	break_cow(vma, new_page, address, page_table);
> +	lru_cache_add_active(new_page);
> +	page_add_anon_rmap(new_page, vma, address);
> 
> -		/* Free the old page.. */
> -		new_page = old_page;
> -	}
> +	/* Free the old page.. */
> +	new_page = old_page;
>  	pte_unmap(page_table);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> -	spin_unlock(&mm->page_table_lock);
>  	return VM_FAULT_MINOR;
> 
>  no_new_page:
> @@ -1314,8 +1308,8 @@
>  }
> 
>  /*
> - * We hold the mm semaphore and the page_table_lock on entry and
> - * should release the pagetable lock on exit..
> + * We hold the mm semaphore and a pte lock n entry and
> + * should release the pte lock on exit..
>   */
>  static int do_swap_page(struct mm_struct * mm,
>  	struct vm_area_struct * vma, unsigned long address,
> @@ -1327,27 +1321,10 @@
>  	int ret = VM_FAULT_MINOR;
> 
>  	pte_unmap(page_table);
> -	spin_unlock(&mm->page_table_lock);
>  	page = lookup_swap_cache(entry);
>  	if (!page) {
>   		swapin_readahead(entry, address, vma);
>   		page = read_swap_cache_async(entry, vma, address);
> -		if (!page) {
> -			/*
> -			 * Back out if somebody else faulted in this pte while
> -			 * we released the page table lock.
> -			 */
> -			spin_lock(&mm->page_table_lock);
> -			page_table = pte_offset_map(pmd, address);
> -			if (likely(pte_same(*page_table, orig_pte)))
> -				ret = VM_FAULT_OOM;
> -			else
> -				ret = VM_FAULT_MINOR;
> -			pte_unmap(page_table);
> -			spin_unlock(&mm->page_table_lock);
> -			goto out;
> -		}
> -
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		inc_page_state(pgmajfault);
> @@ -1356,21 +1333,6 @@
>  	mark_page_accessed(page);
>  	lock_page(page);
> 
> -	/*
> -	 * Back out if somebody else faulted in this pte while we
> -	 * released the page table lock.
> -	 */
> -	spin_lock(&mm->page_table_lock);
> -	page_table = pte_offset_map(pmd, address);
> -	if (unlikely(!pte_same(*page_table, orig_pte))) {
> -		pte_unmap(page_table);
> -		spin_unlock(&mm->page_table_lock);
> -		unlock_page(page);
> -		page_cache_release(page);
> -		ret = VM_FAULT_MINOR;
> -		goto out;
> -	}
> -
>  	/* The page isn't present yet, go ahead with the fault. */
> 
>  	swap_free(entry);
> @@ -1398,8 +1360,8 @@
> 
>  	/* No need to invalidate - it was non-present before */
>  	update_mmu_cache(vma, address, pte);
> +	ptep_unlock(page_table);
>  	pte_unmap(page_table);
> -	spin_unlock(&mm->page_table_lock);
>  out:
>  	return ret;
>  }
> @@ -1424,7 +1386,6 @@
>  	if (write_access) {
>  		/* Allocate our own private page. */
>  		pte_unmap(page_table);
> -		spin_unlock(&mm->page_table_lock);
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> @@ -1433,13 +1394,12 @@
>  			goto no_mem;
>  		clear_user_highpage(page, addr);
> 
> -		spin_lock(&mm->page_table_lock);
>  		page_table = pte_offset_map(pmd, addr);
> 
>  		if (!pte_none(*page_table)) {
> +			ptep_unlock(page_table);
>  			pte_unmap(page_table);
>  			page_cache_release(page);
> -			spin_unlock(&mm->page_table_lock);
>  			goto out;
>  		}
>  		mm->rss++;
> @@ -1456,7 +1416,6 @@
> 
>  	/* No need to invalidate - it was non-present before */
>  	update_mmu_cache(vma, addr, entry);
> -	spin_unlock(&mm->page_table_lock);
>  out:
>  	return VM_FAULT_MINOR;
>  no_mem:
> @@ -1472,8 +1431,8 @@
>   * As this is called only for pages that do not currently exist, we
>   * do not need to flush old virtual caches or the TLB.
>   *
> - * This is called with the MM semaphore held and the page table
> - * spinlock held. Exit with the spinlock released.
> + * This is called with the MM semaphore held and pte lock
> + * held. Exit with the pte lock released.
>   */
>  static int
>  do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -1489,9 +1448,9 @@
>  	if (!vma->vm_ops || !vma->vm_ops->nopage)
>  		return do_anonymous_page(mm, vma, page_table,
>  					pmd, write_access, address);
> +	ptep_unlock(page_table);
>  	pte_unmap(page_table);
> -	spin_unlock(&mm->page_table_lock);
> -
> +
>  	if (vma->vm_file) {
>  		mapping = vma->vm_file->f_mapping;
>  		sequence = atomic_read(&mapping->truncate_count);
> @@ -1523,7 +1482,7 @@
>  		anon = 1;
>  	}
> 
> -	spin_lock(&mm->page_table_lock);
> +	while (ptep_lock(page_table)) ;
>  	/*
>  	 * For a file-backed vma, someone could have truncated or otherwise
>  	 * invalidated this page.  If unmap_mapping_range got called,
> @@ -1532,7 +1491,7 @@
>  	if (mapping &&
>  	      (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
>  		sequence = atomic_read(&mapping->truncate_count);
> -		spin_unlock(&mm->page_table_lock);
> +		ptep_unlock(page_table);
>  		page_cache_release(new_page);
>  		goto retry;
>  	}
> @@ -1565,15 +1524,15 @@
>  		pte_unmap(page_table);
>  	} else {
>  		/* One of our sibling threads was faster, back out. */
> +		ptep_unlock(page_table);
>  		pte_unmap(page_table);
>  		page_cache_release(new_page);
> -		spin_unlock(&mm->page_table_lock);
>  		goto out;
>  	}
> 
>  	/* no need to invalidate: a not-present page shouldn't be cached */
>  	update_mmu_cache(vma, address, entry);
> -	spin_unlock(&mm->page_table_lock);
> +	ptep_unlock(page_table);
>  out:
>  	return ret;
>  oom:
> @@ -1606,8 +1565,8 @@
> 
>  	pgoff = pte_to_pgoff(*pte);
> 
> +	ptep_unlock(pte);
>  	pte_unmap(pte);
> -	spin_unlock(&mm->page_table_lock);
> 
>  	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
>  	if (err == -ENOMEM)
> @@ -1644,13 +1603,11 @@
>  {
>  	pte_t entry;
> 
> -	entry = *pte;
> +	entry = *pte;	/* get the unlocked value so that we do not write the lock bit back */
> +
> +	if (ptep_lock(pte)) return VM_FAULT_MINOR;
> +
>  	if (!pte_present(entry)) {
> -		/*
> -		 * If it truly wasn't present, we know that kswapd
> -		 * and the PTE updates will not touch it later. So
> -		 * drop the lock.
> -		 */
>  		if (pte_none(entry))
>  			return do_no_page(mm, vma, address, write_access, pte, pmd);
>  		if (pte_file(entry))
> @@ -1668,7 +1625,6 @@
>  	ptep_set_access_flags(vma, address, pte, entry, write_access);
>  	update_mmu_cache(vma, address, entry);
>  	pte_unmap(pte);
> -	spin_unlock(&mm->page_table_lock);
>  	return VM_FAULT_MINOR;
>  }
> 
> @@ -1688,12 +1644,6 @@
> 
>  	if (is_vm_hugetlb_page(vma))
>  		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
> -
> -	/*
> -	 * We need the page table lock to synchronize with kswapd
> -	 * and the SMP-safe atomic PTE updates.
> -	 */
> -	spin_lock(&mm->page_table_lock);
>  	pmd = pmd_alloc(mm, pgd, address);
> 
>  	if (pmd) {
> @@ -1701,7 +1651,6 @@
>  		if (pte)
>  			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
>  	}
> -	spin_unlock(&mm->page_table_lock);
>  	return VM_FAULT_OOM;
>  }
> 
> Index: linux-2.6.8.1/mm/rmap.c
> ===================================================================
> --- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
> +++ linux-2.6.8.1/mm/rmap.c	2004-08-15 19:59:32.000000000 -0700
> @@ -494,8 +494,14 @@
> 
>  	/* Nuke the page table entry. */
>  	flush_cache_page(vma, address);
> +#ifdef __HAVE_ARCH_PTE_LOCK
> +	/* If we would simply zero the pte then handle_mm_fault might
> +	 * race against this code and reinstate an anonymous mapping
> +	 */
> +	pteval = ptep_clear_and_lock_flush(vma, address, pte);
> +#else
>  	pteval = ptep_clear_flush(vma, address, pte);
> -
> +#endif
>  	/* Move the dirty bit to the physical page now the pte is gone. */
>  	if (pte_dirty(pteval))
>  		set_page_dirty(page);
> @@ -508,9 +514,13 @@
>  		 */
>  		BUG_ON(!PageSwapCache(page));
>  		swap_duplicate(entry);
> +		/* This is going to clear the lock that may have been set on the pte */
>  		set_pte(pte, swp_entry_to_pte(entry));
>  		BUG_ON(pte_file(*pte));
>  	}
> +#ifdef __HAVE_ARCH_PTE_LOCK
> +	else ptep_unlock(pte);
> +#endif
> 
>  	mm->rss--;
>  	BUG_ON(!page->mapcount);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  3:29           ` Christoph Lameter
  2004-08-16  7:00             ` Ray Bryant
@ 2004-08-16 14:39             ` William Lee Irwin III
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
       [not found]               ` <B6E8046E1E28D34EB815A11AC8CA3129027B679F@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 2 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-08-16 14:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David S. Miller, linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:
>> munmap() can destroy pmd and pte tables.  somehow we have
>> to protect against that, and currently that is having the
>> VMA semaphore held for reading, see free_pgtables().

On Sun, Aug 15, 2004 at 08:29:11PM -0700, Christoph Lameter wrote:
> It looks to me like the code takes care to provide the correct
> sequencing so that the integrity of pgd,pmd and pte links is
> guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
> protect one kernel thread messing with the addresses of these entities
> that might be stored in another threads register.
> Therefore it is safe to walk the chain only holding the semaphore read
> lock?

Detached pagetables are assumed to be freeable after a TLB flush IPI.
Previously holding ->page_table_lock would prevent the shootdowns of
links to the pagetable page from executing concurrently with
modifications to the pagetable page. Disabling interrupts or otherwise
inhibiting the progress of the IPI'ing cpu is needed to prevent
dereferencing freed pagetables and incorrect accounting based on
contents of about-to-be-freed pagetables. Reference counting pagetable
pages may help here, where the final put would be responsible for
unaccounting the various things in the pagetable page.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  7:00             ` Ray Bryant
@ 2004-08-16 15:18               ` Christoph Lameter
  2004-08-16 16:18                 ` William Lee Irwin III
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-16 15:18 UTC (permalink / raw)
  To: Ray Bryant; +Cc: David S. Miller, linux-ia64, linux-kernel

On Mon, 16 Aug 2004, Ray Bryant wrote:

> Something else to worry about here is mm->rss.  Previously, this was updated
> only with the page_table_lock held, so concurrent increments were not a
> problem.  rss may need to converted be an atomic_t if you use pte_locks.
> It may be that an approximate value for rss is good enough, but I'm not sure
> how to bound the error that could be introduced by a couple of hundred
> processers handling page faults in parallel and updating rss without locking
> it or making it an atomic_t.

Correct. There are a number of issues that may have to be addressed but
first we need to agree on a general idea how to proceed.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16 15:18               ` Christoph Lameter
@ 2004-08-16 16:18                 ` William Lee Irwin III
  0 siblings, 0 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-08-16 16:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Ray Bryant, David S. Miller, linux-ia64, linux-kernel

On Mon, 16 Aug 2004, Ray Bryant wrote:
>> Something else to worry about here is mm->rss.  Previously, this was updated
>> only with the page_table_lock held, so concurrent increments were not a
>> problem.  rss may need to converted be an atomic_t if you use pte_locks.
>> It may be that an approximate value for rss is good enough, but I'm not sure
>> how to bound the error that could be introduced by a couple of hundred
>> processers handling page faults in parallel and updating rss without locking
>> it or making it an atomic_t.

On Mon, Aug 16, 2004 at 08:18:11AM -0700, Christoph Lameter wrote:
> Correct. There are a number of issues that may have to be addressed but
> first we need to agree on a general idea how to proceed.

I'd favor a per-cpu counter so the cacheline doesn't bounce.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 22:38 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Benjamin Herrenschmidt
@ 2004-08-16 17:28   ` Christoph Lameter
  2004-08-17  8:01     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-16 17:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard

On Mon, 16 Aug 2004, Benjamin Herrenschmidt wrote:

> On Sun, 2004-08-15 at 23:50, Christoph Lameter wrote:
> > Well this is more an idea than a real patch yet. The page_table_lock
> > becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
> > memory. "pft" is a program that measures the performance of page faults on
> > SMP system. It allocates memory simultaneously in multiple threads thereby
> > causing lots of page faults for anonymous pages.
>
> Just a note: on ppc64, we already have a PTE lock bit, we use it to
> guard against concurrent hash table insertion, it could be extended
> to the whole page fault path provided we can guarantee we will never
> fault in the hash table on that PTE while it is held. This shouldn't
> be a problem as long as only user pages are locked that way (which
> should be the case with do_page_fault) provided update_mmu_cache()
> is updated to not take this lock, but assume it already held.

Is this the _PAGE_BUSY bit? The pte update routines on PPC64 seem to spin
on that bit when it is set waiting for the hash value update to complete.
Looks very specific to the PPC64 architecture.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16 17:28   ` Christoph Lameter
@ 2004-08-17  8:01     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 106+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-17  8:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard


> Is this the _PAGE_BUSY bit? The pte update routines on PPC64 seem to spin
> on that bit when it is set waiting for the hash value update to complete.
> Looks very specific to the PPC64 architecture.

Yes, it is, I was thinking it's use could be extended tho

Ben.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-16 14:39             ` William Lee Irwin III
@ 2004-08-17 15:28               ` Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
                                   ` (2 more replies)
       [not found]               ` <B6E8046E1E28D34EB815A11AC8CA3129027B679F@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 3 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-17 15:28 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64, linux-kernel

This is the second release of the page fault fastpath path. The fast path
avoids locking during the creation of page table entries for anonymous
memory in a threaded application running on a SMP system. The performance
increases significantly for more than 4 threads running concurrently.

Changes:
- Insure that it is safe to call the various functions without holding
the page_table_lock.
- Fix cases in rmap.c where a pte could be cleared for a very short time
before being set to another value by introducing a pte_xchg function. This
created a potential race condition with the fastpath code which checks for
a cleared pte without holding the page_table_lock.
- i386 support
- Various cleanups

Issue remaining:
- The fastpath increments mm->rss without acquiring the page_table_lock.
Introducing the page_table_lock even for a short time makes performance
drop to the level before the patch.

Ideas:
- One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
seems to be supported by all ia64 and i386 cpus except the original 80386.
- Make rss atomic or eliminate rss?

==== 8 CPU SMP system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.094s      4.500s   4.059s 85561.646  85568.398
  2   3    2    0.092s      6.390s   3.043s 60649.650 114521.474
  2   3    4    0.081s      6.500s   1.093s 59740.813 203552.963
  2   3    8    0.101s     12.001s   2.035s 32487.736 167082.560

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.095s      4.544s   4.064s 84733.378  84699.952
  2   3    2    0.080s      4.749s   2.056s 81426.302 153163.463
  2   3    4    0.081s      5.173s   1.057s 74828.674 249792.084
  2   3    8    0.093s      7.097s   1.021s 54678.576 324072.260

==== 16 CPU system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.627s     61.749s  62.038s 50430.908  50427.364
 16   3    2    0.579s     64.237s  33.068s 48532.874  93375.083
 16   3    4    0.608s     87.579s  28.011s 35670.888 111900.261
 16   3    8    0.612s    122.913s  19.074s 25466.233 159343.342
 16   3   16    0.617s    383.727s  26.091s  8184.648 116868.093
 16   3   32    2.492s    753.081s  25.031s  4163.364 124275.119

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.572s     61.460s  62.003s 50710.367  50705.490
 16   3    2    0.571s     63.951s  33.057s 48753.975  93679.565
 16   3    4    0.593s     72.737s  24.078s 42897.603 126927.505
 16   3    8    0.625s     85.085s  15.008s 36701.575 208502.061
 16   3   16    0.560s     67.191s   6.096s 46430.048 451954.271
 16   3   32    1.599s    162.986s   5.079s 19112.972 543031.652

==== 512 CPU system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.748s     67.200s  67.098s 46295.921  46270.533
 16   3    2    0.899s    100.189s  52.021s 31118.426  60242.544
 16   3    4    1.517s    103.467s  31.021s 29963.479 100777.788
 16   3    8    1.268s    166.023s  26.035s 18803.807 119350.434
 16   3   16    6.296s    453.445s  33.082s  6842.371  92987.774
 16   3   32   22.434s   1341.205s  48.026s  2306.860  65174.913
 16   3   64   54.189s   4633.748s  81.089s   671.026  38411.466
 16   3  128  244.333s  17584.111s 152.026s   176.444  20659.132
 16   3  256  222.936s   8167.241s  73.018s   374.930  42983.366
 16   3  512  207.464s   4259.264s  39.044s   704.258  79741.366

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.884s     64.241s  65.014s 48302.177  48287.787
 16   3    2    0.931s     99.156s  51.058s 31429.640  60979.126
 16   3    4    1.028s     88.451s  26.096s 35155.837 116669.999
 16   3    8    1.957s     61.395s  12.099s 49654.307 242078.305
 16   3   16    5.701s     81.382s   9.039s 36122.904 334774.381
 16   3   32   15.207s    163.893s   9.094s 17564.021 316284.690
 16   3   64   76.056s    440.771s  13.037s  6086.601 235120.800
 16   3  128  203.843s   1535.909s  19.084s  1808.145 158495.679
 16   3  256  274.815s    755.764s  12.058s  3052.387 250010.942
 16   3  512  205.505s    381.106s   7.060s  5362.531 413531.352

Test program and scripts were posted with the first release of this patch.

Feedback welcome. I will be at a conference for the rest of the week and
may reply late to feedback.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

==== FASTPATH PATCH

Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-14 03:55:24.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-16 21:37:39.000000000 -0700
@@ -1680,6 +1680,10 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+#ifdef __HAVE_ARCH_PTE_LOCK
+	pte_t *pte;
+	pte_t entry;
+#endif

 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
@@ -1688,7 +1692,81 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/*
+	 * Fast path for anonymous pages, not found faults bypassing
+	 * the necessity to acquire the page_table_lock
+	 */
+
+	if ((vma->vm_ops && vma->vm_ops->nopage) || pgd_none(*pgd)) goto use_page_table_lock;
+	pmd = pmd_offset(pgd,address);
+	if (pmd_none(*pmd)) goto use_page_table_lock;
+	pte = pte_offset_kernel(pmd,address);
+	if (pte_locked(*pte)) return VM_FAULT_MINOR;
+	if (!pte_none(*pte)) goto use_page_table_lock;
+
+	/*
+	 * Page not present, so kswapd and PTE updates will not touch the pte
+	 * so we are able to just use a pte lock.
+	 */
+
+	/* Return from fault handler perhaps cause another fault if the page is still locked */
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+	/* Someout could have set the pte to something else before we acquired the lock. check */
+	if (!pte_none(pte_mkunlocked(*pte))) {
+		ptep_unlock(pte);
+		return VM_FAULT_MINOR;
+	}
+	/* Read-only mapping of ZERO_PAGE. */
+	entry = pte_wrprotect(mk_pte(ZERO_PAGE(address), vma->vm_page_prot));
+
+	if (write_access) {
+		struct page *page;
+
+		/*
+		 * anon_vma_prepare only requires the mmap_mem lock and
+		 * will acquire the page_table_lock if necessary
+		 */
+		if (unlikely(anon_vma_prepare(vma))) goto no_mem;
+
+		/* alloc_page_vma only requires mmap_mem lock */
+		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!page)  goto no_mem;
+
+		clear_user_highpage(page, address);
+
+		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)),vma);
+		/* lru_cache_add_active uses a cpu_var */
+		lru_cache_add_active(page);
+		mark_page_accessed(page);
+
+		/*
+		 * Incrementing rss usually requires the page_table_lock
+		 * We need something to make this atomic!
+		 * Adding a lock here will hurt performance significantly
+		 */
+		mm->rss++;
+
+		/*
+		 * Invoking page_add_anon_rmap without the page_table_lock since
+		 * page is a newly allocated page not yet managed by VM
+		 */
+		page_add_anon_rmap(page, vma, address);
+	}
+	/* Setting the pte clears the pte lock so there is no need for unlocking */
+	set_pte(pte, entry);
+	pte_unmap(pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	return VM_FAULT_MINOR;		/* Minor fault */

+no_mem:
+	ptep_unlock(pte);
+	return VM_FAULT_OOM;
+
+use_page_table_lock:
+#endif
 	/*
 	 * We need the page table lock to synchronize with kswapd
 	 * and the SMP-safe atomic PTE updates.
Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-16 21:41:19.000000000 -0700
@@ -333,7 +333,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -494,11 +497,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -508,9 +506,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);

 	mm->rss--;
 	BUG_ON(!page->mapcount);
@@ -602,11 +605,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))

===== PTE LOCK PATCH

Index: linux-2.6.8.1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-generic/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-generic/pgtable.h	2004-08-16 21:36:11.000000000 -0700
@@ -85,6 +85,15 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +103,16 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.8.1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-ia64/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-ia64/pgtable.h	2004-08-16 20:36:12.000000000 -0700
@@ -30,6 +30,8 @@
 #define _PAGE_P_BIT		0
 #define _PAGE_A_BIT		5
 #define _PAGE_D_BIT		6
+#define _PAGE_IG_BITS		53
+#define _PAGE_LOCK_BIT		(_PAGE_IG_BITS+3)	/* bit 56. Aligned to 8 bits */

 #define _PAGE_P			(1 << _PAGE_P_BIT)	/* page present bit */
 #define _PAGE_MA_WB		(0x0 <<  2)	/* write back memory attribute */
@@ -58,6 +60,7 @@
 #define _PAGE_PPN_MASK		(((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
 #define _PAGE_ED		(__IA64_UL(1) << 52)	/* exception deferral */
 #define _PAGE_PROTNONE		(__IA64_UL(1) << 63)
+#define _PAGE_LOCK		(__IA64_UL(1) << _PAGE_LOCK_BIT)

 /* Valid only for a PTE with the present bit cleared: */
 #define _PAGE_FILE		(1 << 1)		/* see swap & file pte remarks below */
@@ -281,6 +284,13 @@
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte)	(__pte(pte_val(pte) & ~_PAGE_LOCK))
+/*
+ * Lock functions for pte's
+*/
+#define ptep_lock(ptep)		test_and_set_bit(_PAGE_LOCK_BIT,ptep)
+#define ptep_unlock(ptep)	{ clear_bit(_PAGE_LOCK_BIT,ptep);smp_mb__after_clear_bit(); }
+#define pte_locked(pte)		((pte_val(pte) & _PAGE_LOCK)!=0)

 /*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
@@ -387,6 +397,18 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +576,12 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8.1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-i386/pgtable.h	2004-08-14 03:55:48.000000000 -0700
+++ linux-2.6.8.1/include/asm-i386/pgtable.h	2004-08-16 20:36:12.000000000 -0700
@@ -101,7 +101,7 @@
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1	9	/* available for programmer */
+#define _PAGE_BIT_LOCK		9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
 #define _PAGE_BIT_NX		63
@@ -115,7 +115,7 @@
 #define _PAGE_DIRTY	0x040
 #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
-#define _PAGE_UNUSED1	0x200	/* available for programmer */
+#define _PAGE_LOCK	0x200	/* available for programmer */
 #define _PAGE_UNUSED2	0x400
 #define _PAGE_UNUSED3	0x800

@@ -201,6 +201,7 @@
 extern unsigned long pg0[];

 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
+#define pte_locked(x) ((x).pte_low & _PAGE_LOCK)
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)

 #define pmd_none(x)	(!pmd_val(x))
@@ -236,6 +237,7 @@
 static inline pte_t pte_mkdirty(pte_t pte)	{ (pte).pte_low |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
+static inline pte_t pte_mkunlocked(pte_t pte)	{ (pte).pte_low &= ~_PAGE_LOCK; return pte; }

 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
@@ -260,6 +262,9 @@
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }

+#define ptep_lock(ptep) test_and_set_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define ptep_unlock(ptep) clear_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+
 /*
  * Macro to mark a page protection value as "uncacheable".  On processors which do not support
  * it, this is a no-op.
@@ -416,9 +421,11 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
@ 2004-08-17 15:37                 ` Christoph Hellwig
  2004-08-17 15:51                 ` William Lee Irwin III
  2004-08-18 17:55                 ` Hugh Dickins
  2 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2004-08-17 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel

On Tue, Aug 17, 2004 at 08:28:44AM -0700, Christoph Lameter wrote:
> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.

Please reformat your patch according to Documentation/CodingStyle
(or just look at the surrounding code..).

Also you're duplicating far too much code of the regular pagefault code,
this probably wants some inlined helpers.

Your ptep_lock should be called ptep_trylock.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
@ 2004-08-17 15:51                 ` William Lee Irwin III
  2004-08-18 17:55                 ` Hugh Dickins
  2 siblings, 0 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-08-17 15:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64, linux-kernel

On Tue, Aug 17, 2004 at 08:28:44AM -0700, Christoph Lameter wrote:
> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.
> Changes:
> - Insure that it is safe to call the various functions without holding
> the page_table_lock.
> - Fix cases in rmap.c where a pte could be cleared for a very short time
> before being set to another value by introducing a pte_xchg function. This
> created a potential race condition with the fastpath code which checks for
> a cleared pte without holding the page_table_lock.
> - i386 support
> - Various cleanups
> Issue remaining:
> - The fastpath increments mm->rss without acquiring the page_table_lock.
> Introducing the page_table_lock even for a short time makes performance
> drop to the level before the patch.

Hmm. I'm suspicious but I can't immediately poke a hole in it as it
leaves most uses of ->page_table_lock in place. I can't help thinking
there's a more comprehensive attack on the locking in this area, either.

-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
  2004-08-17 15:51                 ` William Lee Irwin III
@ 2004-08-18 17:55                 ` Hugh Dickins
  2004-08-18 20:20                   ` William Lee Irwin III
  2004-08-19  1:19                   ` Christoph Lameter
  2 siblings, 2 replies; 106+ messages in thread
From: Hugh Dickins @ 2004-08-18 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel

On Tue, 17 Aug 2004, Christoph Lameter wrote:

> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.

It is interesting.  I don't like it at all in its current state,
#ifdef'ed special casing for one particular path through the code,
but it does seem worth taking further.

Just handling that one anonymous case is not worth it, when we know
that the next day someone else from SGI will post a similar test
which shows the same on file pages ;)

Your ptep lock bit avoids collision with pte bits, but does it not
also need to avoid collision with pte swap entry bits?  And the
pte_file bit too, at least once it's extended to nopage areas.

I'm very suspicious of the way you just return VM_FAULT_MINOR when
you find the lock bit already set.  Yes, you can do that, but the
lock bit is held right across the alloc_page_vma, so other threads
trying to fault the same pte will be spinning back out to user and
refaulting back into kernel while they wait: we'd usually use a
waitqueue and wakeup with that kind of lock; or not hold it across,
and make it a bitspin lock.

It's a realistic case, which I guess your test program won't be trying.
Feels livelocky to me, but I may be overreacting against: it's not as
if you're changing the page_table_lock to be treated that way.

> Introducing the page_table_lock even for a short time makes performance
> drop to the level before the patch.

That's interesting, and disappointing.

The main lesson I took from your patch (I think wli was hinting at
the same) is that we ought now to question page_table_lock usage,
should be possible to cut it a lot.

I recall from exchanges with Dave McCracken 18 months ago that the
page_table_lock is _almost_ unnecessary in rmap.c, should be possible
to get avoid it there and in some other places.

We take page_table_lock when making absent present and when making
present absent: I like your observation that those are exclusive cases.

But you've found that narrowing the width of the page_table_lock
in a particular path does not help.  You sound surprised, me too.
Did you find out why that was?
 
> - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> seems to be supported by all ia64 and i386 cpus except the original 80386.

I do think this will be a more fruitful direction than pte locking:
just looking through the arches for spare bits puts me off pte locking.

Hugh


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-18 17:55                 ` Hugh Dickins
@ 2004-08-18 20:20                   ` William Lee Irwin III
  2004-08-19  1:19                   ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-08-18 20:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> It is interesting.  I don't like it at all in its current state,
> #ifdef'ed special casing for one particular path through the code,
> but it does seem worth taking further.
> Just handling that one anonymous case is not worth it, when we know
> that the next day someone else from SGI will post a similar test
> which shows the same on file pages ;)
> Your ptep lock bit avoids collision with pte bits, but does it not
> also need to avoid collision with pte swap entry bits?  And the
> pte_file bit too, at least once it's extended to nopage areas.
> I'm very suspicious of the way you just return VM_FAULT_MINOR when
> you find the lock bit already set.  Yes, you can do that, but the
> lock bit is held right across the alloc_page_vma, so other threads
> trying to fault the same pte will be spinning back out to user and
> refaulting back into kernel while they wait: we'd usually use a
> waitqueue and wakeup with that kind of lock; or not hold it across,
> and make it a bitspin lock.
> It's a realistic case, which I guess your test program won't be trying.
> Feels livelocky to me, but I may be overreacting against: it's not as
> if you're changing the page_table_lock to be treated that way.

Both points are valid; it should retry in-kernel for the pte lock bit
and arrange to use a bit not used for swap (there are at least
PAGE_SHIFT of these on all 64-bit arches).


On Tue, 17 Aug 2004, Christoph Lameter wrote:
>> Introducing the page_table_lock even for a short time makes performance
>> drop to the level before the patch.

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> That's interesting, and disappointing.
> The main lesson I took from your patch (I think wli was hinting at
> the same) is that we ought now to question page_table_lock usage,
> should be possible to cut it a lot.
> I recall from exchanges with Dave McCracken 18 months ago that the
> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
> to get avoid it there and in some other places.
> We take page_table_lock when making absent present and when making
> present absent: I like your observation that those are exclusive cases.
> But you've found that narrowing the width of the page_table_lock
> in a particular path does not help.  You sound surprised, me too.
> Did you find out why that was?

It also protects against vma tree modifications in mainline, but rmap.c
shouldn't need it for vmas anymore, as the vma is rooted to the spot by
mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.


On Tue, 17 Aug 2004, Christoph Lameter wrote:
>> - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
>> seems to be supported by all ia64 and i386 cpus except the original 80386.

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.

Fortunately, spare bits aren't strictly necessary, and neither is
cmpxchg. A single invalid value can serve in place of a bitflag. When
using such an invalid value, just xchg()'ing it and looping when the
invalid value is seen should suffice. This holds more generally for all
radix trees, not just pagetables, and happily xchg() or emulation
thereof is required by core code for all arches.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-18 17:55                 ` Hugh Dickins
  2004-08-18 20:20                   ` William Lee Irwin III
@ 2004-08-19  1:19                   ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-19  1:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel

> > - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> > seems to be supported by all ia64 and i386 cpus except the original 80386.
>
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.

Thanks for the support. Got a V3 here (not ready to post yet) that throws
out the locks and uses cmpxchg instead. It also removes the use of
page_table_lock completely from handle_mm_fault.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch v3: use cmpxchg, make rss atomic
       [not found]               ` <B6E8046E1E28D34EB815A11AC8CA3129027B679F@mtv-atc-605e--n.corp.sgi.com>
@ 2004-08-24  4:43                 ` Christoph Lameter
  2004-08-24  5:49                   ` Christoph Lameter
       [not found]                 ` <B6E8046E1E28D34EB815A11AC8CA3129027B67A9@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-24  4:43 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh

This is the third release of the page fault scalability patches. The scalability
patches avoid locking during the creation of page table entries for anonymous
memory in a threaded application running on a SMP system. The performance
increases significantly for more than 2 threads running concurrently.

Changes:
- use cmpxchg instead of pte_locking
- modify the existing function instead of creating a fastpath
- make rss in mm_struct atomic

Issue remaining:
- Support is only provided for i386 and ia64. Other architectures
  may need to be updated if the provided generic functions do not
  fit. This is especially necessary for architectures supporting SMP.
- i386 version builds fine but untested
- Figure out why performance drops for single thread.
- More testing needed. Is this really addressing all issues?

Ideas:
- Remove page_table_lock from __pmd_alloc and pte_alloc_map.
- Find further code paths that could benefit from removing the page_table lock

==== Test results on an 8 CPU SMP system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.094s      4.500s   4.059s 85561.646  85568.398
  2   3    2    0.092s      6.390s   3.043s 60649.650 114521.474
  2   3    4    0.081s      6.500s   1.093s 59740.813 203552.963
  2   3    8    0.101s     12.001s   2.035s 32487.736 167082.560

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.176s     11.323s  11.050s 68385.368  68345.055
  4   3    2    0.174s     10.716s   5.096s 72205.329 131848.322
  4   3    4    0.170s     10.694s   3.040s 72380.552 231128.569
  4   3    8    0.177s     14.717s   2.064s 52796.567 297380.041

The patchset consists of three parts:

1. pte cmpxchg patch
 Provides the necessary changes to asm-generic, asm-ia64 and asm-i386 to implement
 pte_xchg and pte_cmpxchg.

2. page_table_lock reduction patch
- Do not allocate page_table_lock in handle_mm_fault for the most frequently used
  path. Changes functions that are used by handle_mm_fault so that everything works
  properly.
- Implement do_anonymous_page using pte_cmpxchg.
- Eliminate periods where ptes would be set to zero in the swapper code by using pte_xchg.

3. rss-atomic
- Make all uses of mm->rss atomic.

====== PTE_CMPXCHG

Index: linux-2.6.8.1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-generic/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-generic/pgtable.h	2004-08-23 17:52:53.000000000 -0700
@@ -85,6 +85,15 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +103,28 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+static inline pte_t ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	if (*ptep->pte == oldval) {
+		ptep_xchg(ptep, newval, oldval);
+		return 1;
+	}
+	else
+		return 0;
+}
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.8.1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-i386/pgtable.h	2004-08-14 03:55:48.000000000 -0700
+++ linux-2.6.8.1/include/asm-i386/pgtable.h	2004-08-23 17:52:53.000000000 -0700
@@ -416,9 +416,13 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.8.1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-ia64/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-ia64/pgtable.h	2004-08-23 20:50:56.000000000 -0700
@@ -387,6 +387,26 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
+#ifdef CONFIG_SMP
+static inline int
+ptep_cmpxchg (pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+#endif
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +574,14 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#ifdef CONFIG_SMP
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8.1/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-i386/pgtable-3level.h	2004-08-14 03:55:59.000000000 -0700
+++ linux-2.6.8.1/include/asm-i386/pgtable-3level.h	2004-08-23 17:52:53.000000000 -0700
@@ -88,6 +88,11 @@
 	return res;
 }

+static inline int ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(&ptep->pte_low, (long)oldval, (long)newval)==(long)oldval;
+}
+
 static inline int pte_same(pte_t a, pte_t b)
 {
 	return a.pte_low == b.pte_low && a.pte_high == b.pte_high;
Index: linux-2.6.8.1/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-i386/pgtable-2level.h	2004-08-14 03:55:33.000000000 -0700
+++ linux-2.6.8.1/include/asm-i386/pgtable-2level.h	2004-08-23 17:52:53.000000000 -0700
@@ -40,6 +40,10 @@
 	return (pmd_t *) dir;
 }
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+#define ptep_xchg(xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#ifdef CONFIG_X86_CMPXCHG
+#define ptep_cmpxchg(xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+#endif
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)


====== PAGE_TABLE_LOCK_REDUCTION


Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-23 17:52:49.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-23 17:52:57.000000000 -0700
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1405,14 +1402,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1424,7 +1419,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1433,30 +1427,41 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		/* FIXME: rss++ needs page_table_lock */
+		mm->rss++;
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1472,12 +1477,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1488,9 +1493,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1588,7 +1592,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1601,13 +1605,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1626,17 +1629,12 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1649,15 +1647,29 @@
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
+not_present:
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}
-
+
+	/*
+	 * We really need the table_lock since we currently modify the pte
+	 * without the use of atomic operations.
+	 * FIXME: rewrite to use atomic operations
+	 */
+	spin_lock(&mm->page_table_lock);
+	/* Check again in case a swapout happened before we acquired the lock */
+	entry= *pte;
+	if (!pte_present(entry)) {
+		spin_unlock(&mm->page_table_lock);
+		goto not_present;
+	}
+
 	if (write_access) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
@@ -1686,22 +1698,37 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+		/*
+		 * This is a rare case. We need to satify the entry and exit requirement
+		 * of __pmd_alloc which will immediately drop the table lock
+		 */
+		spin_lock(&mm->page_table_lock);
+		pmd = __pmd_alloc(mm, pgd, address);
+		spin_unlock(&mm->page_table_lock);
+	} else
+		pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t * pte;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (unlikely(!pmd_present(*pmd))) {
+			spin_lock(&mm->page_table_lock);
+			pte = pte_alloc_map(mm, pmd, address);
+			spin_unlock(&mm->page_table_lock);
+		} else
+			pte = pte_offset_map(pmd, address);
+
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-23 17:52:57.000000000 -0700
@@ -333,7 +333,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -494,11 +497,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -508,9 +506,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);

 	mm->rss--;
 	BUG_ON(!page->mapcount);
@@ -602,11 +605,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))



====== RSS-ATOMIC




Index: linux-2.6.8.1/include/linux/sched.h
===================================================================
--- linux-2.6.8.1.orig/include/linux/sched.h	2004-08-14 03:54:49.000000000 -0700
+++ linux-2.6.8.1/include/linux/sched.h	2004-08-23 21:05:27.000000000 -0700
@@ -197,9 +197,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -209,7 +210,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm;
+	unsigned long total_vm, locked_vm;
 	unsigned long def_flags;

 	unsigned long saved_auxv[40]; /* for /proc/PID/auxv */
Index: linux-2.6.8.1/kernel/fork.c
===================================================================
--- linux-2.6.8.1.orig/kernel/fork.c	2004-08-14 03:54:49.000000000 -0700
+++ linux-2.6.8.1/kernel/fork.c	2004-08-23 21:05:27.000000000 -0700
@@ -281,7 +281,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.8.1/mm/mmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/mmap.c	2004-08-14 03:55:35.000000000 -0700
+++ linux-2.6.8.1/mm/mmap.c	2004-08-23 21:05:27.000000000 -0700
@@ -1719,7 +1719,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-23 20:51:03.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-23 21:05:53.000000000 -0700
@@ -325,7 +325,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1096,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1374,7 +1374,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1444,6 +1444,7 @@
 			unlock_page(page);
 			page_cache_release(page);
 		}
+		printk(KERN_INFO "do_anon_page: cmpxchg failed. Backing out.\n");
 		goto out;
 	}
 	if (write_access) {
@@ -1454,8 +1455,7 @@
 		 */
 		lru_cache_add_active(page);
 		page_add_anon_rmap(page, vma, addr);
-		/* FIXME: rss++ needs page_table_lock */
-		mm->rss++;
+		atomic_inc(&mm->mm_rss);
 		unlock_page(page);
 	}
 	pte_unmap(page_table);
@@ -1555,7 +1555,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
Index: linux-2.6.8.1/fs/exec.c
===================================================================
--- linux-2.6.8.1.orig/fs/exec.c	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/fs/exec.c	2004-08-23 21:05:27.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.8.1/fs/binfmt_flat.c
===================================================================
--- linux-2.6.8.1.orig/fs/binfmt_flat.c	2004-08-14 03:54:46.000000000 -0700
+++ linux-2.6.8.1/fs/binfmt_flat.c	2004-08-23 21:05:27.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.8.1/mm/fremap.c
===================================================================
--- linux-2.6.8.1.orig/mm/fremap.c	2004-08-14 03:54:47.000000000 -0700
+++ linux-2.6.8.1/mm/fremap.c	2004-08-23 21:05:27.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.8.1/fs/binfmt_som.c
===================================================================
--- linux-2.6.8.1.orig/fs/binfmt_som.c	2004-08-14 03:55:19.000000000 -0700
+++ linux-2.6.8.1/fs/binfmt_som.c	2004-08-23 21:05:27.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.8.1/mm/swapfile.c
===================================================================
--- linux-2.6.8.1.orig/mm/swapfile.c	2004-08-23 17:52:49.000000000 -0700
+++ linux-2.6.8.1/mm/swapfile.c	2004-08-23 21:05:27.000000000 -0700
@@ -434,7 +434,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.8.1/fs/binfmt_aout.c
===================================================================
--- linux-2.6.8.1.orig/fs/binfmt_aout.c	2004-08-14 03:54:51.000000000 -0700
+++ linux-2.6.8.1/fs/binfmt_aout.c	2004-08-23 21:05:27.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.8.1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.8.1.orig/fs/binfmt_elf.c	2004-08-14 03:55:23.000000000 -0700
+++ linux-2.6.8.1/fs/binfmt_elf.c	2004-08-23 21:05:27.000000000 -0700
@@ -705,7 +705,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-23 20:51:03.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-23 21:05:27.000000000 -0700
@@ -203,7 +203,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -435,7 +435,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -515,7 +515,7 @@
 	if (pte_dirty(pteval))
 		set_page_dirty(page);

-	mm->rss--;
+	atomic_dec(&mm->mm_rss);
 	BUG_ON(!page->mapcount);
 	page->mapcount--;
 	page_cache_release(page);
@@ -618,7 +618,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -719,7 +719,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				ret = try_to_unmap_cluster(
Index: linux-2.6.8.1/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.8.1.orig/fs/proc/task_mmu.c	2004-08-14 03:54:50.000000000 -0700
+++ linux-2.6.8.1/fs/proc/task_mmu.c	2004-08-23 21:05:27.000000000 -0700
@@ -37,7 +37,7 @@
 		"VmLib:\t%8lu kB\n",
 		mm->total_vm << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data - stack, stack,
 		exec - lib, lib);
 	up_read(&mm->mmap_sem);
@@ -55,7 +55,7 @@
 	struct vm_area_struct *vma;
 	int size = 0;

-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;

Index: linux-2.6.8.1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.8.1.orig/arch/ia64/mm/hugetlbpage.c	2004-08-14 03:55:32.000000000 -0700
+++ linux-2.6.8.1/arch/ia64/mm/hugetlbpage.c	2004-08-23 21:05:27.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.8.1/fs/proc/array.c
===================================================================
--- linux-2.6.8.1.orig/fs/proc/array.c	2004-08-14 03:55:34.000000000 -0700
+++ linux-2.6.8.1/fs/proc/array.c	2004-08-23 21:05:27.000000000 -0700
@@ -384,7 +384,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.8.1/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-ia64/tlb.h	2004-08-14 03:55:19.000000000 -0700
+++ linux-2.6.8.1/include/asm-ia64/tlb.h	2004-08-23 21:05:27.000000000 -0700
@@ -45,6 +45,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
+#include <asm/atomic.h>

 #ifdef CONFIG_SMP
 # define FREE_PTE_NR		2048
@@ -160,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.8.1/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-generic/tlb.h	2004-08-14 03:54:46.000000000 -0700
+++ linux-2.6.8.1/include/asm-generic/tlb.h	2004-08-23 21:05:27.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch v3: use cmpxchg, make rss atomic
  2004-08-24  4:43                 ` page fault scalability patch v3: use cmpxchg, make rss atomic Christoph Lameter
@ 2004-08-24  5:49                   ` Christoph Lameter
  2004-08-24 12:34                     ` Matthew Wilcox
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-24  5:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel, vrajesh

On Mon, 23 Aug 2004, Christoph Lameter wrote:

> Issue remaining:
> - Figure out why performance drops for single thread.

Sorry, wrong baseline... There is no issue here except using a 2GB
test against a 4GB one.

Unpatched:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.157s     11.197s  11.035s 69261.721  69239.940
  4   3    2    0.145s     11.445s   6.079s 67849.528 115681.409
  4   3    4    0.182s     13.894s   4.027s 55865.834 184108.856
  4   3    8    0.196s     24.874s   4.025s 31369.039 184790.767

With page fault scalability patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.176s     11.323s  11.050s 68385.368  68345.055
  4   3    2    0.174s     10.716s   5.096s 72205.329 131848.322
  4   3    4    0.170s     10.694s   3.040s 72380.552 231128.569
  4   3    8    0.177s     14.717s   2.064s 52796.567 297380.041

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch v3: use cmpxchg, make rss atomic
  2004-08-24  5:49                   ` Christoph Lameter
@ 2004-08-24 12:34                     ` Matthew Wilcox
  2004-08-24 14:47                       ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2004-08-24 12:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Lameter, William Lee Irwin III, David S. Miller,
	raybry, ak, benh, manfred, linux-ia64, linux-kernel, vrajesh

On Mon, Aug 23, 2004 at 10:49:31PM -0700, Christoph Lameter wrote:
> Unpatched:
> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   4   3    1    0.157s     11.197s  11.035s 69261.721  69239.940
>   4   3    2    0.145s     11.445s   6.079s 67849.528 115681.409
>   4   3    4    0.182s     13.894s   4.027s 55865.834 184108.856
>   4   3    8    0.196s     24.874s   4.025s 31369.039 184790.767
> 
> With page fault scalability patch:
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   4   3    1    0.176s     11.323s  11.050s 68385.368  68345.055
>   4   3    2    0.174s     10.716s   5.096s 72205.329 131848.322
>   4   3    4    0.170s     10.694s   3.040s 72380.552 231128.569
>   4   3    8    0.177s     14.717s   2.064s 52796.567 297380.041

What kind of variance are you seeing with this benchmark?  I'm suspicious
that your 2 thread case is faster than your single thread case.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch v3: use cmpxchg, make rss atomic
  2004-08-24 12:34                     ` Matthew Wilcox
@ 2004-08-24 14:47                       ` Christoph Lameter
  0 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-24 14:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, William Lee Irwin III, David S. Miller,
	raybry, ak, benh, manfred, linux-ia64, linux-kernel, vrajesh

On Tue, 24 Aug 2004, Matthew Wilcox wrote:

> On Mon, Aug 23, 2004 at 10:49:31PM -0700, Christoph Lameter wrote:
> > Unpatched:
> > Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >   4   3    1    0.157s     11.197s  11.035s 69261.721  69239.940
> >   4   3    2    0.145s     11.445s   6.079s 67849.528 115681.409
> >   4   3    4    0.182s     13.894s   4.027s 55865.834 184108.856
> >   4   3    8    0.196s     24.874s   4.025s 31369.039 184790.767
> >
> > With page fault scalability patch:
> >  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >   4   3    1    0.176s     11.323s  11.050s 68385.368  68345.055
> >   4   3    2    0.174s     10.716s   5.096s 72205.329 131848.322
> >   4   3    4    0.170s     10.694s   3.040s 72380.552 231128.569
> >   4   3    8    0.177s     14.717s   2.064s 52796.567 297380.041
>
> What kind of variance are you seeing with this benchmark?  I'm suspicious
> that your 2 thread case is faster than your single thread case.

What is so suspicious about it? Two CPUs can do the job faster than a
single one. Thats the way it should be and the point of these
patches is to reduce locking so that this can happen in a more efficient
way.

There are some variances in these tests especially if one uses low memory
settings such as 1GB 2GB and to some extend also at 4GB. 16 GB gives quite
stable results but the machine I had available for this only had 8 GB. See
my earlier posts on the subject for other test results.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch v4: reduce page_table_lock use, atomic pmd,pgd handlin
       [not found]                 ` <B6E8046E1E28D34EB815A11AC8CA3129027B67A9@mtv-atc-605e--n.corp.sgi.com>
@ 2004-08-26 15:20                   ` Christoph Lameter
       [not found]                   ` <B6E8046E1E28D34EB815A11AC8CA3129027B67B4@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-26 15:20 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

This is the fourth release of the page fault scalability patches. The scalability
patches avoid locking during the creation of page table entries for anonymous
memory in a threaded application running on a SMP system. The performance
increases significantly for more than 2 threads running concurrently.

Changes:
- Expand the use of cmpxchg. page_table_lock removed from
  pmd_alloc pte_alloc_map, handle_mm_fault and handle_pte_fault.
- Integrated single patch.

It seems that no further progress can be made unless the lock semantics
of mmap_sem, page_table_lock and atomic pte operations are changed. If
that is done then some kernel subsystems require major surgery.

The patches result in a bypass of the page_table_lock (but not the mmap_sem!)
and therefore an increase in the ability to concurrently execute the
page fault handler for:

1. Operations where an empty pte or pmd entry is populated
(This is safe since the swapper may only depopulate them and the
swapper code has been changed to never use an empty pte until the
page has been evicted).

2. Modifications of flags in a pte entry (write/accessed).
These modifications are done by the CPU or by low level handlers
on various platforms which is also bypassing all locks. So this
seems to be safe too.

Issue remaining:
- Support is only provided for i386 and ia64. Other architectures
  need to be updated. *_test_and_populate() must be provided for
  other architectures.
- i386 version builds fine but untested

Ideas:
- Remove page_table_lock entirely and use atomic operations?
- Rely on locking via struct page, mmap_sem or on a special invalid
  pte/pmd value if necessary.

==== Test results on an 8 CPU SMP system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.191s     11.100s  11.029s 69645.077  69631.148
  4   3    2    0.170s     15.063s   8.016s 51622.179  96298.094
  4   3    4    0.169s     13.791s   4.026s 56330.865 184439.165
  4   3    8    0.180s     24.694s   4.011s 31615.342 190917.461

With the patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.155s     11.348s  11.050s 68362.142  68349.273
  4   3    2    0.165s     10.697s   5.053s 72400.070 142032.437
  4   3    4    0.162s     10.682s   3.041s 72517.428 230441.818
  4   3    8    0.189s     14.843s   2.064s 52312.979 296975.799

==== Test results on a 32 CPU SMP system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.587s     61.225s  61.081s 50890.625  50891.649
 16   3    2    0.649s     85.250s  44.029s 36621.046  71017.720
 16   3    4    0.677s     82.429s  28.021s 37851.330 111475.333
 16   3    8    0.630s    133.244s  22.052s 23497.490 139665.178
 16   3   16    0.661s    365.279s  26.007s  8596.280 120653.071
 16   3   32    0.921s    806.164s  28.092s  3897.635 108767.881

With the patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.610s     61.557s  62.016s 50600.438  50599.822
 16   3    2    0.640s     83.116s  43.016s 37557.847  72869.978
 16   3    4    0.621s     73.897s  26.023s 42214.002 119908.246
 16   3    8    0.596s     86.587s  14.098s 36081.229 209962.059
 16   3   16    0.646s     69.601s   7.000s 44780.269 448823.690
 16   3   32    0.903s    185.609s   8.085s 16866.018 355301.694

Test results may fluctuate which may be a result of the NUMA
architecture where the assignments of these threads to CPUs at
various distances to one another may have an influence.

While there is a major improvement, the numbers still show that
the page fault handler does not scale well over 16 cpus.

==== Patch

Index: linux-2.6.9-rc1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc1.orig/kernel/fork.c	2004-08-25 10:50:17.000000000 -0700
+++ linux-2.6.9-rc1/kernel/fork.c	2004-08-25 10:53:03.000000000 -0700
@@ -282,7 +282,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-08-25 10:50:12.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/sched.h	2004-08-25 10:53:03.000000000 -0700
@@ -197,9 +197,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -209,7 +210,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm;
+	unsigned long total_vm, locked_vm;
 	unsigned long def_flags;

 	unsigned long saved_auxv[40]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/task_mmu.c	2004-08-25 10:50:01.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/task_mmu.c	2004-08-25 10:53:03.000000000 -0700
@@ -37,7 +37,7 @@
 		"VmLib:\t%8lu kB\n",
 		mm->total_vm << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data - stack, stack,
 		exec - lib, lib);
 	up_read(&mm->mmap_sem);
@@ -55,7 +55,7 @@
 	struct vm_area_struct *vma;
 	int size = 0;

-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;

Index: linux-2.6.9-rc1/mm/mmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/mmap.c	2004-08-25 10:50:14.000000000 -0700
+++ linux-2.6.9-rc1/mm/mmap.c	2004-08-25 10:53:09.000000000 -0700
@@ -1718,7 +1718,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc1/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/tlb.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/tlb.h	2004-08-25 10:53:09.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc1/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_flat.c	2004-08-25 10:50:04.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_flat.c	2004-08-25 10:53:03.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc1/fs/exec.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/exec.c	2004-08-25 10:50:04.000000000 -0700
+++ linux-2.6.9-rc1/fs/exec.c	2004-08-25 10:53:03.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc1/mm/memory.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/memory.c	2004-08-25 10:50:14.000000000 -0700
+++ linux-2.6.9-rc1/mm/memory.c	2004-08-25 10:53:03.000000000 -0700
@@ -158,9 +158,7 @@
 	if (!pmd_present(*pmd)) {
 		struct page *new;

-		spin_unlock(&mm->page_table_lock);
 		new = pte_alloc_one(mm, address);
-		spin_lock(&mm->page_table_lock);
 		if (!new)
 			return NULL;

@@ -172,8 +170,11 @@
 			pte_free(new);
 			goto out;
 		}
+		if (!pmd_test_and_populate(mm, pmd, new)) {
+			pte_free(new);
+			goto out;
+		}
 		inc_page_state(nr_page_table_pages);
-		pmd_populate(mm, pmd, new);
 	}
 out:
 	return pte_offset_map(pmd, address);
@@ -325,7 +326,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1097,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1314,8 +1315,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1327,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1378,7 +1376,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1406,14 +1404,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1421,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1429,40 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1478,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1494,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1552,7 +1556,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1589,7 +1593,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1606,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1630,54 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 *
+	 * The following statement was removed
+	 * ptep_set_access_flags(vma, address, pte, new_entry, write_access);
+	 * Not sure if all the side effects are replicated here for all platforms.
+	 *
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,30 +1695,27 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);

-	if (pmd) {
+	if (likely(pmd)) {
 		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

 /*
  * Allocate page middle directory.
  *
- * We've already handled the fast-path in-line, and we own the
- * page table lock.
+ * We've already handled the fast-path in-line.
  *
  * On a two-level page table, this ends up actually being entirely
  * optimized away.
@@ -1719,9 +1724,7 @@
 {
 	pmd_t *new;

-	spin_unlock(&mm->page_table_lock);
 	new = pmd_alloc_one(mm, address);
-	spin_lock(&mm->page_table_lock);
 	if (!new)
 		return NULL;

@@ -1733,7 +1736,11 @@
 		pmd_free(new);
 		goto out;
 	}
-	pgd_populate(mm, pgd, new);
+	/* Insure that the update is done in an atomic way */
+	if (!pgd_test_and_populate(mm, pgd, new)) {
+		pmd_free(new);
+		goto out;
+	}
 out:
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgalloc.h	2004-08-25 10:50:09.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgalloc.h	2004-08-25 10:53:09.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -84,6 +88,11 @@
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +141,12 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linux-2.6.9-rc1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable.h	2004-08-25 10:53:09.000000000 -0700
@@ -416,9 +416,13 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-3level.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h	2004-08-25 10:53:09.000000000 -0700
@@ -88,6 +88,11 @@
 	return res;
 }

+static inline int ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(&ptep->pte_low, (long)oldval, (long)newval)==(long)oldval;
+}
+
 static inline int pte_same(pte_t a, pte_t b)
 {
 	return a.pte_low == b.pte_low && a.pte_high == b.pte_high;
Index: linux-2.6.9-rc1/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_som.c	2004-08-25 10:50:04.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_som.c	2004-08-25 10:53:09.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc1/mm/fremap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/fremap.c	2004-08-25 10:50:14.000000000 -0700
+++ linux-2.6.9-rc1/mm/fremap.c	2004-08-25 10:53:09.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc1/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/swapfile.c	2004-08-25 10:50:17.000000000 -0700
+++ linux-2.6.9-rc1/mm/swapfile.c	2004-08-25 10:53:09.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc1/include/linux/mm.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/mm.h	2004-08-25 10:50:14.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/mm.h	2004-08-25 10:53:09.000000000 -0700
@@ -593,7 +593,7 @@
  */
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
-	if (pgd_none(*pgd))
+	if (unlikely(pgd_none(*pgd)))
 		return __pmd_alloc(mm, pgd, address);
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/pgtable.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/pgtable.h	2004-08-25 10:53:09.000000000 -0700
@@ -85,6 +85,15 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +103,28 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+static inline pte_t ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	if (*ptep->pte == oldval) {
+		ptep_xchg(ptep, newval, oldval);
+		return 1;
+	}
+	else
+		return 0;
+}
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.9-rc1/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_aout.c	2004-08-25 10:50:04.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_aout.c	2004-08-25 10:53:09.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-2level.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h	2004-08-25 10:53:09.000000000 -0700
@@ -40,6 +40,10 @@
 	return (pmd_t *) dir;
 }
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+#define ptep_xchg(xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#ifdef CONFIG_X86_CMPXCHG
+#define ptep_cmpxchg(xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+#endif
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)
Index: linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc1.orig/arch/ia64/mm/hugetlbpage.c	2004-08-25 10:49:28.000000000 -0700
+++ linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c	2004-08-25 10:53:09.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgtable.h	2004-08-25 10:50:09.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgtable.h	2004-08-25 10:53:09.000000000 -0700
@@ -387,6 +387,26 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
+#ifdef CONFIG_SMP
+static inline int
+ptep_cmpxchg (pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+#endif
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +574,14 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#ifdef CONFIG_SMP
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.9-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/array.c	2004-08-25 10:50:01.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/array.c	2004-08-25 10:53:09.000000000 -0700
@@ -389,7 +389,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_elf.c	2004-08-25 10:50:04.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_elf.c	2004-08-25 10:53:09.000000000 -0700
@@ -705,7 +705,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc1/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/tlb.h	2004-08-25 10:50:10.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/tlb.h	2004-08-25 10:53:09.000000000 -0700
@@ -45,6 +45,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
+#include <asm/atomic.h>

 #ifdef CONFIG_SMP
 # define FREE_PTE_NR		2048
@@ -160,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc1/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgalloc.h	2004-08-25 10:50:11.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgalloc.h	2004-08-25 11:11:15.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PMD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,14 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg( ((unsigned long *)pmd), _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT), PMD_NONE) == PMD_NONE;
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +59,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)

Index: linux-2.6.9-rc1/mm/rmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/rmap.c	2004-08-25 10:50:14.000000000 -0700
+++ linux-2.6.9-rc1/mm/rmap.c	2004-08-25 10:53:09.000000000 -0700
@@ -203,7 +203,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -335,7 +335,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -434,7 +437,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -496,11 +499,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -510,11 +508,16 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

-	mm->rss--;
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
+
+	atomic_dec(&mm->mm_rss);
 	BUG_ON(!page->mapcount);
 	page->mapcount--;
 	page_cache_release(page);
@@ -604,11 +607,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
@@ -616,7 +620,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -716,7 +720,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				ret = try_to_unmap_cluster(


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch final : i386 tested, x86_64 support added
       [not found]                   ` <B6E8046E1E28D34EB815A11AC8CA3129027B67B4@mtv-atc-605e--n.corp.sgi.com>
@ 2004-08-27 23:20                     ` Christoph Lameter
  2004-08-27 23:36                       ` Andi Kleen
  2004-09-01  4:24                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-27 23:20 UTC (permalink / raw)
  To: akpm, William Lee Irwin III
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

Signed-off-by: Christoph Lameter <clameter@sgi.com>

This is the fifth (and hopefully final) release of the page fault
scalability patches. The scalability patches avoid locking during the
creation of page table entries for anonymous memory in a threaded
application. The performance increases significantly for more than 2
threads running concurrently.

Typical performance increases in the page fault rate are:
2 CPUs -> 10%
4 CPUs -> 30%
8 CPUs -> 50%

This is accomplished by avoiding the use of the page_table_lock spinlock (but not
mm->mmap_sem!) through providing new atomic operations on pte's (ptep_xchg, ptep_cmpxchg)
and on pmd and pdg's (pgd_test_and_populate, pmd_test_and_populate). The page table lock
can be avoided in the following situations:

1. Operations where an empty pte or pmd entry is populated
This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted.

2. Modifications of flags in a pte entry (write/accessed).
These modifications are done by the CPU or by low level handlers
on various platforms which is also bypassing all locks. So this
seems to be safe too.

It was necessary to make mm->rss atomic since the page_table_lock was also used
to protect incrementing and decrementing rss.

Changes from V4:
- Debug and test i386 support
- Provide a cmpxchg that works for all i386 cpus.
- Provide x86_64 support (untested).

All architectures must at least provide pgd_test_and_populate and pmd_test_and_populate.
The patch only includes these for x86_64, i386 and ia64 since I have no other platform
available.

The scalability could be further increased if the locking scheme (mmap_sem, page_table lock etc)
would be changed but this would require significant changes to the memory subsystem.
This patch lays the groundwork for future work by providing a way to handle page table
entries via xchg and cmpxchg instead of locks.

Index: linux-2.6.9-rc1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc1.orig/kernel/fork.c	2004-08-25 10:50:17.079219000 -0700
+++ linux-2.6.9-rc1/kernel/fork.c	2004-08-27 12:14:09.561009080 -0700
@@ -282,7 +282,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-08-25 10:50:12.534021000 -0700
+++ linux-2.6.9-rc1/include/linux/sched.h	2004-08-27 12:14:09.564008624 -0700
@@ -197,9 +197,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -209,7 +210,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm;
+	unsigned long total_vm, locked_vm;
 	unsigned long def_flags;

 	unsigned long saved_auxv[40]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/task_mmu.c	2004-08-25 10:50:01.908971000 -0700
+++ linux-2.6.9-rc1/fs/proc/task_mmu.c	2004-08-27 12:14:09.602002848 -0700
@@ -37,7 +37,7 @@
 		"VmLib:\t%8lu kB\n",
 		mm->total_vm << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data - stack, stack,
 		exec - lib, lib);
 	up_read(&mm->mmap_sem);
@@ -55,7 +55,7 @@
 	struct vm_area_struct *vma;
 	int size = 0;

-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;

Index: linux-2.6.9-rc1/mm/mmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/mmap.c	2004-08-25 10:50:14.808371000 -0700
+++ linux-2.6.9-rc1/mm/mmap.c	2004-08-27 12:14:09.607002088 -0700
@@ -1718,7 +1718,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc1/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/tlb.h	2004-08-25 10:50:11.009716000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/tlb.h	2004-08-27 12:14:09.609001784 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc1/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_flat.c	2004-08-25 10:50:04.733798000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_flat.c	2004-08-27 12:14:09.612001328 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc1/fs/exec.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/exec.c	2004-08-25 10:50:04.285860000 -0700
+++ linux-2.6.9-rc1/fs/exec.c	2004-08-27 12:14:09.615000872 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc1/mm/memory.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/memory.c	2004-08-25 10:50:14.795679000 -0700
+++ linux-2.6.9-rc1/mm/memory.c	2004-08-27 12:14:09.622999656 -0700
@@ -158,9 +158,7 @@
 	if (!pmd_present(*pmd)) {
 		struct page *new;

-		spin_unlock(&mm->page_table_lock);
 		new = pte_alloc_one(mm, address);
-		spin_lock(&mm->page_table_lock);
 		if (!new)
 			return NULL;

@@ -172,8 +170,11 @@
 			pte_free(new);
 			goto out;
 		}
+		if (!pmd_test_and_populate(mm, pmd, new)) {
+			pte_free(new);
+			goto out;
+		}
 		inc_page_state(nr_page_table_pages);
-		pmd_populate(mm, pmd, new);
 	}
 out:
 	return pte_offset_map(pmd, address);
@@ -325,7 +326,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1097,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1314,8 +1315,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1327,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1378,7 +1376,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1406,14 +1404,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1421,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1429,40 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1478,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1494,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1552,7 +1556,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1589,7 +1593,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1606,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1630,54 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 *
+	 * The following statement was removed
+	 * ptep_set_access_flags(vma, address, pte, new_entry, write_access);
+	 * Not sure if all the side effects are replicated here for all platforms.
+	 *
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,30 +1695,27 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);

-	if (pmd) {
+	if (likely(pmd)) {
 		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

 /*
  * Allocate page middle directory.
  *
- * We've already handled the fast-path in-line, and we own the
- * page table lock.
+ * We've already handled the fast-path in-line.
  *
  * On a two-level page table, this ends up actually being entirely
  * optimized away.
@@ -1719,9 +1724,7 @@
 {
 	pmd_t *new;

-	spin_unlock(&mm->page_table_lock);
 	new = pmd_alloc_one(mm, address);
-	spin_lock(&mm->page_table_lock);
 	if (!new)
 		return NULL;

@@ -1733,7 +1736,11 @@
 		pmd_free(new);
 		goto out;
 	}
-	pgd_populate(mm, pgd, new);
+	/* Insure that the update is done in an atomic way */
+	if (!pgd_test_and_populate(mm, pgd, new)) {
+		pmd_free(new);
+		goto out;
+	}
 out:
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgalloc.h	2004-08-25 10:50:09.536971000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgalloc.h	2004-08-27 12:39:06.007514632 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PTE_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -84,6 +88,11 @@
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PMD_NONE) == PMD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +141,12 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PTE_NONE) == PTE_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linux-2.6.9-rc1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable.h	2004-08-25 10:50:11.261746000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable.h	2004-08-27 12:14:09.641996768 -0700
@@ -416,9 +416,11 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTEP_CMPXCHG
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-3level.h	2004-08-25 10:50:11.260034000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h	2004-08-27 12:14:09.643996464 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -88,6 +89,26 @@
 	return res;
 }

+static inline pte_t ptep_xchg(pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 static inline int pte_same(pte_t a, pte_t b)
 {
 	return a.pte_low == b.pte_low && a.pte_high == b.pte_high;
Index: linux-2.6.9-rc1/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_som.c	2004-08-25 10:50:04.738698000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_som.c	2004-08-27 12:14:09.645996160 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc1/mm/fremap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/fremap.c	2004-08-25 10:50:14.788450000 -0700
+++ linux-2.6.9-rc1/mm/fremap.c	2004-08-27 12:14:09.646996008 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc1/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/swapfile.c	2004-08-25 10:50:17.352090000 -0700
+++ linux-2.6.9-rc1/mm/swapfile.c	2004-08-27 12:14:09.650995400 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc1/include/linux/mm.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/mm.h	2004-08-25 10:50:14.940562000 -0700
+++ linux-2.6.9-rc1/include/linux/mm.h	2004-08-27 12:14:09.653994944 -0700
@@ -593,7 +593,7 @@
  */
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
-	if (pgd_none(*pgd))
+	if (unlikely(pgd_none(*pgd)))
 		return __pmd_alloc(mm, pgd, address);
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/pgtable.h	2004-08-25 10:50:11.005634000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/pgtable.h	2004-08-27 12:14:09.655994640 -0700
@@ -85,6 +85,20 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+/* Implementatio not suitable for SMP */
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	unsigned int flags;
+
+	local_irq_save(flags);
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	local_irq_restore(flags);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +108,31 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+/* Implementation is not suitable for SMP */
+static inline pte_t ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	unsigned flags;
+	pte_t val;
+
+	local_irq_save(flags);
+	val= *ptep;
+	if (pte_same(val, oldval)) set_pte(ptep, newval);
+	local_irq_restore(flags);
+	return pte_same(val, oldval);
+}
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.9-rc1/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_aout.c	2004-08-25 10:50:04.726365000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_aout.c	2004-08-27 12:14:09.657994336 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-2level.h	2004-08-25 10:50:11.253628000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h	2004-08-27 12:14:09.658994184 -0700
@@ -40,6 +40,8 @@
 	return (pmd_t *) dir;
 }
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+#define ptep_xchg(xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)
Index: linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc1.orig/arch/ia64/mm/hugetlbpage.c	2004-08-25 10:49:28.016410000 -0700
+++ linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c	2004-08-27 12:14:09.660993880 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgtable.h	2004-08-25 10:50:09.538643000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgtable.h	2004-08-27 12:14:09.663993424 -0700
@@ -387,6 +387,26 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
+#ifdef CONFIG_SMP
+static inline int
+ptep_cmpxchg (pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+#endif
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +574,14 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#ifdef CONFIG_SMP
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.9-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/array.c	2004-08-25 10:50:01.877908000 -0700
+++ linux-2.6.9-rc1/fs/proc/array.c	2004-08-27 12:14:09.665993120 -0700
@@ -389,7 +389,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_elf.c	2004-08-25 10:50:04.729208000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_elf.c	2004-08-27 12:14:09.669992512 -0700
@@ -705,7 +705,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc1/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/tlb.h	2004-08-25 10:50:10.045321000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/tlb.h	2004-08-27 12:14:09.671992208 -0700
@@ -45,6 +45,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
+#include <asm/atomic.h>

 #ifdef CONFIG_SMP
 # define FREE_PTE_NR		2048
@@ -160,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc1/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgalloc.h	2004-08-25 10:50:11.248995000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgalloc.h	2004-08-27 12:40:21.221080432 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PTE_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,18 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PTE_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PTE_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PTE_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PTE_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +63,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)

Index: linux-2.6.9-rc1/mm/rmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/rmap.c	2004-08-25 10:50:14.849241000 -0700
+++ linux-2.6.9-rc1/mm/rmap.c	2004-08-27 12:14:09.676991448 -0700
@@ -203,7 +203,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -335,7 +335,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -434,7 +437,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -496,11 +499,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -510,11 +508,16 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

-	mm->rss--;
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
+
+	atomic_dec(&mm->mm_rss);
 	BUG_ON(!page->mapcount);
 	page->mapcount--;
 	page_cache_release(page);
@@ -604,11 +607,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
@@ -616,7 +620,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -716,7 +720,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				ret = try_to_unmap_cluster(
Index: linux-2.6.9-rc1/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/system.h	2004-08-25 10:50:09.442480000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/system.h	2004-08-27 12:14:09.678991144 -0700
@@ -203,77 +203,6 @@
  __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
  __set_64bit(ptr, ll_low(value), ll_high(value)) )

-/*
- * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
- * Note 2: xchg has side effect, so that attribute volatile is necessary,
- *	  but generally the primitive is invalid, *ptr is output argument. --ANK
- */
-static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
-{
-	switch (size) {
-		case 1:
-			__asm__ __volatile__("xchgb %b0,%1"
-				:"=q" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 2:
-			__asm__ __volatile__("xchgw %w0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 4:
-			__asm__ __volatile__("xchgl %0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-	}
-	return x;
-}
-
-/*
- * Atomic compare and exchange.  Compare OLD with MEM, if identical,
- * store NEW in MEM.  Return the initial value in MEM.  Success is
- * indicated by comparing RETURN with OLD.
- */
-
-#ifdef CONFIG_X86_CMPXCHG
-#define __HAVE_ARCH_CMPXCHG 1
-#endif
-
-static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
-				      unsigned long new, int size)
-{
-	unsigned long prev;
-	switch (size) {
-	case 1:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 2:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 4:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	}
-	return old;
-}
-
-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc1/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc1.orig/arch/i386/Kconfig	2004-08-25 10:49:33.470457000 -0700
+++ linux-2.6.9-rc1/arch/i386/Kconfig	2004-08-27 12:14:09.682990536 -0700
@@ -341,6 +341,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc1/include/asm-i386/processor.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/processor.h	2004-08-25 10:50:06.054711000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/processor.h	2004-08-27 12:14:09.685990080 -0700
@@ -652,4 +652,138 @@
 #define ARCH_HAS_SCHED_WAKE_IDLE
 #endif

+/*
+ * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
+ * Note 2: xchg has side effect, so that attribute volatile is necessary,
+ *	  but generally the primitive is invalid, *ptr is output argument. --ANK
+ */
+static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
+{
+	switch (size) {
+		case 1:
+			__asm__ __volatile__("xchgb %b0,%1"
+				:"=q" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 2:
+			__asm__ __volatile__("xchgw %w0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 4:
+			__asm__ __volatile__("xchgl %0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+	}
+	return x;
+}
+
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_CMPXCHG 1
+#else
+#endif
+
+static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+#ifndef CONFIG_X86_CMPXCHG
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 */
+	unsigned long flags;
+
+	/* All CPUs except 386 support CMPXCHG */
+	if (cpu_data->x86 > 3) goto have_cmpxchg;
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+have_cmpxchg:
+#endif
+	switch (size) {
+	case 1:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 2:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 4:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	}
+	return prev;
+}
+
+static inline unsigned long long cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+#ifndef CONFIG_X86_CMPXCHG8B
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data,X86_FEATURE_CX8)) goto have_cmpxchg8b;
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+
+have_cmpxchg8b:
+#endif
+
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
 #endif /* __ASM_I386_PROCESSOR_H */
Index: linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgalloc.h	2004-08-25 10:50:10.677424000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h	2004-08-27 12:39:15.464077016 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PTE_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg(pgd, PMD_NONE, _PAGE_TABLE | __pa(pmd)) == PMD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg(pmd, PTE_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PTE_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9-rc1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgtable.h	2004-08-25 10:50:10.679171000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgtable.h	2004-08-27 12:21:55.676148808 -0700
@@ -102,6 +102,8 @@
 ((unsigned long) __va(pgd_val(pgd) & PHYSICAL_PAGE_MASK))

 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte, 0))
+#define ptep_xchg(xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval))
+#define ptep_cmpxchg(xp,newval,oldval) (cmpxchg(&(xp)->pte, pte_val(newval), pte_val(oldval) == pte_val(oldval))
 #define pte_same(a, b)		((a).pte == (b).pte)

 #define PML4_SIZE	(1UL << PML4_SHIFT)
@@ -442,6 +444,8 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTEP_XCHG
+#define __HAVE_ARCH_PTEP_CMPXCHG
 #include <asm-generic/pgtable.h>

 #endif /* _X86_64_PGTABLE_H */



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-27 23:20                     ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
@ 2004-08-27 23:36                       ` Andi Kleen
  2004-08-27 23:43                         ` David S. Miller
  2004-08-28  0:19                         ` Christoph Lameter
  2004-09-01  4:24                       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-27 23:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

> Index: linux-2.6.9-rc1/include/linux/sched.h
> ===================================================================
> --- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-08-25 10:50:12.534021000 -0700
> +++ linux-2.6.9-rc1/include/linux/sched.h	2004-08-27 12:14:09.564008624 -0700
> @@ -197,9 +197,10 @@
>  	pgd_t * pgd;
>  	atomic_t mm_users;			/* How many users with user space? */
>  	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
> +	atomic_t mm_rss;			/* Number of pages used by this mm struct */

atomic_t is normally 32bit, even on a 64bit arch.  This will limit the max 
memory size per process to 2^(32+PAGE_SHIFT). I don't think that's a good idea.

On some architectures it used to be 24bit only even, but I think that
has been fixed.

I think you need atomic64_t

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-27 23:36                       ` Andi Kleen
@ 2004-08-27 23:43                         ` David S. Miller
  2004-08-28  0:19                         ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: David S. Miller @ 2004-08-27 23:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: clameter, akpm, wli, davem, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Sat, 28 Aug 2004 01:36:02 +0200
Andi Kleen <ak@suse.de> wrote:

> On some architectures it used to be 24bit only even, but I think that
> has been fixed.

It has, on sparc32 the hashed spinlock scheme is being used
so it's a full 32-bit counter now.

Well, it's not even 32-bits, it's actually 31-bits since the
value is declared as signed.

Only 64-bit platforms provide the atomic64_t implementation.
We'd need to deal with that before making your suggested
change.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-27 23:36                       ` Andi Kleen
  2004-08-27 23:43                         ` David S. Miller
@ 2004-08-28  0:19                         ` Christoph Lameter
  2004-08-28  0:23                           ` David S. Miller
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-28  0:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, William Lee Irwin III, David S. Miller, raybry, ak, benh,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

That is still 2^(32+12) = 2^44 = 16TB.

On Sat, 28 Aug 2004, Andi Kleen wrote:

> > Index: linux-2.6.9-rc1/include/linux/sched.h
> > ===================================================================
> > --- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-08-25 10:50:12.534021000 -0700
> > +++ linux-2.6.9-rc1/include/linux/sched.h	2004-08-27 12:14:09.564008624 -0700
> > @@ -197,9 +197,10 @@
> >  	pgd_t * pgd;
> >  	atomic_t mm_users;			/* How many users with user space? */
> >  	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
> > +	atomic_t mm_rss;			/* Number of pages used by this mm struct */
>
> atomic_t is normally 32bit, even on a 64bit arch.  This will limit the max
> memory size per process to 2^(32+PAGE_SHIFT). I don't think that's a good idea.
>
> On some architectures it used to be 24bit only even, but I think that
> has been fixed.
>
> I think you need atomic64_t
>
> -Andi
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  0:19                         ` Christoph Lameter
@ 2004-08-28  0:23                           ` David S. Miller
  2004-08-28  0:36                             ` Andrew Morton
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-28  0:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: ak, akpm, wli, davem, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004 17:19:11 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> That is still 2^(32+12) = 2^44 = 16TB.

It's actually:

	2 ^ (31 + PAGE_SHIFT)

'31' because atomic_t is 'signed' and PAGE_SHIFT should
be obvious.

Christoph definitely has a point, this is even more virtual space
than most of the 64-bit platforms even support.  (Sparc64 is
2^43 and I believe ia64 is similar)  and this limit actually
mostly comes from the 3-level page table limits.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  0:23                           ` David S. Miller
@ 2004-08-28  0:36                             ` Andrew Morton
  2004-08-28  0:40                               ` David S. Miller
  2004-08-28  1:02                               ` Andi Kleen
  0 siblings, 2 replies; 106+ messages in thread
From: Andrew Morton @ 2004-08-28  0:36 UTC (permalink / raw)
  To: David S. Miller
  Cc: clameter, ak, wli, davem, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

"David S. Miller" <davem@davemloft.net> wrote:
>
> On Fri, 27 Aug 2004 17:19:11 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > That is still 2^(32+12) = 2^44 = 16TB.
> 
> It's actually:
> 
> 	2 ^ (31 + PAGE_SHIFT)
> 
> '31' because atomic_t is 'signed' and PAGE_SHIFT should
> be obvious.
> 
> Christoph definitely has a point, this is even more virtual space
> than most of the 64-bit platforms even support.  (Sparc64 is
> 2^43 and I believe ia64 is similar)

When can we reasonably expect someone to blow this out of the water? 
Within the next couple of years, I suspect?

It does look like we need a new type which is atomic64 on 64-bit and
atomic32 on 32-bit.  That could be used to fix the
mmaping-the-same-page-4G-times-kills-the-kernel bug too.

> and this limit actually
> mostly comes from the 3-level page table limits.

This reminds me - where's that 4-level pagetable patch got to?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  0:36                             ` Andrew Morton
@ 2004-08-28  0:40                               ` David S. Miller
  2004-08-28  1:05                                 ` Andi Kleen
  2004-08-28  1:02                               ` Andi Kleen
  1 sibling, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-28  0:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, ak, wli, davem, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004 17:36:41 -0700
Andrew Morton <akpm@osdl.org> wrote:

> This reminds me - where's that 4-level pagetable patch got to?

I've never seen that.

Wow, with that thing we'd _REALLY_ need the clear_page_range()
optimizations as 4-levels will be extremely sparse to access
on address space teardown.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  0:36                             ` Andrew Morton
  2004-08-28  0:40                               ` David S. Miller
@ 2004-08-28  1:02                               ` Andi Kleen
  2004-08-28  1:39                                 ` Andrew Morton
  2004-08-28 21:41                                 ` Daniel Phillips
  1 sibling, 2 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-28  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David S. Miller, clameter, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, Aug 27, 2004 at 05:36:41PM -0700, Andrew Morton wrote:
> "David S. Miller" <davem@davemloft.net> wrote:
> >
> > On Fri, 27 Aug 2004 17:19:11 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > That is still 2^(32+12) = 2^44 = 16TB.
> > 
> > It's actually:
> > 
> > 	2 ^ (31 + PAGE_SHIFT)
> > 
> > '31' because atomic_t is 'signed' and PAGE_SHIFT should
> > be obvious.
> > 
> > Christoph definitely has a point, this is even more virtual space
> > than most of the 64-bit platforms even support.  (Sparc64 is
> > 2^43 and I believe ia64 is similar)
> 
> When can we reasonably expect someone to blow this out of the water? 
> Within the next couple of years, I suspect?

With 4 level page tables x86-64 will support 47 bits virtual theoretical.
They cannot be used right now because the current x86-64 CPUs have
40 bits physical max and it is currently even hardcoded to 40bits,
but I planned to drop that in the 4 level patch (in fact I already did) 
so that the kernel will in theory support CPUs will more physical memory.


> It does look like we need a new type which is atomic64 on 64-bit and
> atomic32 on 32-bit.  That could be used to fix the
> mmaping-the-same-page-4G-times-kills-the-kernel bug too.

Yep.  Good plan. atomic_long_t ? 

> 
> > and this limit actually
> > mostly comes from the 3-level page table limits.
> 
> This reminds me - where's that 4-level pagetable patch got to?

It exists on my HD, but is not really finished yet.

I was on vacation and travelling and had some other things to do, so it got 
delayed a bit, but I hope to work on it soon again.

-Andi


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  0:40                               ` David S. Miller
@ 2004-08-28  1:05                                 ` Andi Kleen
  2004-08-28  1:11                                   ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Andi Kleen @ 2004-08-28  1:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andrew Morton, clameter, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, Aug 27, 2004 at 05:40:38PM -0700, David S. Miller wrote:
> On Fri, 27 Aug 2004 17:36:41 -0700
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > This reminds me - where's that 4-level pagetable patch got to?
> 
> I've never seen that.

It's not really finished yet...

> 
> Wow, with that thing we'd _REALLY_ need the clear_page_range()
> optimizations as 4-levels will be extremely sparse to access
> on address space teardown.

I would expect most programs to be not have that many holes, so
it will probably not make that much difference for them. But for extreme 
cases like ElectricFenced agreed it may need some optimizations later.  
First implementation is minimal changes only though.

Also BTW most archs will continue to use 2 or 3 levels, you only have
to switch to 4 levels if you want to.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  1:05                                 ` Andi Kleen
@ 2004-08-28  1:11                                   ` David S. Miller
  2004-08-28  1:17                                     ` Andi Kleen
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-08-28  1:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, clameter, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On 28 Aug 2004 03:05:42 +0200
Andi Kleen <ak@muc.de> wrote:

> I would expect most programs to be not have that many holes,

Holes are not the issue.

clear_page_tables() doesn't even use the VMA list as a guide
(it actually can't), it just walks the page tables one pgd at a
time, one pmd at a time, one pte at a time.  And this has the
worst cache behavior even for simple cases like lat_proc
in lmbench.

Each pgd/pmd scan is a data reading walk of a whole page
(or whatever size the particular page table level blocks
are for the platform, usually they are PAGE_SIZE).
It's very costly.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  1:11                                   ` David S. Miller
@ 2004-08-28  1:17                                     ` Andi Kleen
  0 siblings, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-28  1:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: akpm, clameter, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, Aug 27, 2004 at 06:11:42PM -0700, David S. Miller wrote:
> On 28 Aug 2004 03:05:42 +0200
> Andi Kleen <ak@muc.de> wrote:
> 
> > I would expect most programs to be not have that many holes,
> 
> Holes are not the issue.

Holey parts will be quickly skipped in pml4.

> 
> clear_page_tables() doesn't even use the VMA list as a guide
> (it actually can't), it just walks the page tables one pgd at a
> time, one pmd at a time, one pte at a time.  And this has the
> worst cache behavior even for simple cases like lat_proc
> in lmbench.
> 
> Each pgd/pmd scan is a data reading walk of a whole page
> (or whatever size the particular page table level blocks
> are for the platform, usually they are PAGE_SIZE).
> It's very costly.

Ok, haven't done measurements yet. I would hope though that
on any arch that needs 4 levels reading another PAGE_SIZE worth 
of memory is not prohibitive.

That said any optimizations are welcome of course.

-Andi



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  1:02                               ` Andi Kleen
@ 2004-08-28  1:39                                 ` Andrew Morton
  2004-08-28  2:08                                   ` Paul Mackerras
  2004-08-28 21:41                                 ` Daniel Phillips
  1 sibling, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2004-08-28  1:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: davem, clameter, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Andi Kleen <ak@muc.de> wrote:
>
> > When can we reasonably expect someone to blow this out of the water? 
>  > Within the next couple of years, I suspect?
> 
>  With 4 level page tables x86-64 will support 47 bits virtual theoretical.
>  They cannot be used right now because the current x86-64 CPUs have
>  40 bits physical max and it is currently even hardcoded to 40bits,
>  but I planned to drop that in the 4 level patch (in fact I already did) 
>  so that the kernel will in theory support CPUs will more physical memory.
> 

hm.  What's the maximum virtual size on power5?

>  > It does look like we need a new type which is atomic64 on 64-bit and
>  > atomic32 on 32-bit.  That could be used to fix the
>  > mmaping-the-same-page-4G-times-kills-the-kernel bug too.
> 
>  Yep.  Good plan. atomic_long_t ? 

Sounds good.  Converting page->_count should be fairly straightforward now
too, as it's all done via wrappers.

>  > 
>  > > and this limit actually
>  > > mostly comes from the 3-level page table limits.
>  > 
>  > This reminds me - where's that 4-level pagetable patch got to?
> 
>  It exists on my HD, but is not really finished yet.
> 
>  I was on vacation and travelling and had some other things to do, so it got 
>  delayed a bit, but I hope to work on it soon again.

OK, thanks.  There's no rush on this one.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  1:39                                 ` Andrew Morton
@ 2004-08-28  2:08                                   ` Paul Mackerras
  2004-08-28  3:32                                     ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Paul Mackerras @ 2004-08-28  2:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, davem, clameter, ak, wli, davem, raybry, benh,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

Andrew Morton writes:

> hm.  What's the maximum virtual size on power5?

The hardware MMU maps a full 64-bit effective address to a physical
address (one of the (few) advantages of using a hash table :).  That's
true on all the ppc64 processors.  I'm not sure how many bits of
physical address the power5 chip uses, but it is around 50.

Under Linux we are currently limited to a 41-bit virtual address space
(2TB) for user processes, because of the three-level page tables and
the 4kB page size (the pgd and pmd entries are 32 bits).  Due to
various things the linear mapping (and thus the amount of RAM we
can use) is currently also limited to 2TB.  We can increase that
without too much pain, and we'll have to do that at some stage (no one
has yet offered us a 2TB box to play with, but the time will come, for
sure :).

Regards,
Paul.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  2:08                                   ` Paul Mackerras
@ 2004-08-28  3:32                                     ` Christoph Lameter
  2004-08-28  3:42                                       ` Andrew Morton
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-28  3:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, Andi Kleen, davem, ak, wli, davem, raybry, benh,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

So I think the move to atomic for rss acceptable?

The patch follows against 2.6.9-rc1-mm1 (the earlier post was against
2.6.9-rc1).

Also here are some numbers (64GB allocations) on a 512 CPU system with
500GB of memory. The patch roughly doubles the performance at a high
number of CPUs:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 64   3    1    2.940s    308.449s 311.024s 40408.876  40427.880
 64   3    2    3.146s    347.536s 181.031s 35881.179  69397.783
 64   3    4    3.104s    356.892s 119.032s 34952.800 105447.468
 64   3    8    3.009s    334.970s  63.070s 37229.683 197513.752
 64   3   16    3.150s    419.547s  44.016s 29768.126 284913.189
 64   3   32    4.471s    713.090s  40.088s 17535.647 307728.613
 64   3   64    9.967s   1777.146s  45.027s  7040.911 277906.805
 64   3  128   23.077s   5925.452s  63.064s  2115.298 197709.241
 64   3  256   27.838s   3185.475s  34.005s  3915.867 369527.746
 64   3  512   25.287s   1500.589s  16.078s  8246.349 749845.256

Without the patch:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 64   3    1    2.906s    306.330s 309.027s 40690.257  40684.877
 64   3    2    3.162s    406.475s 211.037s 30717.157  59528.843
 64   3    4    2.988s    468.129s 143.018s 26708.653  87878.947
 64   3    8    3.096s    643.947s  95.081s 19446.756 131318.325
 64   3   16    3.635s   1211.389s  96.014s 10356.093 130875.623
 64   3   32   20.920s   1277.679s  98.037s  9689.596 127908.850
 64   3   64   66.025s   1327.081s 104.032s  9032.264 120615.748
 64   3  128  285.900s   1302.198s 114.060s  7923.252 109790.867
 64   3  256  307.222s    672.339s  69.094s 12845.447 179906.645
 64   3  512  255.730s    318.023s  40.042s 21930.841 311235.794

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc1.orig/kernel/fork.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/kernel/fork.c	2004-08-27 19:10:34.000000000 -0700
@@ -302,7 +302,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/sched.h	2004-08-27 19:12:01.000000000 -0700
@@ -213,9 +213,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -225,7 +226,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rlimit_rss, rss, total_vm, locked_vm, shared_vm;
+	unsigned long rlimit_rss, total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

 	unsigned long saved_auxv[40]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/task_mmu.c	2004-08-27 19:09:54.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/task_mmu.c	2004-08-27 19:31:28.000000000 -0700
@@ -21,7 +21,7 @@
 		"VmLib:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long) atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib);
 	return buffer;
@@ -38,7 +38,7 @@
 	*shared = mm->shared_vm;
 	*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm - *text;
-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	return mm->total_vm;
 }

Index: linux-2.6.9-rc1/mm/mmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/mmap.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/mm/mmap.c	2004-08-27 19:10:34.000000000 -0700
@@ -1843,7 +1843,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc1/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/tlb.h	2004-08-24 00:01:50.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/tlb.h	2004-08-27 19:10:34.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc1/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_flat.c	2004-08-24 00:01:50.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_flat.c	2004-08-27 19:10:34.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc1/fs/exec.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/exec.c	2004-08-27 19:09:53.000000000 -0700
+++ linux-2.6.9-rc1/fs/exec.c	2004-08-27 19:10:34.000000000 -0700
@@ -320,7 +320,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc1/mm/memory.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/memory.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/mm/memory.c	2004-08-27 19:10:34.000000000 -0700
@@ -158,9 +158,7 @@
 	if (!pmd_present(*pmd)) {
 		struct page *new;

-		spin_unlock(&mm->page_table_lock);
 		new = pte_alloc_one(mm, address);
-		spin_lock(&mm->page_table_lock);
 		if (!new)
 			return NULL;

@@ -172,8 +170,11 @@
 			pte_free(new);
 			goto out;
 		}
+		if (!pmd_test_and_populate(mm, pmd, new)) {
+			pte_free(new);
+			goto out;
+		}
 		inc_page_state(nr_page_table_pages);
-		pmd_populate(mm, pmd, new);
 	}
 out:
 	return pte_offset_map(pmd, address);
@@ -325,7 +326,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1117,7 +1118,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1335,8 +1336,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1348,15 +1348,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1399,7 +1397,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1427,14 +1425,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1446,7 +1442,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1455,30 +1450,40 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1494,12 +1499,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1510,9 +1515,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1573,7 +1577,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1610,7 +1614,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1623,13 +1627,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1648,49 +1651,54 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 *
+	 * The following statement was removed
+	 * ptep_set_access_flags(vma, address, pte, new_entry, write_access);
+	 * Not sure if all the side effects are replicated here for all platforms.
+	 *
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1708,30 +1716,27 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);

-	if (pmd) {
+	if (likely(pmd)) {
 		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

 /*
  * Allocate page middle directory.
  *
- * We've already handled the fast-path in-line, and we own the
- * page table lock.
+ * We've already handled the fast-path in-line.
  *
  * On a two-level page table, this ends up actually being entirely
  * optimized away.
@@ -1740,9 +1745,7 @@
 {
 	pmd_t *new;

-	spin_unlock(&mm->page_table_lock);
 	new = pmd_alloc_one(mm, address);
-	spin_lock(&mm->page_table_lock);
 	if (!new)
 		return NULL;

@@ -1754,7 +1757,11 @@
 		pmd_free(new);
 		goto out;
 	}
-	pgd_populate(mm, pgd, new);
+	/* Insure that the update is done in an atomic way */
+	if (!pgd_test_and_populate(mm, pgd, new)) {
+		pmd_free(new);
+		goto out;
+	}
 out:
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgalloc.h	2004-08-24 00:01:52.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgalloc.h	2004-08-27 19:10:34.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PTE_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -84,6 +88,11 @@
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PMD_NONE) == PMD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +141,12 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PTE_NONE) == PTE_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linux-2.6.9-rc1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable.h	2004-08-27 19:10:34.000000000 -0700
@@ -409,9 +409,11 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTEP_CMPXCHG
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-3level.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h	2004-08-27 19:10:34.000000000 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -88,6 +89,26 @@
 	return res;
 }

+static inline pte_t ptep_xchg(pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 static inline int pte_same(pte_t a, pte_t b)
 {
 	return a.pte_low == b.pte_low && a.pte_high == b.pte_high;
Index: linux-2.6.9-rc1/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_som.c	2004-08-24 00:02:26.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_som.c	2004-08-27 19:10:34.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc1/mm/fremap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/fremap.c	2004-08-24 00:01:51.000000000 -0700
+++ linux-2.6.9-rc1/mm/fremap.c	2004-08-27 19:10:34.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc1/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/swapfile.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/mm/swapfile.c	2004-08-27 19:10:34.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc1/include/linux/mm.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/mm.h	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/mm.h	2004-08-27 19:10:34.000000000 -0700
@@ -622,7 +622,7 @@
  */
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
-	if (pgd_none(*pgd))
+	if (unlikely(pgd_none(*pgd)))
 		return __pmd_alloc(mm, pgd, address);
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/pgtable.h	2004-08-24 00:02:25.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/pgtable.h	2004-08-27 19:10:34.000000000 -0700
@@ -85,6 +85,20 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+/* Implementatio not suitable for SMP */
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	unsigned int flags;
+
+	local_irq_save(flags);
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	local_irq_restore(flags);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +108,31 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+/* Implementation is not suitable for SMP */
+static inline pte_t ptep_cmpxchg(pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	unsigned flags;
+	pte_t val;
+
+	local_irq_save(flags);
+	val= *ptep;
+	if (pte_same(val, oldval)) set_pte(ptep, newval);
+	local_irq_restore(flags);
+	return pte_same(val, oldval);
+}
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.9-rc1/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_aout.c	2004-08-27 19:09:51.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_aout.c	2004-08-27 19:10:34.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-2level.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h	2004-08-27 19:10:34.000000000 -0700
@@ -40,6 +40,8 @@
 	return (pmd_t *) dir;
 }
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
+#define ptep_xchg(xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
 #define pte_none(x)		(!(x).pte_low)
Index: linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc1.orig/arch/ia64/mm/hugetlbpage.c	2004-08-24 00:02:47.000000000 -0700
+++ linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c	2004-08-27 19:10:34.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgtable.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgtable.h	2004-08-27 19:10:34.000000000 -0700
@@ -387,6 +387,26 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
+#ifdef CONFIG_SMP
+static inline int
+ptep_cmpxchg (pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+#endif
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +574,14 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#ifdef CONFIG_SMP
+#define __HAVE_ARCH_PTEP_CMPXCHG
+#endif
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.9-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/array.c	2004-08-27 19:09:54.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/array.c	2004-08-27 19:10:34.000000000 -0700
@@ -387,7 +387,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_elf.c	2004-08-27 19:09:53.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_elf.c	2004-08-27 19:18:15.000000000 -0700
@@ -708,7 +708,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc1/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/tlb.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/tlb.h	2004-08-27 19:10:34.000000000 -0700
@@ -46,6 +46,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
+#include <asm/atomic.h>

 #ifdef CONFIG_SMP
 # define FREE_PTE_NR		2048
@@ -161,11 +162,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc1/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgalloc.h	2004-08-24 00:01:53.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgalloc.h	2004-08-27 19:10:34.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PTE_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,18 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PTE_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PTE_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PTE_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PTE_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +63,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)

Index: linux-2.6.9-rc1/mm/rmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/rmap.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/mm/rmap.c	2004-08-27 19:33:38.000000000 -0700
@@ -262,7 +262,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -291,7 +291,7 @@
 	if (mm != current->mm && has_swap_token(mm))
 		referenced++;

-	if (mm->rss > mm->rlimit_rss)
+	if (atomic_read(&mm->mm_rss) > mm->rlimit_rss)
 		referenced = 0;

 	(*mapcount)--;
@@ -422,7 +422,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -503,7 +506,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -564,11 +567,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -578,11 +576,16 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+ 	} else
+ 		pteval = ptep_clear_flush(vma, address, pte);

-	mm->rss--;
+ 	/* Move the dirty bit to the physical page now the pte is gone. */
+ 	if (pte_dirty(pteval))
+ 		set_page_dirty(page);
+
+ 	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);

@@ -670,11 +673,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
@@ -682,7 +686,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -781,7 +785,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.9-rc1/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/system.h	2004-08-24 00:01:51.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/system.h	2004-08-27 19:10:34.000000000 -0700
@@ -203,77 +203,6 @@
  __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
  __set_64bit(ptr, ll_low(value), ll_high(value)) )

-/*
- * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
- * Note 2: xchg has side effect, so that attribute volatile is necessary,
- *	  but generally the primitive is invalid, *ptr is output argument. --ANK
- */
-static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
-{
-	switch (size) {
-		case 1:
-			__asm__ __volatile__("xchgb %b0,%1"
-				:"=q" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 2:
-			__asm__ __volatile__("xchgw %w0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 4:
-			__asm__ __volatile__("xchgl %0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-	}
-	return x;
-}
-
-/*
- * Atomic compare and exchange.  Compare OLD with MEM, if identical,
- * store NEW in MEM.  Return the initial value in MEM.  Success is
- * indicated by comparing RETURN with OLD.
- */
-
-#ifdef CONFIG_X86_CMPXCHG
-#define __HAVE_ARCH_CMPXCHG 1
-#endif
-
-static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
-				      unsigned long new, int size)
-{
-	unsigned long prev;
-	switch (size) {
-	case 1:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 2:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 4:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	}
-	return old;
-}
-
-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc1/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc1.orig/arch/i386/Kconfig	2004-08-27 19:09:43.000000000 -0700
+++ linux-2.6.9-rc1/arch/i386/Kconfig	2004-08-27 19:10:34.000000000 -0700
@@ -341,6 +341,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc1/include/asm-i386/processor.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/processor.h	2004-08-27 19:09:56.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/processor.h	2004-08-27 19:10:34.000000000 -0700
@@ -651,4 +651,138 @@

 #define cache_line_size() (boot_cpu_data.x86_cache_alignment)

+/*
+ * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
+ * Note 2: xchg has side effect, so that attribute volatile is necessary,
+ *	  but generally the primitive is invalid, *ptr is output argument. --ANK
+ */
+static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
+{
+	switch (size) {
+		case 1:
+			__asm__ __volatile__("xchgb %b0,%1"
+				:"=q" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 2:
+			__asm__ __volatile__("xchgw %w0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 4:
+			__asm__ __volatile__("xchgl %0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+	}
+	return x;
+}
+
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_CMPXCHG 1
+#else
+#endif
+
+static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+#ifndef CONFIG_X86_CMPXCHG
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 */
+	unsigned long flags;
+
+	/* All CPUs except 386 support CMPXCHG */
+	if (cpu_data->x86 > 3) goto have_cmpxchg;
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+have_cmpxchg:
+#endif
+	switch (size) {
+	case 1:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 2:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 4:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	}
+	return prev;
+}
+
+static inline unsigned long long cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+#ifndef CONFIG_X86_CMPXCHG8B
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data,X86_FEATURE_CX8)) goto have_cmpxchg8b;
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+
+have_cmpxchg8b:
+#endif
+
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
 #endif /* __ASM_I386_PROCESSOR_H */
Index: linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgalloc.h	2004-08-24 00:02:47.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h	2004-08-27 19:10:34.000000000 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PTE_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg(pgd, PMD_NONE, _PAGE_TABLE | __pa(pmd)) == PMD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg(pmd, PTE_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PTE_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9-rc1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgtable.h	2004-08-24 00:03:19.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgtable.h	2004-08-27 19:10:35.000000000 -0700
@@ -102,6 +102,8 @@
 ((unsigned long) __va(pgd_val(pgd) & PHYSICAL_PAGE_MASK))

 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte, 0))
+#define ptep_xchg(xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval))
+#define ptep_cmpxchg(xp,newval,oldval) (cmpxchg(&(xp)->pte, pte_val(newval), pte_val(oldval) == pte_val(oldval))
 #define pte_same(a, b)		((a).pte == (b).pte)

 #define PML4_SIZE	(1UL << PML4_SHIFT)
@@ -442,6 +444,8 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTEP_XCHG
+#define __HAVE_ARCH_PTEP_CMPXCHG
 #include <asm-generic/pgtable.h>

 #endif /* _X86_64_PGTABLE_H */
Index: linux-2.6.9-rc1/mm/thrash.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/thrash.c	2004-08-27 19:09:57.000000000 -0700
+++ linux-2.6.9-rc1/mm/thrash.c	2004-08-27 19:28:20.000000000 -0700
@@ -35,7 +35,7 @@
 		ret = SWAP_TOKEN_ENOUGH_RSS;
 	else if (time_after(jiffies, swap_token_timeout))
 		ret = SWAP_TOKEN_TIMED_OUT;
-	else if (mm->rss > mm->rlimit_rss)
+	else if (atomic_read(&mm->mm_rss) > mm->rlimit_rss)
 		ret = SWAP_TOKEN_ENOUGH_RSS;
 	mm->recent_pagein = 0;
 	return ret;
@@ -61,7 +61,7 @@
 	if (time_after(jiffies, swap_token_check)) {

 		/* Can't get swapout protection if we exceed our RSS limit. */
-		if (current->mm->rss > current->mm->rlimit_rss)
+		if (atomic_read(&current->mm->mm_rss) > current->mm->rlimit_rss)
 			return;

 		/* ... or if we recently held the token. */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  3:32                                     ` Christoph Lameter
@ 2004-08-28  3:42                                       ` Andrew Morton
  2004-08-28  4:24                                         ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2004-08-28  3:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: paulus, ak, davem, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Christoph Lameter <clameter@sgi.com> wrote:
>
>  So I think the move to atomic for rss acceptable?

Short-term, yes.  Longer term (within 12 months), no - 50-bit addresses on
power5 will cause it to overflow.

We may as well fix it up now, but let's set that aside during consideration
of the rest of your patch.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  3:42                                       ` Andrew Morton
@ 2004-08-28  4:24                                         ` Christoph Lameter
  2004-08-28  5:39                                           ` Andrew Morton
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-28  4:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paulus, ak, davem, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004, Andrew Morton wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> >  So I think the move to atomic for rss acceptable?
>
> Short-term, yes.  Longer term (within 12 months), no - 50-bit addresses on
> power5 will cause it to overflow.

I would expect the page size to rise as well. On IA64 we already have
16KB-64KB pages corresponding to 256TB - 1PB. Having to manage a couple of
billion pages could be a significant performance impact. Better increase
the page size.

I still would also like to see atomic64_t. I think there was a patch
posted to linux-ia64 a couple of months back introducing atomic64_t but it
was rejected since it would not be supportable on other arches.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  4:24                                         ` Christoph Lameter
@ 2004-08-28  5:39                                           ` Andrew Morton
  2004-08-28  5:58                                             ` Christoph Lameter
                                                               ` (4 more replies)
  0 siblings, 5 replies; 106+ messages in thread
From: Andrew Morton @ 2004-08-28  5:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: paulus, ak, davem, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Fri, 27 Aug 2004, Andrew Morton wrote:
> 
> > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > >  So I think the move to atomic for rss acceptable?
> >
> > Short-term, yes.  Longer term (within 12 months), no - 50-bit addresses on
> > power5 will cause it to overflow.
> 
> I would expect the page size to rise as well. On IA64 we already have
> 16KB-64KB pages corresponding to 256TB - 1PB. Having to manage a couple of
> billion pages could be a significant performance impact. Better increase
> the page size.

I don't know if that's an option on the power architecture.

And we need larger atomic types _anyway_ for page->_count.  An unprivileged
app can mmap the same page 4G times and can then munmap it once.  Do it on
purpose and it's a security hole.  Due it by accident and it's a crash.

> I still would also like to see atomic64_t. I think there was a patch
> posted to linux-ia64 a couple of months back introducing atomic64_t but it
> was rejected since it would not be supportable on other arches.

atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
sparc64.

As I said - for both these applications we need a new type which is
atomic64_t on 64-bit and atomic_t on 32-bit.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:39                                           ` Andrew Morton
@ 2004-08-28  5:58                                             ` Christoph Lameter
  2004-08-28  6:03                                               ` William Lee Irwin III
  2004-08-28  6:06                                               ` Andrew Morton
  2004-08-28 13:19                                             ` Andi Kleen
                                                               ` (3 subsequent siblings)
  4 siblings, 2 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-08-28  5:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paulus, ak, davem, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004, Andrew Morton wrote:

> atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
> sparc64.
>
> As I said - for both these applications we need a new type which is
> atomic64_t on 64-bit and atomic_t on 32-bit.

That is simply a new definition in include/asm-*/atomic.h

so

#define atomic_long atomic64_t

on 64 bit

and

#define atomic_long atomic_t

on 32bit?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:58                                             ` Christoph Lameter
@ 2004-08-28  6:03                                               ` William Lee Irwin III
  2004-08-28  6:06                                               ` Andrew Morton
  1 sibling, 0 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-08-28  6:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, paulus, ak, davem, ak, davem, raybry, benh,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004, Andrew Morton wrote:
>> atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
>> sparc64.
>> As I said - for both these applications we need a new type which is
>> atomic64_t on 64-bit and atomic_t on 32-bit.

On Fri, Aug 27, 2004 at 10:58:09PM -0700, Christoph Lameter wrote:
> That is simply a new definition in include/asm-*/atomic.h
> so
> #define atomic_long atomic64_t
> on 64 bit
> and
> #define atomic_long atomic_t
> on 32bit?

The operations must also be defined on them, and the remaining 64-bit
architectures must implement them also.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:58                                             ` Christoph Lameter
  2004-08-28  6:03                                               ` William Lee Irwin III
@ 2004-08-28  6:06                                               ` Andrew Morton
  2004-08-30 17:02                                                 ` Herbert Poetzl
  1 sibling, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2004-08-28  6:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: paulus, ak, davem, ak, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Christoph Lameter <clameter@sgi.com> wrote:
>
> > As I said - for both these applications we need a new type which is
>  > atomic64_t on 64-bit and atomic_t on 32-bit.
> 
>  That is simply a new definition in include/asm-*/atomic.h
> 
>  so
> 
>  #define atomic_long atomic64_t
> 
>  on 64 bit
> 
>  and
> 
>  #define atomic_long atomic_t
> 
>  on 32bit?

No, a whole host of wrappers are needed - atomic_long_inc/dec/set/read,
etc.  For page->_count we'll also need the fancier functions such as
atomic_long_add_return().

As I said: let's address this later on.  It's probably not an issue for RSS
until 4-level pagetables come along.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:39                                           ` Andrew Morton
  2004-08-28  5:58                                             ` Christoph Lameter
@ 2004-08-28 13:19                                             ` Andi Kleen
  2004-08-28 15:48                                             ` Matt Mackall
                                                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-28 13:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, paulus, ak, davem, wli, davem, raybry, benh, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Fri, 27 Aug 2004 22:39:54 -0700
Andrew Morton <akpm@osdl.org> wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > On Fri, 27 Aug 2004, Andrew Morton wrote:
> > 
> > > Christoph Lameter <clameter@sgi.com> wrote:
> > > >
> > > >  So I think the move to atomic for rss acceptable?
> > >
> > > Short-term, yes.  Longer term (within 12 months), no - 50-bit addresses on
> > > power5 will cause it to overflow.
> > 
> > I would expect the page size to rise as well. On IA64 we already have
> > 16KB-64KB pages corresponding to 256TB - 1PB. Having to manage a couple of
> > billion pages could be a significant performance impact. Better increase
> > the page size.
> 
> I don't know if that's an option on the power architecture.

On x86-64 it isn't an option.

> 
> And we need larger atomic types _anyway_ for page->_count.  An unprivileged
> app can mmap the same page 4G times and can then munmap it once.  Do it on
> purpose and it's a security hole.  Due it by accident and it's a crash.
> 
> > I still would also like to see atomic64_t. I think there was a patch
> > posted to linux-ia64 a couple of months back introducing atomic64_t but it
> > was rejected since it would not be supportable on other arches.
> 
> atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
> sparc64.

Here's a patch to add it to x86-64: 

-Andi

-------------------------------------------------------------------------------------

Add atomic64_t type to x86-64


diff -u linux-2.6.8-work/include/asm-x86_64/atomic.h-o linux-2.6.8-work/include/asm-x86_64/atomic.h
--- linux-2.6.8-work/include/asm-x86_64/atomic.h-o	2004-03-21 21:11:54.000000000 +0100
+++ linux-2.6.8-work/include/asm-x86_64/atomic.h	2004-08-28 15:15:46.000000000 +0200
@@ -178,6 +178,166 @@
 	return c;
 }
 
+/* An 64bit atomic type */ 
+
+typedef struct { volatile long counter; } atomic64_t;
+
+#define ATOMIC64_INIT(i)	{ (i) }
+
+/**
+ * atomic64_read - read atomic64 variable
+ * @v: pointer of type atomic64_t
+ * 
+ * Atomically reads the value of @v.
+ * Doesn't imply a read memory barrier.
+ */ 
+#define atomic64_read(v)		((v)->counter)
+
+/**
+ * atomic64_set - set atomic64 variable
+ * @v: pointer to type atomic64_t
+ * @i: required value
+ * 
+ * Atomically sets the value of @v to @i.
+ */ 
+#define atomic64_set(v,i)		(((v)->counter) = (i))
+
+/**
+ * atomic64_add - add integer to atomic64 variable
+ * @i: integer value to add
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically adds @i to @v.
+ */
+static __inline__ void atomic64_add(long i, atomic64_t *v)
+{
+	__asm__ __volatile__(
+		LOCK "addq %1,%0"
+		:"=m" (v->counter)
+		:"ir" (i), "m" (v->counter));
+}
+
+/**
+ * atomic64_sub - subtract the atomic64 variable
+ * @i: integer value to subtract
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically subtracts @i from @v.
+ */
+static __inline__ void atomic64_sub(long i, atomic64_t *v)
+{
+	__asm__ __volatile__(
+		LOCK "subq %1,%0"
+		:"=m" (v->counter)
+		:"ir" (i), "m" (v->counter));
+}
+
+/**
+ * atomic64_sub_and_test - subtract value from variable and test result
+ * @i: integer value to subtract
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically subtracts @i from @v and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static __inline__ int atomic64_sub_and_test(long i, atomic64_t *v)
+{
+	unsigned char c;
+
+	__asm__ __volatile__(
+		LOCK "subq %2,%0; sete %1"
+		:"=m" (v->counter), "=qm" (c)
+		:"ir" (i), "m" (v->counter) : "memory");
+	return c;
+}
+
+/**
+ * atomic64_inc - increment atomic64 variable
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically increments @v by 1.
+ */ 
+static __inline__ void atomic64_inc(atomic64_t *v)
+{
+	__asm__ __volatile__(
+		LOCK "incq %0"
+		:"=m" (v->counter)
+		:"m" (v->counter));
+}
+
+/**
+ * atomic64_dec - decrement atomic64 variable
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically decrements @v by 1.
+ */ 
+static __inline__ void atomic64_dec(atomic64_t *v)
+{
+	__asm__ __volatile__(
+		LOCK "decq %0"
+		:"=m" (v->counter)
+		:"m" (v->counter));
+}
+
+/**
+ * atomic64_dec_and_test - decrement and test
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically decrements @v by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */ 
+static __inline__ int atomic64_dec_and_test(atomic64_t *v)
+{
+	unsigned char c;
+
+	__asm__ __volatile__(
+		LOCK "decq %0; sete %1"
+		:"=m" (v->counter), "=qm" (c)
+		:"m" (v->counter) : "memory");
+	return c != 0;
+}
+
+/**
+ * atomic64_inc_and_test - increment and test 
+ * @v: pointer to type atomic64_t
+ * 
+ * Atomically increments @v by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */ 
+static __inline__ int atomic64_inc_and_test(atomic64_t *v)
+{
+	unsigned char c;
+
+	__asm__ __volatile__(
+		LOCK "incq %0; sete %1"
+		:"=m" (v->counter), "=qm" (c)
+		:"m" (v->counter) : "memory");
+	return c != 0;
+}
+
+/**
+ * atomic64_add_negative - add and test if negative
+ * @v: pointer to atomic64_t
+ * @i: integer value to add
+ * 
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */ 
+static __inline__ long atomic64_add_negative(long i, atomic64_t *v)
+{
+	unsigned char c;
+
+	__asm__ __volatile__(
+		LOCK "addq %2,%0; sets %1"
+		:"=m" (v->counter), "=qm" (c)
+		:"ir" (i), "m" (v->counter) : "memory");
+	return c;
+}
+
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
 __asm__ __volatile__(LOCK "andl %0,%1" \

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:39                                           ` Andrew Morton
  2004-08-28  5:58                                             ` Christoph Lameter
  2004-08-28 13:19                                             ` Andi Kleen
@ 2004-08-28 15:48                                             ` Matt Mackall
  2004-09-01  4:13                                             ` Benjamin Herrenschmidt
  2004-09-01 18:03                                             ` Matthew Wilcox
  4 siblings, 0 replies; 106+ messages in thread
From: Matt Mackall @ 2004-08-28 15:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, paulus, ak, davem, ak, wli, davem, raybry,
	benh, manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Fri, Aug 27, 2004 at 10:39:54PM -0700, Andrew Morton wrote:

> As I said - for both these applications we need a new type which is
> atomic64_t on 64-bit and atomic_t on 32-bit.

atomic_long_t -> longest available atomic type?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  1:02                               ` Andi Kleen
  2004-08-28  1:39                                 ` Andrew Morton
@ 2004-08-28 21:41                                 ` Daniel Phillips
  1 sibling, 0 replies; 106+ messages in thread
From: Daniel Phillips @ 2004-08-28 21:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, David S. Miller, clameter, ak, wli, davem, raybry,
	benh, manfred, linux-ia64, linux-kernel, vrajesh, hugh

Hi Andi,

On Friday 27 August 2004 21:02, Andi Kleen wrote:
> Yep.  Good plan. atomic_long_t ?

Would it not be more C-ish as long_atomic_t?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  6:06                                               ` Andrew Morton
@ 2004-08-30 17:02                                                 ` Herbert Poetzl
  2004-08-30 17:05                                                   ` Andi Kleen
  0 siblings, 1 reply; 106+ messages in thread
From: Herbert Poetzl @ 2004-08-30 17:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, paulus, ak, davem, ak, wli, davem, raybry,
	benh, manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Fri, Aug 27, 2004 at 11:06:37PM -0700, Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > > As I said - for both these applications we need a new type which is
> >  > atomic64_t on 64-bit and atomic_t on 32-bit.
> > 
> >  That is simply a new definition in include/asm-*/atomic.h
> > 
> >  so
> > 
> >  #define atomic_long atomic64_t
> > 
> >  on 64 bit
> > 
> >  and
> > 
> >  #define atomic_long atomic_t
> > 
> >  on 32bit?
> 
> No, a whole host of wrappers are needed - atomic_long_inc/dec/set/read,
> etc.  For page->_count we'll also need the fancier functions such as
> atomic_long_add_return().

hmm, please correct me, but last time I checked
atomic_add_return() wasn't even available for i386
for example ...

best,
Herbert

> As I said: let's address this later on.  It's probably not an issue for RSS
> until 4-level pagetables come along.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-30 17:02                                                 ` Herbert Poetzl
@ 2004-08-30 17:05                                                   ` Andi Kleen
  0 siblings, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-30 17:05 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, paulus, davem, ak, wli, davem,
	raybry, benh, manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Mon, Aug 30, 2004 at 07:02:11PM +0200, Herbert Poetzl wrote:
> 
> hmm, please correct me, but last time I checked
> atomic_add_return() wasn't even available for i386
> for example ...

There is a patch pending to add it.


-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:39                                           ` Andrew Morton
                                                               ` (2 preceding siblings ...)
  2004-08-28 15:48                                             ` Matt Mackall
@ 2004-09-01  4:13                                             ` Benjamin Herrenschmidt
  2004-09-02 21:26                                               ` Andi Kleen
  2004-09-01 18:03                                             ` Matthew Wilcox
  4 siblings, 1 reply; 106+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-01  4:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Paul Mackerras, ak, davem, Andi Kleen, wli,
	David S. Miller, raybry, manfred, linux-ia64, Linux Kernel list,
	vrajesh, hugh

On Sat, 2004-08-28 at 15:39, Andrew Morton wrote:

> atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
> sparc64.
> 
> As I said - for both these applications we need a new type which is
> atomic64_t on 64-bit and atomic_t on 32-bit.

Implementing it on ppc64 is trivial. I'd vote for atomic_long_t though
that is either 32 bits on 32 bits archs or 64 bits on 64 bits arch, as
it would be a real pain (spinlock & all) to get a 64 bits atomic on
ppc32

Ben.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-27 23:20                     ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
  2004-08-27 23:36                       ` Andi Kleen
@ 2004-09-01  4:24                       ` Benjamin Herrenschmidt
  2004-09-01  5:22                         ` David S. Miller
  2004-09-01 16:43                         ` Christoph Lameter
  1 sibling, 2 replies; 106+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-01  4:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, William Lee Irwin III, David S. Miller, raybry,
	ak, manfred, linux-ia64, Linux Kernel list, vrajesh, hugh

On Sat, 2004-08-28 at 09:20, Christoph Lameter wrote:
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> This is the fifth (and hopefully final) release of the page fault
> scalability patches. The scalability patches avoid locking during the
> creation of page table entries for anonymous memory in a threaded
> application. The performance increases significantly for more than 2
> threads running concurrently.

Sorry for "waking up" late on this one but we've been kept busy by
a lot of other things.

The removal of the page table lock has other more subtle side effects
on ppc64 (and ppc32 too) that aren't trivial to solve. Typically, due
to the way we use the hash table as a TLB cache.

For example, out ptep_test_and_clear will first clear the PTE and then
flush the hash table entry. If in the meantime another CPU gets in,
takes a fault, re-populates the PTE and fills the hash table via
update_mmu_cache, we may end up with 2 hash PTEs for the same linux
PTE at least for a short while. This is a potential cause of checkstop
on ppc CPUs.

There may be other subtle races of that sort I haven't encovered yet.

We need to spend more time on our (ppc/ppc64) side to figure out what
is the extent of the problem. We may have a cheap way to fix most of the
issues using the PAGE_BUSY bit we have in the PTEs as a lock, but we
don't have that facility on ppc32.

I think there wouldn't be a problem if we could guarantee exclusion
between page fault and clearing of a PTE (that is basically having the
swapper take the mm write sem) but I don't think that's realistic, oh
well, not that I understand anything about the swap code anyways...

Ben.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01  4:24                       ` Benjamin Herrenschmidt
@ 2004-09-01  5:22                         ` David S. Miller
  2004-09-01 16:43                         ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: David S. Miller @ 2004-09-01  5:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: clameter, akpm, wli, davem, raybry, ak, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Wed, 01 Sep 2004 14:24:50 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> The removal of the page table lock has other more subtle side effects
> on ppc64 (and ppc32 too) that aren't trivial to solve. Typically, due
> to the way we use the hash table as a TLB cache.

True on sparc64 as well where the page table lock is what
synchronizes TLB context allocation for a process.
While the lock is held, we know that the TLB context cannot
change and this allows all kinds of TLB flush optimizations.

We also have the pseudo-invariant that flush_tlb_page() is
always called with the page table lock held.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01  4:24                       ` Benjamin Herrenschmidt
  2004-09-01  5:22                         ` David S. Miller
@ 2004-09-01 16:43                         ` Christoph Lameter
  2004-09-01 23:09                           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-01 16:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, William Lee Irwin III, David S. Miller, raybry,
	ak, manfred, linux-ia64, Linux Kernel list, vrajesh, hugh

On Wed, 1 Sep 2004, Benjamin Herrenschmidt wrote:

> The removal of the page table lock has other more subtle side effects
> on ppc64 (and ppc32 too) that aren't trivial to solve. Typically, due
> to the way we use the hash table as a TLB cache.
>
> For example, out ptep_test_and_clear will first clear the PTE and then
> flush the hash table entry. If in the meantime another CPU gets in,
> takes a fault, re-populates the PTE and fills the hash table via
> update_mmu_cache, we may end up with 2 hash PTEs for the same linux
> PTE at least for a short while. This is a potential cause of checkstop
> on ppc CPUs.
>
> There may be other subtle races of that sort I haven't encovered yet.
>
> We need to spend more time on our (ppc/ppc64) side to figure out what
> is the extent of the problem. We may have a cheap way to fix most of the
> issues using the PAGE_BUSY bit we have in the PTEs as a lock, but we
> don't have that facility on ppc32.
>
> I think there wouldn't be a problem if we could guarantee exclusion
> between page fault and clearing of a PTE (that is basically having the
> swapper take the mm write sem) but I don't think that's realistic, oh
> well, not that I understand anything about the swap code anyways...

We may be able to accomplish that by generic routines for
ptep_cmpxchg and so on that would use the page table lock for platforms
that do not support atomic pte operations.

Something along the lines of:

pte_t ptep_xchg(struct mm_struct *mm, pte_t *ptep, pte_t new) {
	pte_t old;

	spin_lock(&mm->page_table_lock);
	old = *ptep
	set_pte(ptep, new_pte);
	/* Do rehashing */
	spin_unlock(&mm->page_table_lock);
	return old;
}

This would limit the time that the page_table_lock is held to a minimum
and may still offer some of the performance improvements.

Would that be acceptable?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-08-28  5:39                                           ` Andrew Morton
                                                               ` (3 preceding siblings ...)
  2004-09-01  4:13                                             ` Benjamin Herrenschmidt
@ 2004-09-01 18:03                                             ` Matthew Wilcox
  2004-09-01 18:19                                               ` Andrew Morton
  4 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2004-09-01 18:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, Aug 27, 2004 at 10:39:54PM -0700, Andrew Morton wrote:
> And we need larger atomic types _anyway_ for page->_count.  An unprivileged
> app can mmap the same page 4G times and can then munmap it once.  Do it on
> purpose and it's a security hole.  Due it by accident and it's a crash.

Sure, but the same kind of app can also do this on 32-bit architectures.
Assuming there's only 2.5GB of address space available per process,
you'd need 1638 cooperating processes to do it.  OK, that's a lot but
the lowest limit I can spy on a quick poll of multiuser boxes I have a
login on is 3064.  Most are above 10,000 (poll sample includes Debian,
RHAS and Fedora).

I think it would be better to check for overflow of the atomic_t (atomic_t
is signed) in the mmap routines.  Then kill the process that caused the
overflow.  OK, this is a local denial-of-service if someone does it to
glibc, but at least the admin should be able to reboot the box.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01 18:03                                             ` Matthew Wilcox
@ 2004-09-01 18:19                                               ` Andrew Morton
  2004-09-01 19:06                                                 ` William Lee Irwin III
  0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2004-09-01 18:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel

Matthew Wilcox <willy@debian.org> wrote:
>
> On Fri, Aug 27, 2004 at 10:39:54PM -0700, Andrew Morton wrote:
> > And we need larger atomic types _anyway_ for page->_count.  An unprivileged
> > app can mmap the same page 4G times and can then munmap it once.  Do it on
> > purpose and it's a security hole.  Due it by accident and it's a crash.
> 
> Sure, but the same kind of app can also do this on 32-bit architectures.
> Assuming there's only 2.5GB of address space available per process,
> you'd need 1638 cooperating processes to do it.  OK, that's a lot but
> the lowest limit I can spy on a quick poll of multiuser boxes I have a
> login on is 3064.  Most are above 10,000 (poll sample includes Debian,
> RHAS and Fedora).

It requires 32GB's worth of pte's.

So yeah, it might be possible on a 64GB ia32 box.

> I think it would be better to check for overflow of the atomic_t (atomic_t
> is signed) in the mmap routines.  Then kill the process that caused the
> overflow.  OK, this is a local denial-of-service if someone does it to
> glibc, but at least the admin should be able to reboot the box.

The overflow can happen in any get_page(), anywhere.  If we wanted to check
for it in this manner I guess we could do

	if (page_count(page) > (unsigned)-1000)
		barf();

in the pagefault handler.

But I don't think it's a serious issue on 32-bit machines, and on 64 bit
machines the 64-bit counter kinda makes sense?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01 18:19                                               ` Andrew Morton
@ 2004-09-01 19:06                                                 ` William Lee Irwin III
  0 siblings, 0 replies; 106+ messages in thread
From: William Lee Irwin III @ 2004-09-01 19:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox, linux-kernel

Matthew Wilcox <willy@debian.org> wrote:
>> Sure, but the same kind of app can also do this on 32-bit architectures.
>> Assuming there's only 2.5GB of address space available per process,
>> you'd need 1638 cooperating processes to do it.  OK, that's a lot but
>> the lowest limit I can spy on a quick poll of multiuser boxes I have a
>> login on is 3064.  Most are above 10,000 (poll sample includes Debian,
>> RHAS and Fedora).

On Wed, Sep 01, 2004 at 11:19:11AM -0700, Andrew Morton wrote:
> It requires 32GB's worth of pte's.
> So yeah, it might be possible on a 64GB ia32 box.

This only requires approximately 10922.666666666666 processes, which
has surprisingly been done in practice.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01 16:43                         ` Christoph Lameter
@ 2004-09-01 23:09                           ` Benjamin Herrenschmidt
       [not found]                             ` <Pine.LNX.4.58.0409012140440.23186@schroedinger.engr.sgi.com>
  0 siblings, 1 reply; 106+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-01 23:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, William Lee Irwin III, David S. Miller, raybry,
	ak, manfred, linux-ia64, Linux Kernel list, vrajesh, hugh

On Thu, 2004-09-02 at 02:43, Christoph Lameter wrote:

> This would limit the time that the page_table_lock is held to a minimum
> and may still offer some of the performance improvements.
> 
> Would that be acceptable?

Not sure... You probably want to have the set_pte and the later flush_*
in the same lock to maintain expected semantics with those platforms...

It's not that a simple issue. I have ways to do sort-of lock-less by
using my PAGE_BUSY lock bit in the PTE instead on ppc64 and I think
doing that properly would result in almost no overhead over what we have
now, so I'm still interested. ppc32 would have to take a global
spinlock, but that's fine as we aren't looking for scalability on this
arch.

So while I like your idea, I think it needs a bit more thinking & work
on some platforms. David wrote about potential issues on sparc64, and I
wonder if it would be worth re-thinking some of the pte invalidation
semantics a bit (pushing more logic into set-pte, that is making it
higher level, rather than having the common code split changing of PTEs
and invalidations, with eventually some beign/end semantics for batches)

BTW. We should get David's patch in first thing before tackling with
this complicated issue (the one adding mm & addr to set_pte & friends).

Ben.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
       [not found]                               ` <20040901215741.3538bbf4.davem@davemloft.net>
@ 2004-09-02  5:18                                 ` William Lee Irwin III
  2004-09-09 15:38                                   ` page fault scalability patch: V7 (+fallback for atomic page table ops) Christoph Lameter
  2004-09-02 16:24                                 ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
  1 sibling, 1 reply; 106+ messages in thread
From: William Lee Irwin III @ 2004-09-02  5:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: Christoph Lameter, benh, akpm, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Wed, 1 Sep 2004 21:45:20 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>> Where would I find that patch?

On Wed, Sep 01, 2004 at 09:57:41PM -0700, David S. Miller wrote:
> Attached.
> It's held up because it needs to be ported to all platforms
> before we can consider it seriously for inclusion and
> only sparc64 and ppc{,64} are converted.

Nice, I guess I can port this to a few arches. Maybe this is a good
excuse get my new (to me) Octane running, too.


-- wli

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
       [not found]                               ` <20040901215741.3538bbf4.davem@davemloft.net>
  2004-09-02  5:18                                 ` William Lee Irwin III
@ 2004-09-02 16:24                                 ` Christoph Lameter
  2004-09-02 20:10                                   ` David S. Miller
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-02 16:24 UTC (permalink / raw)
  To: David S. Miller
  Cc: benh, akpm, wli, davem, raybry, ak, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

Why was it done that way? Would it not be better to add the new
functionality by giving the function another name?

Like f.e. set_pte_mm()

then one could add the following in asm-generic/pgtable.h

#ifndef __HAVE_ARCH_SET_PTE_MM
#define set_pte_mm(mm, address, ptep, pte) set_pte(ptep, pte)
#endif

which would avoid having to update the other platforms and woud allow a
gradual transition.

My patch does something like that in order to avoid having to update all
arches. Arches without __HAVE_ARCH_ATOMIC_TABLE_OPS will have their
atomicity simulated via the page_table_lock.

On Wed, 1 Sep 2004, David S. Miller wrote:

> On Wed, 1 Sep 2004 21:45:20 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> > > BTW. We should get David's patch in first thing before tackling with
> > > this complicated issue (the one adding mm & addr to set_pte & friends).
> >
> > Where would I find that patch?
> Attached.
> It's held up because it needs to be ported to all platforms
> before we can consider it seriously for inclusion and
> only sparc64 and ppc{,64} are converted.
>
>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-02 16:24                                 ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
@ 2004-09-02 20:10                                   ` David S. Miller
  2004-09-02 21:02                                     ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2004-09-02 20:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: benh, akpm, wli, davem, raybry, ak, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Thu, 2 Sep 2004 09:24:47 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> Why was it done that way? Would it not be better to add the new
> functionality by giving the function another name?
> 
> Like f.e. set_pte_mm()
> 
> then one could add the following in asm-generic/pgtable.h
> 
> #ifndef __HAVE_ARCH_SET_PTE_MM
> #define set_pte_mm(mm, address, ptep, pte) set_pte(ptep, pte)
> #endif
> 
> which would avoid having to update the other platforms and woud allow a
> gradual transition.

In order for it to be useful, every set_pte() call has to get the
new args.  If there are exceptions, then it doesn't work out cleanly.

All of the call sites of set_pte() have the information available,
so implementing it properly in all cases is nearly trivial, it's
just a lot of busy work.  So we should get it over with now.

I did all of the generic code, it's just each platform's code that
needs updating.

And BTW it's not just set_pte(), it's also pte_clear() and some of
the other routines that need the added mm and address args.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-02 20:10                                   ` David S. Miller
@ 2004-09-02 21:02                                     ` Christoph Lameter
  2004-09-02 21:07                                       ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-02 21:02 UTC (permalink / raw)
  To: David S. Miller
  Cc: benh, akpm, wli, davem, raybry, ak, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Thu, 2 Sep 2004, David S. Miller wrote:

> On Thu, 2 Sep 2004 09:24:47 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Why was it done that way? Would it not be better to add the new
> > functionality by giving the function another name?
> >
> > Like f.e. set_pte_mm()
> >
> > then one could add the following in asm-generic/pgtable.h
> >
> > #ifndef __HAVE_ARCH_SET_PTE_MM
> > #define set_pte_mm(mm, address, ptep, pte) set_pte(ptep, pte)
> > #endif
> >
> > which would avoid having to update the other platforms and woud allow a
> > gradual transition.
>
> In order for it to be useful, every set_pte() call has to get the
> new args.  If there are exceptions, then it doesn't work out cleanly.

Yes. The mechanism that I proposed allows one to provide the info at each
call of set_pte_mm(). set_pte() would only be used for the arch specific
stuff and would become a legacy thing.

> I did all of the generic code, it's just each platform's code that
> needs updating.
>
> And BTW it's not just set_pte(), it's also pte_clear() and some of
> the other routines that need the added mm and address args.

Would not the generic code if done the way I suggested make the updating
of each platforms code unnecessary?

I have the similar issues with the page scalability patch. Should I not do
the legacy thing for platforms that do not have atomic pte operations?
'

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-02 21:02                                     ` Christoph Lameter
@ 2004-09-02 21:07                                       ` David S. Miller
  2004-09-18 23:23                                         ` page fault scalability patch V8: [0/7] Description Christoph Lameter
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
  0 siblings, 2 replies; 106+ messages in thread
From: David S. Miller @ 2004-09-02 21:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: benh, akpm, wli, davem, raybry, ak, manfred, linux-ia64,
	linux-kernel, vrajesh, hugh

On Thu, 2 Sep 2004 14:02:47 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> I have the similar issues with the page scalability patch. Should I not do
> the legacy thing for platforms that do not have atomic pte operations?
> '

I think your situation is different.  The set_pte() changes are modifying
the arguments of an existing interface.

Your changes are adding support for taking advantage of a facility
that may or may not exist on a platform.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-01  4:13                                             ` Benjamin Herrenschmidt
@ 2004-09-02 21:26                                               ` Andi Kleen
  2004-09-02 21:55                                                 ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Andi Kleen @ 2004-09-02 21:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Christoph Lameter, Paul Mackerras, ak, davem,
	Andi Kleen, wli, David S. Miller, raybry, manfred, linux-ia64,
	Linux Kernel list, vrajesh, hugh

On Wed, Sep 01, 2004 at 02:13:49PM +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2004-08-28 at 15:39, Andrew Morton wrote:
> 
> > atomic64_t already appears to be implemented on alpha, ia64, mips, s390 and
> > sparc64.
> > 
> > As I said - for both these applications we need a new type which is
> > atomic64_t on 64-bit and atomic_t on 32-bit.
> 
> Implementing it on ppc64 is trivial. I'd vote for atomic_long_t though
> that is either 32 bits on 32 bits archs or 64 bits on 64 bits arch, as
> it would be a real pain (spinlock & all) to get a 64 bits atomic on
> ppc32

I would do atomic64 on 64bit archs only and then do a wrapper 
somewhere that defines atomiclongt based on BITSPERLONG 

-Andi

P.S. sorry for the missing underscores, but i am typing this
on a japanese keyboard and i just cannot find it.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch final : i386 tested, x86_64 support added
  2004-09-02 21:26                                               ` Andi Kleen
@ 2004-09-02 21:55                                                 ` David S. Miller
  0 siblings, 0 replies; 106+ messages in thread
From: David S. Miller @ 2004-09-02 21:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: benh, akpm, clameter, paulus, ak, ak, wli, davem, raybry,
	manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Thu, 2 Sep 2004 23:26:34 +0200
Andi Kleen <ak@suse.de> wrote:

> I would do atomic64 on 64bit archs only and then do a wrapper 
> somewhere that defines atomiclongt based on BITSPERLONG 

We do have CONFIG_64BIT, might as well use it.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch: V7 (+fallback for atomic page table ops)
  2004-09-02  5:18                                 ` William Lee Irwin III
@ 2004-09-09 15:38                                   ` Christoph Lameter
  0 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-09 15:38 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, benh, akpm, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

The code for x86_64 and s390 needs some testing but I do not have the
appropriate hardware. I could remove the support for both and have those
archs fall back to the page table lock? Could this get into the -mm tree
to get more testing?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Changelog:
 * Add atomic page table operations for i386, ia64, x86_64 and s390
 * Use atomic page table operations in handle_mm_fault and only
   acquire the page_table_lock in some cases. Modify the swapper
   code to not clear a PTE until the writeout is committed.
 * Anonymous memory allocation without acquiring the page table
   lock allowing parallel execution of page faults on SMP systems.
 * i386: Add cmpxchg8b. Move cmpxchg functionality into processor.h and
   emulate cmpxchg if the processor is a 486 or 386 so that
   cmpxchg functionality is generally available on i386.
 * Make mm->rss atomic and rename to mm->mm_rss
 * Fallback to inline code simulating atomic instructions through the use
   of the page_table_lock for platforms that do not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. The page_table_lock is then only
   held for very short time periods.

Further details (patch against 2.6.9-rc1-bk15 follows at the end):

With a high number of CPUs (16..512) we are seeing the page fault rate
roughly doubling.

This is accomplished by avoiding the use of the page_table_lock spinlock
(but not mm->mmap_sem!) through providing new atomic operations on pte's
(ptep_xchg, ptep_cmpxchg) and on pmd and pdg's (pgd_test_and_populate,
pmd_test_and_populate). The page table lock can be avoided in the following
situations:

1. Operations where an empty pte or pmd entry is populated
This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted.

2. Modifications of flags in a pte entry (write/accessed).
These modifications are done by the CPU or by low level handlers
on various platforms which is also bypassing all locks. So this
seems to be safe too.

It was necessary to make mm->rss atomic since the page_table_lock was also
used to protect incrementing and decrementing rss.

Scalability could be further increased if the locking scheme (mmap_sem,
page_table lock etc) would be changed but this would require significant
changes to the memory subsystem. This patch hopefully lay the groundwork for
future work by providing a way to handle page table entries via xchg and
cmpxchg.

Index: linux-2.6.9-rc1/kernel/fork.c
===================================================================
--- linux-2.6.9-rc1.orig/kernel/fork.c	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/kernel/fork.c	2004-09-08 14:43:21.000000000 -0700
@@ -296,7 +296,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/sched.h	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/sched.h	2004-09-08 14:43:53.000000000 -0700
@@ -212,9 +212,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -224,7 +225,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm, shared_vm;
+	unsigned long total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/task_mmu.c	2004-09-08 14:42:33.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/task_mmu.c	2004-09-08 19:39:11.000000000 -0700
@@ -21,7 +21,7 @@
 		"VmLib:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib);
 	return buffer;
@@ -38,7 +38,7 @@
 	*shared = mm->shared_vm;
 	*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm - *text;
-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	return mm->total_vm;
 }

Index: linux-2.6.9-rc1/mm/mmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/mmap.c	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/mm/mmap.c	2004-09-08 14:43:21.000000000 -0700
@@ -1845,7 +1845,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc1/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/tlb.h	2004-08-24 00:01:50.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/tlb.h	2004-09-08 14:43:21.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc1/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_flat.c	2004-08-24 00:01:50.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_flat.c	2004-09-08 14:43:21.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc1/fs/exec.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/exec.c	2004-09-08 14:42:32.000000000 -0700
+++ linux-2.6.9-rc1/fs/exec.c	2004-09-08 14:43:21.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc1/mm/memory.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/memory.c	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/mm/memory.c	2004-09-08 14:43:21.000000000 -0700
@@ -325,7 +325,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1096,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1378,7 +1375,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1406,14 +1403,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1420,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1428,40 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1477,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1493,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1552,7 +1555,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1589,7 +1592,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1605,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1629,49 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,22 +1689,42 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+       		pmd_t *new = pmd_alloc_one(mm, address);
+		if (!new) return VM_FAULT_OOM;
+
+		/* Insure that the update is done in an atomic way */
+		if (!pgd_test_and_populate(mm, pgd, new)) pmd_free(new);
+	}
+
+        pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t *pte;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (!pmd_present(*pmd)) {
+			struct page *new;
+
+			new = pte_alloc_one(mm, address);
+			if (!new) return VM_FAULT_OOM;
+
+			if (!pmd_test_and_populate(mm, pmd, new))
+				pte_free(new);
+                        else
+				inc_page_state(nr_page_table_pages);
+		}
+
+		pte = pte_offset_map(pmd, address);
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.9-rc1/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgalloc.h	2004-09-08 13:42:45.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgalloc.h	2004-09-08 14:43:21.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -78,12 +82,19 @@
 	preempt_enable();
 }

+
 static inline void
 pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
 {
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linux-2.6.9-rc1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable.h	2004-09-08 14:43:21.000000000 -0700
@@ -412,6 +412,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-3level.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-3level.h	2004-09-08 14:43:21.000000000 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -141,4 +142,26 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+static inline pte_t ptep_xchg(struct mm_struct *mm, pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9-rc1/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_som.c	2004-08-24 00:02:26.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_som.c	2004-09-08 14:43:21.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc1/mm/fremap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/fremap.c	2004-08-24 00:01:51.000000000 -0700
+++ linux-2.6.9-rc1/mm/fremap.c	2004-09-08 14:43:21.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc1/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/swapfile.c	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/mm/swapfile.c	2004-09-08 14:43:21.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc1/include/linux/mm.h
===================================================================
--- linux-2.6.9-rc1.orig/include/linux/mm.h	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/include/linux/mm.h	2004-09-08 14:43:21.000000000 -0700
@@ -630,7 +630,7 @@
  */
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
-	if (pgd_none(*pgd))
+	if (unlikely(pgd_none(*pgd)))
 		return __pmd_alloc(mm, pgd, address);
 	return pmd_offset(pgd, address);
 }
Index: linux-2.6.9-rc1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-generic/pgtable.h	2004-09-08 13:42:45.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-generic/pgtable.h	2004-09-08 14:43:21.000000000 -0700
@@ -126,4 +126,81 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = *(__ptep);						\
+	set_pte(__ptep, __pteval);					\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_get_clear_flush(__vma, __address, __ptep)			\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = ptep_get_and_clear(__ptep);				\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval)		\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pte_same(*(__ptep), __oldval);				\
+	if (__rc) set_pte(__ptep, __newval);				\
+	flush_tlb_page(__vma, __addr);					\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pgd_test_and_populate(__mm, __pgd, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pgd_present(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+
+#else
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+
+#endif
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9-rc1/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_aout.c	2004-09-08 14:42:32.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_aout.c	2004-09-08 14:43:21.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgtable-2level.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgtable-2level.h	2004-09-08 14:43:21.000000000 -0700
@@ -82,4 +82,8 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg(mm,xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(mm,a,xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc1.orig/arch/ia64/mm/hugetlbpage.c	2004-08-24 00:02:47.000000000 -0700
+++ linux-2.6.9-rc1/arch/ia64/mm/hugetlbpage.c	2004-09-08 14:43:21.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/proc/array.c	2004-09-08 14:42:33.000000000 -0700
+++ linux-2.6.9-rc1/fs/proc/array.c	2004-09-08 14:43:21.000000000 -0700
@@ -386,7 +386,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc1.orig/fs/binfmt_elf.c	2004-09-08 14:42:32.000000000 -0700
+++ linux-2.6.9-rc1/fs/binfmt_elf.c	2004-09-08 19:40:50.000000000 -0700
@@ -708,7 +708,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc1/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/tlb.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/tlb.h	2004-09-08 14:43:21.000000000 -0700
@@ -46,6 +46,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
+#include <asm/atomic.h>

 #ifdef CONFIG_SMP
 # define FREE_PTE_NR		2048
@@ -161,11 +162,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc1/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/pgalloc.h	2004-08-24 00:01:53.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/pgalloc.h	2004-09-08 14:43:21.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PMD_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,19 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +64,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)

Index: linux-2.6.9-rc1/mm/rmap.c
===================================================================
--- linux-2.6.9-rc1.orig/mm/rmap.c	2004-09-08 14:42:35.000000000 -0700
+++ linux-2.6.9-rc1/mm/rmap.c	2004-09-08 19:44:19.000000000 -0700
@@ -262,7 +262,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -419,7 +419,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -500,7 +503,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -561,11 +564,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -575,11 +573,15 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

-	mm->rss--;
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
+	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);

@@ -665,21 +667,30 @@
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;

-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
+		/*
+		 * There would be a race here with the handle_mm_fault code that
+		 * bypasses the page_table_lock to allow a fast creation of ptes
+		 * if we would zap the pte before
+		 * putting something into it. On the other hand we need to
+		 * have the dirty flag when we replaced the value.
+		 * The dirty flag may be handled by a processor so we better
+		 * use an atomic operation here.
+		 */

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_get_and_clear(pte);

-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -778,7 +789,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.9-rc1/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/system.h	2004-08-24 00:01:51.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/system.h	2004-09-08 14:43:21.000000000 -0700
@@ -203,77 +203,6 @@
  __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
  __set_64bit(ptr, ll_low(value), ll_high(value)) )

-/*
- * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
- * Note 2: xchg has side effect, so that attribute volatile is necessary,
- *	  but generally the primitive is invalid, *ptr is output argument. --ANK
- */
-static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
-{
-	switch (size) {
-		case 1:
-			__asm__ __volatile__("xchgb %b0,%1"
-				:"=q" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 2:
-			__asm__ __volatile__("xchgw %w0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 4:
-			__asm__ __volatile__("xchgl %0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-	}
-	return x;
-}
-
-/*
- * Atomic compare and exchange.  Compare OLD with MEM, if identical,
- * store NEW in MEM.  Return the initial value in MEM.  Success is
- * indicated by comparing RETURN with OLD.
- */
-
-#ifdef CONFIG_X86_CMPXCHG
-#define __HAVE_ARCH_CMPXCHG 1
-#endif
-
-static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
-				      unsigned long new, int size)
-{
-	unsigned long prev;
-	switch (size) {
-	case 1:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 2:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 4:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	}
-	return old;
-}
-
-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc1/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc1.orig/arch/i386/Kconfig	2004-08-24 00:01:55.000000000 -0700
+++ linux-2.6.9-rc1/arch/i386/Kconfig	2004-09-08 14:43:21.000000000 -0700
@@ -341,6 +341,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc1/include/asm-i386/processor.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-i386/processor.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-i386/processor.h	2004-09-08 14:43:21.000000000 -0700
@@ -649,4 +649,137 @@

 #define cache_line_size() (boot_cpu_data.x86_cache_alignment)

+/*
+ * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
+ * Note 2: xchg has side effect, so that attribute volatile is necessary,
+ *	  but generally the primitive is invalid, *ptr is output argument. --ANK
+ */
+static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
+{
+	switch (size) {
+		case 1:
+			__asm__ __volatile__("xchgb %b0,%1"
+				:"=q" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 2:
+			__asm__ __volatile__("xchgw %w0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 4:
+			__asm__ __volatile__("xchgl %0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+	}
+	return x;
+}
+
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_CMPXCHG 1
+#endif
+
+static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+#ifndef CONFIG_X86_CMPXCHG
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 */
+	unsigned long flags;
+
+	/* All CPUs except 386 support CMPXCHG */
+	if (cpu_data->x86 > 3) goto have_cmpxchg;
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+have_cmpxchg:
+#endif
+	switch (size) {
+	case 1:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 2:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 4:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	}
+	return prev;
+}
+
+static inline unsigned long long cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+#ifndef CONFIG_X86_CMPXCHG8B
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data, X86_FEATURE_CX8)) goto have_cmpxchg8b;
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+
+have_cmpxchg8b:
+#endif
+
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
 #endif /* __ASM_I386_PROCESSOR_H */
Index: linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgalloc.h	2004-08-24 00:02:47.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgalloc.h	2004-09-08 14:43:21.000000000 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9-rc1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-x86_64/pgtable.h	2004-08-24 00:03:19.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-x86_64/pgtable.h	2004-09-08 14:43:21.000000000 -0700
@@ -436,6 +436,11 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_xchg(mm,addr,xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval))
+#define ptep_cmpxchg(mm,addr,xp,newval,oldval) (cmpxchg(&(xp)->pte, pte_val(newval), pte_val(oldval) == pte_val(oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
Index: linux-2.6.9-rc1/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-s390/pgtable.h	2004-08-24 00:03:30.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-s390/pgtable.h	2004-09-08 14:43:21.000000000 -0700
@@ -783,6 +783,19 @@

 #define kern_addr_valid(addr)   (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline pte_t ptep_xchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg(ptep, pte_val(pteval)));
+}
+
+static inline int ptep_cmpxchg (struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 /*
  * No page table caches to initialise
  */
Index: linux-2.6.9-rc1/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-s390/pgalloc.h	2004-08-24 00:02:58.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-s390/pgalloc.h	2004-09-08 14:43:21.000000000 -0700
@@ -97,6 +97,10 @@
 	pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
 }

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+	return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
 #endif /* __s390x__ */

 static inline void
@@ -119,6 +123,18 @@
 	pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
 }

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+	int rc;
+	spin_lock(&mm->page_table_lock);
+
+	rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+	if (rc) pmd_populate(mm, pmd, page);
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
+
 /*
  * page table entry allocation/free routines.
  */
Index: linux-2.6.9-rc1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9-rc1.orig/include/asm-ia64/pgtable.h	2004-09-08 14:42:34.000000000 -0700
+++ linux-2.6.9-rc1/include/asm-ia64/pgtable.h	2004-09-08 14:43:21.000000000 -0700
@@ -423,6 +423,19 @@
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);

+/* Atomic PTE operations */
+static inline pte_t
+ptep_xchg (struct mm_struct *mm, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg((long *) ptep, pteval.pte));
+}
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
  *	 bits in the swap-type field of the swap pte.  It would be nice to
@@ -558,6 +571,7 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [0/7] Description
  2004-09-02 21:07                                       ` David S. Miller
@ 2004-09-18 23:23                                         ` Christoph Lameter
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:23 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Signed-off-by: Christoph Lameter <clameter@sgi.com>

This is a series of patches that increases the scalability of
the page fault handler for SMP. Typical performance increases in the page
fault rate are:

2 CPUs -> 10%
4 CPUs -> 50%
8 CPUs -> 70%

With a high number of CPUs (16..512) we are seeing the page fault rate
roughly doubling.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

The set of patches is composed of 7 patches:

1/7: Make mm->rss atomic

   The page table lock is used to protect mm->rss and the first patch
   makes rss atomic so that it may be changed without holding the
   page_table_lock.
   Generic atomic variables are only 32 bit under Linux. However, 32 bits
   is sufficient for rss even on a 64 bit machine since rss refers to the
   number of pages allowing still up to 2^(31+12)= 8 terabytes of memory
   to be in use by a single process. A 64 bit atomic would of course be better.

2/7: Avoid page_table_lock in handle_mm_fault

   This patch defers the acquisition of the page_table_lock as much as
   possible and uses atomic operations for allocating anonymous memory.
   These atomic operations are simulated by acquiring the page_table_lock
   for very small time frames if an architecture does not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
   pte will not be set to empty if a page is in transition to swap.

   If only the first two patches are applied then the time that the page_table_lock
   is held is simply reduced. The lock may then be acquired multiple
   times during a page fault.

   The remaining patches introduce the necessary atomic pte operations to avoid
   the page_table_lock.

3/7: Atomic pte operations for ia64

   This patch adds atomic pte operations to the IA64 platform. The page_table_lock
   will then no longer be acquired for page faults that create pte's for anonymous
   memory.

4/7: Make cmpxchg generally available on i386

   The atomic operations on the page table rely heavily on cmpxchg instructions.
   This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
   cpus. The emulations are only included if a kernel is build for these old
   cpus. The emulations are skipped for the real cmpxchg instructions if the kernel
   that is build for 386 or 486 is then run on a more recent cpu.

   This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

   Add atomic PTE operations for i386. A generally available cmpxchg (last patch)
   must be available for this patch.

6/7: Atomic pte operation for x86_64

   Add atomic pte operations for x86_64. This has not been tested yet since I have
   no x86_64.

7/7: Atomic pte operations for s390

   Add atomic PTE operations for S390. This has also not been tested yet since I have
   no S/390 available. Feedback from the S/390 people seems to indicate though that
   the way it is done is fine.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [1/7] make mm->rss atomic
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
@ 2004-09-18 23:24                                           ` Christoph Lameter
  2004-09-18 23:26                                           ` page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault Christoph Lameter
                                                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:24 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Make mm->rss atomic, so that rss may be incremented or decremented
	  without holding the page table lock.
	* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/kernel/fork.c
===================================================================
--- linus.orig/kernel/fork.c	2004-09-18 14:25:22.000000000 -0700
+++ linus/kernel/fork.c	2004-09-18 14:56:47.000000000 -0700
@@ -296,7 +296,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linus/include/linux/sched.h
===================================================================
--- linus.orig/include/linux/sched.h	2004-09-18 14:25:22.000000000 -0700
+++ linus/include/linux/sched.h	2004-09-18 14:56:47.000000000 -0700
@@ -213,9 +213,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -225,7 +226,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm, shared_vm;
+	unsigned long total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
Index: linus/fs/proc/task_mmu.c
===================================================================
--- linus.orig/fs/proc/task_mmu.c	2004-09-18 14:25:22.000000000 -0700
+++ linus/fs/proc/task_mmu.c	2004-09-18 14:56:47.000000000 -0700
@@ -21,7 +21,7 @@
 		"VmLib:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib);
 	return buffer;
@@ -38,7 +38,7 @@
 	*shared = mm->shared_vm;
 	*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm - *text;
-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	return mm->total_vm;
 }

Index: linus/mm/mmap.c
===================================================================
--- linus.orig/mm/mmap.c	2004-09-18 14:25:22.000000000 -0700
+++ linus/mm/mmap.c	2004-09-18 14:56:47.000000000 -0700
@@ -1845,7 +1845,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linus/include/asm-generic/tlb.h
===================================================================
--- linus.orig/include/asm-generic/tlb.h	2004-09-18 14:25:22.000000000 -0700
+++ linus/include/asm-generic/tlb.h	2004-09-18 14:56:47.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linus/fs/binfmt_flat.c
===================================================================
--- linus.orig/fs/binfmt_flat.c	2004-09-18 14:25:22.000000000 -0700
+++ linus/fs/binfmt_flat.c	2004-09-18 14:56:47.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linus/fs/exec.c
===================================================================
--- linus.orig/fs/exec.c	2004-09-18 14:25:22.000000000 -0700
+++ linus/fs/exec.c	2004-09-18 14:56:47.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linus/mm/memory.c
===================================================================
--- linus.orig/mm/memory.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/mm/memory.c	2004-09-18 15:01:05.000000000 -0700
@@ -325,7 +325,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1096,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1378,7 +1378,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1443,7 +1443,7 @@
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-		mm->rss++;
+		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
@@ -1552,7 +1552,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
Index: linus/fs/binfmt_som.c
===================================================================
--- linus.orig/fs/binfmt_som.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/fs/binfmt_som.c	2004-09-18 14:56:47.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linus/mm/fremap.c
===================================================================
--- linus.orig/mm/fremap.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/mm/fremap.c	2004-09-18 14:56:47.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linus/mm/swapfile.c
===================================================================
--- linus.orig/mm/swapfile.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/mm/swapfile.c	2004-09-18 14:56:47.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linus/fs/binfmt_aout.c
===================================================================
--- linus.orig/fs/binfmt_aout.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/fs/binfmt_aout.c	2004-09-18 14:56:47.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linus/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linus.orig/arch/ia64/mm/hugetlbpage.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/arch/ia64/mm/hugetlbpage.c	2004-09-18 14:56:47.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linus/fs/proc/array.c
===================================================================
--- linus.orig/fs/proc/array.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/fs/proc/array.c	2004-09-18 14:56:47.000000000 -0700
@@ -388,7 +388,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linus/fs/binfmt_elf.c
===================================================================
--- linus.orig/fs/binfmt_elf.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/fs/binfmt_elf.c	2004-09-18 14:56:47.000000000 -0700
@@ -708,7 +708,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linus/mm/rmap.c
===================================================================
--- linus.orig/mm/rmap.c	2004-09-18 14:25:23.000000000 -0700
+++ linus/mm/rmap.c	2004-09-18 15:13:05.000000000 -0700
@@ -262,7 +262,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -501,7 +501,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -580,7 +580,7 @@
 		BUG_ON(pte_file(*pte));
 	}

-	mm->rss--;
+	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);

@@ -680,7 +680,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -779,7 +779,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linus/include/asm-ia64/tlb.h
===================================================================
--- linus.orig/include/asm-ia64/tlb.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-ia64/tlb.h	2004-09-18 15:07:23.000000000 -0700
@@ -161,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
  2004-09-18 23:24                                           ` page fault scalability patch V8: [1/7] make mm->rss atomic Christoph Lameter
@ 2004-09-18 23:26                                           ` Christoph Lameter
  2004-09-19  9:04                                             ` Christoph Hellwig
  2004-09-18 23:27                                           ` page fault scalability patch V8: [3/7] atomic pte operations for ia64 Christoph Lameter
                                                             ` (4 subsequent siblings)
  6 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:26 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Increase parallelism in SMP configurations by deferring
	  the acquisition of page_table_lock in handle_mm_fault
	* Anonymous memory page faults bypass the page_table_lock
	  through the use of atomic page table operations
	* Swapper does not set pte to empty in transition to swap
	* Simulate atomic page table operations using the
	  page_table_lock if an arch does not define
	  __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
	  a performance benefit since the page_table_lock
	  is held for shorter periods of time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/mm/memory.c
===================================================================
--- linus.orig/mm/memory.c	2004-09-18 15:01:05.000000000 -0700
+++ linus/mm/memory.c	2004-09-18 15:40:02.000000000 -0700
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1406,14 +1403,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1420,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1428,40 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry))
+	{
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1477,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1493,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1589,7 +1592,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1605,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1629,49 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,22 +1689,42 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+       		pmd_t *new = pmd_alloc_one(mm, address);
+		if (!new) return VM_FAULT_OOM;
+
+		/* Insure that the update is done in an atomic way */
+		if (!pgd_test_and_populate(mm, pgd, new)) pmd_free(new);
+	}
+
+        pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t *pte;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+		if (!pmd_present(*pmd)) {
+			struct page *new;
+
+			new = pte_alloc_one(mm, address);
+			if (!new) return VM_FAULT_OOM;
+
+			if (!pmd_test_and_populate(mm, pmd, new))
+				pte_free(new);
+                        else
+				inc_page_state(nr_page_table_pages);
+		}
+
+		pte = pte_offset_map(pmd, address);
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linus/include/asm-generic/pgtable.h
===================================================================
--- linus.orig/include/asm-generic/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-generic/pgtable.h	2004-09-18 15:34:59.000000000 -0700
@@ -126,4 +126,81 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = *(__ptep);						\
+	set_pte(__ptep, __pteval);					\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_get_clear_flush(__vma, __address, __ptep)			\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = ptep_get_and_clear(__ptep);				\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval)		\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pte_same(*(__ptep), __oldval);				\
+	if (__rc) set_pte(__ptep, __newval);				\
+	flush_tlb_page(__vma, __addr);					\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pgd_test_and_populate(__mm, __pgd, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pgd_present(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+
+#else
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+
+#endif
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linus/mm/rmap.c
===================================================================
--- linus.orig/mm/rmap.c	2004-09-18 15:13:05.000000000 -0700
+++ linus/mm/rmap.c	2004-09-18 15:41:15.000000000 -0700
@@ -420,7 +420,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -562,11 +565,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -576,10 +574,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -666,15 +668,24 @@
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;

-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
+		/*
+		 * There would be a race here with the handle_mm_fault code that
+		 * bypasses the page_table_lock to allow a fast creation of ptes
+		 * if we would zap the pte before
+		 * putting something into it. On the other hand we need to
+		 * have the dirty flag when we replaced the value.
+		 * The dirty flag may be handled by a processor so we better
+		 * use an atomic operation here.
+		 */

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_get_and_clear(pte);

-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [3/7] atomic pte operations for ia64
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
  2004-09-18 23:24                                           ` page fault scalability patch V8: [1/7] make mm->rss atomic Christoph Lameter
  2004-09-18 23:26                                           ` page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault Christoph Lameter
@ 2004-09-18 23:27                                           ` Christoph Lameter
  2004-09-18 23:28                                           ` page fault scalability patch V8: [4/7] universally available cmpxchg on i386 Christoph Lameter
                                                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:27 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Provide atomic pte operations for ia64
	* Enhanced parallelism in page fault handler if applied together
	  with the generic patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-ia64/pgalloc.h
===================================================================
--- linus.orig/include/asm-ia64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-ia64/pgalloc.h	2004-09-18 15:43:25.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -78,12 +82,19 @@
 	preempt_enable();
 }

+
 static inline void
 pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
 {
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linus/include/asm-ia64/pgtable.h
===================================================================
--- linus.orig/include/asm-ia64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-ia64/pgtable.h	2004-09-18 15:43:25.000000000 -0700
@@ -423,6 +423,19 @@
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);

+/* Atomic PTE operations */
+static inline pte_t
+ptep_xchg (struct mm_struct *mm, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg((long *) ptep, pteval.pte));
+}
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
  *	 bits in the swap-type field of the swap pte.  It would be nice to
@@ -558,6 +571,7 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
                                                             ` (2 preceding siblings ...)
  2004-09-18 23:27                                           ` page fault scalability patch V8: [3/7] atomic pte operations for ia64 Christoph Lameter
@ 2004-09-18 23:28                                           ` Christoph Lameter
       [not found]                                             ` <200409191430.37444.vda@port.imtp.ilyichevsk.odessa.ua>
  2004-09-18 23:29                                           ` page fault scalability patch V8: [5/7] atomic pte operations for i386 Christoph Lameter
                                                             ` (2 subsequent siblings)
  6 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:28 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Make cmpxchg and cmpxchg8b generally available on i386.
	* Provide emulation of cmpxchg suitable for UP if build and
	  run on 386.
	* Provide emulation of cmpxchg8b suitable for UP if build
	  and run on 386 or 486.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-i386/system.h
===================================================================
--- linus.orig/include/asm-i386/system.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/system.h	2004-09-18 14:56:59.000000000 -0700
@@ -203,77 +203,6 @@
  __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
  __set_64bit(ptr, ll_low(value), ll_high(value)) )

-/*
- * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
- * Note 2: xchg has side effect, so that attribute volatile is necessary,
- *	  but generally the primitive is invalid, *ptr is output argument. --ANK
- */
-static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
-{
-	switch (size) {
-		case 1:
-			__asm__ __volatile__("xchgb %b0,%1"
-				:"=q" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 2:
-			__asm__ __volatile__("xchgw %w0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-		case 4:
-			__asm__ __volatile__("xchgl %0,%1"
-				:"=r" (x)
-				:"m" (*__xg(ptr)), "0" (x)
-				:"memory");
-			break;
-	}
-	return x;
-}
-
-/*
- * Atomic compare and exchange.  Compare OLD with MEM, if identical,
- * store NEW in MEM.  Return the initial value in MEM.  Success is
- * indicated by comparing RETURN with OLD.
- */
-
-#ifdef CONFIG_X86_CMPXCHG
-#define __HAVE_ARCH_CMPXCHG 1
-#endif
-
-static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
-				      unsigned long new, int size)
-{
-	unsigned long prev;
-	switch (size) {
-	case 1:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 2:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	case 4:
-		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
-				     : "=a"(prev)
-				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
-				     : "memory");
-		return prev;
-	}
-	return old;
-}
-
-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linus/arch/i386/Kconfig
===================================================================
--- linus.orig/arch/i386/Kconfig	2004-09-18 14:25:23.000000000 -0700
+++ linus/arch/i386/Kconfig	2004-09-18 14:56:59.000000000 -0700
@@ -345,6 +345,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linus/include/asm-i386/processor.h
===================================================================
--- linus.orig/include/asm-i386/processor.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/processor.h	2004-09-18 14:56:59.000000000 -0700
@@ -657,4 +657,137 @@

 #define cache_line_size() (boot_cpu_data.x86_cache_alignment)

+/*
+ * Note: no "lock" prefix even on SMP: xchg always implies lock anyway
+ * Note 2: xchg has side effect, so that attribute volatile is necessary,
+ *	  but generally the primitive is invalid, *ptr is output argument. --ANK
+ */
+static inline unsigned long __xchg(unsigned long x, volatile void * ptr, int size)
+{
+	switch (size) {
+		case 1:
+			__asm__ __volatile__("xchgb %b0,%1"
+				:"=q" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 2:
+			__asm__ __volatile__("xchgw %w0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+		case 4:
+			__asm__ __volatile__("xchgl %0,%1"
+				:"=r" (x)
+				:"m" (*__xg(ptr)), "0" (x)
+				:"memory");
+			break;
+	}
+	return x;
+}
+
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+
+#ifdef CONFIG_X86_CMPXCHG
+#define __HAVE_ARCH_CMPXCHG 1
+#endif
+
+static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+#ifndef CONFIG_X86_CMPXCHG
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 */
+	unsigned long flags;
+
+	/* All CPUs except 386 support CMPXCHG */
+	if (cpu_data->x86 > 3) goto have_cmpxchg;
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+have_cmpxchg:
+#endif
+	switch (size) {
+	case 1:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 2:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	case 4:
+		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
+				     : "=a"(prev)
+				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
+				     : "memory");
+		return prev;
+	}
+	return prev;
+}
+
+static inline unsigned long long cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+#ifndef CONFIG_X86_CMPXCHG8B
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data, X86_FEATURE_CX8)) goto have_cmpxchg8b;
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+
+have_cmpxchg8b:
+#endif
+
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
 #endif /* __ASM_I386_PROCESSOR_H */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [5/7] atomic pte operations for i386
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
                                                             ` (3 preceding siblings ...)
  2004-09-18 23:28                                           ` page fault scalability patch V8: [4/7] universally available cmpxchg on i386 Christoph Lameter
@ 2004-09-18 23:29                                           ` Christoph Lameter
  2004-09-18 23:30                                           ` page fault scalability patch V8: [6/7] atomic pte operations for x86_64 Christoph Lameter
  2004-09-18 23:31                                           ` page fault scalability patch V8: [7/7] atomic pte operations for s390 Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:29 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Atomic pte operations for i386
	* Needs the general cmpxchg patch for i386

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-i386/pgtable.h
===================================================================
--- linus.orig/include/asm-i386/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable.h	2004-09-18 15:41:52.000000000 -0700
@@ -412,6 +412,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linus/include/asm-i386/pgtable-3level.h
===================================================================
--- linus.orig/include/asm-i386/pgtable-3level.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable-3level.h	2004-09-18 15:41:52.000000000 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -141,4 +142,26 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+static inline pte_t ptep_xchg(struct mm_struct *mm, pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
Index: linus/include/asm-i386/pgtable-2level.h
===================================================================
--- linus.orig/include/asm-i386/pgtable-2level.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable-2level.h	2004-09-18 15:41:52.000000000 -0700
@@ -82,4 +82,8 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg(mm,xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(mm,a,xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
Index: linus/include/asm-i386/pgalloc.h
===================================================================
--- linus.orig/include/asm-i386/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgalloc.h	2004-09-18 15:41:52.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PMD_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,19 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +64,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [6/7] atomic pte operations for x86_64
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
                                                             ` (4 preceding siblings ...)
  2004-09-18 23:29                                           ` page fault scalability patch V8: [5/7] atomic pte operations for i386 Christoph Lameter
@ 2004-09-18 23:30                                           ` Christoph Lameter
  2004-09-18 23:31                                           ` page fault scalability patch V8: [7/7] atomic pte operations for s390 Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:30 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-x86_64/pgalloc.h
===================================================================
--- linus.orig/include/asm-x86_64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-x86_64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linus/include/asm-x86_64/pgtable.h
===================================================================
--- linus.orig/include/asm-x86_64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-x86_64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
@@ -436,6 +436,11 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_xchg(mm,addr,xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval))
+#define ptep_cmpxchg(mm,addr,xp,newval,oldval) (cmpxchg(&(xp)->pte, pte_val(newval), pte_val(oldval) == pte_val(oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V8: [7/7] atomic pte operations for s390
       [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
                                                             ` (5 preceding siblings ...)
  2004-09-18 23:30                                           ` page fault scalability patch V8: [6/7] atomic pte operations for x86_64 Christoph Lameter
@ 2004-09-18 23:31                                           ` Christoph Lameter
       [not found]                                             ` <200409191435.09445.vda@port.imtp.ilyichevsk.odessa.ua>
  6 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-18 23:31 UTC (permalink / raw)
  To: akpm
  Cc: David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

Changelog
	* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-s390/pgtable.h
===================================================================
--- linus.orig/include/asm-s390/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-s390/pgtable.h	2004-09-18 14:47:30.000000000 -0700
@@ -783,6 +783,19 @@

 #define kern_addr_valid(addr)   (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline pte_t ptep_xchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg(ptep, pte_val(pteval)));
+}
+
+static inline int ptep_cmpxchg (struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 /*
  * No page table caches to initialise
  */
Index: linus/include/asm-s390/pgalloc.h
===================================================================
--- linus.orig/include/asm-s390/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-s390/pgalloc.h	2004-09-18 14:47:30.000000000 -0700
@@ -97,6 +97,10 @@
 	pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
 }

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+	return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
 #endif /* __s390x__ */

 static inline void
@@ -119,6 +123,18 @@
 	pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
 }

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+	int rc;
+	spin_lock(&mm->page_table_lock);
+
+	rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+	if (rc) pmd_populate(mm, pmd, page);
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
+
 /*
  * page table entry allocation/free routines.
  */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault
  2004-09-18 23:26                                           ` page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault Christoph Lameter
@ 2004-09-19  9:04                                             ` Christoph Hellwig
  0 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2004-09-19  9:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

> +	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry))
> +	{

wrong brace placement.

> +			if (!pmd_test_and_populate(mm, pmd, new))
> +				pte_free(new);
> +                        else

indentation using spaces instead of tabs

> +			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));

please make sure lines are never longer than 80 characters

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                             ` <200409191430.37444.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-19 12:11                                               ` Andi Kleen
  2004-09-20 15:45                                               ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-09-19 12:11 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Christoph Lameter, akpm, David S. Miller, benh, wli, davem,
	raybry, manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Sun, Sep 19, 2004 at 02:30:37PM +0300, Denis Vlasenko wrote:
> Far too large for inline

It's much smaller than it looks - the switch will be optimized away by the
compiler. For the X86_CMPXCHG case it is even a single instruction.
For the other case it should be < 10 instructions, which is still reasonable.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [7/7] atomic pte operations for s390
       [not found]                                             ` <200409191435.09445.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-20 15:44                                               ` Christoph Lameter
  0 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-20 15:44 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: akpm, David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Sun, 19 Sep 2004, Denis Vlasenko wrote:

> On Sunday 19 September 2004 02:31, Christoph Lameter wrote:
> > +static inline int
> > +pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
> > +{
> > +	int rc;
> > +	spin_lock(&mm->page_table_lock);
> > +
> > +	rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
> > +	if (rc) pmd_populate(mm, pmd, page);
> > +	spin_unlock(&mm->page_table_lock);
> > +	return rc;
> > +}
>
> Considering that spin_lock and spin_unlock are inline functions,
> this function may end up being large.
>
> I didn't see a single non-inlined function in these patches yet.
> Please think about code size.

This function is only used once.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                             ` <200409191430.37444.vda@port.imtp.ilyichevsk.odessa.ua>
  2004-09-19 12:11                                               ` Andi Kleen
@ 2004-09-20 15:45                                               ` Christoph Lameter
       [not found]                                                 ` <200409202043.00580.vda@port.imtp.ilyichevsk.odessa.ua>
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-20 15:45 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: akpm, David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Sun, 19 Sep 2004, Denis Vlasenko wrote:

> Far too large for inline
> Ditto.

Umm... The code was inline before and for non 80386 its the same size as
before.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                                 ` <200409202043.00580.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-20 20:49                                                   ` Christoph Lameter
  2004-09-20 20:57                                                     ` Andi Kleen
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-20 20:49 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: akpm, David S. Miller, benh, wli, davem, raybry, ak, manfred,
	linux-ia64, linux-kernel, vrajesh, hugh

On Mon, 20 Sep 2004, Denis Vlasenko wrote:

> I think it shouldn't be this way.
>
> OTOH for !CONFIG_386 case it makes perfect sense to have it inlined.

Would the following revised patch be acceptable?

Index: linux-2.6.9-rc2/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-i386/system.h	2004-09-12 22:31:26.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-i386/system.h	2004-09-20 13:44:49.000000000 -0700
@@ -240,7 +240,24 @@
  */

 #ifdef CONFIG_X86_CMPXCHG
+
 #define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU.
+ */
+
+extern unsigned long cmpxchg_386(void *,unsigned long,unsigned long);
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))cmpxchg_386((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
 #endif

 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +287,32 @@
 	return old;
 }

-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg_486(unsigned long long *,
+				unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc2/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc2.orig/arch/i386/Kconfig	2004-09-20 08:57:47.000000000 -0700
+++ linux-2.6.9-rc2/arch/i386/Kconfig	2004-09-20 10:11:45.000000000 -0700
@@ -345,6 +345,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc2/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/i386/kernel/cpu/intel.c	2004-09-12 22:31:59.000000000 -0700
+++ linux-2.6.9-rc2/arch/i386/kernel/cpu/intel.c	2004-09-20 13:44:08.000000000 -0700
@@ -415,5 +415,68 @@
 	return 0;
 }

+#ifndef CONFIG_X86_CMPXCHG
+/*
+ * Atomic compare and exchange.  Compare OLD with MEM, if identical,
+ * store NEW in MEM.  Return the initial value in MEM.  Success is
+ * indicated by comparing RETURN with OLD.
+ */
+
+unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 */
+	unsigned long flags;
+
+	/* All CPUs except 386 support CMPXCHG */
+	if (cpu_data->x86 > 3) return __cmpxchg(ptr, old, new, size);
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+}
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data, X86_FEATURE_CX8)) return __cmpxchg8b(ptr, old newv);
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+}
+#endif
+
 // arch_initcall(intel_cpu_init);


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
  2004-09-20 20:49                                                   ` Christoph Lameter
@ 2004-09-20 20:57                                                     ` Andi Kleen
       [not found]                                                       ` <200409211841.25507.vda@port.imtp.ilyichevsk.odessa.ua>
  0 siblings, 1 reply; 106+ messages in thread
From: Andi Kleen @ 2004-09-20 20:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Denis Vlasenko, akpm, David S. Miller, benh, wli, davem, raybry,
	ak, manfred, linux-ia64, linux-kernel, vrajesh, hugh

On Mon, Sep 20, 2004 at 01:49:20PM -0700, Christoph Lameter wrote:
> On Mon, 20 Sep 2004, Denis Vlasenko wrote:
> 
> > I think it shouldn't be this way.
> >
> > OTOH for !CONFIG_386 case it makes perfect sense to have it inlined.
> 
> Would the following revised patch be acceptable?

You would need an EXPORT_SYMBOL at least. But to be honest your
original patch was much simpler and nicer and cmpxchg is not called
that often that it really matters. I would just ignore Denis' 
suggestion and stay with the old patch.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                                       ` <200409211841.25507.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-21 15:45                                                         ` Andi Kleen
       [not found]                                                           ` <200409212306.38800.vda@port.imtp.ilyichevsk.odessa.ua>
  2004-09-23  7:17                                                           ` Andy Lutomirski
  0 siblings, 2 replies; 106+ messages in thread
From: Andi Kleen @ 2004-09-21 15:45 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Andi Kleen, Christoph Lameter, akpm, David S. Miller, benh, wli,
	davem, raybry, ak, manfred, linux-ia64, linux-kernel, vrajesh,
	hugh

On Tue, Sep 21, 2004 at 06:41:25PM +0300, Denis Vlasenko wrote:
> On Monday 20 September 2004 23:57, Andi Kleen wrote:
> > On Mon, Sep 20, 2004 at 01:49:20PM -0700, Christoph Lameter wrote:
> > > On Mon, 20 Sep 2004, Denis Vlasenko wrote:
> > > 
> > > > I think it shouldn't be this way.
> > > >
> > > > OTOH for !CONFIG_386 case it makes perfect sense to have it inlined.
> > > 
> > > Would the following revised patch be acceptable?
> > 
> > You would need an EXPORT_SYMBOL at least. But to be honest your
> > original patch was much simpler and nicer and cmpxchg is not called
> > that often that it really matters. I would just ignore Denis' 
> > suggestion and stay with the old patch.
> 
> A bit faster approach (for CONFIG_386 case) would be using

It's actually slower. Many x86 CPUs cannot predict indirect jumps
and those that do cannot predict them as well as a test and jump.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
       [not found]                                                           ` <200409212306.38800.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-21 20:14                                                             ` Andi Kleen
  0 siblings, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-09-21 20:14 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Andi Kleen, Christoph Lameter, akpm, David S. Miller, benh, wli,
	davem, raybry, ak, manfred, linux-ia64, linux-kernel, vrajesh,
	hugh

> Looks like indirect jump is only slightly slower (on this CPU).

K7/K8 can predict indirect jumps. But most P3 and P4s can't (except for
the new Prescotts and Centrinos). And in all cases their jump predictor works 
worse.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
  2004-09-21 15:45                                                         ` Andi Kleen
       [not found]                                                           ` <200409212306.38800.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-23  7:17                                                           ` Andy Lutomirski
  2004-09-23  9:03                                                             ` Andi Kleen
  1 sibling, 1 reply; 106+ messages in thread
From: Andy Lutomirski @ 2004-09-23  7:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, akpm, David S. Miller, benh, wli, davem,
	raybry, ak, manfred, linux-ia64, linux-kernel, vrajesh, hugh

Andi Kleen wrote:
> On Tue, Sep 21, 2004 at 06:41:25PM +0300, Denis Vlasenko wrote:
> 
>>On Monday 20 September 2004 23:57, Andi Kleen wrote:
>>
>>>On Mon, Sep 20, 2004 at 01:49:20PM -0700, Christoph Lameter wrote:
>>>
>>>>On Mon, 20 Sep 2004, Denis Vlasenko wrote:
>>>>
>>>>
>>>>>I think it shouldn't be this way.
>>>>>
>>>>>OTOH for !CONFIG_386 case it makes perfect sense to have it inlined.
>>>>
>>>>Would the following revised patch be acceptable?
>>>
>>>You would need an EXPORT_SYMBOL at least. But to be honest your
>>>original patch was much simpler and nicer and cmpxchg is not called
>>>that often that it really matters. I would just ignore Denis' 
>>>suggestion and stay with the old patch.
>>
>>A bit faster approach (for CONFIG_386 case) would be using
> 
> 
> It's actually slower. Many x86 CPUs cannot predict indirect jumps
> and those that do cannot predict them as well as a test and jump.

Wouldn't alternative_input() choosing between a cmpxchg and a call be 
the way to go here?  Or is the overhead too high in an inline function?

(No patch included since I don't pretend to understand gcc's asm syntax 
at all.)

--Andy

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V8: [4/7] universally available cmpxchg on i386
  2004-09-23  7:17                                                           ` Andy Lutomirski
@ 2004-09-23  9:03                                                             ` Andi Kleen
  2004-09-27 19:06                                                               ` page fault scalability patch V9: [0/7] overview Christoph Lameter
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
  0 siblings, 2 replies; 106+ messages in thread
From: Andi Kleen @ 2004-09-23  9:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Christoph Lameter, akpm, David S. Miller, benh, wli,
	davem, raybry, ak, manfred, linux-ia64, linux-kernel, vrajesh,
	hugh

> Wouldn't alternative_input() choosing between a cmpxchg and a call be 
> the way to go here?  Or is the overhead too high in an inline function?

It would if you want the absolute micro optimization yes. Disadvantage
is that you would waste some more space for nops in the !CONFIG_I386 case.
I personally don't think it matters much and that Christian's original
code was just fine.

-Andi (last post on the thread) 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [0/7] overview
  2004-09-23  9:03                                                             ` Andi Kleen
@ 2004-09-27 19:06                                                               ` Christoph Lameter
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
  1 sibling, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:06 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Changes from V8->V9 of this patch:
- Verify that mm->rss is changed to atomic on all arches
- Fixes to the i386 cmpxchg support. Make it as small as possible
  by using a function instead of inlining.
- Patches against 2.6.9-rc2-bk15

This is a series of patches that increases the scalability of
the page fault handler for SMP. Typical performance increases in the page
fault rate are:

2 CPUs -> 10%
4 CPUs -> 50%
8 CPUs -> 70%

With a high number of CPUs (16..512) we are seeing the page fault rate
roughly doubling.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

The patchset is composed of 7 patches:

1/7: Make mm->rss atomic

   The page table lock is used to protect mm->rss and the first patch
   makes rss atomic so that it may be changed without holding the
   page_table_lock.
   Generic atomic variables are only 32 bit under Linux. However, 32 bits
   is sufficient for rss even on a 64 bit machine since rss refers to the
   number of pages allowing still up to 2^(31+12)= 8 terabytes of memory
   to be in use by a single process. A 64 bit atomic would of course be better.

2/7: Avoid page_table_lock in handle_mm_fault

   This patch defers the acquisition of the page_table_lock as much as
   possible and uses atomic operations for allocating anonymous memory.
   These atomic operations are simulated by acquiring the page_table_lock
   for very small time frames if an architecture does not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
   pte will not be set to empty if a page is in transition to swap.

   If only the first two patches are applied then the time that the page_table_lock
   is held is simply reduced. The lock may then be acquired multiple
   times during a page fault.

   The remaining patches introduce the necessary atomic pte operations to avoid
   the page_table_lock.

3/7: Atomic pte operations for ia64

   This patch adds atomic pte operations to the IA64 platform. The page_table_lock
   will then no longer be acquired for page faults that create pte's for anonymous
   memory.

4/7: Make cmpxchg generally available on i386

   The atomic operations on the page table rely heavily on cmpxchg instructions.
   This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
   cpus. The emulations are only included if a kernel is build for these old
   cpus. The emulations are skipped for the real cmpxchg instructions if the kernel
   that is build for 386 or 486 is then run on a more recent cpu.

   This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

   Add atomic PTE operations for i386. A generally available cmpxchg (last patch)
   must be available for this patch.

6/7: Atomic pte operation for x86_64

   Add atomic pte operations for x86_64. This has not been tested yet since I have
   no x86_64.

7/7: Atomic pte operations for s390

   Add atomic PTE operations for S390. This has also not been tested yet since I have
   no S/390 available. Feedback from the S/390 people seems to indicate though that
   the way it is done is fine.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [1/7] make mm->rss atomic
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
@ 2004-09-27 19:07                                                                 ` Christoph Lameter
  2004-09-27 19:08                                                                 ` page fault scalability patch V9: [2/7] defer/remove page_table_lock Christoph Lameter
                                                                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:07 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Make mm->rss atomic, so that rss may be incremented or decremented
	  without holding the page table lock.
	* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc2/kernel/fork.c
===================================================================
--- linux-2.6.9-rc2.orig/kernel/fork.c	2004-09-21 13:12:26.000000000 -0700
+++ linux-2.6.9-rc2/kernel/fork.c	2004-09-21 13:12:29.000000000 -0700
@@ -296,7 +296,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc2/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc2.orig/include/linux/sched.h	2004-09-21 13:12:26.000000000 -0700
+++ linux-2.6.9-rc2/include/linux/sched.h	2004-09-21 13:12:29.000000000 -0700
@@ -213,9 +213,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -225,7 +226,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm, shared_vm;
+	unsigned long total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc2/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/proc/task_mmu.c	2004-09-12 22:31:44.000000000 -0700
+++ linux-2.6.9-rc2/fs/proc/task_mmu.c	2004-09-21 13:12:29.000000000 -0700
@@ -21,7 +21,7 @@
 		"VmLib:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib);
 	return buffer;
@@ -38,7 +38,7 @@
 	*shared = mm->shared_vm;
 	*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm - *text;
-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	return mm->total_vm;
 }

Index: linux-2.6.9-rc2/mm/mmap.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/mmap.c	2004-09-12 22:32:54.000000000 -0700
+++ linux-2.6.9-rc2/mm/mmap.c	2004-09-21 13:12:29.000000000 -0700
@@ -1845,7 +1845,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc2/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-generic/tlb.h	2004-09-12 22:31:26.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-generic/tlb.h	2004-09-21 13:12:29.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc2/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/binfmt_flat.c	2004-09-12 22:31:26.000000000 -0700
+++ linux-2.6.9-rc2/fs/binfmt_flat.c	2004-09-21 13:12:29.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc2/fs/exec.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/exec.c	2004-09-21 13:12:26.000000000 -0700
+++ linux-2.6.9-rc2/fs/exec.c	2004-09-21 13:12:29.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc2/mm/memory.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/memory.c	2004-09-12 22:32:26.000000000 -0700
+++ linux-2.6.9-rc2/mm/memory.c	2004-09-21 13:12:29.000000000 -0700
@@ -325,7 +325,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1096,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1378,7 +1378,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1443,7 +1443,7 @@
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-		mm->rss++;
+		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
@@ -1552,7 +1552,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
Index: linux-2.6.9-rc2/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/binfmt_som.c	2004-09-12 22:32:17.000000000 -0700
+++ linux-2.6.9-rc2/fs/binfmt_som.c	2004-09-21 13:12:29.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc2/mm/fremap.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/fremap.c	2004-09-12 22:31:27.000000000 -0700
+++ linux-2.6.9-rc2/mm/fremap.c	2004-09-21 13:12:29.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc2/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/swapfile.c	2004-09-12 22:31:57.000000000 -0700
+++ linux-2.6.9-rc2/mm/swapfile.c	2004-09-21 13:12:29.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc2/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/binfmt_aout.c	2004-09-12 22:31:57.000000000 -0700
+++ linux-2.6.9-rc2/fs/binfmt_aout.c	2004-09-21 13:12:29.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc2/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/ia64/mm/hugetlbpage.c	2004-09-12 22:32:48.000000000 -0700
+++ linux-2.6.9-rc2/arch/ia64/mm/hugetlbpage.c	2004-09-21 13:12:29.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc2/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/proc/array.c	2004-09-12 22:32:54.000000000 -0700
+++ linux-2.6.9-rc2/fs/proc/array.c	2004-09-21 13:12:29.000000000 -0700
@@ -388,7 +388,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc2/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc2.orig/fs/binfmt_elf.c	2004-09-12 22:32:26.000000000 -0700
+++ linux-2.6.9-rc2/fs/binfmt_elf.c	2004-09-21 13:12:29.000000000 -0700
@@ -708,7 +708,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc2/mm/rmap.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/rmap.c	2004-09-21 13:12:26.000000000 -0700
+++ linux-2.6.9-rc2/mm/rmap.c	2004-09-21 13:12:29.000000000 -0700
@@ -262,7 +262,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -501,7 +501,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -580,7 +580,7 @@
 		BUG_ON(pte_file(*pte));
 	}

-	mm->rss--;
+	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);

@@ -680,7 +680,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -779,7 +779,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.9-rc2/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-ia64/tlb.h	2004-09-12 22:32:26.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-ia64/tlb.h	2004-09-21 13:12:29.000000000 -0700
@@ -161,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc2/include/asm-arm/tlb.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-arm/tlb.h	2004-09-12 22:31:27.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-arm/tlb.h	2004-09-21 13:12:29.000000000 -0700
@@ -54,11 +54,11 @@
 {
 	struct mm_struct *mm = tlb->mm;
 	unsigned long freed = tlb->freed;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);

 	if (freed) {
 		flush_tlb_mm(mm);
Index: linux-2.6.9-rc2/include/asm-arm26/tlb.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-arm26/tlb.h	2004-09-12 22:33:38.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-arm26/tlb.h	2004-09-21 13:12:29.000000000 -0700
@@ -35,11 +35,11 @@
 {
         struct mm_struct *mm = tlb->mm;
         unsigned long freed = tlb->freed;
-        int rss = mm->rss;
+        int rss = atomic_read(&mm->mm_rss);

         if (rss < freed)
                 freed = rss;
-        mm->rss = rss - freed;
+        atomic_sub(freed, &mm->mm_rss);

         if (freed) {
                 flush_tlb_mm(mm);
Index: linux-2.6.9-rc2/include/asm-sparc64/tlb.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-sparc64/tlb.h	2004-09-12 22:33:11.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-sparc64/tlb.h	2004-09-21 13:12:29.000000000 -0700
@@ -80,11 +80,11 @@
 {
 	unsigned long freed = mp->freed;
 	struct mm_struct *mm = mp->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);

 	tlb_flush_mmu(mp);

Index: linux-2.6.9-rc2/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/sh/mm/hugetlbpage.c	2004-09-12 22:32:54.000000000 -0700
+++ linux-2.6.9-rc2/arch/sh/mm/hugetlbpage.c	2004-09-21 13:19:27.000000000 -0700
@@ -62,7 +62,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc2/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/x86_64/ia32/ia32_aout.c	2004-09-12 22:33:38.000000000 -0700
+++ linux-2.6.9-rc2/arch/x86_64/ia32/ia32_aout.c	2004-09-21 13:15:41.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss. 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc2/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/ppc64/mm/hugetlbpage.c	2004-09-21 13:12:26.000000000 -0700
+++ linux-2.6.9-rc2/arch/ppc64/mm/hugetlbpage.c	2004-09-21 13:22:41.000000000 -0700
@@ -125,7 +125,7 @@
 	hugepte_t entry;
 	int i;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	entry = mk_hugepte(page, write_access);
 	for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
 		set_hugepte(ptep+i, entry);
@@ -287,7 +287,7 @@
 			/* This is the first hugepte in a batch */
 			ptepage = hugepte_page(entry);
 			get_page(ptepage);
-			dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+			atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		}
 		set_hugepte(dst_pte, entry);

@@ -410,7 +410,7 @@
 	}
 	put_cpu();

-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 }

 int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
Index: linux-2.6.9-rc2/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/sh64/mm/hugetlbpage.c	2004-09-12 22:32:00.000000000 -0700
+++ linux-2.6.9-rc2/arch/sh64/mm/hugetlbpage.c	2004-09-21 13:20:44.000000000 -0700
@@ -62,7 +62,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc2/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/sparc64/mm/hugetlbpage.c	2004-09-12 22:32:55.000000000 -0700
+++ linux-2.6.9-rc2/arch/sparc64/mm/hugetlbpage.c	2004-09-21 13:17:58.000000000 -0700
@@ -59,7 +59,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -112,7 +112,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -203,7 +203,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc2/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/mips/kernel/irixelf.c	2004-09-12 22:33:39.000000000 -0700
+++ linux-2.6.9-rc2/arch/mips/kernel/irixelf.c	2004-09-21 13:25:46.000000000 -0700
@@ -686,7 +686,7 @@
 	/* Do this so that we can load the interpreter, if need be.  We will
 	 * change some of these later.
 	 */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	setup_arg_pages(bprm, EXSTACK_DEFAULT);
 	current->mm->start_stack = bprm->p;

Index: linux-2.6.9-rc2/arch/m68k/atari/stram.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/m68k/atari/stram.c	2004-09-12 22:33:11.000000000 -0700
+++ linux-2.6.9-rc2/arch/m68k/atari/stram.c	2004-09-21 13:23:38.000000000 -0700
@@ -635,7 +635,7 @@
 	set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	get_page(page);
-	++vma->vm_mm->rss;
+	atomic_inc(&vma->vm_mm->mm_rss);
 }

 static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
Index: linux-2.6.9-rc2/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/i386/mm/hugetlbpage.c	2004-09-12 22:33:36.000000000 -0700
+++ linux-2.6.9-rc2/arch/i386/mm/hugetlbpage.c	2004-09-21 13:24:57.000000000 -0700
@@ -42,7 +42,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -82,7 +82,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -218,7 +218,7 @@
 		page = pte_page(pte);
 		put_page(page);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc2/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/sparc64/kernel/binfmt_aout32.c	2004-09-12 22:32:44.000000000 -0700
+++ linux-2.6.9-rc2/arch/sparc64/kernel/binfmt_aout32.c	2004-09-21 13:21:23.000000000 -0700
@@ -239,7 +239,7 @@
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [2/7] defer/remove page_table_lock
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
  2004-09-27 19:07                                                                 ` page fault scalability patch V9: [1/7] make mm->rss atomic Christoph Lameter
@ 2004-09-27 19:08                                                                 ` Christoph Lameter
  2004-09-27 19:10                                                                 ` page fault scalability patch V9: [3/7] atomic pte operatios for ia64 Christoph Lameter
                                                                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:08 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Increase parallelism in SMP configurations by deferring
	  the acquisition of page_table_lock in handle_mm_fault
	* Anonymous memory page faults bypass the page_table_lock
	  through the use of atomic page table operations
	* Swapper does not set pte to empty in transition to swap
	* Simulate atomic page table operations using the
	  page_table_lock if an arch does not define
	  __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
	  a performance benefit since the page_table_lock
	  is held for shorter periods of time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc2/mm/memory.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/memory.c	2004-09-20 08:57:52.000000000 -0700
+++ linux-2.6.9-rc2/mm/memory.c	2004-09-20 09:04:38.000000000 -0700
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1406,14 +1403,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1420,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1428,39 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1476,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1492,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1589,7 +1591,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1604,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1628,49 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,22 +1688,42 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+       		pmd_t *new = pmd_alloc_one(mm, address);
+		if (!new) return VM_FAULT_OOM;
+
+		/* Insure that the update is done in an atomic way */
+		if (!pgd_test_and_populate(mm, pgd, new)) pmd_free(new);
+	}
+
+        pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t *pte;
+
+		if (!pmd_present(*pmd)) {
+			struct page *new;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+			new = pte_alloc_one(mm, address);
+			if (!new) return VM_FAULT_OOM;
+
+			if (!pmd_test_and_populate(mm, pmd, new))
+				pte_free(new);
+			else
+				inc_page_state(nr_page_table_pages);
+		}
+
+		pte = pte_offset_map(pmd, address);
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.9-rc2/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-generic/pgtable.h	2004-09-12 22:32:00.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-generic/pgtable.h	2004-09-20 08:57:59.000000000 -0700
@@ -126,4 +126,81 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = *(__ptep);						\
+	set_pte(__ptep, __pteval);					\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_get_clear_flush(__vma, __address, __ptep)			\
+({									\
+	pte_t __pte;							\
+	spin_lock(&mm->page_table_lock);				\
+	__pte = ptep_get_and_clear(__ptep);				\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&mm->page_table_lock);				\
+	__pte;								\
+})
+
+
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval)		\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pte_same(*(__ptep), __oldval);				\
+	if (__rc) set_pte(__ptep, __newval);				\
+	flush_tlb_page(__vma, __addr);					\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pgd_test_and_populate(__mm, __pgd, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pgd_present(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+
+
+#else
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+
+#endif
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9-rc2/mm/rmap.c
===================================================================
--- linux-2.6.9-rc2.orig/mm/rmap.c	2004-09-20 08:57:52.000000000 -0700
+++ linux-2.6.9-rc2/mm/rmap.c	2004-09-20 08:57:59.000000000 -0700
@@ -420,7 +420,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -562,11 +565,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -576,10 +574,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -666,15 +668,24 @@
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;

-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
+		/*
+		 * There would be a race here with the handle_mm_fault code that
+		 * bypasses the page_table_lock to allow a fast creation of ptes
+		 * if we would zap the pte before
+		 * putting something into it. On the other hand we need to
+		 * have the dirty flag when we replaced the value.
+		 * The dirty flag may be handled by a processor so we better
+		 * use an atomic operation here.
+		 */

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_get_and_clear(pte);

-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [3/7] atomic pte operatios for ia64
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
  2004-09-27 19:07                                                                 ` page fault scalability patch V9: [1/7] make mm->rss atomic Christoph Lameter
  2004-09-27 19:08                                                                 ` page fault scalability patch V9: [2/7] defer/remove page_table_lock Christoph Lameter
@ 2004-09-27 19:10                                                                 ` Christoph Lameter
  2004-09-27 19:10                                                                 ` page fault scalability patch V9: [4/7] generally available cmpxchg on i386 Christoph Lameter
                                                                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:10 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Provide atomic pte operations for ia64
	* Enhanced parallelism in page fault handler if applied together
	  with the generic patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-ia64/pgalloc.h
===================================================================
--- linus.orig/include/asm-ia64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-ia64/pgalloc.h	2004-09-18 15:43:25.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -78,12 +82,19 @@
 	preempt_enable();
 }

+
 static inline void
 pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
 {
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linus/include/asm-ia64/pgtable.h
===================================================================
--- linus.orig/include/asm-ia64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-ia64/pgtable.h	2004-09-18 15:43:25.000000000 -0700
@@ -423,6 +423,19 @@
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);

+/* Atomic PTE operations */
+static inline pte_t
+ptep_xchg (struct mm_struct *mm, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg((long *) ptep, pteval.pte));
+}
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
  *	 bits in the swap-type field of the swap pte.  It would be nice to
@@ -558,6 +571,7 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [4/7] generally available cmpxchg on i386
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
                                                                                   ` (2 preceding siblings ...)
  2004-09-27 19:10                                                                 ` page fault scalability patch V9: [3/7] atomic pte operatios for ia64 Christoph Lameter
@ 2004-09-27 19:10                                                                 ` Christoph Lameter
  2004-09-27 19:11                                                                 ` page fault scalability patch V9: [5/7] atomic pte operations for i386 Christoph Lameter
                                                                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:10 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Make cmpxchg and cmpxchg8b generally available on i386.
	* Provide emulation of cmpxchg suitable for UP if build and
	  run on 386.
	* Provide emulation of cmpxchg8b suitable for UP if build
	  and run on 386 or 486.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc2/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc2.orig/include/asm-i386/system.h	2004-09-12 22:31:26.000000000 -0700
+++ linux-2.6.9-rc2/include/asm-i386/system.h	2004-09-21 13:37:06.000000000 -0700
@@ -240,7 +240,24 @@
  */

 #ifdef CONFIG_X86_CMPXCHG
+
 #define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU.
+ */
+
+extern unsigned long cmpxchg_386(void *, unsigned long, unsigned long);
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))cmpxchg_386((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
 #endif

 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +287,32 @@
 	return old;
 }

-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg_486(unsigned long long *,
+				unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc2/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc2.orig/arch/i386/Kconfig	2004-09-21 13:12:25.000000000 -0700
+++ linux-2.6.9-rc2/arch/i386/Kconfig	2004-09-21 13:32:25.000000000 -0700
@@ -345,6 +345,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc2/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9-rc2.orig/arch/i386/kernel/cpu/intel.c	2004-09-12 22:31:59.000000000 -0700
+++ linux-2.6.9-rc2/arch/i386/kernel/cpu/intel.c	2004-09-21 13:32:25.000000000 -0700
@@ -415,5 +415,65 @@
 	return 0;
 }

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+	unsigned long flags;
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 * All CPUs except 386 support CMPXCHG
+	 */
+	if (cpu_data->x86 > 3) return __cmpxchg(ptr, old, new, size);
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data, X86_FEATURE_CX8)) return __cmpxchg8b(ptr, old newv);
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
 // arch_initcall(intel_cpu_init);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [5/7] atomic pte operations for i386
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
                                                                                   ` (3 preceding siblings ...)
  2004-09-27 19:10                                                                 ` page fault scalability patch V9: [4/7] generally available cmpxchg on i386 Christoph Lameter
@ 2004-09-27 19:11                                                                 ` Christoph Lameter
  2004-09-27 19:12                                                                 ` page fault scalability patch V9: [6/7] atomic pte operations for x86_64 Christoph Lameter
  2004-09-27 19:13                                                                 ` page fault scalability patch V9: [7/7] atomic pte operatiosn for s390 Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:11 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Atomic pte operations for i386
	* Needs the general cmpxchg patch for i386

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-i386/pgtable.h
===================================================================
--- linus.orig/include/asm-i386/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable.h	2004-09-18 15:41:52.000000000 -0700
@@ -412,6 +412,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linus/include/asm-i386/pgtable-3level.h
===================================================================
--- linus.orig/include/asm-i386/pgtable-3level.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable-3level.h	2004-09-18 15:41:52.000000000 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -141,4 +142,26 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+static inline pte_t ptep_xchg(struct mm_struct *mm, pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
Index: linus/include/asm-i386/pgtable-2level.h
===================================================================
--- linus.orig/include/asm-i386/pgtable-2level.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgtable-2level.h	2004-09-18 15:41:52.000000000 -0700
@@ -82,4 +82,8 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg(mm,xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(mm,a,xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
Index: linus/include/asm-i386/pgalloc.h
===================================================================
--- linus.orig/include/asm-i386/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-i386/pgalloc.h	2004-09-18 15:41:52.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PMD_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,19 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +64,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [6/7] atomic pte operations for x86_64
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
                                                                                   ` (4 preceding siblings ...)
  2004-09-27 19:11                                                                 ` page fault scalability patch V9: [5/7] atomic pte operations for i386 Christoph Lameter
@ 2004-09-27 19:12                                                                 ` Christoph Lameter
  2004-09-27 19:13                                                                 ` page fault scalability patch V9: [7/7] atomic pte operatiosn for s390 Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:12 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-x86_64/pgalloc.h
===================================================================
--- linus.orig/include/asm-x86_64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-x86_64/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linus/include/asm-x86_64/pgtable.h
===================================================================
--- linus.orig/include/asm-x86_64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-x86_64/pgtable.h	2004-09-18 14:25:23.000000000 -0700
@@ -436,6 +436,11 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_xchg(mm,addr,xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval))
+#define ptep_cmpxchg(mm,addr,xp,newval,oldval) (cmpxchg(&(xp)->pte, pte_val(newval), pte_val(oldval) == pte_val(oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR



^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V9: [7/7] atomic pte operatiosn for s390
       [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
                                                                                   ` (5 preceding siblings ...)
  2004-09-27 19:12                                                                 ` page fault scalability patch V9: [6/7] atomic pte operations for x86_64 Christoph Lameter
@ 2004-09-27 19:13                                                                 ` Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-09-27 19:13 UTC (permalink / raw)
  To: akpm; +Cc: Andy Lutomirski, ak, nickpiggin, linux-kernel

Changelog
	* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linus/include/asm-s390/pgtable.h
===================================================================
--- linus.orig/include/asm-s390/pgtable.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-s390/pgtable.h	2004-09-18 14:47:30.000000000 -0700
@@ -783,6 +783,19 @@

 #define kern_addr_valid(addr)   (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline pte_t ptep_xchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg(ptep, pte_val(pteval)));
+}
+
+static inline int ptep_cmpxchg (struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 /*
  * No page table caches to initialise
  */
Index: linus/include/asm-s390/pgalloc.h
===================================================================
--- linus.orig/include/asm-s390/pgalloc.h	2004-09-18 14:25:23.000000000 -0700
+++ linus/include/asm-s390/pgalloc.h	2004-09-18 14:47:30.000000000 -0700
@@ -97,6 +97,10 @@
 	pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
 }

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+	return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
 #endif /* __s390x__ */

 static inline void
@@ -119,6 +123,18 @@
 	pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
 }

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+	int rc;
+	spin_lock(&mm->page_table_lock);
+
+	rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+	if (rc) pmd_populate(mm, pmd, page);
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
+
 /*
  * page table entry allocation/free routines.
  */


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [0/7] overview
  2004-09-27 19:06                                                               ` page fault scalability patch V9: [0/7] overview Christoph Lameter
@ 2004-10-15 19:02                                                                 ` Christoph Lameter
  2004-10-15 19:03                                                                   ` page fault scalability patch V10: [1/7] make rss atomic Christoph Lameter
                                                                                     ` (6 more replies)
  0 siblings, 7 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Changes from V9->V10 of this patch:
- generic: fixes and updates
- S390: changes after feedback from Martin Schwidefsky
- x86_64: tested and now works fine.
- i386: stable
- ia64: stable. Added support for pte_locking necessary
	for a planned parallelization of COW.

This is a series of patches that increases the scalability of
the page fault handler for SMP. Typical performance increases in the page
fault rate are:

2 CPUs -> 30%
4 CPUs -> 45%
8 CPUs -> 60%

With a high number of CPUs (16..512) we are seeing the page fault rate
roughly doubling.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

The patchset is composed of 7 patches:

1/7: Make mm->rss atomic

   The page table lock is used to protect mm->rss and the first patch
   makes rss atomic so that it may be changed without holding the
   page_table_lock.
   Generic atomic variables are only 32 bit under Linux. However, 32 bits
   is sufficient for rss even on a 64 bit machine since rss refers to the
   number of pages allowing still up to 2^(31+12)= 8 terabytes of memory
   to be in use by a single process. A 64 bit atomic would of course be better.

2/7: Avoid page_table_lock in handle_mm_fault

   This patch defers the acquisition of the page_table_lock as much as
   possible and uses atomic operations for allocating anonymous memory.
   These atomic operations are simulated by acquiring the page_table_lock
   for very small time frames if an architecture does not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
   pte will not be set to empty if a page is in transition to swap.

   If only the first two patches are applied then the time that the page_table_lock
   is held is simply reduced. The lock may then be acquired multiple
   times during a page fault.

   The remaining patches introduce the necessary atomic pte operations to avoid
   the page_table_lock.

3/7: Atomic pte operations for ia64

4/7: Make cmpxchg generally available on i386

   The atomic operations on the page table rely heavily on cmpxchg instructions.
   This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
   cpus. The emulations are only included if a kernel is build for these old
   cpus and are skipped for the real cmpxchg instructions if the kernel
   that is build for 386 or 486 is then run on a more recent cpu.

   This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

   A generally available cmpxchg (last patch) must be available for this patch to
   preserve the ability to build kernels for 386 and 486.

6/7: Atomic pte operation for x86_64

7/7: Atomic pte operations for s390

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [1/7] make rss atomic
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
@ 2004-10-15 19:03                                                                   ` Christoph Lameter
  2004-10-15 19:04                                                                   ` page fault scalability patch V10: [2/7] defer/omit taking page_table_lock Christoph Lameter
                                                                                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:03 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Make mm->rss atomic, so that rss may be incremented or decremented
	  without holding the page table lock.
	* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/kernel/fork.c
===================================================================
--- linux-2.6.9-rc4.orig/kernel/fork.c	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/kernel/fork.c	2004-10-14 12:22:14.000000000 -0700
@@ -296,7 +296,7 @@
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
Index: linux-2.6.9-rc4/include/linux/sched.h
===================================================================
--- linux-2.6.9-rc4.orig/include/linux/sched.h	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/include/linux/sched.h	2004-10-14 12:22:14.000000000 -0700
@@ -213,9 +213,10 @@
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
+	atomic_t mm_rss;			/* Number of pages used by this mm struct */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
+	spinlock_t page_table_lock;		/* Protects task page tables */

 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -225,7 +226,7 @@
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, total_vm, locked_vm, shared_vm;
+	unsigned long total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
Index: linux-2.6.9-rc4/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/proc/task_mmu.c	2004-10-10 19:57:06.000000000 -0700
+++ linux-2.6.9-rc4/fs/proc/task_mmu.c	2004-10-14 12:22:14.000000000 -0700
@@ -21,7 +21,7 @@
 		"VmLib:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		(unsigned long)atomic_read(&mm->mm_rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib);
 	return buffer;
@@ -38,7 +38,7 @@
 	*shared = mm->shared_vm;
 	*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm - *text;
-	*resident = mm->rss;
+	*resident = atomic_read(&mm->mm_rss);
 	return mm->total_vm;
 }

Index: linux-2.6.9-rc4/mm/mmap.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/mmap.c	2004-10-10 19:58:06.000000000 -0700
+++ linux-2.6.9-rc4/mm/mmap.c	2004-10-14 12:22:14.000000000 -0700
@@ -1847,7 +1847,7 @@
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	atomic_set(&mm->mm_rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;

Index: linux-2.6.9-rc4/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-generic/tlb.h	2004-10-10 19:56:36.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-generic/tlb.h	2004-10-14 12:22:14.000000000 -0700
@@ -88,11 +88,11 @@
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_set(&mm->mm_rss, rss - freed);
 	tlb_flush_mmu(tlb, start, end);

 	/* keep the page table cache within bounds */
Index: linux-2.6.9-rc4/fs/binfmt_flat.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/binfmt_flat.c	2004-10-10 19:56:36.000000000 -0700
+++ linux-2.6.9-rc4/fs/binfmt_flat.c	2004-10-14 12:22:14.000000000 -0700
@@ -650,7 +650,7 @@
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		atomic_set(current->mm->mm_rss, 0);
 	}

 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.9-rc4/fs/exec.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/exec.c	2004-10-10 19:57:30.000000000 -0700
+++ linux-2.6.9-rc4/fs/exec.c	2004-10-14 12:22:14.000000000 -0700
@@ -319,7 +319,7 @@
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.9-rc4/mm/memory.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/memory.c	2004-10-10 19:57:50.000000000 -0700
+++ linux-2.6.9-rc4/mm/memory.c	2004-10-14 12:22:14.000000000 -0700
@@ -325,7 +325,7 @@
 					pte = pte_mkclean(pte);
 				pte = pte_mkold(pte);
 				get_page(page);
-				dst->rss++;
+				atomic_inc(&dst->mm_rss);
 				set_pte(dst_pte, pte);
 				page_dup_rmap(page);
 cont_copy_pte_range_noset:
@@ -1096,7 +1096,7 @@
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
 		if (PageReserved(old_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		else
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
@@ -1378,7 +1378,7 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1443,7 +1443,7 @@
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-		mm->rss++;
+		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
@@ -1552,7 +1552,7 @@
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			atomic_inc(&mm->mm_rss);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
Index: linux-2.6.9-rc4/fs/binfmt_som.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/binfmt_som.c	2004-10-10 19:57:30.000000000 -0700
+++ linux-2.6.9-rc4/fs/binfmt_som.c	2004-10-14 12:22:14.000000000 -0700
@@ -259,7 +259,7 @@
 	create_som_tables(bprm);

 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	atomic_set(current->mm->mm_rss, 0);

 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.9-rc4/mm/fremap.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/fremap.c	2004-10-10 19:56:40.000000000 -0700
+++ linux-2.6.9-rc4/mm/fremap.c	2004-10-14 12:22:14.000000000 -0700
@@ -38,7 +38,7 @@
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				atomic_dec(&mm->mm_rss);
 			}
 		}
 	} else {
@@ -86,7 +86,7 @@

 	zap_pte(mm, vma, addr, pte);

-	mm->rss++;
+	atomic_inc(&mm->mm_rss);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.9-rc4/mm/swapfile.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/swapfile.c	2004-10-10 19:57:07.000000000 -0700
+++ linux-2.6.9-rc4/mm/swapfile.c	2004-10-14 12:22:14.000000000 -0700
@@ -430,7 +430,7 @@
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	atomic_inc(&vma->vm_mm->mm_rss);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.9-rc4/fs/binfmt_aout.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/binfmt_aout.c	2004-10-10 19:57:06.000000000 -0700
+++ linux-2.6.9-rc4/fs/binfmt_aout.c	2004-10-14 12:22:14.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc4/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/ia64/mm/hugetlbpage.c	2004-10-10 19:57:59.000000000 -0700
+++ linux-2.6.9-rc4/arch/ia64/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -65,7 +65,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -108,7 +108,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -249,7 +249,7 @@
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc4/fs/proc/array.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/proc/array.c	2004-10-10 19:58:06.000000000 -0700
+++ linux-2.6.9-rc4/fs/proc/array.c	2004-10-14 12:22:14.000000000 -0700
@@ -388,7 +388,7 @@
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? (unsigned long)atomic_read(&mm->mm_rss) : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.9-rc4/fs/binfmt_elf.c
===================================================================
--- linux-2.6.9-rc4.orig/fs/binfmt_elf.c	2004-10-10 19:57:50.000000000 -0700
+++ linux-2.6.9-rc4/fs/binfmt_elf.c	2004-10-14 12:22:14.000000000 -0700
@@ -716,7 +716,7 @@

 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.9-rc4/mm/rmap.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/rmap.c	2004-10-10 19:58:49.000000000 -0700
+++ linux-2.6.9-rc4/mm/rmap.c	2004-10-14 12:22:14.000000000 -0700
@@ -262,7 +262,7 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -501,7 +501,7 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
+	if (!atomic_read(&mm->mm_rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -580,7 +580,7 @@
 		BUG_ON(pte_file(*pte));
 	}

-	mm->rss--;
+	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);

@@ -680,7 +680,7 @@

 		page_remove_rmap(page);
 		page_cache_release(page);
-		mm->rss--;
+		atomic_dec(&mm->mm_rss);
 		(*mapcount)--;
 	}

@@ -779,7 +779,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (atomic_read(&vma->vm_mm->mm_rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.9-rc4/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-ia64/tlb.h	2004-10-10 19:57:44.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-ia64/tlb.h	2004-10-14 12:22:14.000000000 -0700
@@ -161,11 +161,11 @@
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.9-rc4/include/asm-arm/tlb.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-arm/tlb.h	2004-10-10 19:56:40.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-arm/tlb.h	2004-10-14 12:22:14.000000000 -0700
@@ -54,11 +54,11 @@
 {
 	struct mm_struct *mm = tlb->mm;
 	unsigned long freed = tlb->freed;
-	int rss = mm->rss;
+	int rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);

 	if (freed) {
 		flush_tlb_mm(mm);
Index: linux-2.6.9-rc4/include/asm-arm26/tlb.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-arm26/tlb.h	2004-10-10 19:58:56.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-arm26/tlb.h	2004-10-14 12:22:14.000000000 -0700
@@ -35,11 +35,11 @@
 {
         struct mm_struct *mm = tlb->mm;
         unsigned long freed = tlb->freed;
-        int rss = mm->rss;
+        int rss = atomic_read(&mm->mm_rss);

         if (rss < freed)
                 freed = rss;
-        mm->rss = rss - freed;
+        atomic_sub(freed, &mm->mm_rss);

         if (freed) {
                 flush_tlb_mm(mm);
Index: linux-2.6.9-rc4/include/asm-sparc64/tlb.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-sparc64/tlb.h	2004-10-10 19:58:24.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-sparc64/tlb.h	2004-10-14 12:22:14.000000000 -0700
@@ -80,11 +80,11 @@
 {
 	unsigned long freed = mp->freed;
 	struct mm_struct *mm = mp->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = atomic_read(&mm->mm_rss);

 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	atomic_sub(freed, &mm->mm_rss);

 	tlb_flush_mmu(mp);

Index: linux-2.6.9-rc4/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/sh/mm/hugetlbpage.c	2004-10-10 19:58:06.000000000 -0700
+++ linux-2.6.9-rc4/arch/sh/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -62,7 +62,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc4/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/x86_64/ia32/ia32_aout.c	2004-10-10 19:58:56.000000000 -0700
+++ linux-2.6.9-rc4/arch/x86_64/ia32/ia32_aout.c	2004-10-14 12:22:14.000000000 -0700
@@ -309,7 +309,7 @@
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss. 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.9-rc4/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/ppc64/mm/hugetlbpage.c	2004-10-10 19:57:59.000000000 -0700
+++ linux-2.6.9-rc4/arch/ppc64/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -125,7 +125,7 @@
 	hugepte_t entry;
 	int i;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	entry = mk_hugepte(page, write_access);
 	for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
 		set_hugepte(ptep+i, entry);
@@ -287,7 +287,7 @@
 			/* This is the first hugepte in a batch */
 			ptepage = hugepte_page(entry);
 			get_page(ptepage);
-			dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+			atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		}
 		set_hugepte(dst_pte, entry);

@@ -410,7 +410,7 @@
 	}
 	put_cpu();

-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 }

 int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
Index: linux-2.6.9-rc4/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/sh64/mm/hugetlbpage.c	2004-10-10 19:57:30.000000000 -0700
+++ linux-2.6.9-rc4/arch/sh64/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -62,7 +62,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc4/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/sparc64/mm/hugetlbpage.c	2004-10-10 19:58:07.000000000 -0700
+++ linux-2.6.9-rc4/arch/sparc64/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -59,7 +59,7 @@
 	unsigned long i;
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);

 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -112,7 +112,7 @@
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -203,7 +203,7 @@
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc4/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/mips/kernel/irixelf.c	2004-10-10 19:58:56.000000000 -0700
+++ linux-2.6.9-rc4/arch/mips/kernel/irixelf.c	2004-10-14 12:22:14.000000000 -0700
@@ -686,7 +686,7 @@
 	/* Do this so that we can load the interpreter, if need be.  We will
 	 * change some of these later.
 	 */
-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	setup_arg_pages(bprm, EXSTACK_DEFAULT);
 	current->mm->start_stack = bprm->p;

Index: linux-2.6.9-rc4/arch/m68k/atari/stram.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/m68k/atari/stram.c	2004-10-10 19:58:24.000000000 -0700
+++ linux-2.6.9-rc4/arch/m68k/atari/stram.c	2004-10-14 12:22:14.000000000 -0700
@@ -635,7 +635,7 @@
 	set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	get_page(page);
-	++vma->vm_mm->rss;
+	atomic_inc(&vma->vm_mm->mm_rss);
 }

 static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
Index: linux-2.6.9-rc4/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/i386/mm/hugetlbpage.c	2004-10-10 19:58:49.000000000 -0700
+++ linux-2.6.9-rc4/arch/i386/mm/hugetlbpage.c	2004-10-14 12:22:14.000000000 -0700
@@ -42,7 +42,7 @@
 {
 	pte_t entry;

-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	atomic_add(HPAGE_SIZE / PAGE_SIZE, &mm->mm_rss);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -82,7 +82,7 @@
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		atomic_add(HPAGE_SIZE / PAGE_SIZE, &dst->mm_rss);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -218,7 +218,7 @@
 		page = pte_page(pte);
 		put_page(page);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	atomic_sub((end - start) >> PAGE_SHIFT, &mm->mm_rss);
 	flush_tlb_range(vma, start, end);
 }

Index: linux-2.6.9-rc4/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/sparc64/kernel/binfmt_aout32.c	2004-10-10 19:57:59.000000000 -0700
+++ linux-2.6.9-rc4/arch/sparc64/kernel/binfmt_aout32.c	2004-10-14 12:22:14.000000000 -0700
@@ -239,7 +239,7 @@
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));

-	current->mm->rss = 0;
+	atomic_set(&current->mm->mm_rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [2/7] defer/omit taking page_table_lock
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
  2004-10-15 19:03                                                                   ` page fault scalability patch V10: [1/7] make rss atomic Christoph Lameter
@ 2004-10-15 19:04                                                                   ` Christoph Lameter
  2004-10-15 20:00                                                                     ` Marcelo Tosatti
  2004-10-15 19:05                                                                   ` page fault scalability patch V10: [3/7] IA64 atomic pte operations Christoph Lameter
                                                                                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Increase parallelism in SMP configurations by deferring
	  the acquisition of page_table_lock in handle_mm_fault
	* Anonymous memory page faults bypass the page_table_lock
	  through the use of atomic page table operations
	* Swapper does not set pte to empty in transition to swap
	* Simulate atomic page table operations using the
	  page_table_lock if an arch does not define
	  __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
	  a performance benefit since the page_table_lock
	  is held for shorter periods of time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/mm/memory.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/memory.c	2004-10-14 12:22:14.000000000 -0700
+++ linux-2.6.9-rc4/mm/memory.c	2004-10-14 12:22:14.000000000 -0700
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1406,14 +1403,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1420,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1428,39 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
+		lock_page(page);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+		if (write_access) {
+			pte_unmap(page_table);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * The following two functions are safe to use without
+		 * the page_table_lock but do they need to come before
+		 * the cmpxchg?
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+		unlock_page(page);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1476,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1492,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1589,7 +1591,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1604,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1628,49 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,22 +1688,42 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+       		pmd_t *new = pmd_alloc_one(mm, address);
+		if (!new) return VM_FAULT_OOM;
+
+		/* Insure that the update is done in an atomic way */
+		if (!pgd_test_and_populate(mm, pgd, new)) pmd_free(new);
+	}
+
+        pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t *pte;
+
+		if (!pmd_present(*pmd)) {
+			struct page *new;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+			new = pte_alloc_one(mm, address);
+			if (!new) return VM_FAULT_OOM;
+
+			if (!pmd_test_and_populate(mm, pmd, new))
+				pte_free(new);
+			else
+				inc_page_state(nr_page_table_pages);
+		}
+
+		pte = pte_offset_map(pmd, address);
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.9-rc4/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-generic/pgtable.h	2004-10-10 19:57:30.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-generic/pgtable.h	2004-10-14 12:22:14.000000000 -0700
@@ -126,4 +126,75 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte;							\
+	spin_lock(&__vma->vm_mm->page_table_lock);			\
+	__pte = *(__ptep);						\
+	set_pte(__ptep, __pteval);					\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&__vma->vm_mm->page_table_lock);			\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval)		\
+({									\
+	int __rc;							\
+	spin_lock(&__vma->vm_mm->page_table_lock);			\
+	__rc = pte_same(*(__ptep), __oldval);				\
+	if (__rc) set_pte(__ptep, __newval);				\
+	spin_unlock(&__vma->vm_mm->page_table_lock);			\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&__mm->page_table_lock);				\
+	__rc = !pgd_present(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pmd);			\
+	spin_unlock(&__mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&__mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&__mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#else
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+
+#endif
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9-rc4/mm/rmap.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/rmap.c	2004-10-14 12:22:14.000000000 -0700
+++ linux-2.6.9-rc4/mm/rmap.c	2004-10-14 12:22:14.000000000 -0700
@@ -420,7 +420,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -562,11 +565,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -576,10 +574,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -666,15 +668,24 @@
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;

-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
+		/*
+		 * There would be a race here with the handle_mm_fault code that
+		 * bypasses the page_table_lock to allow a fast creation of ptes
+		 * if we would zap the pte before
+		 * putting something into it. On the other hand we need to
+		 * have the dirty flag when we replaced the value.
+		 * The dirty flag may be handled by a processor so we better
+		 * use an atomic operation here.
+		 */

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_get_and_clear(pte);

-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [3/7] IA64 atomic pte operations
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
  2004-10-15 19:03                                                                   ` page fault scalability patch V10: [1/7] make rss atomic Christoph Lameter
  2004-10-15 19:04                                                                   ` page fault scalability patch V10: [2/7] defer/omit taking page_table_lock Christoph Lameter
@ 2004-10-15 19:05                                                                   ` Christoph Lameter
  2004-10-15 19:06                                                                   ` page fault scalability patch V10: [4/7] cmpxchg for 386 and 486 Christoph Lameter
                                                                                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:05 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Provide atomic pte operations for ia64
	* pte lock operations
	* Enhanced parallelism in page fault handler if applied together
	  with the generic patch

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-ia64/pgalloc.h	2004-10-10 19:56:40.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-ia64/pgalloc.h	2004-10-14 11:32:38.000000000 -0700
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -78,12 +82,19 @@
 	preempt_enable();
 }

+
 static inline void
 pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
 {
 	pgd_val(*pgd_entry) = __pa(pmd);
 }

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {
Index: linux-2.6.9-rc4/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-ia64/pgtable.h	2004-10-10 19:57:17.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-ia64/pgtable.h	2004-10-14 12:21:06.000000000 -0700
@@ -30,6 +30,8 @@
 #define _PAGE_P_BIT		0
 #define _PAGE_A_BIT		5
 #define _PAGE_D_BIT		6
+#define _PAGE_IG_BITS          53
+#define _PAGE_LOCK_BIT         (_PAGE_IG_BITS+3)       /* bit 56. Aligned to 8 bits */

 #define _PAGE_P			(1 << _PAGE_P_BIT)	/* page present bit */
 #define _PAGE_MA_WB		(0x0 <<  2)	/* write back memory attribute */
@@ -58,6 +60,7 @@
 #define _PAGE_PPN_MASK		(((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
 #define _PAGE_ED		(__IA64_UL(1) << 52)	/* exception deferral */
 #define _PAGE_PROTNONE		(__IA64_UL(1) << 63)
+#define _PAGE_LOCK		(__IA64_UL(1) << _PAGE_LOCK_BIT)

 /* Valid only for a PTE with the present bit cleared: */
 #define _PAGE_FILE		(1 << 1)		/* see swap & file pte remarks below */
@@ -270,6 +273,8 @@
 #define pte_dirty(pte)		((pte_val(pte) & _PAGE_D) != 0)
 #define pte_young(pte)		((pte_val(pte) & _PAGE_A) != 0)
 #define pte_file(pte)		((pte_val(pte) & _PAGE_FILE) != 0)
+#define pte_locked(pte)		((pte_val(pte) & _PAGE_LOCK)!=0)
+
 /*
  * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the
  * access rights:
@@ -281,8 +286,15 @@
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte)	(__pte(pte_val(pte) & ~_PAGE_LOCK))

 /*
+ * Lock functions for pte's
+ */
+#define ptep_lock(ptep)		test_and_set_bit(_PAGE_LOCK_BIT, ptep)
+#define ptep_unlock(ptep)	{ clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); }
+#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val))
+/*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
  * misnomer here as the protection value contains the memory attribute bits, dirty bits,
  * and various other bits as well.
@@ -342,7 +354,6 @@
 #define pte_unmap_nested(pte)		do { } while (0)

 /* atomic versions of the some PTE manipulations: */
-
 static inline int
 ptep_test_and_clear_young (pte_t *ptep)
 {
@@ -414,6 +425,18 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (struct mm_struct *mm, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg((long *) ptep, pteval.pte));
+}
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
 static inline int
 pte_same (pte_t a, pte_t b)
 {
@@ -558,6 +581,8 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [4/7] cmpxchg for 386 and 486
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
                                                                                     ` (2 preceding siblings ...)
  2004-10-15 19:05                                                                   ` page fault scalability patch V10: [3/7] IA64 atomic pte operations Christoph Lameter
@ 2004-10-15 19:06                                                                   ` Christoph Lameter
  2004-10-15 19:06                                                                   ` page fault scalability patch V10: [5/7] i386 atomic pte operations Christoph Lameter
                                                                                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:06 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Make cmpxchg and cmpxchg8b generally available on i386.
	* Provide emulation of cmpxchg suitable for UP if build and
	  run on 386.
	* Provide emulation of cmpxchg8b suitable for UP if build
	  and run on 386 or 486.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/include/asm-i386/system.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-i386/system.h	2004-10-10 19:56:39.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-i386/system.h	2004-10-14 11:32:35.000000000 -0700
@@ -240,7 +240,24 @@
  */

 #ifdef CONFIG_X86_CMPXCHG
+
 #define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU.
+ */
+
+extern unsigned long cmpxchg_386(void *, unsigned long, unsigned long);
+
+#define cmpxchg(ptr,o,n)\
+	((__typeof__(*(ptr)))cmpxchg_386((ptr),(unsigned long)(o),\
+					(unsigned long)(n),sizeof(*(ptr))))
 #endif

 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +287,32 @@
 	return old;
 }

-#define cmpxchg(ptr,o,n)\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
-					(unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	 __asm__ __volatile__(
+	LOCK_PREFIX "cmpxchg8b %4\n"
+	: "=A" (prev)
+	: "0" (old), "c" ((unsigned long)(newv >> 32)),
+       		"b" ((unsigned long)(newv & 0xffffffffLL)), "m" (ptr)
+	: "memory");
+	return prev ;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg_486(unsigned long long *,
+				unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
 #ifdef __KERNEL__
 struct alt_instr {
 	__u8 *instr; 		/* original instruction */
Index: linux-2.6.9-rc4/arch/i386/Kconfig
===================================================================
--- linux-2.6.9-rc4.orig/arch/i386/Kconfig	2004-10-10 19:57:06.000000000 -0700
+++ linux-2.6.9-rc4/arch/i386/Kconfig	2004-10-14 11:32:35.000000000 -0700
@@ -345,6 +345,11 @@
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
Index: linux-2.6.9-rc4/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9-rc4.orig/arch/i386/kernel/cpu/intel.c	2004-10-10 19:57:16.000000000 -0700
+++ linux-2.6.9-rc4/arch/i386/kernel/cpu/intel.c	2004-10-14 11:32:35.000000000 -0700
@@ -415,5 +415,65 @@
 	return 0;
 }

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+				      unsigned long new, int size)
+{
+	unsigned long prev;
+	unsigned long flags;
+	/*
+	 * Check if the kernel was compiled for an old cpu but the
+	 * currently running cpu can do cmpxchg after all
+	 * All CPUs except 386 support CMPXCHG
+	 */
+	if (cpu_data->x86 > 3) return __cmpxchg(ptr, old, new, size);
+
+	/* Poor man's cmpxchg for 386. Unsuitable for SMP */
+	local_irq_save(flags);
+	switch (size) {
+	case 1:
+		prev = * (u8 *)ptr;
+		if (prev == old) *(u8 *)ptr = new;
+		break;
+	case 2:
+		prev = * (u16 *)ptr;
+		if (prev == old) *(u16 *)ptr = new;
+	case 4:
+		prev = *(u32 *)ptr;
+		if (prev == old) *(u32 *)ptr = new;
+		break;
+	}
+	local_irq_restore(flags);
+	return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(unsigned long long *ptr,
+	       unsigned long long old, unsigned long long newv)
+{
+	unsigned long long prev;
+	unsigned long flags;
+
+	/*
+	 * Check if the kernel was compiled for an old cpu but
+	 * we are running really on a cpu capable of cmpxchg8b
+	 */
+
+	if (cpu_has(cpu_data, X86_FEATURE_CX8)) return __cmpxchg8b(ptr, old newv);
+
+	/* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+	local_irq_save(flags);
+	prev = *ptr;
+	if (prev == old) *ptr = newv;
+	local_irq_restore(flags);
+	return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
 // arch_initcall(intel_cpu_init);


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [5/7] i386 atomic pte operations
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
                                                                                     ` (3 preceding siblings ...)
  2004-10-15 19:06                                                                   ` page fault scalability patch V10: [4/7] cmpxchg for 386 and 486 Christoph Lameter
@ 2004-10-15 19:06                                                                   ` Christoph Lameter
  2004-10-15 19:07                                                                   ` page fault scalability patch V10: [6/7] x86_64 " Christoph Lameter
  2004-10-15 19:08                                                                   ` page fault scalability patch V10: [7/7] s/390 " Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:06 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Atomic pte operations for i386

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-i386/pgtable.h	2004-10-10 19:58:24.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-i386/pgtable.h	2004-10-14 11:32:37.000000000 -0700
@@ -412,6 +412,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.9-rc4/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-i386/pgtable-3level.h	2004-10-10 19:58:41.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-i386/pgtable-3level.h	2004-10-14 11:32:37.000000000 -0700
@@ -6,7 +6,8 @@
  * tables on PPro+ CPUs.
  *
  * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg and ptep_xchg <christoph@lameter.com>
+*/

 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -141,4 +142,26 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+static inline pte_t ptep_xchg(struct mm_struct *mm, pte_t *ptep, pte_t newval)
+{
+	pte_t res;
+
+	/* xchg acts as a barrier before the setting of the high bits.
+	 * (But we also have a cmpxchg8b. Why not use that? (cl))
+	  */
+	res.pte_low = xchg(&ptep->pte_low, newval.pte_low);
+	res.pte_high = ptep->pte_high;
+	ptep->pte_high = newval.pte_high;
+
+	return res;
+}
+
+
+static inline int ptep_cmpxchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9-rc4/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-i386/pgtable-2level.h	2004-10-10 19:58:05.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-i386/pgtable-2level.h	2004-10-14 11:32:37.000000000 -0700
@@ -82,4 +82,8 @@
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg(mm,xp,a)       __pte(xchg(&(xp)->pte_low, (a).pte_low))
+#define ptep_cmpxchg(mm,a,xp,oldpte,newpte) (cmpxchg(&(xp)->pte_low, (oldpte).pte_low, (newpte).pte_low)==(oldpte).pte_low)
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9-rc4/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-i386/pgalloc.h	2004-10-10 19:57:02.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-i386/pgalloc.h	2004-10-14 11:32:37.000000000 -0700
@@ -7,6 +7,8 @@
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */

+#define PMD_NONE 0L
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +18,19 @@
 		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+	return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+		((unsigned long long)page_to_pfn(pte) <<
+			(unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+	return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
 /*
  * Allocate and free page tables.
  */
@@ -49,6 +64,7 @@
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#define pgd_test_and_populate(mm, pmd, pte)	({ BUG(); 1; })

 #define check_pgt_cache()	do { } while (0)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [6/7] x86_64 atomic pte operations
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
                                                                                     ` (4 preceding siblings ...)
  2004-10-15 19:06                                                                   ` page fault scalability patch V10: [5/7] i386 atomic pte operations Christoph Lameter
@ 2004-10-15 19:07                                                                   ` Christoph Lameter
  2004-10-15 19:08                                                                   ` page fault scalability patch V10: [7/7] s/390 " Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-x86_64/pgalloc.h	2004-10-10 19:57:59.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-x86_64/pgalloc.h	2004-10-15 11:20:36.000000000 -0700
@@ -7,16 +7,26 @@
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+	return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
 extern __inline__ pmd_t *get_pmd(void)
 {
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9-rc4/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-x86_64/pgtable.h	2004-10-10 19:58:23.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-x86_64/pgtable.h	2004-10-15 11:21:27.000000000 -0700
@@ -436,6 +436,11 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_xchg(addr,xp,newval)	__pte(xchg(&(xp)->pte, pte_val(newval)))
+#define ptep_cmpxchg(mm,addr,xp,oldval,newval) (cmpxchg(&(xp)->pte, pte_val(oldval), pte_val(newval)) == pte_val(oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* page fault scalability patch V10: [7/7] s/390 atomic pte operations
  2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
                                                                                     ` (5 preceding siblings ...)
  2004-10-15 19:07                                                                   ` page fault scalability patch V10: [6/7] x86_64 " Christoph Lameter
@ 2004-10-15 19:08                                                                   ` Christoph Lameter
  6 siblings, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-15 19:08 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-ia64

Changelog
	* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9-rc4/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-s390/pgtable.h	2004-10-10 19:58:24.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-s390/pgtable.h	2004-10-14 12:22:14.000000000 -0700
@@ -567,6 +567,17 @@
 	return pte;
 }

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)            \
+({                                                                     \
+	struct mm_struct *__mm = __vma->vm_mm;                          \
+	pte_t __pte;                                                    \
+	spin_lock(&__mm->page_table_lock);                              \
+	__pte = ptep_clear_flush(__vma, __address, __ptep);             \
+	set_pte(__ptep, __pteval);                                      \
+	spin_unlock(&__mm->page_table_lock);                            \
+	__pte;                                                          \
+})
+
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
 	pte_t old_pte = *ptep;
@@ -778,6 +789,19 @@

 #define kern_addr_valid(addr)   (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline pte_t ptep_xchg(struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t pteval)
+{
+	return __pte(xchg(ptep, pte_val(pteval)));
+}
+
+static inline int ptep_cmpxchg (struct mm_struct *mm, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+	return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
 /*
  * No page table caches to initialise
  */
@@ -791,6 +815,7 @@
 #define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.9-rc4/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9-rc4.orig/include/asm-s390/pgalloc.h	2004-10-10 19:58:06.000000000 -0700
+++ linux-2.6.9-rc4/include/asm-s390/pgalloc.h	2004-10-14 12:22:14.000000000 -0700
@@ -97,6 +97,10 @@
 	pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
 }

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+	return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
 #endif /* __s390x__ */

 static inline void
@@ -119,6 +123,18 @@
 	pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
 }

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+	int rc;
+	spin_lock(&mm->page_table_lock);
+
+	rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+	if (rc) pmd_populate(mm, pmd, page);
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
+
 /*
  * page table entry allocation/free routines.
  */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V10: [2/7] defer/omit taking page_table_lock
  2004-10-15 19:04                                                                   ` page fault scalability patch V10: [2/7] defer/omit taking page_table_lock Christoph Lameter
@ 2004-10-15 20:00                                                                     ` Marcelo Tosatti
  2004-10-18 15:59                                                                       ` Christoph Lameter
  2004-10-19  5:25                                                                       ` [revised] " Christoph Lameter
  0 siblings, 2 replies; 106+ messages in thread
From: Marcelo Tosatti @ 2004-10-15 20:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-ia64


Hi Christoph,

Nice work! 

On Fri, Oct 15, 2004 at 12:04:53PM -0700, Christoph Lameter wrote:
> Changelog
> 	* Increase parallelism in SMP configurations by deferring
> 	  the acquisition of page_table_lock in handle_mm_fault
> 	* Anonymous memory page faults bypass the page_table_lock
> 	  through the use of atomic page table operations
> 	* Swapper does not set pte to empty in transition to swap
> 	* Simulate atomic page table operations using the
> 	  page_table_lock if an arch does not define
> 	  __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
> 	  a performance benefit since the page_table_lock
> 	  is held for shorter periods of time.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.9-rc4/mm/memory.c
> ===================================================================
> --- linux-2.6.9-rc4.orig/mm/memory.c	2004-10-14 12:22:14.000000000 -0700
> +++ linux-2.6.9-rc4/mm/memory.c	2004-10-14 12:22:14.000000000 -0700
> @@ -1314,8 +1314,7 @@
>  }
> 
>  /*
> - * We hold the mm semaphore and the page_table_lock on entry and
> - * should release the pagetable lock on exit..
> + * We hold the mm semaphore
>   */
>  static int do_swap_page(struct mm_struct * mm,
>  	struct vm_area_struct * vma, unsigned long address,
> @@ -1327,15 +1326,13 @@
>  	int ret = VM_FAULT_MINOR;
> 
>  	pte_unmap(page_table);
> -	spin_unlock(&mm->page_table_lock);
>  	page = lookup_swap_cache(entry);
>  	if (!page) {
>   		swapin_readahead(entry, address, vma);
>   		page = read_swap_cache_async(entry, vma, address);
>  		if (!page) {
>  			/*
> -			 * Back out if somebody else faulted in this pte while
> -			 * we released the page table lock.
> +			 * Back out if somebody else faulted in this pte
>  			 */
>  			spin_lock(&mm->page_table_lock);
>  			page_table = pte_offset_map(pmd, address);

The comment above, which is a few lines down on do_swap_page() is now 
bogus (the "while we released the page table lock"). 

        /*
         * Back out if somebody else faulted in this pte while we
         * released the page table lock.
         */
        spin_lock(&mm->page_table_lock);
        page_table = pte_offset_map(pmd, address);
        if (unlikely(!pte_same(*page_table, orig_pte))) {

> @@ -1406,14 +1403,12 @@
>  }
> 
>  /*
> - * We are called with the MM semaphore and page_table_lock
> - * spinlock held to protect against concurrent faults in
> - * multithreaded programs.
> + * We are called with the MM semaphore held.
>   */
>  static int
>  do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte_t *page_table, pmd_t *pmd, int write_access,
> -		unsigned long addr)
> +		unsigned long addr, pte_t orig_entry)
>  {
>  	pte_t entry;
>  	struct page * page = ZERO_PAGE(addr);
> @@ -1425,7 +1420,6 @@
>  	if (write_access) {
>  		/* Allocate our own private page. */
>  		pte_unmap(page_table);
> -		spin_unlock(&mm->page_table_lock);
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> @@ -1434,30 +1428,39 @@
>  			goto no_mem;
>  		clear_user_highpage(page, addr);
> 
> -		spin_lock(&mm->page_table_lock);
> +		lock_page(page);

Question: Why do you need to hold the pagelock now?

I can't seem to figure that out myself.

>  		page_table = pte_offset_map(pmd, addr);
> 
> -		if (!pte_none(*page_table)) {
> -			pte_unmap(page_table);
> -			page_cache_release(page);
> -			spin_unlock(&mm->page_table_lock);
> -			goto out;
> -		}
> -		atomic_inc(&mm->mm_rss);
>  		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
>  							 vma->vm_page_prot)),
>  				      vma);
> -		lru_cache_add_active(page);
>  		mark_page_accessed(page);
> -		page_add_anon_rmap(page, vma, addr);
>  	}
> 
> -	set_pte(page_table, entry);
> +	/* update the entry */
> +	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
> +		if (write_access) {
> +			pte_unmap(page_table);
> +			unlock_page(page);
> +			page_cache_release(page);
> +		}
> +		goto out;
> +	}
> +	if (write_access) {
> +		/*
> +		 * The following two functions are safe to use without
> +		 * the page_table_lock but do they need to come before
> +		 * the cmpxchg?
> +		 */

They do need to come after AFAICS - from the point they are in the reverse map 
and the page is on the LRU try_to_unmap() can come in and try to 
unmap the pte (now that we dont hold page_table_lock anymore).

> +		lru_cache_add_active(page);
> +		page_add_anon_rmap(page, vma, addr);
> +		atomic_inc(&mm->mm_rss);
> +		unlock_page(page);
> +	}
>  	pte_unmap(page_table);
> 
>  	/* No need to invalidate - it was non-present before */
>  	update_mmu_cache(vma, addr, entry);
> -	spin_unlock(&mm->page_table_lock);
>  out:
>  	return VM_FAULT_MINOR;
>  no_mem:
> @@ -1473,12 +1476,12 @@
>   * As this is called only for pages that do not currently exist, we
>   * do not need to flush old virtual caches or the TLB.
>   *
> - * This is called with the MM semaphore held and the page table
> - * spinlock held. Exit with the spinlock released.
> + * This is called with the MM semaphore held.
>   */
>  static int
>  do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
> +	unsigned long address, int write_access, pte_t *page_table,
> +        pmd_t *pmd, pte_t orig_entry)
>  {
>  	struct page * new_page;
>  	struct address_space *mapping = NULL;
> @@ -1489,9 +1492,8 @@
> 
>  	if (!vma->vm_ops || !vma->vm_ops->nopage)
>  		return do_anonymous_page(mm, vma, page_table,
> -					pmd, write_access, address);
> +					pmd, write_access, address, orig_entry);
>  	pte_unmap(page_table);
> -	spin_unlock(&mm->page_table_lock);
> 
>  	if (vma->vm_file) {
>  		mapping = vma->vm_file->f_mapping;
> @@ -1589,7 +1591,7 @@
>   * nonlinear vmas.
>   */
>  static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
> -	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
> +	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
>  {
>  	unsigned long pgoff;
>  	int err;
> @@ -1602,13 +1604,12 @@
>  	if (!vma->vm_ops || !vma->vm_ops->populate ||
>  			(write_access && !(vma->vm_flags & VM_SHARED))) {
>  		pte_clear(pte);
> -		return do_no_page(mm, vma, address, write_access, pte, pmd);
> +		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
>  	}
> 
>  	pgoff = pte_to_pgoff(*pte);
> 
>  	pte_unmap(pte);
> -	spin_unlock(&mm->page_table_lock);
> 
>  	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
>  	if (err == -ENOMEM)
> @@ -1627,49 +1628,49 @@
>   * with external mmu caches can use to update those (ie the Sparc or
>   * PowerPC hashed page tables that act as extended TLBs).
>   *
> - * Note the "page_table_lock". It is to protect against kswapd removing
> - * pages from under us. Note that kswapd only ever _removes_ pages, never
> - * adds them. As such, once we have noticed that the page is not present,
> - * we can drop the lock early.
> - *
> + * Note that kswapd only ever _removes_ pages, never adds them.
> + * We need to insure to handle that case properly.
> + *
>   * The adding of pages is protected by the MM semaphore (which we hold),
>   * so we don't need to worry about a page being suddenly been added into
>   * our VM.
> - *
> - * We enter with the pagetable spinlock held, we are supposed to
> - * release it when done.
>   */
>  static inline int handle_pte_fault(struct mm_struct *mm,
>  	struct vm_area_struct * vma, unsigned long address,
>  	int write_access, pte_t *pte, pmd_t *pmd)
>  {
>  	pte_t entry;
> +	pte_t new_entry;
> 
>  	entry = *pte;
>  	if (!pte_present(entry)) {
>  		/*
>  		 * If it truly wasn't present, we know that kswapd
>  		 * and the PTE updates will not touch it later. So
> -		 * drop the lock.
> +		 * no need to acquire the page_table_lock.
>  		 */
>  		if (pte_none(entry))
> -			return do_no_page(mm, vma, address, write_access, pte, pmd);
> +			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
>  		if (pte_file(entry))
> -			return do_file_page(mm, vma, address, write_access, pte, pmd);
> +			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
>  		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
>  	}

I wonder what happens if kswapd, through try_to_unmap_one(), unmap's the
pte right here ? 

Aren't we going to proceed with the "pte_mkyoung(entry)" of a potentially 
now unmapped pte? Isnt that case possible now?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault scalability patch V10: [2/7] defer/omit taking page_table_lock
  2004-10-15 20:00                                                                     ` Marcelo Tosatti
@ 2004-10-18 15:59                                                                       ` Christoph Lameter
  2004-10-19  5:25                                                                       ` [revised] " Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-18 15:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, linux-kernel, linux-ia64

On Fri, 15 Oct 2004, Marcelo Tosatti wrote:

> The comment above, which is a few lines down on do_swap_page() is now
> bogus (the "while we released the page table lock").

Ok. I also modified the second occurence of that comment.

> >  	if (write_access) {
> >  		/* Allocate our own private page. */
> >  		pte_unmap(page_table);
> > -		spin_unlock(&mm->page_table_lock);
> >
> >  		if (unlikely(anon_vma_prepare(vma)))
> >  			goto no_mem;
> > @@ -1434,30 +1428,39 @@
> >  			goto no_mem;
> >  		clear_user_highpage(page, addr);
> >
> > -		spin_lock(&mm->page_table_lock);
> > +		lock_page(page);
>
> Question: Why do you need to hold the pagelock now?
>
> I can't seem to figure that out myself.

Hmm.. I cannot see a good reason for it either. The page is new
and thus no references can exist yet. Removed.

> > +	if (write_access) {
> > +		/*
> > +		 * The following two functions are safe to use without
> > +		 * the page_table_lock but do they need to come before
> > +		 * the cmpxchg?
> > +		 */
>
> They do need to come after AFAICS - from the point they are in the reverse map
> and the page is on the LRU try_to_unmap() can come in and try to
> unmap the pte (now that we dont hold page_table_lock anymore).

Ahh. Thanks.

> > -			return do_no_page(mm, vma, address, write_access, pte, pmd);
> > +			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
> >  		if (pte_file(entry))
> > -			return do_file_page(mm, vma, address, write_access, pte, pmd);
> > +			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
> >  		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
> >  	}
>
> I wonder what happens if kswapd, through try_to_unmap_one(), unmap's the
> pte right here ?
>
> Aren't we going to proceed with the "pte_mkyoung(entry)" of a potentially
> now unmapped pte? Isnt that case possible now?

The pte value is saved in entry. If the pte is umapped then its value is
changed. Thus the cmpxchg will fail and the page fault handler will return
without doing anything.

try_to_unmap_one was modified to handle the pte's in an atomic way using
ptep_xchg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [revised] page fault scalability patch V10: [2/7] defer/omit taking page_table_lock
  2004-10-15 20:00                                                                     ` Marcelo Tosatti
  2004-10-18 15:59                                                                       ` Christoph Lameter
@ 2004-10-19  5:25                                                                       ` Christoph Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christoph Lameter @ 2004-10-19  5:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, linux-kernel, linux-ia64

Here is an updated version following up on Marcelo's feedback....

Index: linux-2.6.9-final/mm/memory.c
===================================================================
--- linux-2.6.9-final.orig/mm/memory.c	2004-10-18 08:43:49.000000000 -0700
+++ linux-2.6.9-final/mm/memory.c	2004-10-18 09:00:10.000000000 -0700
@@ -1314,8 +1314,7 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,15 +1326,13 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1358,8 +1355,7 @@
 	lock_page(page);

 	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
+	 * Back out if somebody else faulted in this pte
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
@@ -1406,14 +1402,12 @@
 }

 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
@@ -1425,7 +1419,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1434,30 +1427,36 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		atomic_inc(&mm->mm_rss);
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
-		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}

-	set_pte(page_table, entry);
+	/* update the entry */
+	if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+		if (write_access) {
+			pte_unmap(page_table);
+			page_cache_release(page);
+		}
+		goto out;
+	}
+	if (write_access) {
+		/*
+		 * These two functions must come after the cmpxchg
+		 * because if the page is on the LRU then try_to_unmap may come
+		 * in and unmap the pte.
+		 */
+		lru_cache_add_active(page);
+		page_add_anon_rmap(page, vma, addr);
+		atomic_inc(&mm->mm_rss);
+	}
 	pte_unmap(page_table);

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1473,12 +1472,12 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1489,9 +1488,8 @@

 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);

 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1589,7 +1587,7 @@
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1602,13 +1600,12 @@
 	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}

 	pgoff = pte_to_pgoff(*pte);

 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1627,49 +1624,49 @@
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
+ *
  * The adding of pages is protected by the MM semaphore (which we hold),
  * so we don't need to worry about a page being suddenly been added into
  * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;

 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
 		 * If it truly wasn't present, we know that kswapd
 		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * no need to acquire the page_table_lock.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}

+	/*
+	 * This is the case in which we may only update some bits in the pte.
+	 */
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			/* do_wp_page expects us to hold the page_table_lock */
+			spin_lock(&mm->page_table_lock);
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+	if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+		update_mmu_cache(vma, address, new_entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1687,22 +1684,42 @@

 	inc_page_state(pgfault);

-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */

 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd
 	 */
-	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	if (unlikely(pgd_none(*pgd))) {
+       		pmd_t *new = pmd_alloc_one(mm, address);
+		if (!new) return VM_FAULT_OOM;
+
+		/* Insure that the update is done in an atomic way */
+		if (!pgd_test_and_populate(mm, pgd, new)) pmd_free(new);
+	}
+
+        pmd = pmd_offset(pgd, address);
+
+	if (likely(pmd)) {
+		pte_t *pte;
+
+		if (!pmd_present(*pmd)) {
+			struct page *new;

-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
+			new = pte_alloc_one(mm, address);
+			if (!new) return VM_FAULT_OOM;
+
+			if (!pmd_test_and_populate(mm, pmd, new))
+				pte_free(new);
+			else
+				inc_page_state(nr_page_table_pages);
+		}
+
+		pte = pte_offset_map(pmd, address);
+		if (likely(pte))
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.9-final/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9-final.orig/include/asm-generic/pgtable.h	2004-10-15 20:02:39.000000000 -0700
+++ linux-2.6.9-final/include/asm-generic/pgtable.h	2004-10-18 08:43:56.000000000 -0700
@@ -134,4 +134,75 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte;							\
+	spin_lock(&__vma->vm_mm->page_table_lock);			\
+	__pte = *(__ptep);						\
+	set_pte(__ptep, __pteval);					\
+	flush_tlb_page(__vma, __address);				\
+	spin_unlock(&__vma->vm_mm->page_table_lock);			\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval)		\
+({									\
+	int __rc;							\
+	spin_lock(&__vma->vm_mm->page_table_lock);			\
+	__rc = pte_same(*(__ptep), __oldval);				\
+	if (__rc) set_pte(__ptep, __newval);				\
+	spin_unlock(&__vma->vm_mm->page_table_lock);			\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&__mm->page_table_lock);				\
+	__rc = !pgd_present(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pmd);			\
+	spin_unlock(&__mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&__mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&__mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#else
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+
+#endif
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9-final/mm/rmap.c
===================================================================
--- linux-2.6.9-final.orig/mm/rmap.c	2004-10-18 08:43:49.000000000 -0700
+++ linux-2.6.9-final/mm/rmap.c	2004-10-18 08:43:56.000000000 -0700
@@ -420,7 +420,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -562,11 +565,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -576,10 +574,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);

+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 	atomic_dec(&mm->mm_rss);
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -666,15 +668,24 @@
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;

-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
+		/*
+		 * There would be a race here with the handle_mm_fault code that
+		 * bypasses the page_table_lock to allow a fast creation of ptes
+		 * if we would zap the pte before
+		 * putting something into it. On the other hand we need to
+		 * have the dirty flag when we replaced the value.
+		 * The dirty flag may be handled by a processor so we better
+		 * use an atomic operation here.
+		 */

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_get_and_clear(pte);

-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 23:55       ` Christoph Lameter
@ 2004-08-16  0:12         ` Andi Kleen
  0 siblings, 0 replies; 106+ messages in thread
From: Andi Kleen @ 2004-08-16  0:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, linux-ia64, linux-kernel

On Sun, Aug 15, 2004 at 04:55:57PM -0700, Christoph Lameter wrote:
> On Mon, 16 Aug 2004, Andi Kleen wrote:
> 
> > Christoph Lameter <clameter@sgi.com> writes:
> >
> > > On Sun, 15 Aug 2004, David S. Miller wrote:
> > >
> > >>
> > >> Is the read lock in the VMA semaphore enough to let you do
> > >> the pgd/pmd walking without the page_table_lock?
> > >> I think it is, but just checking.
> > >
> > > That would be great.... May I change the page_table lock to
> > > be a read write spinlock instead?
> >
> > That's probably not a good idea. r/w locks are extremly slow on
> > some architectures. Including ia64.
> 
> I was thinking about a read write spinlock not an readwrite
> semaphore. Look at include/asm-ia64/spinlock.h.

I was also talking about rw spinlocks.

> The implementations are almost the same. Are you sure
> about this?

Yes. Try the cat /proc/net/tcp test. It will take >100k read locks
for the TCP listen hash table, and on bigger ppc64 and ia64 machines this
can take nearly a second of system time.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 23:53     ` Andi Kleen
@ 2004-08-15 23:55       ` Christoph Lameter
  2004-08-16  0:12         ` Andi Kleen
  0 siblings, 1 reply; 106+ messages in thread
From: Christoph Lameter @ 2004-08-15 23:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, linux-ia64, linux-kernel

On Mon, 16 Aug 2004, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
>
> > On Sun, 15 Aug 2004, David S. Miller wrote:
> >
> >>
> >> Is the read lock in the VMA semaphore enough to let you do
> >> the pgd/pmd walking without the page_table_lock?
> >> I think it is, but just checking.
> >
> > That would be great.... May I change the page_table lock to
> > be a read write spinlock instead?
>
> That's probably not a good idea. r/w locks are extremly slow on
> some architectures. Including ia64.

I was thinking about a read write spinlock not an readwrite
semaphore. Look at include/asm-ia64/spinlock.h.
The implementations are almost the same. Are you sure
about this?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing  pte locks?
       [not found]   ` <2tCiw-8pK-1@gated-at.bofh.it>
@ 2004-08-15 23:53     ` Andi Kleen
  2004-08-15 23:55       ` Christoph Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Andi Kleen @ 2004-08-15 23:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> On Sun, 15 Aug 2004, David S. Miller wrote:
>
>>
>> Is the read lock in the VMA semaphore enough to let you do
>> the pgd/pmd walking without the page_table_lock?
>> I think it is, but just checking.
>
> That would be great.... May I change the page_table lock to
> be a read write spinlock instead?

That's probably not a good idea. r/w locks are extremly slow on 
some architectures. Including ia64.

Just profile cat /proc/net/tcp on a machine with a lot of memory
and you'll notice.

-Andi


^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2004-10-19  5:27 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-15 13:50 page fault fastpath: Increasing SMP scalability by introducing pte locks? Christoph Lameter
2004-08-15 20:09 ` David S. Miller
2004-08-15 22:58   ` Christoph Lameter
2004-08-15 23:58     ` David S. Miller
2004-08-16  0:11       ` Christoph Lameter
2004-08-16  1:56         ` David S. Miller
2004-08-16  3:29           ` Christoph Lameter
2004-08-16  7:00             ` Ray Bryant
2004-08-16 15:18               ` Christoph Lameter
2004-08-16 16:18                 ` William Lee Irwin III
2004-08-16 14:39             ` William Lee Irwin III
2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
2004-08-17 15:37                 ` Christoph Hellwig
2004-08-17 15:51                 ` William Lee Irwin III
2004-08-18 17:55                 ` Hugh Dickins
2004-08-18 20:20                   ` William Lee Irwin III
2004-08-19  1:19                   ` Christoph Lameter
     [not found]               ` <B6E8046E1E28D34EB815A11AC8CA3129027B679F@mtv-atc-605e--n.corp.sgi.com>
2004-08-24  4:43                 ` page fault scalability patch v3: use cmpxchg, make rss atomic Christoph Lameter
2004-08-24  5:49                   ` Christoph Lameter
2004-08-24 12:34                     ` Matthew Wilcox
2004-08-24 14:47                       ` Christoph Lameter
     [not found]                 ` <B6E8046E1E28D34EB815A11AC8CA3129027B67A9@mtv-atc-605e--n.corp.sgi.com>
2004-08-26 15:20                   ` page fault scalability patch v4: reduce page_table_lock use, atomic pmd,pgd handlin Christoph Lameter
     [not found]                   ` <B6E8046E1E28D34EB815A11AC8CA3129027B67B4@mtv-atc-605e--n.corp.sgi.com>
2004-08-27 23:20                     ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
2004-08-27 23:36                       ` Andi Kleen
2004-08-27 23:43                         ` David S. Miller
2004-08-28  0:19                         ` Christoph Lameter
2004-08-28  0:23                           ` David S. Miller
2004-08-28  0:36                             ` Andrew Morton
2004-08-28  0:40                               ` David S. Miller
2004-08-28  1:05                                 ` Andi Kleen
2004-08-28  1:11                                   ` David S. Miller
2004-08-28  1:17                                     ` Andi Kleen
2004-08-28  1:02                               ` Andi Kleen
2004-08-28  1:39                                 ` Andrew Morton
2004-08-28  2:08                                   ` Paul Mackerras
2004-08-28  3:32                                     ` Christoph Lameter
2004-08-28  3:42                                       ` Andrew Morton
2004-08-28  4:24                                         ` Christoph Lameter
2004-08-28  5:39                                           ` Andrew Morton
2004-08-28  5:58                                             ` Christoph Lameter
2004-08-28  6:03                                               ` William Lee Irwin III
2004-08-28  6:06                                               ` Andrew Morton
2004-08-30 17:02                                                 ` Herbert Poetzl
2004-08-30 17:05                                                   ` Andi Kleen
2004-08-28 13:19                                             ` Andi Kleen
2004-08-28 15:48                                             ` Matt Mackall
2004-09-01  4:13                                             ` Benjamin Herrenschmidt
2004-09-02 21:26                                               ` Andi Kleen
2004-09-02 21:55                                                 ` David S. Miller
2004-09-01 18:03                                             ` Matthew Wilcox
2004-09-01 18:19                                               ` Andrew Morton
2004-09-01 19:06                                                 ` William Lee Irwin III
2004-08-28 21:41                                 ` Daniel Phillips
2004-09-01  4:24                       ` Benjamin Herrenschmidt
2004-09-01  5:22                         ` David S. Miller
2004-09-01 16:43                         ` Christoph Lameter
2004-09-01 23:09                           ` Benjamin Herrenschmidt
     [not found]                             ` <Pine.LNX.4.58.0409012140440.23186@schroedinger.engr.sgi.com>
     [not found]                               ` <20040901215741.3538bbf4.davem@davemloft.net>
2004-09-02  5:18                                 ` William Lee Irwin III
2004-09-09 15:38                                   ` page fault scalability patch: V7 (+fallback for atomic page table ops) Christoph Lameter
2004-09-02 16:24                                 ` page fault scalability patch final : i386 tested, x86_64 support added Christoph Lameter
2004-09-02 20:10                                   ` David S. Miller
2004-09-02 21:02                                     ` Christoph Lameter
2004-09-02 21:07                                       ` David S. Miller
2004-09-18 23:23                                         ` page fault scalability patch V8: [0/7] Description Christoph Lameter
     [not found]                                         ` <B6E8046E1E28D34EB815A11AC8CA312902CD3243@mtv-atc-605e--n.corp.sgi.com>
2004-09-18 23:24                                           ` page fault scalability patch V8: [1/7] make mm->rss atomic Christoph Lameter
2004-09-18 23:26                                           ` page fault scalability patch V8: [2/7] avoid page_table_lock in handle_mm_fault Christoph Lameter
2004-09-19  9:04                                             ` Christoph Hellwig
2004-09-18 23:27                                           ` page fault scalability patch V8: [3/7] atomic pte operations for ia64 Christoph Lameter
2004-09-18 23:28                                           ` page fault scalability patch V8: [4/7] universally available cmpxchg on i386 Christoph Lameter
     [not found]                                             ` <200409191430.37444.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-19 12:11                                               ` Andi Kleen
2004-09-20 15:45                                               ` Christoph Lameter
     [not found]                                                 ` <200409202043.00580.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-20 20:49                                                   ` Christoph Lameter
2004-09-20 20:57                                                     ` Andi Kleen
     [not found]                                                       ` <200409211841.25507.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-21 15:45                                                         ` Andi Kleen
     [not found]                                                           ` <200409212306.38800.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-21 20:14                                                             ` Andi Kleen
2004-09-23  7:17                                                           ` Andy Lutomirski
2004-09-23  9:03                                                             ` Andi Kleen
2004-09-27 19:06                                                               ` page fault scalability patch V9: [0/7] overview Christoph Lameter
2004-10-15 19:02                                                                 ` page fault scalability patch V10: " Christoph Lameter
2004-10-15 19:03                                                                   ` page fault scalability patch V10: [1/7] make rss atomic Christoph Lameter
2004-10-15 19:04                                                                   ` page fault scalability patch V10: [2/7] defer/omit taking page_table_lock Christoph Lameter
2004-10-15 20:00                                                                     ` Marcelo Tosatti
2004-10-18 15:59                                                                       ` Christoph Lameter
2004-10-19  5:25                                                                       ` [revised] " Christoph Lameter
2004-10-15 19:05                                                                   ` page fault scalability patch V10: [3/7] IA64 atomic pte operations Christoph Lameter
2004-10-15 19:06                                                                   ` page fault scalability patch V10: [4/7] cmpxchg for 386 and 486 Christoph Lameter
2004-10-15 19:06                                                                   ` page fault scalability patch V10: [5/7] i386 atomic pte operations Christoph Lameter
2004-10-15 19:07                                                                   ` page fault scalability patch V10: [6/7] x86_64 " Christoph Lameter
2004-10-15 19:08                                                                   ` page fault scalability patch V10: [7/7] s/390 " Christoph Lameter
     [not found]                                                               ` <B6E8046E1E28D34EB815A11AC8CA312902CD3282@mtv-atc-605e--n.corp.sgi.com>
2004-09-27 19:07                                                                 ` page fault scalability patch V9: [1/7] make mm->rss atomic Christoph Lameter
2004-09-27 19:08                                                                 ` page fault scalability patch V9: [2/7] defer/remove page_table_lock Christoph Lameter
2004-09-27 19:10                                                                 ` page fault scalability patch V9: [3/7] atomic pte operatios for ia64 Christoph Lameter
2004-09-27 19:10                                                                 ` page fault scalability patch V9: [4/7] generally available cmpxchg on i386 Christoph Lameter
2004-09-27 19:11                                                                 ` page fault scalability patch V9: [5/7] atomic pte operations for i386 Christoph Lameter
2004-09-27 19:12                                                                 ` page fault scalability patch V9: [6/7] atomic pte operations for x86_64 Christoph Lameter
2004-09-27 19:13                                                                 ` page fault scalability patch V9: [7/7] atomic pte operatiosn for s390 Christoph Lameter
2004-09-18 23:29                                           ` page fault scalability patch V8: [5/7] atomic pte operations for i386 Christoph Lameter
2004-09-18 23:30                                           ` page fault scalability patch V8: [6/7] atomic pte operations for x86_64 Christoph Lameter
2004-09-18 23:31                                           ` page fault scalability patch V8: [7/7] atomic pte operations for s390 Christoph Lameter
     [not found]                                             ` <200409191435.09445.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-20 15:44                                               ` Christoph Lameter
2004-08-15 22:38 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Benjamin Herrenschmidt
2004-08-16 17:28   ` Christoph Lameter
2004-08-17  8:01     ` Benjamin Herrenschmidt
     [not found] <2ttIr-2e4-17@gated-at.bofh.it>
     [not found] ` <2tzE4-6sw-25@gated-at.bofh.it>
     [not found]   ` <2tCiw-8pK-1@gated-at.bofh.it>
2004-08-15 23:53     ` Andi Kleen
2004-08-15 23:55       ` Christoph Lameter
2004-08-16  0:12         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).