* page fault scalability patch V11 [0/7]: overview [not found] ` <1100848068.25520.49.camel@gaston> @ 2004-11-19 19:42 ` Christoph Lameter 2004-11-19 19:43 ` page fault scalability patch V11 [1/7]: sloppy rss Christoph Lameter ` (9 more replies) 0 siblings, 10 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:42 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Signed-off-by: Christoph Lameter <clameter@sgi.com> Changes from V10->V11 of this patch: - cmpxchg_i386: Optimize code generated after feedback from Linus. Various fixes. - drop make_rss_atomic in favor of rss_sloppy - generic: adapt to new changes in Linus tree, some fixes to fallback functions. Add generic ptep_xchg_flush based on xchg. - S390: remove use of page_table_lock from ptep_xchg_flush (deadlock) - x86_64: remove ptep_xchg - i386: integrated Nick Piggin's changes for PAE mode. Create ptep_xchg_flush and various fixes. - ia64: if necessary flush icache before ptep_cmpxchg. Remove ptep_xchg This is a series of patches that increases the scalability of the page fault handler for SMP. Here are some performance results on a machine with 32 processors allocating 32 GB with an increasing number of cpus. Without the patches: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456 32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453 32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486 32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674 32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973 32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521 With the patches: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 10 1 3.772s 330.629s 334.042s 62713.587 62708.706 32 10 2 3.767s 352.252s 185.077s 58905.502 112886.222 32 10 4 3.549s 255.683s 77.000s 80898.177 272326.496 32 10 8 3.522s 263.879s 52.030s 78427.083 400965.857 32 10 16 5.193s 384.813s 42.076s 53772.158 490378.852 32 10 32 15.806s 996.890s 54.077s 20708.587 382879.208 With a high number of CPUs the page fault rate improves more than twofold and may reach 500000 faults/sec betweenr 16-512 cpus. The fault rate drops if a process is running on all processors as also here for the 32 cpu case. Note that the measurements were done on a NUMA system and this test uses off node memory. Variations may exist due to allocations in memory areas in diverse distances to the local cpu. The slight drop for 2 cpus is probably due to that effect. The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem!) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's (pgd_test_and_populate, pmd_test_and_populate). The page table lock can be avoided in the following situations: 1. An empty pte or pmd entry is populated This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory. 2. Modifications of flags in a pte entry (write/accessed). These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too. One essential change in the VM is the use of pte_cmpxchg (or its generic emulation) on page table entries before doing an update_mmu_change without holding the page table lock. However, we do similar things now with other atomic pte operations such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch operates on an *cleared* pte and replaces it with a pte pointing to valid memory. The effect of this change on various architectures has to be thought through. Local definitions of ptep_cmpxchg and ptep_xchg may be necessary. For IA64 an icache coherency issue may arise that potentially requires the flushing of the icache (as done via update_mmu_cache on IA64) prior to the use of ptep_cmpxchg. Similar issues may arise on other platforms. The patch uses sloppy rss handling. mm->rss is incremented without proper locking because locking would introduce too much overhead. Rss is not essential for vm operations (3 uses of rss in rmap.c were not necessary and were removed). The difference in rss values has been found to be less than 1% in our tests (see also the separate email to linux-mm and linux-ia64 on the subject of "sloppy rss"). The move away from using atomic operations for rss in earlier versions of this patch also increases the performance of the page fault handler in the single thread case over an unpatched kernel. Note that I have posted two other approaches of dealing with the rss problem: A. make_rss_atomic. The earlier releases contained that patch but then another variable (such as anon_rss) was introduced that would have required additional atomic operations. Atomic rss operations are also causing slowdowns on machines with a high number of cpus due to memory contention. B. remove_rss. Replace rss with a periodic scan over the vm to determine rss and additional numbers. This was also discussed on linux-mm and linux-ia64. The scans while displaying /proc data were undesirable. The patchset is composed of 7 patches: 1/7: Sloppy rss Removes mm->rss usage from mm/rmap.c and insures that negative rss values are not displayed. 2/7: Avoid page_table_lock in handle_mm_fault This patch defers the acquisition of the page_table_lock as much as possible and uses atomic operations for allocating anonymous memory. These atomic operations are simulated by acquiring the page_table_lock for very small time frames if an architecture does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a pte will not be set to empty if a page is in transition to swap. If only the first two patches are applied then the time that the page_table_lock is held is simply reduced. The lock may then be acquired multiple times during a page fault. The remaining patches introduce the necessary atomic pte operations to avoid the page_table_lock. 3/7: Atomic pte operations for ia64 4/7: Make cmpxchg generally available on i386 The atomic operations on the page table rely heavily on cmpxchg instructions. This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486 cpus. The emulations are only included if a kernel is build for these old cpus and are skipped for the real cmpxchg instructions if the kernel that is build for 386 or 486 is then run on a more recent cpu. This patch may be used independently of the other patches. 5/7: Atomic pte operations for i386 A generally available cmpxchg (last patch) must be available for this patch to preserve the ability to build kernels for 386 and 486. 6/7: Atomic pte operation for x86_64 7/7: Atomic pte operations for s390 ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [1/7]: sloppy rss 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter @ 2004-11-19 19:43 ` Christoph Lameter 2004-11-19 20:50 ` Hugh Dickins 2004-11-19 19:44 ` page fault scalability patch V11 [2/7]: page fault handler optimizations Christoph Lameter ` (8 subsequent siblings) 9 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:43 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Enable the sloppy use of mm->rss and mm->anon_rss atomic without locking * Insure that negative rss values are not given out by the /proc filesystem * remove 3 checks of rss in mm/rmap.c * Prerequisite for page table scalability patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-11-18 13:04:30.000000000 -0800 @@ -216,7 +216,7 @@ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; - spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + spinlock_t page_table_lock; /* Protects page tables */ struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -252,6 +252,21 @@ struct kioctx default_kioctx; }; +/* + * rss and anon_rss are incremented and decremented in some locations without + * proper locking. This function insures that these values do not become negative. + */ +static long inline get_rss(struct mm_struct *mm) +{ + long rss = mm->rss; + + if (rss < 0) + mm->rss = rss = 0; + if ((long)mm->anon_rss < 0) + mm->anon_rss = 0; + return rss; +} + struct sighand_struct { atomic_t count; struct k_sigaction action[_NSIG]; Index: linux-2.6.9/fs/proc/task_mmu.c =================================================================== --- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-18 12:56:26.000000000 -0800 @@ -22,7 +22,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + get_rss(mm) << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -37,7 +37,9 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + *shared = get_rss(mm) - mm->anon_rss; + if (*shared <0) + *shared = 0; *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; Index: linux-2.6.9/fs/proc/array.c =================================================================== --- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/fs/proc/array.c 2004-11-18 12:53:16.000000000 -0800 @@ -420,7 +420,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-18 12:26:45.000000000 -0800 @@ -263,8 +263,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -504,8 +502,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -788,8 +784,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [1/7]: sloppy rss 2004-11-19 19:43 ` page fault scalability patch V11 [1/7]: sloppy rss Christoph Lameter @ 2004-11-19 20:50 ` Hugh Dickins 2004-11-20 1:29 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-11-19 20:50 UTC (permalink / raw) To: Christoph Lameter Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Sorry, against what tree do these patches apply? Apparently not linux-2.6.9, nor latest -bk, nor -mm? Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [1/7]: sloppy rss 2004-11-19 20:50 ` Hugh Dickins @ 2004-11-20 1:29 ` Christoph Lameter 2004-11-22 15:00 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-20 1:29 UTC (permalink / raw) To: Hugh Dickins Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel 2.6.10-rc2-bk3 On Fri, 19 Nov 2004, Hugh Dickins wrote: > Sorry, against what tree do these patches apply? > Apparently not linux-2.6.9, nor latest -bk, nor -mm? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [1/7]: sloppy rss 2004-11-20 1:29 ` Christoph Lameter @ 2004-11-22 15:00 ` Hugh Dickins 2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-11-22 15:00 UTC (permalink / raw) To: Christoph Lameter Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Fri, 19 Nov 2004, Christoph Lameter wrote: > On Fri, 19 Nov 2004, Hugh Dickins wrote: > > > Sorry, against what tree do these patches apply? > > Apparently not linux-2.6.9, nor latest -bk, nor -mm? > > 2.6.10-rc2-bk3 Ah, thanks - got it patched now, but your mailer (or something else) is eating trailing spaces. Better than adding them, but we have to apply this patch before your set: --- 2.6.10-rc2-bk3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000 +++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000 @@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo #define cmpxchg(ptr,o,n)\ ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) - + #ifdef __KERNEL__ -struct alt_instr { +struct alt_instr { __u8 *instr; /* original instruction */ __u8 *replacement; __u8 cpuid; /* cpuid bit set for replacement */ --- 2.6.10-rc2-bk3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100 +++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000 @@ -99,7 +99,7 @@ static inline void pgd_populate(struct m #endif /* __s390x__ */ -static inline void +static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) { #ifndef __s390x__ --- 2.6.10-rc2-bk3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000 +++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000 @@ -1424,7 +1424,7 @@ out: /* * We are called with the MM semaphore and page_table_lock * spinlock held to protect against concurrent faults in - * multithreaded programs. + * multithreaded programs. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, @@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct * Fall back to the linear mapping if the fs does not support * ->populate: */ - if (!vma->vm_ops || !vma->vm_ops->populate || + if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); return do_no_page(mm, vma, address, write_access, pte, pmd); ^ permalink raw reply [flat|nested] 286+ messages in thread
* deferred rss update instead of sloppy rss 2004-11-22 15:00 ` Hugh Dickins @ 2004-11-22 21:50 ` Christoph Lameter 2004-11-22 22:11 ` Andrew Morton 2004-11-22 22:22 ` Linus Torvalds 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 21:50 UTC (permalink / raw) To: Hugh Dickins Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel One way to solve the rss issues is--as discussed--to put rss into the task structure and then have the page fault increment that rss. The problem is then that the proc filesystem must do an extensive scan over all threads to find users of a certain mm_struct. The following patch does put the rss into task_struct. The page fault handler is then incrementing current->rss if the page_table_lock is not held. The timer interrupt checks if task->rss is non zero (when doing stime/utime updates. rss is defined near those so its hopefully on the same cacheline and has a minimal impact). If rss is non zero and the page_table_lock and the mmap_sem can be taken then the mm->rss will be updated by the value of the task->rss and task->rss will be zeroed. This avoids all proc issues. The only disadvantage is that rss may be inaccurate for a couple of clock ticks. This also adds some performance (sorry only a 4p system): sloppy rss: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565 4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558 4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945 deferred rss: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580 4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659 4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196 Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-11-22 13:18:58.000000000 -0800 @@ -584,6 +584,10 @@ unsigned long it_real_incr, it_prof_incr, it_virt_incr; struct timer_list real_timer; unsigned long utime, stime; + long rss; /* rss counter when mm->rss is not usable. mm->page_table_lock + * not held but mm->mmap_sem must be held for sync with + * the timer interrupt which clears rss and updates mm->rss. + */ unsigned long nvcsw, nivcsw; /* context switch counts */ struct timespec start_time; /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */ Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-22 09:51:58.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-22 11:16:02.000000000 -0800 @@ -263,8 +263,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -507,8 +505,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -791,8 +787,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.9/kernel/fork.c =================================================================== --- linux-2.6.9.orig/kernel/fork.c 2004-11-22 09:51:58.000000000 -0800 +++ linux-2.6.9/kernel/fork.c 2004-11-22 11:16:02.000000000 -0800 @@ -876,6 +876,7 @@ p->io_context = NULL; p->io_wait = NULL; p->audit_context = NULL; + p->rss = 0; #ifdef CONFIG_NUMA p->mempolicy = mpol_copy(p->mempolicy); if (IS_ERR(p->mempolicy)) { Index: linux-2.6.9/kernel/exit.c =================================================================== --- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800 +++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800 @@ -501,6 +501,9 @@ /* more a memory barrier than a real lock */ task_lock(tsk); tsk->mm = NULL; + /* only holding mmap_sem here maybe get page_table_lock too? */ + mm->rss += tsk->rss; + tsk->rss = 0; up_read(&mm->mmap_sem); enter_lazy_tlb(mm, current); task_unlock(tsk); Index: linux-2.6.9/kernel/timer.c =================================================================== --- linux-2.6.9.orig/kernel/timer.c 2004-11-22 09:51:58.000000000 -0800 +++ linux-2.6.9/kernel/timer.c 2004-11-22 11:42:12.000000000 -0800 @@ -815,6 +815,15 @@ if (psecs / HZ >= p->signal->rlim[RLIMIT_CPU].rlim_max) send_sig(SIGKILL, p, 1); } + /* Update mm->rss if necessary */ + if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) { + if (spin_trylock(&p->mm->page_table_lock)) { + p->mm->rss += p->rss; + p->rss = 0; + spin_unlock(&p->mm->page_table_lock); + } + up_write(&p->mm->mmap_sem); + } } static inline void do_it_virt(struct task_struct * p, unsigned long ticks) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter @ 2004-11-22 22:11 ` Andrew Morton 2004-11-22 22:13 ` Christoph Lameter 2004-11-22 22:22 ` Linus Torvalds 1 sibling, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-11-22 22:11 UTC (permalink / raw) To: Christoph Lameter Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter <clameter@sgi.com> wrote: > > One way to solve the rss issues is--as discussed--to put rss into the > task structure and then have the page fault increment that rss. > > The problem is then that the proc filesystem must do an extensive scan > over all threads to find users of a certain mm_struct. > > The following patch does put the rss into task_struct. The page fault > handler is then incrementing current->rss if the page_table_lock is not > held. > > The timer interrupt checks if task->rss is non zero (when doing > stime/utime updates. rss is defined near those so its hopefully on the > same cacheline and has a minimal impact). > > If rss is non zero and the page_table_lock and the mmap_sem can be taken > then the mm->rss will be updated by the value of the task->rss and > task->rss will be zeroed. This avoids all proc issues. The only > disadvantage is that rss may be inaccurate for a couple of clock ticks. > > This also adds some performance (sorry only a 4p system): > > sloppy rss: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565 > 4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558 > 4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945 > > deferred rss: > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580 > 4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659 > 4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196 hrm. I cannot see anywhere in this patch where you update task_struct.rss. > Index: linux-2.6.9/kernel/exit.c > =================================================================== > --- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800 > +++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800 > @@ -501,6 +501,9 @@ > /* more a memory barrier than a real lock */ > task_lock(tsk); > tsk->mm = NULL; > + /* only holding mmap_sem here maybe get page_table_lock too? */ > + mm->rss += tsk->rss; > + tsk->rss = 0; > up_read(&mm->mmap_sem); mmap_sem needs to be held for writing, surely? > + /* Update mm->rss if necessary */ > + if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) { > + if (spin_trylock(&p->mm->page_table_lock)) { > + p->mm->rss += p->rss; > + p->rss = 0; > + spin_unlock(&p->mm->page_table_lock); > + } > + up_write(&p->mm->mmap_sem); > + } > } I'd also suggest that you do: tsk->rss++; if (tsk->rss > 16) { spin_lock(&mm->page_table_lock); mm->rss += tsk->rss; spin_unlock(&mm->page_table_lock); tsk->rss = 0; } just to prevent transient gross inaccuracies. For some value of "16". ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:11 ` Andrew Morton @ 2004-11-22 22:13 ` Christoph Lameter 2004-11-22 22:17 ` Benjamin Herrenschmidt 2004-11-22 22:45 ` Andrew Morton 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 22:13 UTC (permalink / raw) To: Andrew Morton Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Andrew Morton wrote: > hrm. I cannot see anywhere in this patch where you update task_struct.rss. This is just the piece around it dealing with rss. The updating of rss happens in the generic code. The change to that is trivial. I can repost the whole shebang if you want. > > + /* only holding mmap_sem here maybe get page_table_lock too? */ > > + mm->rss += tsk->rss; > > + tsk->rss = 0; > > up_read(&mm->mmap_sem); > > mmap_sem needs to be held for writing, surely? If there are no page faults occurring anymore then we would not need to get the lock. Q: Is it safe to assume that no faults occur anymore at this point? > just to prevent transient gross inaccuracies. For some value of "16". The page fault code only increments rss. For larger transactions that increase / decrease rss significantly the page_table_lock is taken and mm->rss is updated directly. So no gross inaccuracies can result. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:13 ` Christoph Lameter @ 2004-11-22 22:17 ` Benjamin Herrenschmidt 2004-11-22 22:45 ` Andrew Morton 1 sibling, 0 replies; 286+ messages in thread From: Benjamin Herrenschmidt @ 2004-11-22 22:17 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, hugh, Linus Torvalds, nickpiggin, linux-mm, linux-ia64, Linux Kernel list On Mon, 2004-11-22 at 14:13 -0800, Christoph Lameter wrote: > On Mon, 22 Nov 2004, Andrew Morton wrote: > > > hrm. I cannot see anywhere in this patch where you update task_struct.rss. > > This is just the piece around it dealing with rss. The updating of rss > happens in the generic code. The change to that is trivial. I can repost > the whole shebang if you want. > > > > + /* only holding mmap_sem here maybe get page_table_lock too? */ > > > + mm->rss += tsk->rss; > > > + tsk->rss = 0; > > > up_read(&mm->mmap_sem); > > > > mmap_sem needs to be held for writing, surely? > > If there are no page faults occurring anymore then we would not need to > get the lock. Q: Is it safe to assume that no faults occur > anymore at this point? Why wouldn't the mm take faults on other CPUs ? (other threads) > > just to prevent transient gross inaccuracies. For some value of "16". > > The page fault code only increments rss. For larger transactions that > increase / decrease rss significantly the page_table_lock is taken and > mm->rss is updated directly. So no > gross inaccuracies can result. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> -- Benjamin Herrenschmidt <benh@kernel.crashing.org> ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:13 ` Christoph Lameter 2004-11-22 22:17 ` Benjamin Herrenschmidt @ 2004-11-22 22:45 ` Andrew Morton 2004-11-22 22:48 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-11-22 22:45 UTC (permalink / raw) To: Christoph Lameter Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter <clameter@sgi.com> wrote: > > > just to prevent transient gross inaccuracies. For some value of "16". > > The page fault code only increments rss. For larger transactions that > increase / decrease rss significantly the page_table_lock is taken and > mm->rss is updated directly. So no > gross inaccuracies can result. Sure. Take a million successive pagefaults and mm->rss is grossly inaccurate. Hence my suggestion that it be spilled into mm->rss periodically. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:45 ` Andrew Morton @ 2004-11-22 22:48 ` Christoph Lameter 2004-11-22 23:09 ` Nick Piggin 2004-11-22 23:16 ` Andrew Morton 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 22:48 UTC (permalink / raw) To: Andrew Morton Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Andrew Morton wrote: > > The page fault code only increments rss. For larger transactions that > > increase / decrease rss significantly the page_table_lock is taken and > > mm->rss is updated directly. So no > > gross inaccuracies can result. > > Sure. Take a million successive pagefaults and mm->rss is grossly > inaccurate. Hence my suggestion that it be spilled into mm->rss > periodically. It is spilled into mm->rss periodically. That is the whole point of the patch. The timer tick occurs every 1 ms. The maximum pagefault frequency that I have seen is 500000 faults /second. The max deviation is therefore less than 500 (could be greater if page table lock / mmap_sem always held when the tick occurs). ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:48 ` Christoph Lameter @ 2004-11-22 23:09 ` Nick Piggin 2004-11-22 23:13 ` Christoph Lameter 2004-11-22 23:16 ` Andrew Morton 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-22 23:09 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, hugh, torvalds, benh, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > On Mon, 22 Nov 2004, Andrew Morton wrote: > > >>>The page fault code only increments rss. For larger transactions that >>>increase / decrease rss significantly the page_table_lock is taken and >>>mm->rss is updated directly. So no >>>gross inaccuracies can result. >> >>Sure. Take a million successive pagefaults and mm->rss is grossly >>inaccurate. Hence my suggestion that it be spilled into mm->rss >>periodically. > > > It is spilled into mm->rss periodically. That is the whole point of the > patch. > > The timer tick occurs every 1 ms. The maximum pagefault frequency that I > have seen is 500000 faults /second. The max deviation is therefore > less than 500 (could be greater if page table lock / mmap_sem always held > when the tick occurs). You could imagine a situation where something pagefaults and sleeps in lock-step with the timer though. Theoretical problem only? I think that by the time you get the spilling code in, the mm-list method will be looking positively elegant! ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 23:09 ` Nick Piggin @ 2004-11-22 23:13 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 23:13 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, hugh, torvalds, benh, linux-mm, linux-ia64, linux-kernel On Tue, 23 Nov 2004, Nick Piggin wrote: > > The timer tick occurs every 1 ms. The maximum pagefault frequency that I > > have seen is 500000 faults /second. The max deviation is therefore > > less than 500 (could be greater if page table lock / mmap_sem always held > > when the tick occurs). > I think that by the time you get the spilling code in, the mm-list method > will be looking positively elegant! I do not care what gets in as long as something goes in to address the performance issues. So far everyone seems to have their pet ideas. By all means do the mm-list method and post it. But we have already seen objections by other against loops in proc. So that will also cause additional controversy. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:48 ` Christoph Lameter 2004-11-22 23:09 ` Nick Piggin @ 2004-11-22 23:16 ` Andrew Morton 2004-11-22 23:19 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-11-22 23:16 UTC (permalink / raw) To: Christoph Lameter Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter <clameter@sgi.com> wrote: > > On Mon, 22 Nov 2004, Andrew Morton wrote: > > > > The page fault code only increments rss. For larger transactions that > > > increase / decrease rss significantly the page_table_lock is taken and > > > mm->rss is updated directly. So no > > > gross inaccuracies can result. > > > > Sure. Take a million successive pagefaults and mm->rss is grossly > > inaccurate. Hence my suggestion that it be spilled into mm->rss > > periodically. > > It is spilled into mm->rss periodically. That is the whole point of the > patch. > > The timer tick occurs every 1 ms. That only works if the task happens to have the CPU when the timer tick occurs. There remains no theoretical upper bound to the error in mm->rss, and that's very easy to fix. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 23:16 ` Andrew Morton @ 2004-11-22 23:19 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 23:19 UTC (permalink / raw) To: Andrew Morton Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Andrew Morton wrote: > > The timer tick occurs every 1 ms. > > That only works if the task happens to have the CPU when the timer tick > occurs. There remains no theoretical upper bound to the error in mm->rss, > and that's very easy to fix. Page fault intensive programs typically hog the cpu. But you are in principle right. The "easy fix" that you propose is to add additional logic to some very hot code paths. Then we are probably better off with another approach. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter 2004-11-22 22:11 ` Andrew Morton @ 2004-11-22 22:22 ` Linus Torvalds 2004-11-22 22:27 ` Christoph Lameter 2004-11-22 22:32 ` deferred rss update instead of sloppy rss Nick Piggin 1 sibling, 2 replies; 286+ messages in thread From: Linus Torvalds @ 2004-11-22 22:22 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Christoph Lameter wrote: > > The problem is then that the proc filesystem must do an extensive scan > over all threads to find users of a certain mm_struct. The alternative is to just add a simple list into the task_struct and the head of it into mm_struct. Then, at fork, you just finish the fork() with list_add(p->mm_list, p->mm->thread_list); and do the proper list_del() in exit_mm() or wherever. You'll still loop in /proc, but you'll do the minimal loop necessary. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:22 ` Linus Torvalds @ 2004-11-22 22:27 ` Christoph Lameter 2004-11-22 22:40 ` Linus Torvalds 2004-11-22 22:32 ` deferred rss update instead of sloppy rss Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 22:27 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Linus Torvalds wrote: > The alternative is to just add a simple list into the task_struct and the > head of it into mm_struct. Then, at fork, you just finish the fork() with > > list_add(p->mm_list, p->mm->thread_list); > > and do the proper list_del() in exit_mm() or wherever. > > You'll still loop in /proc, but you'll do the minimal loop necessary. I think the approach that I posted is simpler unless there are other benefits to be gained if it would be easy to figure out which tasks use an mm. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:27 ` Christoph Lameter @ 2004-11-22 22:40 ` Linus Torvalds 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Linus Torvalds @ 2004-11-22 22:40 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, Christoph Lameter wrote: > > I think the approach that I posted is simpler unless there are other > benefits to be gained if it would be easy to figure out which tasks use an > mm. I'm just worried that your timer tick thing won't catch things in a timely manner. That said, if that isn't an issue, and people don't have problems with it. On the other hand, if /proc literally is the only real user, then I guess it really can't matter. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [0/7]: Overview and performance tests 2004-11-22 22:40 ` Linus Torvalds @ 2004-12-01 23:41 ` Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock Christoph Lameter ` (9 more replies) 0 siblings, 10 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:41 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changes from V11->V12 of this patch: - dump sloppy_rss in favor of list_rss (Linus' proposal) - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) This is a series of patches that increases the scalability of the page fault handler for SMP. Here are some performance results on a machine with 512 processors allocating 32 GB with an increasing number of threads (that are assigned a processor each). Without the patches: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686 With the patches: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.451s 140.151s 141.060s 44430.367 44428.115 32 3 2 1.399s 136.349s 73.041s 45673.303 85699.793 32 3 4 1.321s 129.760s 39.027s 47996.303 160197.217 32 3 8 1.279s 100.648s 20.039s 61724.641 308454.557 32 3 16 1.414s 153.975s 15.090s 40488.236 395681.716 32 3 32 2.534s 337.021s 17.016s 18528.487 366445.400 32 3 64 4.271s 709.872s 18.057s 8809.787 338656.440 32 3 128 18.734s 1805.094s 21.084s 3449.586 288005.644 32 3 256 14.698s 963.787s 11.078s 6429.787 534077.540 32 3 512 15.299s 453.990s 5.098s 13406.321 1050416.414 For more than 8 cpus the page fault rate increases by orders of magnitude. For more than 64 cpus the improvement in performace is 10 times better. The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem!) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's (pgd_test_and_populate, pmd_test_and_populate). The page table lock can be avoided in the following situations: 1. An empty pte or pmd entry is populated This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory. 2. Modifications of flags in a pte entry (write/accessed). These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too. One essential change in the VM is the use of pte_cmpxchg (or its generic emulation) on page table entries before doing an update_mmu_change without holding the page table lock. However, we do similar things now with other atomic pte operations such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch operates on an *cleared* pte and replaces it with a pte pointing to valid memory. The effect of this change on various architectures has to be thought through. Local definitions of ptep_cmpxchg and ptep_xchg may be necessary. For IA64 an icache coherency issue may arise that potentially requires the flushing of the icache (as done via update_mmu_cache on IA64) prior to the use of ptep_cmpxchg. Similar issues may arise on other platforms. The patch introduces a split counter for rss handling to avoid atomic operations and locks currently necessary for rss modifications. In addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be in the same cache line as tsk->mm (which is already used by the fault handler) and thus tsk->rss can be incremented without locks in a fast way. The cache line does not need to be shared between processors in the page table handler. A tasklist is generated for each mm (rcu based). Values in that list are added up to calculate rss or anon_rss values. The patchset is composed of 7 patches: 1/7: Avoid page_table_lock in handle_mm_fault This patch defers the acquisition of the page_table_lock as much as possible and uses atomic operations for allocating anonymous memory. These atomic operations are simulated by acquiring the page_table_lock for very small time frames if an architecture does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a pte will not be set to empty if a page is in transition to swap. If only the first two patches are applied then the time that the page_table_lock is held is simply reduced. The lock may then be acquired multiple times during a page fault. 2/7: Atomic pte operations for ia64 3/7: Make cmpxchg generally available on i386 The atomic operations on the page table rely heavily on cmpxchg instructions. This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486 cpus. The emulations are only included if a kernel is build for these old cpus and are skipped for the real cmpxchg instructions if the kernel that is build for 386 or 486 is then run on a more recent cpu. This patch may be used independently of the other patches. 4/7: Atomic pte operations for i386 A generally available cmpxchg (last patch) must be available for this patch to preserve the ability to build kernels for 386 and 486. 5/7: Atomic pte operation for x86_64 6/7: Atomic pte operations for s390 7/7: Split counter implementation for rss Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic to calculate rss from tasklist. There are some additional outstanding performance enhancements that are not available yet but which require this patch. Those modifications push the maximum page fault rate from ~ 1 mio faults per second as shown above to above 3 mio faults per second. The last editions of the sloppy rss, atomic rss and deferred rss patch will be posted to linux-ia64 for archival purpose. Signed-off-by: Christoph Lameter <clameter@sgi.com> ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter @ 2004-12-01 23:42 ` Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [2/7]: atomic pte operations for ia64 Christoph Lameter ` (8 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:42 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Increase parallelism in SMP configurations by deferring the acquisition of page_table_lock in handle_mm_fault * Anonymous memory page faults bypass the page_table_lock through the use of atomic page table operations * Swapper does not set pte to empty in transition to swap * Simulate atomic page table operations using the page_table_lock if an arch does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides a performance benefit since the page_table_lock is held for shorter periods of time. Signed-off-by: Christoph Lameter <clameter@sgi.com Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-11-23 10:06:03.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-11-23 10:07:55.000000000 -0800 @@ -1330,8 +1330,7 @@ } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1343,15 +1342,13 @@ int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1374,8 +1371,7 @@ lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1422,14 +1418,12 @@ } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; struct page * page = ZERO_PAGE(addr); @@ -1441,7 +1435,6 @@ if (write_access) { /* Allocate our own private page. */ pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_mem; @@ -1450,30 +1443,37 @@ goto no_mem; clear_user_highpage(page, addr); - spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { - pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; - } - mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)), vma); - lru_cache_add_active(page); mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - set_pte(page_table, entry); + /* update the entry */ + if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { + if (write_access) { + pte_unmap(page_table); + page_cache_release(page); + } + goto out; + } + if (write_access) { + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + lru_cache_add_active(page); + page_add_anon_rmap(page, vma, addr); + mm->rss++; + + } pte_unmap(page_table); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, addr, entry); - spin_unlock(&mm->page_table_lock); out: return VM_FAULT_MINOR; no_mem: @@ -1489,12 +1489,12 @@ * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1505,9 +1505,8 @@ if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1605,7 +1604,7 @@ * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1618,13 +1617,12 @@ if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } pgoff = pte_to_pgoff(*pte); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -1643,49 +1641,40 @@ * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; entry = *pte; if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + /* + * This is the case in which we only update some bits in the pte. + */ + new_entry = pte_mkyoung(entry); if (write_access) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + /* do_wp_page expects us to hold the page_table_lock */ + spin_lock(&mm->page_table_lock); return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + } + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + if (ptep_cmpxchg(vma, address, pte, entry, new_entry)) + update_mmu_cache(vma, address, new_entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; } @@ -1703,22 +1692,45 @@ inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd */ - spin_lock(&mm->page_table_lock); - pmd = pmd_alloc(mm, pgd, address); + if (unlikely(pgd_none(*pgd))) { + pmd_t *new = pmd_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + /* Insure that the update is done in an atomic way */ + if (!pgd_test_and_populate(mm, pgd, new)) + pmd_free(new); + } + + pmd = pmd_offset(pgd, address); + + if (likely(pmd)) { + pte_t *pte; + + if (!pmd_present(*pmd)) { + struct page *new; - if (pmd) { - pte_t * pte = pte_alloc_map(mm, pmd, address); - if (pte) + new = pte_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else + inc_page_state(nr_page_table_pages); + } + + pte = pte_offset_map(pmd, address); + if (likely(pte)) return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } - spin_unlock(&mm->page_table_lock); return VM_FAULT_OOM; } Index: linux-2.6.9/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700 +++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-23 10:06:12.000000000 -0800 @@ -134,4 +134,60 @@ #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) #endif +#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS +/* + * If atomic page table operations are not available then use + * the page_table_lock to insure some form of locking. + * Note thought that low level operations as well as the + * page_table_handling of the cpu may bypass all locking. + */ + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \ +({ \ + int __rc; \ + spin_lock(&__vma->vm_mm->page_table_lock); \ + __rc = pte_same(*(__ptep), __oldval); \ + if (__rc) set_pte(__ptep, __newval); \ + spin_unlock(&__vma->vm_mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pmd) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pgd_present(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pmd); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\ + flush_tlb_page(__vma, __address); \ + __p; \ +}) + +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-23 10:06:03.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-23 10:06:12.000000000 -0800 @@ -424,7 +424,10 @@ * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * - * The caller needs to hold the mm->page_table_lock. + * The caller needs to hold the mm->page_table_lock if page + * is pointing to something that is known by the vm. + * The lock does not need to be held if page is pointing + * to a newly allocated page. */ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) @@ -568,11 +571,6 @@ /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -587,11 +585,15 @@ list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; page_remove_rmap(page); page_cache_release(page); @@ -678,15 +680,21 @@ if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); + /* + * There would be a race here with handle_mm_fault and do_anonymous_page + * which bypasses the page_table_lock if we would zap the pte before + * putting something into it. On the other hand we need to + * have the dirty flag setting at the time we replaced the value. + */ /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_get_and_clear(pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [2/7]: atomic pte operations for ia64 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock Christoph Lameter @ 2004-12-01 23:42 ` Christoph Lameter 2004-12-01 23:43 ` page fault scalability patch V12 [3/7]: universal cmpxchg for i386 Christoph Lameter ` (7 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:42 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for ia64 * Enhanced parallelism in page fault handler if applied together with the generic patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PGD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -78,12 +82,19 @@ preempt_enable(); } + static inline void pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) { pgd_val(*pgd_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE; +} static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) @@ -132,6 +143,13 @@ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { Index: linux-2.6.9/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800 @@ -30,6 +30,8 @@ #define _PAGE_P_BIT 0 #define _PAGE_A_BIT 5 #define _PAGE_D_BIT 6 +#define _PAGE_IG_BITS 53 +#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */ #define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */ #define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */ @@ -58,6 +60,7 @@ #define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL) #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63) +#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT) /* Valid only for a PTE with the present bit cleared: */ #define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ @@ -270,6 +273,8 @@ #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) #define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) +#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0) + /* * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the * access rights: @@ -281,8 +286,15 @@ #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) #define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D)) +#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK)) /* + * Lock functions for pte's + */ +#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep) +#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); } +#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val)) +/* * Macro to a page protection value as "uncacheable". Note that "protection" is really a * misnomer here as the protection value contains the memory attribute bits, dirty bits, * and various other bits as well. @@ -342,7 +354,6 @@ #define pte_unmap_nested(pte) do { } while (0) /* atomic versions of the some PTE manipulations: */ - static inline int ptep_test_and_clear_young (pte_t *ptep) { @@ -414,6 +425,26 @@ #endif } +/* + * IA-64 doesn't have any external MMU info: the page tables contain all the necessary + * information. However, we use this routine to take care of any (delayed) i-cache + * flushing that may be necessary. + */ +extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); + +static inline int +ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval) +{ + /* + * IA64 defers icache flushes. If the new pte is executable we may + * have to flush the icache to insure cache coherency immediately + * after the cmpxchg. + */ + if (pte_exec(newval)) + update_mmu_cache(vma, addr, newval); + return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte; +} + static inline int pte_same (pte_t a, pte_t b) { @@ -476,13 +507,6 @@ struct vm_area_struct * prev, unsigned long start, unsigned long end); #endif -/* - * IA-64 doesn't have any external MMU info: the page tables contain all the necessary - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ -extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Update PTEP with ENTRY, which is guaranteed to be a less @@ -560,6 +584,8 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE +#define __HAVE_ARCH_ATOMIC_TABLE_OPS +#define __HAVE_ARCH_LOCK_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _ASM_IA64_PGTABLE_H */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [3/7]: universal cmpxchg for i386 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [2/7]: atomic pte operations for ia64 Christoph Lameter @ 2004-12-01 23:43 ` Christoph Lameter 2004-12-01 23:43 ` page fault scalability patch V12 [4/7]: atomic pte operations " Christoph Lameter ` (6 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:43 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Make cmpxchg and cmpxchg8b generally available on the i386 platform. * Provide emulation of cmpxchg suitable for uniprocessor if build and run on 386. * Provide emulation of cmpxchg8b suitable for uniprocessor systems if build and run on 386 or 486. * Provide an inline function to atomically get a 64 bit value via cmpxchg8b in an SMP system (courtesy of Nick Piggin) (important for i386 PAE mode and other places where atomic 64 bit operations are useful) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/arch/i386/Kconfig =================================================================== --- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800 +++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800 @@ -351,6 +351,11 @@ depends on !M386 default y +config X86_CMPXCHG8B + bool + depends on !M386 && !M486 + default y + config X86_XADD bool depends on !M386 Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c =================================================================== --- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800 +++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800 @@ -6,6 +6,7 @@ #include <linux/bitops.h> #include <linux/smp.h> #include <linux/thread_info.h> +#include <linux/module.h> #include <asm/processor.h> #include <asm/msr.h> @@ -287,5 +288,103 @@ return 0; } +#ifndef CONFIG_X86_CMPXCHG +unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new) +{ + u8 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u8)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u8 *)ptr; + if (prev == old) + *(u8 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u8); + +unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new) +{ + u16 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u16)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u16 *)ptr; + if (prev == old) + *(u16 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u16); + +unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) +{ + u32 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u32)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u32 *)ptr; + if (prev == old) + *(u32 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u32); +#endif + +#ifndef CONFIG_X86_CMPXCHG8B +unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + unsigned long flags; + + /* + * Check if the kernel was compiled for an old cpu but + * we are running really on a cpu capable of cmpxchg8b + */ + + if (cpu_has(cpu_data, X86_FEATURE_CX8)) + return __cmpxchg8b(ptr, old, newv); + + /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */ + local_irq_save(flags); + prev = *ptr; + if (prev == old) + *ptr = newv; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg8b_486); +#endif + // arch_initcall(intel_cpu_init); Index: linux-2.6.9/include/asm-i386/system.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800 @@ -149,6 +149,9 @@ #define __xg(x) ((struct __xchg_dummy *)(x)) +#define ll_low(x) *(((unsigned int*)&(x))+0) +#define ll_high(x) *(((unsigned int*)&(x))+1) + /* * The semantics of XCHGCMP8B are a bit strange, this is why * there is a loop and the loading of %%eax and %%edx has to @@ -184,8 +187,6 @@ { __set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL)); } -#define ll_low(x) *(((unsigned int*)&(x))+0) -#define ll_high(x) *(((unsigned int*)&(x))+1) static inline void __set_64bit_var (unsigned long long *ptr, unsigned long long value) @@ -203,6 +204,26 @@ __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \ __set_64bit(ptr, ll_low(value), ll_high(value)) ) +static inline unsigned long long __get_64bit(unsigned long long * ptr) +{ + unsigned long long ret; + __asm__ __volatile__ ( + "\n1:\t" + "movl (%1), %%eax\n\t" + "movl 4(%1), %%edx\n\t" + "movl %%eax, %%ebx\n\t" + "movl %%edx, %%ecx\n\t" + LOCK_PREFIX "cmpxchg8b (%1)\n\t" + "jnz 1b" + : "=A"(ret) + : "D"(ptr) + : "ebx", "ecx", "memory"); + return ret; +} + +#define get_64bit(ptr) __get_64bit(ptr) + + /* * Note: no "lock" prefix even on SMP: xchg always implies lock anyway * Note 2: xchg has side effect, so that attribute volatile is necessary, @@ -240,7 +261,41 @@ */ #ifdef CONFIG_X86_CMPXCHG + #define __HAVE_ARCH_CMPXCHG 1 +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) + +#else + +/* + * Building a kernel capable running on 80386. It may be necessary to + * simulate the cmpxchg on the 80386 CPU. For that purpose we define + * a function for each of the sizes we support. + */ + +extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8); +extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16); +extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32); + +static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, + unsigned long new, int size) +{ + switch (size) { + case 1: + return cmpxchg_386_u8(ptr, old, new); + case 2: + return cmpxchg_386_u16(ptr, old, new); + case 4: + return cmpxchg_386_u32(ptr, old, new); + } + return old; +} + +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) #endif static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, @@ -270,10 +325,32 @@ return old; } -#define cmpxchg(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ - (unsigned long)(n),sizeof(*(ptr)))) - +static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + __asm__ __volatile__( + LOCK_PREFIX "cmpxchg8b (%4)" + : "=A" (prev) + : "0" (old), "c" ((unsigned long)(newv >> 32)), + "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr) + : "memory"); + return prev; +} + +#ifdef CONFIG_X86_CMPXCHG8B +#define cmpxchg8b __cmpxchg8b +#else +/* + * Building a kernel capable of running on 80486 and 80386. Both + * do not support cmpxchg8b. Call a function that emulates the + * instruction if necessary. + */ +extern unsigned long long cmpxchg8b_486(volatile unsigned long long *, + unsigned long long, unsigned long long); +#define cmpxchg8b cmpxchg8b_486 +#endif + #ifdef __KERNEL__ struct alt_instr { __u8 *instr; /* original instruction */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [4/7]: atomic pte operations for i386 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (2 preceding siblings ...) 2004-12-01 23:43 ` page fault scalability patch V12 [3/7]: universal cmpxchg for i386 Christoph Lameter @ 2004-12-01 23:43 ` Christoph Lameter 2004-12-01 23:44 ` page fault scalability patch V12 [5/7]: atomic pte operations for x86_64 Christoph Lameter ` (5 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:43 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Atomic pte operations for i386 in regular and PAE modes Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800 @@ -413,6 +413,7 @@ #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME +#define __HAVE_ARCH_ATOMIC_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _I386_PGTABLE_H */ Index: linux-2.6.9/include/asm-i386/pgtable-3level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800 @@ -6,7 +6,8 @@ * tables on PPro+ CPUs. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> - */ + * August 26, 2004 added ptep_cmpxchg <christoph@lameter.com> +*/ #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low) @@ -42,26 +43,15 @@ return pte_x(pte); } -/* Rules for using set_pte: the pte being assigned *must* be - * either not present or in a state where the hardware will - * not attempt to update the pte. In places where this is - * not possible, use pte_get_and_clear to obtain the old pte - * value and then use set_pte to update it. -ben - */ -static inline void set_pte(pte_t *ptep, pte_t pte) -{ - ptep->pte_high = pte.pte_high; - smp_wmb(); - ptep->pte_low = pte.pte_low; -} -#define __HAVE_ARCH_SET_PTE_ATOMIC -#define set_pte_atomic(pteptr,pteval) \ +#define set_pte(pteptr,pteval) \ set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pgd(pgdptr,pgdval) \ set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval)) +#define set_pte_atomic set_pte + /* * Pentium-II erratum A13: in PAE mode we explicitly have to flush * the TLB via cr3 if the top-level pgd is changed... @@ -142,4 +132,23 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val }) +/* Atomic PTE operations */ +#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \ +({ pte_t __r; \ + /* xchg acts as a barrier before the setting of the high bits. */\ + __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \ + __r.pte_high = (__ptep)->pte_high; \ + (__ptep)->pte_high = (__newval).pte_high; \ + flush_tlb_page(__vma, __addr); \ + (__r); \ +}) + +#define __HAVE_ARCH_PTEP_XCHG_FLUSH + +static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + + #endif /* _I386_PGTABLE_3LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800 @@ -82,4 +82,7 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) +/* Atomic PTE operations */ +#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low) + #endif /* _I386_PGTABLE_2LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800 @@ -4,9 +4,12 @@ #include <linux/config.h> #include <asm/processor.h> #include <asm/fixmap.h> +#include <asm/system.h> #include <linux/threads.h> #include <linux/mm.h> /* for struct page */ +#define PMD_NONE 0L + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))) @@ -16,6 +19,19 @@ ((unsigned long long)page_to_pfn(pte) << (unsigned long long) PAGE_SHIFT))); } + +/* Atomic version */ +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ +#ifdef CONFIG_X86_PAE + return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE + + ((unsigned long long)page_to_pfn(pte) << + (unsigned long long) PAGE_SHIFT) ) == PMD_NONE; +#else + return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +#endif +} + /* * Allocate and free page tables. */ @@ -49,6 +65,7 @@ #define pmd_free(x) do { } while (0) #define __pmd_free_tlb(tlb,x) do { } while (0) #define pgd_populate(mm, pmd, pte) BUG() +#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; }) #define check_pgt_cache() do { } while (0) ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [5/7]: atomic pte operations for x86_64 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (3 preceding siblings ...) 2004-12-01 23:43 ` page fault scalability patch V12 [4/7]: atomic pte operations " Christoph Lameter @ 2004-12-01 23:44 ` Christoph Lameter 2004-12-01 23:45 ` page fault scalability patch V12 [6/7]: atomic pte operations for s390 Christoph Lameter ` (4 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:44 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for x86_64 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700 +++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-23 10:59:01.000000000 -0800 @@ -7,16 +7,26 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pgd_populate(mm, pgd, pmd) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd))) +#define pgd_test_and_populate(mm, pgd, pmd) \ + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE) static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.9/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-22 15:08:43.000000000 -0800 +++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-23 10:59:01.000000000 -0800 @@ -437,6 +437,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [6/7]: atomic pte operations for s390 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (4 preceding siblings ...) 2004-12-01 23:44 ` page fault scalability patch V12 [5/7]: atomic pte operations for x86_64 Christoph Lameter @ 2004-12-01 23:45 ` Christoph Lameter 2004-12-01 23:45 ` page fault scalability patch V12 [7/7]: Split counter for rss Christoph Lameter ` (3 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:45 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for s390 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-s390/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700 +++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800 @@ -567,6 +567,15 @@ return pte; } +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + struct mm_struct *__mm = __vma->vm_mm; \ + pte_t __pte; \ + __pte = ptep_clear_flush(__vma, __address, __ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) + static inline void ptep_set_wrprotect(pte_t *ptep) { pte_t old_pte = *ptep; @@ -778,6 +787,14 @@ #define kern_addr_valid(addr) (1) +/* Atomic PTE operations */ +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + +static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + /* * No page table caches to initialise */ @@ -791,6 +808,7 @@ #define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH #define __HAVE_ARCH_PTEP_GET_AND_CLEAR #define __HAVE_ARCH_PTEP_CLEAR_FLUSH +#define __HAVE_ARCH_PTEP_XCHG_FLUSH #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME Index: linux-2.6.9/include/asm-s390/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700 +++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800 @@ -97,6 +97,10 @@ pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd); } +static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd) +{ + return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV; +} #endif /* __s390x__ */ static inline void @@ -119,6 +123,18 @@ pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT)); } +static inline int +pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page) +{ + int rc; + spin_lock(&mm->page_table_lock); + + rc=pte_same(*pmd, _PAGE_INVALID_EMPTY); + if (rc) pmd_populate(mm, pmd, page); + spin_unlock(&mm->page_table_lock); + return rc; +} + /* * page table entry allocation/free routines. */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12 [7/7]: Split counter for rss 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (5 preceding siblings ...) 2004-12-01 23:45 ` page fault scalability patch V12 [6/7]: atomic pte operations for s390 Christoph Lameter @ 2004-12-01 23:45 ` Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter 2004-12-02 0:10 ` page fault scalability patch V12 [0/7]: Overview and performance tests Linus Torvalds ` (2 subsequent siblings) 9 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-12-01 23:45 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Split rss counter into the task structure * remove 3 checks of rss in mm/rmap.c * Prerequisite for page table scalability patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-11-30 20:33:31.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-11-30 20:33:50.000000000 -0800 @@ -30,6 +30,7 @@ #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> +#include <linux/rcupdate.h> struct exec_domain; @@ -217,6 +218,7 @@ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + long rss, anon_rss; struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -226,7 +228,7 @@ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ @@ -236,6 +238,8 @@ /* Architecture-specific MM context */ mm_context_t context; + struct list_head task_list; /* Tasks using this mm */ + struct rcu_head rcu_head; /* For freeing mm via rcu */ /* Token based thrashing protection. */ unsigned long swap_token_time; @@ -545,6 +549,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Split counters from mm */ + long rss; + long anon_rss; /* task state */ struct linux_binfmt *binfmt; @@ -578,6 +585,9 @@ struct completion *vfork_done; /* for vfork() */ int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ + + /* List of other tasks using the same mm */ + struct list_head mm_tasks; unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; @@ -1111,6 +1121,14 @@ #endif +unsigned long get_rss(struct mm_struct *mm); +unsigned long get_anon_rss(struct mm_struct *mm); +unsigned long get_shared(struct mm_struct *mm); + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk); +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif + Index: linux-2.6.9/fs/proc/task_mmu.c =================================================================== --- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-30 20:33:26.000000000 -0800 +++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-30 20:33:50.000000000 -0800 @@ -22,7 +22,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + get_rss(mm) << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -37,7 +37,7 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + *shared = get_shared(mm); *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; Index: linux-2.6.9/fs/proc/array.c =================================================================== --- linux-2.6.9.orig/fs/proc/array.c 2004-11-30 20:33:26.000000000 -0800 +++ linux-2.6.9/fs/proc/array.c 2004-11-30 20:33:50.000000000 -0800 @@ -420,7 +420,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-30 20:33:46.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-30 20:33:50.000000000 -0800 @@ -263,8 +263,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -438,7 +436,7 @@ BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; + current->anon_rss++; anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; @@ -510,8 +508,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -799,8 +795,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.9/kernel/fork.c =================================================================== --- linux-2.6.9.orig/kernel/fork.c 2004-11-30 20:33:42.000000000 -0800 +++ linux-2.6.9/kernel/fork.c 2004-11-30 20:33:50.000000000 -0800 @@ -151,6 +151,7 @@ *tsk = *orig; tsk->thread_info = ti; ti->task = tsk; + tsk->rss = 0; /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); @@ -292,6 +293,7 @@ atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); + INIT_LIST_HEAD(&mm->task_list); mm->core_waiters = 0; mm->nr_ptes = 0; spin_lock_init(&mm->page_table_lock); @@ -323,6 +325,13 @@ return mm; } +static void rcu_free_mm(struct rcu_head *head) +{ + struct mm_struct *mm = container_of(head ,struct mm_struct, rcu_head); + + free_mm(mm); +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -333,7 +342,7 @@ BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); - free_mm(mm); + call_rcu(&mm->rcu_head, rcu_free_mm); } /* @@ -400,6 +409,8 @@ /* Get rid of any cached register state */ deactivate_mm(tsk, mm); + if (mm) + mm_remove_thread(mm, tsk); /* notify parent sleeping on vfork() */ if (vfork_done) { @@ -447,8 +458,8 @@ * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. + * (mm_add_thread does use the ptl .... ) */ - spin_unlock_wait(&oldmm->page_table_lock); goto good_mm; } @@ -470,6 +481,7 @@ goto free_pt; good_mm: + mm_add_thread(mm, tsk); tsk->mm = mm; tsk->active_mm = mm; return 0; Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-11-30 20:33:46.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-11-30 20:33:50.000000000 -0800 @@ -1467,7 +1467,7 @@ */ lru_cache_add_active(page); page_add_anon_rmap(page, vma, addr); - mm->rss++; + current->rss++; } pte_unmap(page_table); @@ -1859,3 +1859,87 @@ } #endif + +unsigned long get_rss(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +unsigned long get_anon_rss(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->anon_rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +unsigned long get_shared(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->rss - mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->rss - t->anon_rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!mm) + return; + + spin_lock(&mm->page_table_lock); + mm->rss += tsk->rss; + mm->anon_rss += tsk->anon_rss; + list_del_rcu(&tsk->mm_tasks); + spin_unlock(&mm->page_table_lock); +} + +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + spin_lock(&mm->page_table_lock); + list_add_rcu(&tsk->mm_tasks, &mm->task_list); + spin_unlock(&mm->page_table_lock); +} + + Index: linux-2.6.9/include/linux/init_task.h =================================================================== --- linux-2.6.9.orig/include/linux/init_task.h 2004-11-30 20:33:30.000000000 -0800 +++ linux-2.6.9/include/linux/init_task.h 2004-11-30 20:33:50.000000000 -0800 @@ -42,6 +42,7 @@ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ + .task_list = LIST_HEAD_INIT(name.task_list), \ } #define INIT_SIGNALS(sig) { \ @@ -112,6 +113,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \ } Index: linux-2.6.9/fs/exec.c =================================================================== --- linux-2.6.9.orig/fs/exec.c 2004-11-30 20:33:41.000000000 -0800 +++ linux-2.6.9/fs/exec.c 2004-11-30 20:33:50.000000000 -0800 @@ -543,6 +543,7 @@ active_mm = tsk->active_mm; tsk->mm = mm; tsk->active_mm = mm; + mm_add_thread(mm, current); activate_mm(active_mm, mm); task_unlock(tsk); arch_pick_mmap_layout(mm); ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [0/7]: Overview 2004-12-01 23:45 ` page fault scalability patch V12 [7/7]: Split counter for rss Christoph Lameter @ 2005-01-04 19:35 ` Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [1/7]: Avoid taking page_table_lock Christoph Lameter ` (6 more replies) 0 siblings, 7 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:35 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changes from V13->V14 of this patch: - 4level page support - Tested on ia64, i386 and i386 in PAE mode This is a series of patches that increases the scalability of the page fault handler for SMP. The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's (pgd_test_and_populate, pmd_test_and_populate). The page table lock can be avoided in the following situations: 1. An empty pte or pmd entry is populated This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory. 2. Modifications of flags in a pte entry (write/accessed). These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too. One essential change in the VM is the use of pte_cmpxchg (or its generic emulation) on page table entries before doing an update_mmu_change without holding the page table lock. However, we do similar things now with other atomic pte operations such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch operates on an *cleared* pte and replaces it with a pte pointing to valid memory. The effect of this change on various architectures has to be thought through. Local definitions of ptep_cmpxchg and ptep_xchg may be necessary. For IA64 an icache coherency issue may arise that potentially requires the flushing of the icache (as done via update_mmu_cache on IA64) prior to the use of ptep_cmpxchg. Similar issues may arise on other platforms. The patch introduces a split counter for rss handling to avoid atomic operations and locks currently necessary for rss modifications. In addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be in the same cache line as tsk->mm (which is already used by the fault handler) and thus tsk->rss can be incremented without locks in a fast way. The cache line does not need to be shared between processors for the page table handler. A tasklist is generated for each mm (rcu based). Values in that list are added up to calculate rss or anon_rss values. The patchset is composed of 7 patches (and was tested against 2.6.10-bk6): 1/7: Avoid page_table_lock in handle_mm_fault This patch defers the acquisition of the page_table_lock as much as possible and uses atomic operations for allocating anonymous memory. These atomic operations are simulated by acquiring the page_table_lock for very small time frames if an architecture does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes kswapd so that a pte will not be set to empty if a page is in transition to swap. If only the first two patches are applied then the time that the page_table_lock is held is simply reduced. The lock may then be acquired multiple times during a page fault. 2/7: Atomic pte operations for ia64 3/7: Make cmpxchg generally available on i386 The atomic operations on the page table rely heavily on cmpxchg instructions. This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486 cpus. The emulations are only included if a kernel is build for these old cpus and are skipped for the real cmpxchg instructions if the kernel that is build for 386 or 486 is then run on a more recent cpu. This patch may be used independently of the other patches. 4/7: Atomic pte operations for i386 A generally available cmpxchg (last patch) must be available for this patch to preserve the ability to build kernels for 386 and 486. 5/7: Atomic pte operation for x86_64 6/7: Atomic pte operations for s390 7/7: Split counter implementation for rss Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic to calculate rss from tasklist. Signed-off-by: Christoph Lameter <clameter@sgi.com> ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [1/7]: Avoid taking page_table_lock 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter @ 2005-01-04 19:35 ` Christoph Lameter 2005-01-04 19:36 ` page fault scalability patch V14 [2/7]: ia64 atomic pte operations Christoph Lameter ` (5 subsequent siblings) 6 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:35 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Increase parallelism in SMP configurations by deferring the acquisition of page_table_lock in handle_mm_fault * Anonymous memory page faults bypass the page_table_lock through the use of atomic page table operations * Swapper does not set pte to empty in transition to swap * Simulate atomic page table operations using the page_table_lock if an arch does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides a performance benefit since the page_table_lock is held for shorter periods of time. Signed-off-by: Christoph Lameter <clameter@sgi.com Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-03 15:48:34.000000000 -0800 @@ -1537,8 +1537,7 @@ } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1550,15 +1549,13 @@ int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1581,8 +1578,7 @@ lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1629,14 +1625,12 @@ } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; struct page * page = ZERO_PAGE(addr); @@ -1648,7 +1642,6 @@ if (write_access) { /* Allocate our own private page. */ pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_mem; @@ -1657,30 +1650,34 @@ goto no_mem; clear_user_highpage(page, addr); - spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { - pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; - } - mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)), vma); - lru_cache_add_active(page); - mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - set_pte(page_table, entry); + /* update the entry */ + if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { + if (write_access) { + pte_unmap(page_table); + page_cache_release(page); + } + goto out; + } + if (write_access) { + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + page_add_anon_rmap(page, vma, addr); + lru_cache_add_active(page); + mm->rss++; + + } pte_unmap(page_table); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); - spin_unlock(&mm->page_table_lock); out: return VM_FAULT_MINOR; no_mem: @@ -1696,12 +1693,12 @@ * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1712,9 +1709,8 @@ if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1812,7 +1808,7 @@ * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1825,13 +1821,12 @@ if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } - pgoff = pte_to_pgoff(*pte); + pgoff = pte_to_pgoff(entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -1850,49 +1845,46 @@ * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; - entry = *pte; + /* + * This must be a atomic operation since the page_table_lock is + * not held. If a pte_t larger than the word size is used an + * incorrect value could be read because another processor is + * concurrently updating the multi-word pte. The i386 PAE mode + * is raising its ugly head here. + */ + entry = get_pte_atomic(pte); if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + /* + * This is the case in which we only update some bits in the pte. + */ + new_entry = pte_mkyoung(entry); if (write_access) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + /* do_wp_page expects us to hold the page_table_lock */ + spin_lock(&mm->page_table_lock); return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + } + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + ptep_cmpxchg(vma, address, pte, entry, new_entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; } @@ -1911,33 +1903,54 @@ inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd. We can avoid the overhead + * of the p??_alloc functions through atomic operations so + * we duplicate the functionality of pmd_alloc, pud_alloc and + * pte_alloc_map here. */ pgd = pgd_offset(mm, address); - spin_lock(&mm->page_table_lock); + if (unlikely(pgd_none(*pgd))) { + pud_t *new = pud_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; + if (!pgd_test_and_populate(mm, pgd, new)); + pud_free(new); + } - pud = pud_alloc(mm, pgd, address); - if (!pud) - goto oom; - - pmd = pmd_alloc(mm, pud, address); - if (!pmd) - goto oom; - - pte = pte_alloc_map(mm, pmd, address); - if (!pte) - goto oom; + pud = pud_offset(pgd, address); + if (unlikely(pud_none(*pud))) { + pmd_t *new = pmd_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; + + if (!pud_test_and_populate(mm, pud, new)) + pmd_free(new); + } + + pmd = pmd_offset(pud, address); + if (unlikely(!pmd_present(*pmd))) { + struct page *new = pte_alloc_one(mm, address); - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + if (!new) + return VM_FAULT_OOM; - oom: - spin_unlock(&mm->page_table_lock); - return VM_FAULT_OOM; + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else { + inc_page_state(nr_page_table_pages); + mm->nr_ptes++; + } + } + + pte = pte_offset_map(pmd, address); + return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } #ifndef __ARCH_HAS_4LEVEL_HACK Index: linux-2.6.10/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-03 15:48:34.000000000 -0800 @@ -28,6 +28,11 @@ #endif /* __HAVE_ARCH_SET_PTE_ATOMIC */ #endif +/* Get a pte entry without the page table lock */ +#ifndef __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__x) *(__x) +#endif + #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Largely same as above, but only sets the access flags (dirty, @@ -134,4 +139,61 @@ #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) #endif +#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS +/* + * If atomic page table operations are not available then use + * the page_table_lock to insure some form of locking. + * Note thought that low level operations as well as the + * page_table_handling of the cpu may bypass all locking. + */ + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \ +({ \ + int __rc; \ + spin_lock(&__vma->vm_mm->page_table_lock); \ + __rc = pte_same(*(__ptep), __oldval); \ + if (__rc) { set_pte(__ptep, __newval); \ + update_mmu_cache(__vma, __addr, __newval); } \ + spin_unlock(&__vma->vm_mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pmd) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pgd_present(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pmd); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\ + flush_tlb_page(__vma, __address); \ + __p; \ +}) + +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-03 15:48:34.000000000 -0800 @@ -432,7 +432,10 @@ * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * - * The caller needs to hold the mm->page_table_lock. + * The caller needs to hold the mm->page_table_lock if page + * is pointing to something that is known by the vm. + * The lock does not need to be held if page is pointing + * to a newly allocated page. */ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) @@ -581,11 +584,6 @@ /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -600,11 +598,15 @@ list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; page_remove_rmap(page); page_cache_release(page); @@ -696,15 +698,21 @@ if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); + /* + * There would be a race here with handle_mm_fault and do_anonymous_page + * which bypasses the page_table_lock if we would zap the pte before + * putting something into it. On the other hand we need to + * have the dirty flag setting at the time we replaced the value. + */ /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_get_and_clear(pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-03 15:48:34.000000000 -0800 @@ -25,8 +25,9 @@ static inline int pgd_present(pgd_t pgd) { return 1; } static inline void pgd_clear(pgd_t *pgd) { } #define pud_ERROR(pud) (pgd_ERROR((pud).pgd)) - #define pgd_populate(mm, pgd, pud) do { } while (0) +static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) { return 1; } + /* * (puds are folded into pgds so this doesn't get actually called, * but the define is needed for a generic inline function.) Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-03 15:49:12.000000000 -0800 @@ -29,6 +29,7 @@ #define pmd_ERROR(pmd) (pud_ERROR((pmd).pud)) #define pud_populate(mm, pmd, pte) do { } while (0) +static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) { return 1; } /* * (pmds are folded into puds so this doesn't get actually called, ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [2/7]: ia64 atomic pte operations 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [1/7]: Avoid taking page_table_lock Christoph Lameter @ 2005-01-04 19:36 ` Christoph Lameter 2005-01-04 19:37 ` page fault scalability patch V14 [3/7]: i386 universal cmpxchg Christoph Lameter ` (4 subsequent siblings) 6 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for ia64 * Enhanced parallelism in page fault handler if applied together with the generic patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-03 12:37:44.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PUD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -84,6 +88,13 @@ pud_val(*pud_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE; +} + static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) { @@ -131,6 +142,13 @@ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { Index: linux-2.6.10/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-03 12:37:44.000000000 -0800 @@ -30,6 +30,8 @@ #define _PAGE_P_BIT 0 #define _PAGE_A_BIT 5 #define _PAGE_D_BIT 6 +#define _PAGE_IG_BITS 53 +#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */ #define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */ #define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */ @@ -58,6 +60,7 @@ #define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL) #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63) +#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT) /* Valid only for a PTE with the present bit cleared: */ #define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ @@ -271,6 +274,8 @@ #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) #define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) +#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0) + /* * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the * access rights: @@ -282,8 +287,15 @@ #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) #define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D)) +#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK)) /* + * Lock functions for pte's + */ +#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep) +#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); } +#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val)) +/* * Macro to a page protection value as "uncacheable". Note that "protection" is really a * misnomer here as the protection value contains the memory attribute bits, dirty bits, * and various other bits as well. @@ -343,7 +355,6 @@ #define pte_unmap_nested(pte) do { } while (0) /* atomic versions of the some PTE manipulations: */ - static inline int ptep_test_and_clear_young (pte_t *ptep) { @@ -415,6 +426,26 @@ #endif } +/* + * IA-64 doesn't have any external MMU info: the page tables contain all the necessary + * information. However, we use this routine to take care of any (delayed) i-cache + * flushing that may be necessary. + */ +extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); + +static inline int +ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval) +{ + /* + * IA64 defers icache flushes. If the new pte is executable we may + * have to flush the icache to insure cache coherency immediately + * after the cmpxchg. + */ + if (pte_exec(newval)) + update_mmu_cache(vma, addr, newval); + return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte; +} + static inline int pte_same (pte_t a, pte_t b) { @@ -477,13 +508,6 @@ struct vm_area_struct * prev, unsigned long start, unsigned long end); #endif -/* - * IA-64 doesn't have any external MMU info: the page tables contain all the necessary - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ -extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Update PTEP with ENTRY, which is guaranteed to be a less @@ -561,6 +585,8 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE +#define __HAVE_ARCH_ATOMIC_TABLE_OPS +#define __HAVE_ARCH_LOCK_TABLE_OPS #include <asm-generic/pgtable.h> #include <asm-generic/pgtable-nopud.h> ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [3/7]: i386 universal cmpxchg 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [1/7]: Avoid taking page_table_lock Christoph Lameter 2005-01-04 19:36 ` page fault scalability patch V14 [2/7]: ia64 atomic pte operations Christoph Lameter @ 2005-01-04 19:37 ` Christoph Lameter 2005-01-05 11:51 ` Roman Zippel 2005-01-04 19:37 ` page fault scalability patch V14 [4/7]: i386 atomic pte operations Christoph Lameter ` (3 subsequent siblings) 6 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:37 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Make cmpxchg and cmpxchg8b generally available on the i386 platform. * Provide emulation of cmpxchg suitable for uniprocessor if build and run on 386. * Provide emulation of cmpxchg8b suitable for uniprocessor systems if build and run on 386 or 486. * Provide an inline function to atomically get a 64 bit value via cmpxchg8b in an SMP system (courtesy of Nick Piggin) (important for i386 PAE mode and other places where atomic 64 bit operations are useful) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/arch/i386/Kconfig =================================================================== --- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800 +++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800 @@ -351,6 +351,11 @@ depends on !M386 default y +config X86_CMPXCHG8B + bool + depends on !M386 && !M486 + default y + config X86_XADD bool depends on !M386 Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c =================================================================== --- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800 +++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800 @@ -6,6 +6,7 @@ #include <linux/bitops.h> #include <linux/smp.h> #include <linux/thread_info.h> +#include <linux/module.h> #include <asm/processor.h> #include <asm/msr.h> @@ -287,5 +288,103 @@ return 0; } +#ifndef CONFIG_X86_CMPXCHG +unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new) +{ + u8 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u8)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u8 *)ptr; + if (prev == old) + *(u8 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u8); + +unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new) +{ + u16 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u16)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u16 *)ptr; + if (prev == old) + *(u16 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u16); + +unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) +{ + u32 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u32)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u32 *)ptr; + if (prev == old) + *(u32 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u32); +#endif + +#ifndef CONFIG_X86_CMPXCHG8B +unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + unsigned long flags; + + /* + * Check if the kernel was compiled for an old cpu but + * we are running really on a cpu capable of cmpxchg8b + */ + + if (cpu_has(cpu_data, X86_FEATURE_CX8)) + return __cmpxchg8b(ptr, old, newv); + + /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */ + local_irq_save(flags); + prev = *ptr; + if (prev == old) + *ptr = newv; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg8b_486); +#endif + // arch_initcall(intel_cpu_init); Index: linux-2.6.9/include/asm-i386/system.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800 @@ -149,6 +149,9 @@ #define __xg(x) ((struct __xchg_dummy *)(x)) +#define ll_low(x) *(((unsigned int*)&(x))+0) +#define ll_high(x) *(((unsigned int*)&(x))+1) + /* * The semantics of XCHGCMP8B are a bit strange, this is why * there is a loop and the loading of %%eax and %%edx has to @@ -184,8 +187,6 @@ { __set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL)); } -#define ll_low(x) *(((unsigned int*)&(x))+0) -#define ll_high(x) *(((unsigned int*)&(x))+1) static inline void __set_64bit_var (unsigned long long *ptr, unsigned long long value) @@ -203,6 +204,26 @@ __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \ __set_64bit(ptr, ll_low(value), ll_high(value)) ) +static inline unsigned long long __get_64bit(unsigned long long * ptr) +{ + unsigned long long ret; + __asm__ __volatile__ ( + "\n1:\t" + "movl (%1), %%eax\n\t" + "movl 4(%1), %%edx\n\t" + "movl %%eax, %%ebx\n\t" + "movl %%edx, %%ecx\n\t" + LOCK_PREFIX "cmpxchg8b (%1)\n\t" + "jnz 1b" + : "=A"(ret) + : "D"(ptr) + : "ebx", "ecx", "memory"); + return ret; +} + +#define get_64bit(ptr) __get_64bit(ptr) + + /* * Note: no "lock" prefix even on SMP: xchg always implies lock anyway * Note 2: xchg has side effect, so that attribute volatile is necessary, @@ -240,7 +261,41 @@ */ #ifdef CONFIG_X86_CMPXCHG + #define __HAVE_ARCH_CMPXCHG 1 +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) + +#else + +/* + * Building a kernel capable running on 80386. It may be necessary to + * simulate the cmpxchg on the 80386 CPU. For that purpose we define + * a function for each of the sizes we support. + */ + +extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8); +extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16); +extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32); + +static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, + unsigned long new, int size) +{ + switch (size) { + case 1: + return cmpxchg_386_u8(ptr, old, new); + case 2: + return cmpxchg_386_u16(ptr, old, new); + case 4: + return cmpxchg_386_u32(ptr, old, new); + } + return old; +} + +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) #endif static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, @@ -270,12 +325,34 @@ return old; } -#define cmpxchg(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ - (unsigned long)(n),sizeof(*(ptr)))) - +static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + __asm__ __volatile__( + LOCK_PREFIX "cmpxchg8b (%4)" + : "=A" (prev) + : "0" (old), "c" ((unsigned long)(newv >> 32)), + "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr) + : "memory"); + return prev; +} + +#ifdef CONFIG_X86_CMPXCHG8B +#define cmpxchg8b __cmpxchg8b +#else +/* + * Building a kernel capable of running on 80486 and 80386. Both + * do not support cmpxchg8b. Call a function that emulates the + * instruction if necessary. + */ +extern unsigned long long cmpxchg8b_486(volatile unsigned long long *, + unsigned long long, unsigned long long); +#define cmpxchg8b cmpxchg8b_486 +#endif + #ifdef __KERNEL__ -struct alt_instr { +struct alt_instr { __u8 *instr; /* original instruction */ __u8 *replacement; __u8 cpuid; /* cpuid bit set for replacement */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [3/7]: i386 universal cmpxchg 2005-01-04 19:37 ` page fault scalability patch V14 [3/7]: i386 universal cmpxchg Christoph Lameter @ 2005-01-05 11:51 ` Roman Zippel 0 siblings, 0 replies; 286+ messages in thread From: Roman Zippel @ 2005-01-05 11:51 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Hi, On Tuesday 04 January 2005 20:37, Christoph Lameter wrote: > * Provide emulation of cmpxchg suitable for uniprocessor if > build and run on 386. > * Provide emulation of cmpxchg8b suitable for uniprocessor > systems if build and run on 386 or 486. I'm not sure that's such a good idea. This emulation is more expensive as it has to disable interrupts and you already have emulation functions using spinlocks anyway, so why not use them? This way your patch would not just scale up, but also still scale down. bye, Roman ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [4/7]: i386 atomic pte operations 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter ` (2 preceding siblings ...) 2005-01-04 19:37 ` page fault scalability patch V14 [3/7]: i386 universal cmpxchg Christoph Lameter @ 2005-01-04 19:37 ` Christoph Lameter 2005-01-04 19:38 ` page fault scalability patch V14 [5/7]: x86_64 " Christoph Lameter ` (2 subsequent siblings) 6 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:37 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Atomic pte operations for i386 in regular and PAE modes Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable.h 2005-01-03 12:08:35.000000000 -0800 @@ -407,6 +407,7 @@ #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME +#define __HAVE_ARCH_ATOMIC_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _I386_PGTABLE_H */ Index: linux-2.6.10/include/asm-i386/pgtable-3level.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable-3level.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable-3level.h 2005-01-03 12:11:59.000000000 -0800 @@ -8,7 +8,8 @@ * tables on PPro+ CPUs. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> - */ + * August 26, 2004 added ptep_cmpxchg <christoph@lameter.com> +*/ #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low) @@ -44,21 +45,11 @@ return pte_x(pte); } -/* Rules for using set_pte: the pte being assigned *must* be - * either not present or in a state where the hardware will - * not attempt to update the pte. In places where this is - * not possible, use pte_get_and_clear to obtain the old pte - * value and then use set_pte to update it. -ben - */ -static inline void set_pte(pte_t *ptep, pte_t pte) -{ - ptep->pte_high = pte.pte_high; - smp_wmb(); - ptep->pte_low = pte.pte_low; -} #define __HAVE_ARCH_SET_PTE_ATOMIC #define set_pte_atomic(pteptr,pteval) \ set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) +#define set_pte(pteptr,pteval) \ + *(unsigned long long *)(pteptr) = pte_val(pteval) #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pud(pudptr,pudval) \ @@ -155,4 +146,25 @@ #define __pmd_free_tlb(tlb, x) do { } while (0) +/* Atomic PTE operations */ +#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \ +({ pte_t __r; \ + /* xchg acts as a barrier before the setting of the high bits. */\ + __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \ + __r.pte_high = (__ptep)->pte_high; \ + (__ptep)->pte_high = (__newval).pte_high; \ + flush_tlb_page(__vma, __addr); \ + (__r); \ +}) + +#define __HAVE_ARCH_PTEP_XCHG_FLUSH + +static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + +#define __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep))) + #endif /* _I386_PGTABLE_3LEVEL_H */ Index: linux-2.6.10/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable-2level.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable-2level.h 2005-01-03 12:08:35.000000000 -0800 @@ -65,4 +65,7 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) +/* Atomic PTE operations */ +#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low) + #endif /* _I386_PGTABLE_2LEVEL_H */ Index: linux-2.6.10/include/asm-i386/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgalloc.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgalloc.h 2005-01-03 12:11:23.000000000 -0800 @@ -4,9 +4,12 @@ #include <linux/config.h> #include <asm/processor.h> #include <asm/fixmap.h> +#include <asm/system.h> #include <linux/threads.h> #include <linux/mm.h> /* for struct page */ +#define PMD_NONE 0L + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))) @@ -14,6 +17,18 @@ set_pmd(pmd, __pmd(_PAGE_TABLE + \ ((unsigned long long)page_to_pfn(pte) << \ (unsigned long long) PAGE_SHIFT))) +/* Atomic version */ +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ +#ifdef CONFIG_X86_PAE + return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE + + ((unsigned long long)page_to_pfn(pte) << + (unsigned long long) PAGE_SHIFT) ) == PMD_NONE; +#else + return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +#endif +} + /* * Allocate and free page tables. */ @@ -44,6 +59,7 @@ #define pmd_free(x) do { } while (0) #define __pmd_free_tlb(tlb,x) do { } while (0) #define pud_populate(mm, pmd, pte) BUG() +#define pud_test_and_populate(mm, pmd, pte) ({ BUG(); 1; }) #endif #define check_pgt_cache() do { } while (0) ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter ` (3 preceding siblings ...) 2005-01-04 19:37 ` page fault scalability patch V14 [4/7]: i386 atomic pte operations Christoph Lameter @ 2005-01-04 19:38 ` Christoph Lameter 2005-01-04 19:46 ` Andi Kleen 2005-01-04 21:21 ` page fault scalability patch V14 [5/7]: x86_64 atomic pte operations Brian Gerst 2005-01-04 19:38 ` page fault scalability patch V14 [6/7]: s390 atomic pte operationsw Christoph Lameter 2005-01-04 19:39 ` page fault scalability patch V14 [7/7]: Split RSS counters Christoph Lameter 6 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:38 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for x86_64 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-03 12:21:28.000000000 -0800 @@ -7,6 +7,10 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PUD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pud_populate(mm, pud, pmd) \ @@ -14,11 +18,24 @@ #define pgd_populate(mm, pgd, pud) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) +#define pmd_test_and_populate(mm, pmd, pte) \ + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) +#define pud_test_and_populate(mm, pud, pmd) \ + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) +#define pgd_test_and_populate(mm, pgd, pud) \ + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) + + static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.10/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-03 12:13:17.000000000 -0800 @@ -413,6 +413,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 19:38 ` page fault scalability patch V14 [5/7]: x86_64 " Christoph Lameter @ 2005-01-04 19:46 ` Andi Kleen 2005-01-04 19:58 ` Christoph Lameter 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter 2005-01-04 21:21 ` page fault scalability patch V14 [5/7]: x86_64 atomic pte operations Brian Gerst 1 sibling, 2 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-04 19:46 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter <clameter@sgi.com> writes: I bet this has been never tested. > #define pmd_populate_kernel(mm, pmd, pte) \ > set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) > #define pud_populate(mm, pud, pmd) \ > @@ -14,11 +18,24 @@ > #define pgd_populate(mm, pgd, pud) \ > set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) > > +#define pmd_test_and_populate(mm, pmd, pte) \ > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) > +#define pud_test_and_populate(mm, pud, pmd) \ > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) > +#define pgd_test_and_populate(mm, pgd, pud) \ > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) > + Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit. Also why do you cast at all? i think the macro should handle an arbitary pointer. > + > static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) > { > set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); > } > > +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) > +{ > + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; Same. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 19:46 ` Andi Kleen @ 2005-01-04 19:58 ` Christoph Lameter 2005-01-04 20:21 ` Andi Kleen 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:58 UTC (permalink / raw) To: Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Tue, 4 Jan 2005, Andi Kleen wrote: > Christoph Lameter <clameter@sgi.com> writes: > > I bet this has been never tested. I tested this back in October and it worked fine. Would you be able to test your proposed modifications and send me a patch? > > +#define pmd_test_and_populate(mm, pmd, pte) \ > > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) > > +#define pud_test_and_populate(mm, pud, pmd) \ > > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) > > +#define pgd_test_and_populate(mm, pgd, pud) \ > > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) > > + > > Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit. > Also why do you cast at all? i think the macro should handle an arbitary > pointer. The macro checks for the size of the pointer and then generates the appropriate cmpxchg instruction. pgd_t is a struct which may be problematic for the cmpxchg macros. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 19:58 ` Christoph Lameter @ 2005-01-04 20:21 ` Andi Kleen 2005-01-04 20:32 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-04 20:21 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Tue, Jan 04, 2005 at 11:58:13AM -0800, Christoph Lameter wrote: > On Tue, 4 Jan 2005, Andi Kleen wrote: > > > Christoph Lameter <clameter@sgi.com> writes: > > > > I bet this has been never tested. > > I tested this back in October and it worked fine. Would you be able to > test your proposed modifications and send me a patch? Hmm, I don't think it could have worked this way, except if you only tested page faults < 4GB. > > > > +#define pmd_test_and_populate(mm, pmd, pte) \ > > > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) > > > +#define pud_test_and_populate(mm, pud, pmd) \ > > > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) > > > +#define pgd_test_and_populate(mm, pgd, pud) \ > > > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) > > > + > > > > Shouldn't this all be (long *)pmd ? page table entries on x86-64 are 64bit. > > Also why do you cast at all? i think the macro should handle an arbitary > > pointer. > > The macro checks for the size of the pointer and then generates the > appropriate cmpxchg instruction. pgd_t is a struct which may be > problematic for the cmpxchg macros. It just checks sizeof, that should be fine. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 20:21 ` Andi Kleen @ 2005-01-04 20:32 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 20:32 UTC (permalink / raw) To: Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Tue, 4 Jan 2005, Andi Kleen wrote: > > The macro checks for the size of the pointer and then generates the > > appropriate cmpxchg instruction. pgd_t is a struct which may be > > problematic for the cmpxchg macros. > > It just checks sizeof, that should be fine. Index: linux-2.6.10/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-04 12:31:14.000000000 -0800 @@ -7,6 +7,10 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PUD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pud_populate(mm, pud, pmd) \ @@ -14,11 +18,24 @@ #define pgd_populate(mm, pgd, pud) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) +#define pmd_test_and_populate(mm, pmd, pte) \ + (cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) +#define pud_test_and_populate(mm, pud, pmd) \ + (cmpxchg(pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) +#define pgd_test_and_populate(mm, pgd, pud) \ + (cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) + + static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.10/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-04 12:29:25.000000000 -0800 @@ -413,6 +413,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [0/7]: overview 2005-01-04 19:46 ` Andi Kleen 2005-01-04 19:58 ` Christoph Lameter @ 2005-01-11 17:39 ` Christoph Lameter 2005-01-11 17:40 ` page table lock patch V15 [1/7]: Reduce use of page table lock Christoph Lameter ` (7 more replies) 1 sibling, 8 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:39 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changes from V14->V15 of this patch: - Remove misplaced semicolon in handle_mm_fault (caused x86_64 troubles) - Fixed up and tested x86_64 arch specific patch - Redone against 2.6.10-bk14 This is a series of patches that increases the scalability of the page fault handler for SMP. The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd, pud and pgd's (pgd_test_and_populate, pud_test_and_populate, pmd_test_and_populate). The page table lock can be avoided in the following situations: 1. An empty pte or pmd entry is populated This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory. 2. Modifications of flags in a pte entry (write/accessed). These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too. One essential change in the VM is the use of pte_cmpxchg (or its generic emulation) on page table entries before doing an update_mmu_change without holding the page table lock. However, we do similar things now with other atomic pte operations such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch operates on an *cleared* pte and replaces it with a pte pointing to valid memory. The effect of this change on various architectures has to be thought through. Local definitions of ptep_cmpxchg and ptep_xchg may be necessary. For ia64 an icache coherency issue may arise that potentially requires the flushing of the icache (as done via update_mmu_cache on ia64) prior to the use of ptep_cmpxchg. Similar issues may arise on other platforms. The patch introduces a split counter for rss handling to avoid atomic operations and locks currently necessary for rss modifications. In addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be in the same cache line as tsk->mm (which is already used by the fault handler) and thus tsk->rss can be incremented without locks in a fast way. The cache line does not need to be shared between processors for the page table handler. A tasklist is generated for each mm (rcu based). Values in that list are added up to calculate rss or anon_rss values. The patchset is composed of 7 patches (and was tested against 2.6.10-bk6): 1/7: Avoid page_table_lock in handle_mm_fault This patch defers the acquisition of the page_table_lock as much as possible and uses atomic operations for allocating anonymous memory. These atomic operations are simulated by acquiring the page_table_lock for very small time frames if an architecture does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes kswapd so that a pte will not be set to empty if a page is in transition to swap. If only the first two patches are applied then the time that the page_table_lock is held is simply reduced. The lock may then be acquired multiple times during a page fault. 2/7: Atomic pte operations for ia64 3/7: Make cmpxchg generally available on i386 The atomic operations on the page table rely heavily on cmpxchg instructions. This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486 cpus. The emulations are only included if a kernel is build for these old cpus and are skipped for the real cmpxchg instructions if the kernel that is build for 386 or 486 is then run on a more recent cpu. This patch may be used independently of the other patches. 4/7: Atomic pte operations for i386 A generally available cmpxchg (last patch) must be available for this patch to preserve the ability to build kernels for 386 and 486. 5/7: Atomic pte operation for x86_64 6/7: Atomic pte operations for s390 7/7: Split counter implementation for rss Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic to calculate rss from tasklist. Signed-off-by: Christoph Lameter <clameter@sgi.com> ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [1/7]: Reduce use of page table lock 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter @ 2005-01-11 17:40 ` Christoph Lameter 2005-01-11 17:41 ` page table lock patch V15 [2/7]: ia64 atomic pte operations Christoph Lameter ` (6 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:40 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Increase parallelism in SMP configurations by deferring the acquisition of page_table_lock in handle_mm_fault * Anonymous memory page faults bypass the page_table_lock through the use of atomic page table operations * Swapper does not set pte to empty in transition to swap * Simulate atomic page table operations using the page_table_lock if an arch does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides a performance benefit since the page_table_lock is held for shorter periods of time. Signed-off-by: Christoph Lameter <clameter@sgi.com Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-11 09:16:30.000000000 -0800 @@ -36,6 +36,8 @@ * (Gerhard.Wichert@pdb.siemens.de) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) + * Jan 2005 Scalability improvement by reducing the use and the length of time + * the page table lock is held (Christoph Lameter) */ #include <linux/kernel_stat.h> @@ -1677,8 +1679,7 @@ } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1690,15 +1691,13 @@ int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1721,8 +1720,7 @@ lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1772,14 +1770,12 @@ } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; struct page * page = ZERO_PAGE(addr); @@ -1789,47 +1785,44 @@ /* ..except if it's a write access */ if (write_access) { - /* Allocate our own private page. */ - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) - goto no_mem; + return VM_FAULT_OOM; + page = alloc_page_vma(GFP_HIGHUSER, vma, addr); if (!page) - goto no_mem; + return VM_FAULT_OOM; clear_user_highpage(page, addr); - spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); + entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, + vma->vm_page_prot)), + vma); + } - if (!pte_none(*page_table)) { + if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { + if (write_access) { pte_unmap(page_table); page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; } + return VM_FAULT_MINOR; + } + if (write_access) { + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + page_add_anon_rmap(page, vma, addr); + lru_cache_add_active(page); mm->rss++; acct_update_integrals(); update_mem_hiwater(); - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - SetPageReferenced(page); - page_add_anon_rmap(page, vma, addr); - } - set_pte(page_table, entry); + } pte_unmap(page_table); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); - spin_unlock(&mm->page_table_lock); -out: return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* @@ -1841,12 +1834,12 @@ * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1857,9 +1850,8 @@ if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1959,7 +1951,7 @@ * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1972,13 +1964,12 @@ if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } - pgoff = pte_to_pgoff(*pte); + pgoff = pte_to_pgoff(entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -1997,49 +1988,46 @@ * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; - entry = *pte; + /* + * This must be an atomic operation since the page_table_lock is + * not held. If a pte_t larger than the word size is used then an + * incorrect value could be read because another processor is + * concurrently updating the multi-word pte. The i386 PAE mode + * is raising its ugly head here. + */ + entry = get_pte_atomic(pte); if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + /* + * This is the case in which we only update some bits in the pte. + */ + new_entry = pte_mkyoung(entry); if (write_access) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + /* do_wp_page expects us to hold the page_table_lock */ + spin_lock(&mm->page_table_lock); return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + } + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + ptep_cmpxchg(vma, address, pte, entry, new_entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; } @@ -2058,33 +2046,55 @@ inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd. We can avoid the overhead + * of the p??_alloc functions through atomic operations so + * we duplicate the functionality of pmd_alloc, pud_alloc and + * pte_alloc_map here. */ pgd = pgd_offset(mm, address); - spin_lock(&mm->page_table_lock); + if (unlikely(pgd_none(*pgd))) { + pud_t *new = pud_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; - pud = pud_alloc(mm, pgd, address); - if (!pud) - goto oom; - - pmd = pmd_alloc(mm, pud, address); - if (!pmd) - goto oom; - - pte = pte_alloc_map(mm, pmd, address); - if (!pte) - goto oom; + if (!pgd_test_and_populate(mm, pgd, new)) + pud_free(new); + } + + pud = pud_offset(pgd, address); + if (unlikely(pud_none(*pud))) { + pmd_t *new = pmd_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; + + if (!pud_test_and_populate(mm, pud, new)) + pmd_free(new); + } + + pmd = pmd_offset(pud, address); + if (unlikely(!pmd_present(*pmd))) { + struct page *new = pte_alloc_one(mm, address); - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + if (!new) + return VM_FAULT_OOM; - oom: - spin_unlock(&mm->page_table_lock); - return VM_FAULT_OOM; + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else { + inc_page_state(nr_page_table_pages); + mm->nr_ptes++; + } + } + + pte = pte_offset_map(pmd, address); + return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } #ifndef __ARCH_HAS_4LEVEL_HACK Index: linux-2.6.10/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-11 08:48:30.000000000 -0800 @@ -28,6 +28,11 @@ #endif /* __HAVE_ARCH_SET_PTE_ATOMIC */ #endif +/* Get a pte entry without the page table lock */ +#ifndef __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__x) *(__x) +#endif + #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Largely same as above, but only sets the access flags (dirty, @@ -134,4 +139,73 @@ #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) #endif +#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS +/* + * If atomic page table operations are not available then use + * the page_table_lock to insure some form of locking. + * Note thought that low level operations as well as the + * page_table_handling of the cpu may bypass all locking. + */ + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \ +({ \ + int __rc; \ + spin_lock(&__vma->vm_mm->page_table_lock); \ + __rc = pte_same(*(__ptep), __oldval); \ + if (__rc) { set_pte(__ptep, __newval); \ + update_mmu_cache(__vma, __addr, __newval); } \ + spin_unlock(&__vma->vm_mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pud) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pgd_present(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pud); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE +#define pud_test_and_populate(__mm, __pud, __pmd) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pud_present(*(__pud)); \ + if (__rc) pud_populate(__mm, __pud, __pmd); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\ + flush_tlb_page(__vma, __address); \ + __p; \ +}) + +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-11 08:48:30.000000000 -0800 @@ -426,7 +426,10 @@ * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * - * The caller needs to hold the mm->page_table_lock. + * The caller needs to hold the mm->page_table_lock if page + * is pointing to something that is known by the vm. + * The lock does not need to be held if page is pointing + * to a newly allocated page. */ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) @@ -575,11 +578,6 @@ /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -594,11 +592,15 @@ list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; acct_update_integrals(); page_remove_rmap(page); @@ -691,15 +693,21 @@ if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); + /* + * There would be a race here with handle_mm_fault and do_anonymous_page + * which bypasses the page_table_lock if we would zap the pte before + * putting something into it. On the other hand we need to + * have the dirty flag setting at the time we replaced the value. + */ /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_get_and_clear(pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-11 08:48:30.000000000 -0800 @@ -25,8 +25,13 @@ static inline int pgd_present(pgd_t pgd) { return 1; } static inline void pgd_clear(pgd_t *pgd) { } #define pud_ERROR(pud) (pgd_ERROR((pud).pgd)) - #define pgd_populate(mm, pgd, pud) do { } while (0) + +static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) +{ + return 1; +} + /* * (puds are folded into pgds so this doesn't get actually called, * but the define is needed for a generic inline function.) Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-11 08:48:30.000000000 -0800 @@ -29,6 +29,7 @@ #define pmd_ERROR(pmd) (pud_ERROR((pmd).pud)) #define pud_populate(mm, pmd, pte) do { } while (0) +static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) { return 1; } /* * (pmds are folded into puds so this doesn't get actually called, Index: linux-2.6.10/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-11 08:48:30.000000000 -0800 @@ -561,7 +561,7 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE -#include <asm-generic/pgtable.h> #include <asm-generic/pgtable-nopud.h> +#include <asm-generic/pgtable.h> #endif /* _ASM_IA64_PGTABLE_H */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [2/7]: ia64 atomic pte operations 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter 2005-01-11 17:40 ` page table lock patch V15 [1/7]: Reduce use of page table lock Christoph Lameter @ 2005-01-11 17:41 ` Christoph Lameter 2005-01-11 17:41 ` page table lock patch V15 [3/7]: i386 universal cmpxchg Christoph Lameter ` (5 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:41 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for ia64 * Enhanced parallelism in page fault handler if applied together with the generic patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-10 16:31:56.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-10 16:41:00.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PUD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -82,6 +86,13 @@ pud_val(*pud_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE; +} + static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) { @@ -127,6 +138,13 @@ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { Index: linux-2.6.10/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-10 16:32:35.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-10 16:41:00.000000000 -0800 @@ -30,6 +30,8 @@ #define _PAGE_P_BIT 0 #define _PAGE_A_BIT 5 #define _PAGE_D_BIT 6 +#define _PAGE_IG_BITS 53 +#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */ #define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */ #define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */ @@ -58,6 +60,7 @@ #define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL) #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63) +#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT) /* Valid only for a PTE with the present bit cleared: */ #define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ @@ -271,6 +274,8 @@ #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) #define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) +#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0) + /* * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the * access rights: @@ -282,8 +287,15 @@ #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) #define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D)) +#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK)) /* + * Lock functions for pte's + */ +#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep) +#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); } +#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val)) +/* * Macro to a page protection value as "uncacheable". Note that "protection" is really a * misnomer here as the protection value contains the memory attribute bits, dirty bits, * and various other bits as well. @@ -343,7 +355,6 @@ #define pte_unmap_nested(pte) do { } while (0) /* atomic versions of the some PTE manipulations: */ - static inline int ptep_test_and_clear_young (pte_t *ptep) { @@ -415,6 +426,26 @@ #endif } +/* + * IA-64 doesn't have any external MMU info: the page tables contain all the necessary + * information. However, we use this routine to take care of any (delayed) i-cache + * flushing that may be necessary. + */ +extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); + +static inline int +ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval) +{ + /* + * IA64 defers icache flushes. If the new pte is executable we may + * have to flush the icache to insure cache coherency immediately + * after the cmpxchg. + */ + if (pte_exec(newval)) + update_mmu_cache(vma, addr, newval); + return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte; +} + static inline int pte_same (pte_t a, pte_t b) { @@ -477,13 +508,6 @@ struct vm_area_struct * prev, unsigned long start, unsigned long end); #endif -/* - * IA-64 doesn't have any external MMU info: the page tables contain all the necessary - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ -extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Update PTEP with ENTRY, which is guaranteed to be a less @@ -561,6 +585,8 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE +#define __HAVE_ARCH_ATOMIC_TABLE_OPS +#define __HAVE_ARCH_LOCK_TABLE_OPS #include <asm-generic/pgtable-nopud.h> #include <asm-generic/pgtable.h> ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [3/7]: i386 universal cmpxchg 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter 2005-01-11 17:40 ` page table lock patch V15 [1/7]: Reduce use of page table lock Christoph Lameter 2005-01-11 17:41 ` page table lock patch V15 [2/7]: ia64 atomic pte operations Christoph Lameter @ 2005-01-11 17:41 ` Christoph Lameter 2005-01-11 17:42 ` page table lock patch V15 [4/7]: i386 atomic pte operations Christoph Lameter ` (4 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:41 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Make cmpxchg and cmpxchg8b generally available on the i386 platform. * Provide emulation of cmpxchg suitable for uniprocessor if build and run on 386. * Provide emulation of cmpxchg8b suitable for uniprocessor systems if build and run on 386 or 486. * Provide an inline function to atomically get a 64 bit value via cmpxchg8b in an SMP system (courtesy of Nick Piggin) (important for i386 PAE mode and other places where atomic 64 bit operations are useful) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/arch/i386/Kconfig =================================================================== --- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800 +++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800 @@ -351,6 +351,11 @@ depends on !M386 default y +config X86_CMPXCHG8B + bool + depends on !M386 && !M486 + default y + config X86_XADD bool depends on !M386 Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c =================================================================== --- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800 +++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800 @@ -6,6 +6,7 @@ #include <linux/bitops.h> #include <linux/smp.h> #include <linux/thread_info.h> +#include <linux/module.h> #include <asm/processor.h> #include <asm/msr.h> @@ -287,5 +288,103 @@ return 0; } +#ifndef CONFIG_X86_CMPXCHG +unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new) +{ + u8 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u8)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u8 *)ptr; + if (prev == old) + *(u8 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u8); + +unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new) +{ + u16 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u16)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u16 *)ptr; + if (prev == old) + *(u16 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u16); + +unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) +{ + u32 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u32)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u32 *)ptr; + if (prev == old) + *(u32 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u32); +#endif + +#ifndef CONFIG_X86_CMPXCHG8B +unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + unsigned long flags; + + /* + * Check if the kernel was compiled for an old cpu but + * we are running really on a cpu capable of cmpxchg8b + */ + + if (cpu_has(cpu_data, X86_FEATURE_CX8)) + return __cmpxchg8b(ptr, old, newv); + + /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */ + local_irq_save(flags); + prev = *ptr; + if (prev == old) + *ptr = newv; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg8b_486); +#endif + // arch_initcall(intel_cpu_init); Index: linux-2.6.9/include/asm-i386/system.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800 @@ -149,6 +149,9 @@ #define __xg(x) ((struct __xchg_dummy *)(x)) +#define ll_low(x) *(((unsigned int*)&(x))+0) +#define ll_high(x) *(((unsigned int*)&(x))+1) + /* * The semantics of XCHGCMP8B are a bit strange, this is why * there is a loop and the loading of %%eax and %%edx has to @@ -184,8 +187,6 @@ { __set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL)); } -#define ll_low(x) *(((unsigned int*)&(x))+0) -#define ll_high(x) *(((unsigned int*)&(x))+1) static inline void __set_64bit_var (unsigned long long *ptr, unsigned long long value) @@ -203,6 +204,26 @@ __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \ __set_64bit(ptr, ll_low(value), ll_high(value)) ) +static inline unsigned long long __get_64bit(unsigned long long * ptr) +{ + unsigned long long ret; + __asm__ __volatile__ ( + "\n1:\t" + "movl (%1), %%eax\n\t" + "movl 4(%1), %%edx\n\t" + "movl %%eax, %%ebx\n\t" + "movl %%edx, %%ecx\n\t" + LOCK_PREFIX "cmpxchg8b (%1)\n\t" + "jnz 1b" + : "=A"(ret) + : "D"(ptr) + : "ebx", "ecx", "memory"); + return ret; +} + +#define get_64bit(ptr) __get_64bit(ptr) + + /* * Note: no "lock" prefix even on SMP: xchg always implies lock anyway * Note 2: xchg has side effect, so that attribute volatile is necessary, @@ -240,7 +261,41 @@ */ #ifdef CONFIG_X86_CMPXCHG + #define __HAVE_ARCH_CMPXCHG 1 +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) + +#else + +/* + * Building a kernel capable running on 80386. It may be necessary to + * simulate the cmpxchg on the 80386 CPU. For that purpose we define + * a function for each of the sizes we support. + */ + +extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8); +extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16); +extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32); + +static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, + unsigned long new, int size) +{ + switch (size) { + case 1: + return cmpxchg_386_u8(ptr, old, new); + case 2: + return cmpxchg_386_u16(ptr, old, new); + case 4: + return cmpxchg_386_u32(ptr, old, new); + } + return old; +} + +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) #endif static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, @@ -270,12 +325,34 @@ return old; } -#define cmpxchg(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ - (unsigned long)(n),sizeof(*(ptr)))) - +static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + __asm__ __volatile__( + LOCK_PREFIX "cmpxchg8b (%4)" + : "=A" (prev) + : "0" (old), "c" ((unsigned long)(newv >> 32)), + "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr) + : "memory"); + return prev; +} + +#ifdef CONFIG_X86_CMPXCHG8B +#define cmpxchg8b __cmpxchg8b +#else +/* + * Building a kernel capable of running on 80486 and 80386. Both + * do not support cmpxchg8b. Call a function that emulates the + * instruction if necessary. + */ +extern unsigned long long cmpxchg8b_486(volatile unsigned long long *, + unsigned long long, unsigned long long); +#define cmpxchg8b cmpxchg8b_486 +#endif + #ifdef __KERNEL__ -struct alt_instr { +struct alt_instr { __u8 *instr; /* original instruction */ __u8 *replacement; __u8 cpuid; /* cpuid bit set for replacement */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [4/7]: i386 atomic pte operations 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter ` (2 preceding siblings ...) 2005-01-11 17:41 ` page table lock patch V15 [3/7]: i386 universal cmpxchg Christoph Lameter @ 2005-01-11 17:42 ` Christoph Lameter 2005-01-11 17:43 ` page table lock patch V15 [5/7]: x86_64 " Christoph Lameter ` (3 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:42 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Atomic pte operations for i386 in regular and PAE modes Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable.h 2005-01-07 09:48:57.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable.h 2005-01-07 09:51:09.000000000 -0800 @@ -409,6 +409,7 @@ #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME +#define __HAVE_ARCH_ATOMIC_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _I386_PGTABLE_H */ Index: linux-2.6.10/include/asm-i386/pgtable-3level.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable-3level.h 2005-01-07 09:48:57.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable-3level.h 2005-01-07 09:51:09.000000000 -0800 @@ -8,7 +8,8 @@ * tables on PPro+ CPUs. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> - */ + * August 26, 2004 added ptep_cmpxchg <christoph@lameter.com> +*/ #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low) @@ -44,21 +45,11 @@ return pte_x(pte); } -/* Rules for using set_pte: the pte being assigned *must* be - * either not present or in a state where the hardware will - * not attempt to update the pte. In places where this is - * not possible, use pte_get_and_clear to obtain the old pte - * value and then use set_pte to update it. -ben - */ -static inline void set_pte(pte_t *ptep, pte_t pte) -{ - ptep->pte_high = pte.pte_high; - smp_wmb(); - ptep->pte_low = pte.pte_low; -} #define __HAVE_ARCH_SET_PTE_ATOMIC #define set_pte_atomic(pteptr,pteval) \ set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) +#define set_pte(pteptr,pteval) \ + *(unsigned long long *)(pteptr) = pte_val(pteval) #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pud(pudptr,pudval) \ @@ -155,4 +146,25 @@ #define __pmd_free_tlb(tlb, x) do { } while (0) +/* Atomic PTE operations */ +#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \ +({ pte_t __r; \ + /* xchg acts as a barrier before the setting of the high bits. */\ + __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \ + __r.pte_high = (__ptep)->pte_high; \ + (__ptep)->pte_high = (__newval).pte_high; \ + flush_tlb_page(__vma, __addr); \ + (__r); \ +}) + +#define __HAVE_ARCH_PTEP_XCHG_FLUSH + +static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + +#define __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep))) + #endif /* _I386_PGTABLE_3LEVEL_H */ Index: linux-2.6.10/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgtable-2level.h 2005-01-07 09:48:57.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgtable-2level.h 2005-01-07 09:51:09.000000000 -0800 @@ -65,4 +65,7 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) +/* Atomic PTE operations */ +#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low) + #endif /* _I386_PGTABLE_2LEVEL_H */ Index: linux-2.6.10/include/asm-i386/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-i386/pgalloc.h 2005-01-07 09:48:57.000000000 -0800 +++ linux-2.6.10/include/asm-i386/pgalloc.h 2005-01-07 11:15:55.000000000 -0800 @@ -4,9 +4,12 @@ #include <linux/config.h> #include <asm/processor.h> #include <asm/fixmap.h> +#include <asm/system.h> #include <linux/threads.h> #include <linux/mm.h> /* for struct page */ +#define PMD_NONE 0L + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))) @@ -14,6 +17,18 @@ set_pmd(pmd, __pmd(_PAGE_TABLE + \ ((unsigned long long)page_to_pfn(pte) << \ (unsigned long long) PAGE_SHIFT))) +/* Atomic version */ +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ +#ifdef CONFIG_X86_PAE + return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE + + ((unsigned long long)page_to_pfn(pte) << + (unsigned long long) PAGE_SHIFT) ) == PMD_NONE; +#else + return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +#endif +} + /* * Allocate and free page tables. */ @@ -44,6 +59,7 @@ #define pmd_free(x) do { } while (0) #define __pmd_free_tlb(tlb,x) do { } while (0) #define pud_populate(mm, pmd, pte) BUG() +#define pud_test_and_populate(mm, pmd, pte) ({ BUG(); 1; }) #endif #define check_pgt_cache() do { } while (0) ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [5/7]: x86_64 atomic pte operations 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter ` (3 preceding siblings ...) 2005-01-11 17:42 ` page table lock patch V15 [4/7]: i386 atomic pte operations Christoph Lameter @ 2005-01-11 17:43 ` Christoph Lameter 2005-01-11 17:43 ` page table lock patch V15 [6/7]: s390 " Christoph Lameter ` (2 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:43 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for x86_64 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-10 16:31:56.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-10 16:41:24.000000000 -0800 @@ -7,6 +7,10 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PUD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pud_populate(mm, pud, pmd) \ @@ -14,9 +18,20 @@ #define pgd_populate(mm, pgd, pud) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) +#define pud_test_and_populate(mm, pud, pmd) \ + (cmpxchg((unsigned long *)pud, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) +#define pgd_test_and_populate(mm, pgd, pud) \ + (cmpxchg((unsigned long *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) + + static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { - set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); + set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); +} + +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg((unsigned long *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; } extern __inline__ pmd_t *get_pmd(void) Index: linux-2.6.10/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-10 16:31:56.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-10 16:41:24.000000000 -0800 @@ -414,6 +414,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [6/7]: s390 atomic pte operations 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter ` (4 preceding siblings ...) 2005-01-11 17:43 ` page table lock patch V15 [5/7]: x86_64 " Christoph Lameter @ 2005-01-11 17:43 ` Christoph Lameter 2005-01-11 17:44 ` page table lock patch V15 [7/7]: Split RSS counter Christoph Lameter 2005-01-12 5:59 ` page table lock patch V15 [0/7]: overview Nick Piggin 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:43 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for s390 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-s390/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-s390/pgtable.h 2005-01-10 16:31:56.000000000 -0800 +++ linux-2.6.10/include/asm-s390/pgtable.h 2005-01-10 16:41:07.000000000 -0800 @@ -577,6 +577,15 @@ return pte; } +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + struct mm_struct *__mm = __vma->vm_mm; \ + pte_t __pte; \ + __pte = ptep_clear_flush(__vma, __address, __ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) + static inline void ptep_set_wrprotect(pte_t *ptep) { pte_t old_pte = *ptep; @@ -788,6 +797,14 @@ #define kern_addr_valid(addr) (1) +/* Atomic PTE operations */ +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + +static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + /* * No page table caches to initialise */ @@ -801,6 +818,7 @@ #define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH #define __HAVE_ARCH_PTEP_GET_AND_CLEAR #define __HAVE_ARCH_PTEP_CLEAR_FLUSH +#define __HAVE_ARCH_PTEP_XCHG_FLUSH #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME Index: linux-2.6.10/include/asm-s390/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-s390/pgalloc.h 2004-12-24 13:35:00.000000000 -0800 +++ linux-2.6.10/include/asm-s390/pgalloc.h 2005-01-10 16:41:07.000000000 -0800 @@ -97,6 +97,10 @@ pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd); } +static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd) +{ + return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV; +} #endif /* __s390x__ */ static inline void @@ -119,6 +123,18 @@ pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT)); } +static inline int +pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page) +{ + int rc; + spin_lock(&mm->page_table_lock); + + rc=pte_same(*pmd, _PAGE_INVALID_EMPTY); + if (rc) pmd_populate(mm, pmd, page); + spin_unlock(&mm->page_table_lock); + return rc; +} + /* * page table entry allocation/free routines. */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page table lock patch V15 [7/7]: Split RSS counter 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter ` (5 preceding siblings ...) 2005-01-11 17:43 ` page table lock patch V15 [6/7]: s390 " Christoph Lameter @ 2005-01-11 17:44 ` Christoph Lameter 2005-01-12 5:59 ` page table lock patch V15 [0/7]: overview Nick Piggin 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-11 17:44 UTC (permalink / raw) To: torvalds, Andi Kleen Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Split rss counter into the task structure * remove 3 checks of rss in mm/rmap.c * increment current->rss instead of mm->rss in the page fault handler * move incrementing of anon_rss out of page_add_anon_rmap to group the increments more tightly and allow a better cache utilization Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/linux/sched.h =================================================================== --- linux-2.6.10.orig/include/linux/sched.h 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-01-11 08:56:45.000000000 -0800 @@ -31,6 +31,7 @@ #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> +#include <linux/rcupdate.h> struct exec_domain; @@ -216,6 +217,7 @@ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + long rss, anon_rss; struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -225,7 +227,7 @@ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ @@ -235,6 +237,7 @@ /* Architecture-specific MM context */ mm_context_t context; + struct list_head task_list; /* Tasks using this mm */ /* Token based thrashing protection. */ unsigned long swap_token_time; @@ -555,6 +558,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Split counters from mm */ + long rss; + long anon_rss; /* task state */ struct linux_binfmt *binfmt; @@ -587,6 +593,10 @@ struct completion *vfork_done; /* for vfork() */ int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ + + /* List of other tasks using the same mm */ + struct list_head mm_tasks; + struct rcu_head rcu_head; /* For freeing the task via rcu */ unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; @@ -1184,6 +1194,11 @@ return 0; } #endif /* CONFIG_PM */ + +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss); +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk); +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif Index: linux-2.6.10/fs/proc/task_mmu.c =================================================================== --- linux-2.6.10.orig/fs/proc/task_mmu.c 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-11 08:56:45.000000000 -0800 @@ -8,8 +8,9 @@ char *task_mem(struct mm_struct *mm, char *buffer) { - unsigned long data, text, lib; + unsigned long data, text, lib, rss, anon_rss; + get_rss(mm, &rss, &anon_rss); data = mm->total_vm - mm->shared_vm - mm->stack_vm; text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; @@ -24,7 +25,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + rss << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -39,11 +40,14 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + unsigned long rss, anon_rss; + + get_rss(mm, &rss, &anon_rss); + *shared = rss - anon_rss; *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; - *resident = mm->rss; + *resident = rss; return mm->total_vm; } Index: linux-2.6.10/fs/proc/array.c =================================================================== --- linux-2.6.10.orig/fs/proc/array.c 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/fs/proc/array.c 2005-01-11 08:56:45.000000000 -0800 @@ -303,7 +303,7 @@ static int do_task_stat(struct task_struct *task, char * buffer, int whole) { - unsigned long vsize, eip, esp, wchan = ~0UL; + unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL; long priority, nice; int tty_pgrp = -1, tty_nr = 0; sigset_t sigign, sigcatch; @@ -326,6 +326,7 @@ vsize = task_vsize(mm); eip = KSTK_EIP(task); esp = KSTK_ESP(task); + get_rss(mm, &rss, &anon_rss); } get_task_comm(tcomm, task); @@ -421,7 +422,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? rss : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-11 08:48:30.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-11 08:56:45.000000000 -0800 @@ -258,8 +258,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -440,8 +438,6 @@ BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; - anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; index += vma->vm_pgoff; @@ -513,8 +509,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -813,8 +807,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.10/kernel/fork.c =================================================================== --- linux-2.6.10.orig/kernel/fork.c 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/kernel/fork.c 2005-01-11 08:56:45.000000000 -0800 @@ -79,10 +79,16 @@ static kmem_cache_t *task_struct_cachep; #endif +static void rcu_free_task(struct rcu_head *head) +{ + struct task_struct *tsk = container_of(head ,struct task_struct, rcu_head); + free_task_struct(tsk); +} + void free_task(struct task_struct *tsk) { free_thread_info(tsk->thread_info); - free_task_struct(tsk); + call_rcu(&tsk->rcu_head, rcu_free_task); } EXPORT_SYMBOL(free_task); @@ -99,7 +105,7 @@ put_group_info(tsk->group_info); if (!profile_handoff_task(tsk)) - free_task(tsk); + call_rcu(&tsk->rcu_head, rcu_free_task); } void __init fork_init(unsigned long mempages) @@ -152,6 +158,7 @@ *tsk = *orig; tsk->thread_info = ti; ti->task = tsk; + tsk->rss = 0; /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); @@ -294,6 +301,7 @@ atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); + INIT_LIST_HEAD(&mm->task_list); mm->core_waiters = 0; mm->nr_ptes = 0; spin_lock_init(&mm->page_table_lock); @@ -402,6 +410,8 @@ /* Get rid of any cached register state */ deactivate_mm(tsk, mm); + if (mm) + mm_remove_thread(mm, tsk); /* notify parent sleeping on vfork() */ if (vfork_done) { @@ -449,8 +459,8 @@ * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. + * (mm_add_thread does use the ptl .... ) */ - spin_unlock_wait(&oldmm->page_table_lock); goto good_mm; } @@ -475,6 +485,7 @@ mm->hiwater_vm = mm->total_vm; good_mm: + mm_add_thread(mm, tsk); tsk->mm = mm; tsk->active_mm = mm; return 0; @@ -1079,7 +1090,7 @@ atomic_dec(&p->user->processes); free_uid(p->user); bad_fork_free: - free_task(p); + call_rcu(&p->rcu_head, rcu_free_task); goto fork_out; } Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-11 08:56:37.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-11 08:56:45.000000000 -0800 @@ -935,6 +935,7 @@ cond_resched_lock(&mm->page_table_lock); while (!(map = follow_page(mm, start, lookup_write))) { + unsigned long rss, anon_rss; /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -947,6 +948,17 @@ map = ZERO_PAGE(start); break; } + if (mm != current->mm) { + /* + * handle_mm_fault uses the current pointer + * for a split rss counter. The current pointer + * is not correct if we are using a different mm + */ + rss = current->rss; + anon_rss = current->anon_rss; + current->rss = 0; + current->anon_rss = 0; + } spin_unlock(&mm->page_table_lock); switch (handle_mm_fault(mm,vma,start,write)) { case VM_FAULT_MINOR: @@ -971,6 +983,12 @@ */ lookup_write = write && !force; spin_lock(&mm->page_table_lock); + if (mm != current->mm) { + mm->rss += current->rss; + mm->anon_rss += current->anon_rss; + current->rss = rss; + current->anon_rss = anon_rss; + } } if (pages) { pages[i] = get_page_map(map); @@ -1353,6 +1371,7 @@ break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); + mm->anon_rss++; /* Free the old page.. */ new_page = old_page; @@ -1753,6 +1772,7 @@ flush_icache_page(vma, page); set_pte(page_table, pte); page_add_anon_rmap(page, vma, address); + mm->anon_rss++; if (write_access) { if (do_wp_page(mm, vma, address, @@ -1815,6 +1835,7 @@ page_add_anon_rmap(page, vma, addr); lru_cache_add_active(page); mm->rss++; + mm->anon_rss++; acct_update_integrals(); update_mem_hiwater(); @@ -1922,6 +1943,7 @@ if (anon) { lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); + mm->anon_rss++; } else page_add_file_rmap(new_page); pte_unmap(page_table); @@ -2250,6 +2272,49 @@ EXPORT_SYMBOL(vmalloc_to_pfn); +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss) +{ + struct list_head *y; + struct task_struct *t; + long rss_sum, anon_rss_sum; + + rcu_read_lock(); + rss_sum = mm->rss; + anon_rss_sum = mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss_sum += t->rss; + anon_rss_sum += t->anon_rss; + } + if (rss_sum < 0) + rss_sum = 0; + if (anon_rss_sum < 0) + anon_rss_sum = 0; + rcu_read_unlock(); + *rss = rss_sum; + *anon_rss = anon_rss_sum; +} + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!mm) + return; + + spin_lock(&mm->page_table_lock); + mm->rss += tsk->rss; + mm->anon_rss += tsk->anon_rss; + list_del_rcu(&tsk->mm_tasks); + spin_unlock(&mm->page_table_lock); +} + +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + spin_lock(&mm->page_table_lock); + tsk->rss = 0; + tsk->anon_rss = 0; + list_add_rcu(&tsk->mm_tasks, &mm->task_list); + spin_unlock(&mm->page_table_lock); +} /* * update_mem_hiwater * - update per process rss and vm high water data Index: linux-2.6.10/include/linux/init_task.h =================================================================== --- linux-2.6.10.orig/include/linux/init_task.h 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/include/linux/init_task.h 2005-01-11 08:56:45.000000000 -0800 @@ -42,6 +42,7 @@ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ + .task_list = LIST_HEAD_INIT(name.task_list), \ } #define INIT_SIGNALS(sig) { \ @@ -112,6 +113,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \ } Index: linux-2.6.10/fs/exec.c =================================================================== --- linux-2.6.10.orig/fs/exec.c 2005-01-11 08:46:15.000000000 -0800 +++ linux-2.6.10/fs/exec.c 2005-01-11 08:56:45.000000000 -0800 @@ -556,6 +556,7 @@ tsk->active_mm = mm; activate_mm(active_mm, mm); task_unlock(tsk); + mm_add_thread(mm, current); arch_pick_mmap_layout(mm); if (old_mm) { if (active_mm != old_mm) BUG(); Index: linux-2.6.10/fs/aio.c =================================================================== --- linux-2.6.10.orig/fs/aio.c 2004-12-24 13:34:44.000000000 -0800 +++ linux-2.6.10/fs/aio.c 2005-01-11 08:56:45.000000000 -0800 @@ -577,6 +577,7 @@ tsk->active_mm = mm; activate_mm(active_mm, mm); task_unlock(tsk); + mm_add_thread(mm, tsk); mmdrop(active_mm); } @@ -596,6 +597,7 @@ { struct task_struct *tsk = current; + mm_remove_thread(mm,tsk); task_lock(tsk); tsk->flags &= ~PF_BORROWED_MM; tsk->mm = NULL; Index: linux-2.6.10/mm/swapfile.c =================================================================== --- linux-2.6.10.orig/mm/swapfile.c 2005-01-11 08:46:16.000000000 -0800 +++ linux-2.6.10/mm/swapfile.c 2005-01-11 08:56:45.000000000 -0800 @@ -433,6 +433,7 @@ swp_entry_t entry, struct page *page) { vma->vm_mm->rss++; + vma->vm_mm->anon_rss++; get_page(page); set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot))); page_add_anon_rmap(page, vma, address); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter ` (6 preceding siblings ...) 2005-01-11 17:44 ` page table lock patch V15 [7/7]: Split RSS counter Christoph Lameter @ 2005-01-12 5:59 ` Nick Piggin 2005-01-12 9:42 ` Andrew Morton 7 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-12 5:59 UTC (permalink / raw) To: Christoph Lameter Cc: torvalds, Andi Kleen, Hugh Dickins, akpm, linux-mm, linux-ia64, linux-kernel, Benjamin Herrenschmidt Christoph Lameter wrote: > Changes from V14->V15 of this patch: Hi, I wonder what everyone thinks about moving forward with these patches? Has it been decided that they'll be merged soon? Christoph has been working fairly hard on them, but there hasn't been a lot of feedback. And for those few people who have looked at my patches for page table lock removal, is there is any preference to one implementation or the other? It is probably fair to say that my patches are more comprehensive (in terms of ptl removal, ie. the complete removal**), and can allow architectures to be more flexible in their page table synchronisation methods. However, Christoph's are simpler and probably more widely tested and reviewed at this stage, and more polished. Christoph's implementation probably also covers the most pressing performance cases. On the other hand, my patches *do* allow for the use of a spin-locked synchronisation implementation, which is probably closer to the current code than Christoph's spin-locked pte_cmpxchg fallback in terms of changes to locking semantics. [** Aside, I didn't see a very significant improvement in mm/rmap.c functions from ptl removal. Mostly I think due to contention on mapping->i_mmap_lock (I didn't test anonymous pages, they may have a better yield)] ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 5:59 ` page table lock patch V15 [0/7]: overview Nick Piggin @ 2005-01-12 9:42 ` Andrew Morton 2005-01-12 12:29 ` Marcelo Tosatti ` (2 more replies) 0 siblings, 3 replies; 286+ messages in thread From: Andrew Morton @ 2005-01-12 9:42 UTC (permalink / raw) To: Nick Piggin Cc: clameter, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Christoph Lameter wrote: > > Changes from V14->V15 of this patch: > > Hi, > > I wonder what everyone thinks about moving forward with these patches? I was waiting for them to settle down before paying more attention. My general take is that these patches address a single workload on exceedingly rare and expensive machines. If they adversely affect common and cheap machines via code complexity, memory footprint or via runtime impact then it would be pretty hard to justify their inclusion. Do we have measurements of the negative and/or positive impact on smaller machines? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 9:42 ` Andrew Morton @ 2005-01-12 12:29 ` Marcelo Tosatti 2005-01-12 12:43 ` Hugh Dickins 2005-01-12 16:39 ` Christoph Lameter 2 siblings, 0 replies; 286+ messages in thread From: Marcelo Tosatti @ 2005-01-12 12:29 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, clameter, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, Jan 12, 2005 at 01:42:35AM -0800, Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > Christoph Lameter wrote: > > > Changes from V14->V15 of this patch: > > > > Hi, > > > > I wonder what everyone thinks about moving forward with these patches? > > I was waiting for them to settle down before paying more attention. > > My general take is that these patches address a single workload on > exceedingly rare and expensive machines. If they adversely affect common > and cheap machines via code complexity, memory footprint or via runtime > impact then it would be pretty hard to justify their inclusion. > > Do we have measurements of the negative and/or positive impact on smaller > machines? I haven't seen wide performance numbers of this patch yet. Hint: STP is really easy. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 9:42 ` Andrew Morton 2005-01-12 12:29 ` Marcelo Tosatti @ 2005-01-12 12:43 ` Hugh Dickins 2005-01-12 21:22 ` Hugh Dickins 2005-01-12 16:39 ` Christoph Lameter 2 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2005-01-12 12:43 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, clameter, torvalds, ak, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > Christoph Lameter wrote: > > > Changes from V14->V15 of this patch: > > > > I wonder what everyone thinks about moving forward with these patches? > > I was waiting for them to settle down before paying more attention. They seem to have settled down, without advancing to anything satisfactory. 7/7 is particularly amusing at the moment (added complexity with no payoff). > My general take is that these patches address a single workload on > exceedingly rare and expensive machines. Well put. Christoph's patches stubbornly remain a _good_ hack for one very specific initial workload (multi-parallel faulting of anon memory) on one architecture (ia64, perhaps a few more) important to SGI. I don't see why the mainline kernel should want them. > If they adversely affect common > and cheap machines via code complexity, memory footprint or via runtime > impact then it would be pretty hard to justify their inclusion. Aside from 7/7 (and some good asm primitives within headers), the code itself is not complex; but it is more complex to think about, and so less obviously correct. > Do we have measurements of the negative and/or positive impact on smaller > machines? I don't think so. But my main worry remains the detriment to other architectures, which still remains unaddressed. Nick's patches (I've not seen for some while) are a different case: on the minus side, considerably more complex; on the plus side, more general and more aware of the range of architectures. I'll write at greater length to support these accusations later on. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 12:43 ` Hugh Dickins @ 2005-01-12 21:22 ` Hugh Dickins 2005-01-12 23:52 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2005-01-12 21:22 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Christoph Lameter, Jay Lan, Linus Torvalds, Andi Kleen, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Hugh Dickins wrote: > On Wed, 12 Jan 2005, Andrew Morton wrote: > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > Christoph Lameter wrote: > > > > Changes from V14->V15 of this patch: > > > I wonder what everyone thinks about moving forward with these patches? > > I was waiting for them to settle down before paying more attention. > They seem to have settled down, without advancing to anything satisfactory. Well, I studied the patches a bit more, and wrote "That remark looks a bit unfair to me now I've looked closer." Sorry. But I do still think it remains unsatisfactory." Then I studied it a bit more, and I think my hostility melted away once I thought about the other-arch-defaults: I'd been supposing that taking and dropping the page_table_lock within each primitive was adding up to an unhealthy flurry of takes and drops on the non-target architectures. But that doesn't look like the case to me now (except in those rarer paths where a page table has to be allocated: of course, not a problem). I owe Christoph an apology. It's not quite satisfactory yet, but it does look a lot better than an ia64 hack for one special case. Might I save face by suggesting that it would be a lot clearer and better if 1/1 got split into two? The first entirely concerned with removing the spin_lock(&mm->page_table_lock) from handle_mm_fault, and dealing with the consequences of that - moving the locking into the allocating blocks, atomic getting of pud and pmd and pte, passing the atomically-gotten orig_pte down to subfunctions (which no longer expect page_table_lock held on entry) etc. If there's a slight increase in the number of atomic operations in each i386 PAE page fault, well, I think the superiority of x86_64 makes that now an acceptable tradeoff. That would be quite a decent patch, wouldn't it? that could go into -mm for a few days and be measured, before any more. Then one using something called ptep_cmpxchg to encapsulate the page_table_lock'ed checking of pte_same and set_pte in do_anonymous page. Then ones to implement ptep_cmpxchg per selected arches without page_table_lock. Dismiss those suggestions if they'd just waste everyone's time. Christoph has made some strides in correcting for other architectures e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock (probably correct but I can't be sure myself), and get_pte_atomic to get even i386 PAE pte correctly without page_table_lock; and reverted the pessimization of set_pte being always atomic on i386 PAE (but now I've forgotten and can't find the case where it needed to be atomic). Unless it's just been fixed in this latest version, the well-intentioned get_pte_atomic doesn't actually work on i386 PAE: once you get swapping, the swap entries look like pte_nones and all collapses. Presumably just #define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep))) doesn't quite do what it's trying to do, and needs a slight adjustment. But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level entries - I thought we'd agreed they were also necessary on some arches? > 7/7 is particularly amusing at the moment (added complexity with no payoff). I still dislike 7/7, despite seeing the sense of keeping stats in the task struct. It's at the very end anyway, and I'd be glad for it to be delayed (in the hope that time somehow magically makes it nicer). In its present state it is absurd: partly because Christoph seems to have forgotten the point of it, so after all the per-thread infrastructure, has ended up with do_anonymous_page saying mm->rss++, mm->anon_rss++. And partly because others at SGI have been working in the opposite direction, adding mysterious and tasteless acct_update_integrals and update_mem_hiwater calls. I say mysterious because there's nothing in the tree which actually uses the accumulated statistics, or shows how they might be used (when many threads share the mm), - so Adrian/Arjan/HCH might remove them any day. But looking at December mails suggests there's lse-tech agreement that all kinds of addons would find them useful. I say tasteless because they don't even take "mm" arguments (what happens when ptrace or AIO daemon faults something? perhaps it's okay but there's no use of the stats to judge by), and the places where you'd want to update hiwater_rss are almost entirely disjoint from the places where you'd want to update hiwater_vm (expand_stack the exception). If those new stats stay, and the per-task-rss idea stays, then I suppose those new stats need to be split per task too. > I'll write at greater length to support these accusations later on. I rather failed to do so! And perhaps tomorrow I'll have to be apologizing to Jay for my uncomprehending attack on hiwater etc. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 21:22 ` Hugh Dickins @ 2005-01-12 23:52 ` Christoph Lameter 2005-01-13 2:52 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 23:52 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Nick Piggin, Jay Lan, Linus Torvalds, Andi Kleen, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Hugh Dickins wrote: > Well, I studied the patches a bit more, and wrote > "That remark looks a bit unfair to me now I've looked closer." > Sorry. But I do still think it remains unsatisfactory." Well then thanks for not ccing me on the initial rant but a whole bunch of other people instead that you then did not send the following email too. Is this standard behavior on linux-mm? > Might I save face by suggesting that it would be a lot clearer and > better if 1/1 got split into two? The first entirely concerned with > removing the spin_lock(&mm->page_table_lock) from handle_mm_fault, > and dealing with the consequences of that - moving the locking into > the allocating blocks, atomic getting of pud and pmd and pte, > passing the atomically-gotten orig_pte down to subfunctions > (which no longer expect page_table_lock held on entry) etc. That wont do any good since the pte's are not always updated in an atomic way. One would have to change set_pte to always be atomic. The reason that I added get_pte_atomic was that you told me that this would fix the PAE mode. I did not think too much about this but simply added it according to your wish and it seemed to run fine. If you have any complaints, complain to yourself. > If there's a slight increase in the number of atomic operations > in each i386 PAE page fault, well, I think the superiority of > x86_64 makes that now an acceptable tradeoff. Could we have PAE mode drop back to using the page_table_lock? > Dismiss those suggestions if they'd just waste everyone's time. They dont fix the PAE mode issue. > Christoph has made some strides in correcting for other architectures > e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock > (probably correct but I can't be sure myself), and get_pte_atomic to > get even i386 PAE pte correctly without page_table_lock; and reverted > the pessimization of set_pte being always atomic on i386 PAE (but now > I've forgotten and can't find the case where it needed to be atomic). Well this was another suggestion of yours that I followed. Turns out that the set_pte must be atomic for this to work! Look I am no expert on the i386 PAE mode and I rely on other for this to check up on it. And you were the expert. > But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level > entries - I thought we'd agreed they were also necessary on some arches? I did not hear about that. Maybe you also sent that email to other people instead? > In its present state it is absurd: partly because Christoph seems to > have forgotten the point of it, so after all the per-thread infrastructure, > has ended up with do_anonymous_page saying mm->rss++, mm->anon_rss++. Sorry that seems to have dropped out of the patch somehow. Here is the fix: Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-11 09:16:34.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-12 15:49:45.000000000 -0800 @@ -1835,8 +1835,8 @@ do_anonymous_page(struct mm_struct *mm, */ page_add_anon_rmap(page, vma, addr); lru_cache_add_active(page); - mm->rss++; - mm->anon_rss++; + current->rss++; + current->anon_rss++; acct_update_integrals(); update_mem_hiwater(); > And partly because others at SGI have been working in the opposite > direction, adding mysterious and tasteless acct_update_integrals > and update_mem_hiwater calls. I say mysterious because there's Yea. I posted a patch to move that stuff out of the vm. No good deed gets unpunished. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:52 ` Christoph Lameter @ 2005-01-13 2:52 ` Hugh Dickins 2005-01-13 17:05 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2005-01-13 2:52 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Nick Piggin, Jay Lan, Linus Torvalds, Andi Kleen, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Christoph Lameter wrote: > On Wed, 12 Jan 2005, Hugh Dickins wrote: > > > Well, I studied the patches a bit more, and wrote > > "That remark looks a bit unfair to me now I've looked closer." > > Sorry. But I do still think it remains unsatisfactory." > > Well then thanks for not ccing me on the initial rant but a whole bunch of > other people instead that you then did not send the following email too. > Is this standard behavior on linux-mm? I did cc you. What whole bunch of other people? The list of recipients was the same, except (for obvious reasons) I added Jay the second time (and having more time, spelt out most names in full). Perhaps we've a misunderstanding: when I say "and wrote..." above, I'm not quoting from some mail I sent others not you, I'm referring to an earlier draft of the mail I'm then sending. Or perhaps SGI has a spam filter which chose to gobble it up. I'll try forwarding it to you again. > > Might I save face by suggesting that it would be a lot clearer and > > better if 1/1 got split into two? The first entirely concerned with > > removing the spin_lock(&mm->page_table_lock) from handle_mm_fault, > > and dealing with the consequences of that - moving the locking into > > the allocating blocks, atomic getting of pud and pmd and pte, > > passing the atomically-gotten orig_pte down to subfunctions > > (which no longer expect page_table_lock held on entry) etc. > > That wont do any good since the pte's are not always updated in an atomic > way. One would have to change set_pte to always be atomic. You did have set_pte always atomic at one point, to the detriment of (PAE) set_page_range. You rightly reverted that, but you've reminded me of what I confessed to forgetting, where you do need set_pte_atomic in various places, mainly (only?) the fault handlers in mm/memory.c. And yes, I think you're right, that needs to be in this first patch. > The reason > that I added get_pte_atomic was that you told me that this would fix the > PAE mode. I did not think too much about this but simply added it > according to your wish and it seemed to run fine. Please don't leave the thinking to me or anyone else. > If you have any complaints, complain to yourself. I'd better omit my response to that. > > If there's a slight increase in the number of atomic operations > > in each i386 PAE page fault, well, I think the superiority of > > x86_64 makes that now an acceptable tradeoff. > > Could we have PAE mode drop back to using the page_table_lock? That sounds a simple and sensible alternative (to more atomics): haven't really thought it through, but if the default arch code is right, and not overhead, then why not use it for the PAE case instead of cluttering up with cleverness. Yes, I think that's a good idea: anyone see why not? > > Dismiss those suggestions if they'd just waste everyone's time. > > They dont fix the PAE mode issue. > > > Christoph has made some strides in correcting for other architectures > > e.g. update_mmu_cache within default ptep_cmpxchg's page_table_lock > > (probably correct but I can't be sure myself), and get_pte_atomic to > > get even i386 PAE pte correctly without page_table_lock; and reverted > > the pessimization of set_pte being always atomic on i386 PAE (but now > > I've forgotten and can't find the case where it needed to be atomic). > > Well this was another suggestion of yours that I followed. Turns out that > the set_pte must be atomic for this to work! I didn't say you never needed an atomic set_pte, I said that making set_pte always atomic (in the PAE case) unnecessarily slowed down copy_page_range and zap_pte_range. Probably a misunderstanding. > Look I am no expert on the > i386 PAE mode and I rely on other for this to check up on it. And you were > the expert. Expert? I was trying to help, but you seem to resent that. > > But no sign of get_pmd(atomic) or get_pud(atomic) to get the higher level > > entries - I thought we'd agreed they were also necessary on some arches? > > I did not hear about that. Maybe you also sent that email to other people > instead? No, you were cc'ed on that one too (Sun, 12 Dec to Nick Piggin). The spam filter again. Not that I have total recall of every exchange about these patches either. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 2:52 ` Hugh Dickins @ 2005-01-13 17:05 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-13 17:05 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Nick Piggin, Jay Lan, Linus Torvalds, Andi Kleen, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Hugh Dickins wrote: > I did cc you. What whole bunch of other people? The list of recipients > was the same, except (for obvious reasons) I added Jay the second time > (and having more time, spelt out most names in full). > > Or perhaps SGI has a spam filter which chose to gobble it up. > I'll try forwarding it to you again. Yes sorry it was the spam filter. I got my copy from linux-mm just fine on another account. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 9:42 ` Andrew Morton 2005-01-12 12:29 ` Marcelo Tosatti 2005-01-12 12:43 ` Hugh Dickins @ 2005-01-12 16:39 ` Christoph Lameter 2005-01-12 16:49 ` Christoph Hellwig 2005-01-12 18:43 ` Andrew Morton 2 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 16:39 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Andrew Morton wrote: > My general take is that these patches address a single workload on > exceedingly rare and expensive machines. If they adversely affect common > and cheap machines via code complexity, memory footprint or via runtime > impact then it would be pretty hard to justify their inclusion. The future is in higher and higher SMP counts since the chase for the higher clock frequency has ended. We will increasingly see multi-core cpus etc. Machines with higher CPU counts are becoming common in business. Of course SGI uses much higher CPU counts and our supercomputer applications would benefit most from this patch. I thought this patch was already approved by Linus? > Do we have measurements of the negative and/or positive impact on smaller > machines? Here is a measurement of 256M allocation on a 2 way SMP machine 2x PIII-500Mhz: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895 0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566 With patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357 0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412 So only a very minor improvements for old machines (this one from ~ 98). ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 16:39 ` Christoph Lameter @ 2005-01-12 16:49 ` Christoph Hellwig 2005-01-12 17:37 ` Christoph Lameter 2005-01-12 18:43 ` Andrew Morton 1 sibling, 1 reply; 286+ messages in thread From: Christoph Hellwig @ 2005-01-12 16:49 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, Jan 12, 2005 at 08:39:21AM -0800, Christoph Lameter wrote: > The future is in higher and higher SMP counts since the chase for the > higher clock frequency has ended. We will increasingly see multi-core > cpus etc. Machines with higher CPU counts are becoming common in business. An they still are absolutely in the minority. In fact with multicore cpus it becomes more and more important to be fast for SMP systtems with a _small_ number of CPUs, while really larget CPUs will remain a small nische for the forseeable future. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 16:49 ` Christoph Hellwig @ 2005-01-12 17:37 ` Christoph Lameter 2005-01-12 17:41 ` Christoph Hellwig 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 17:37 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Christoph Hellwig wrote: > On Wed, Jan 12, 2005 at 08:39:21AM -0800, Christoph Lameter wrote: > > The future is in higher and higher SMP counts since the chase for the > > higher clock frequency has ended. We will increasingly see multi-core > > cpus etc. Machines with higher CPU counts are becoming common in business. > > An they still are absolutely in the minority. In fact with multicore > cpus it becomes more and more important to be fast for SMP systtems with > a _small_ number of CPUs, while really larget CPUs will remain a small > nische for the forseeable future. The benefits start to be significant pretty fast with even a few cpus on modern architectures: Altix no patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 10 1 0.107s 6.444s 6.055s100028.084 100006.622 1 10 2 0.121s 9.048s 4.082s 71468.414 135904.412 1 10 4 0.129s 10.185s 3.011s 63531.985 210146.600 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 10 1 0.094s 6.116s 6.021s105517.039 105517.574 1 10 2 0.134s 6.998s 3.087s 91879.573 169079.712 1 10 4 0.095s 7.658s 2.043s 84519.939 268955.165 There is even a small benefit to the single thread case. Its not the case that this patch only benefits systems with a large number of CPUs. Of course that is when the benefits results in performance gains by orders of magnitude. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 17:37 ` Christoph Lameter @ 2005-01-12 17:41 ` Christoph Hellwig 2005-01-12 17:52 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Christoph Hellwig @ 2005-01-12 17:41 UTC (permalink / raw) To: Christoph Lameter Cc: Christoph Hellwig, Andrew Morton, Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, Jan 12, 2005 at 09:37:27AM -0800, Christoph Lameter wrote: > > The benefits start to be significant pretty fast with even a few cpus > on modern architectures: > > Altix no patch: > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 1 10 1 0.107s 6.444s 6.055s100028.084 100006.622 > 1 10 2 0.121s 9.048s 4.082s 71468.414 135904.412 > 1 10 4 0.129s 10.185s 3.011s 63531.985 210146.600 > > w/patch > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 1 10 1 0.094s 6.116s 6.021s105517.039 105517.574 > 1 10 2 0.134s 6.998s 3.087s 91879.573 169079.712 > 1 10 4 0.095s 7.658s 2.043s 84519.939 268955.165 These smaller systems are more likely x86/x86_64 machines ;-) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 17:41 ` Christoph Hellwig @ 2005-01-12 17:52 ` Christoph Lameter 2005-01-12 18:04 ` Christoph Hellwig 2005-01-12 18:20 ` Andrew Walrond 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 17:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Christoph Hellwig wrote: > These smaller systems are more likely x86/x86_64 machines ;-) But they will not have been build in 1998 either like the machine I used for the i386 tests. Could you do some tests on contemporary x86/x86_64 SMP systems with large memory? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 17:52 ` Christoph Lameter @ 2005-01-12 18:04 ` Christoph Hellwig 2005-01-12 18:20 ` Andrew Walrond 1 sibling, 0 replies; 286+ messages in thread From: Christoph Hellwig @ 2005-01-12 18:04 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Nick Piggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, Jan 12, 2005 at 09:52:53AM -0800, Christoph Lameter wrote: > On Wed, 12 Jan 2005, Christoph Hellwig wrote: > > > These smaller systems are more likely x86/x86_64 machines ;-) > > But they will not have been build in 1998 either like the machine I used > for the i386 tests. Could you do some tests on contemporary x86/x86_64 > SMP systems with large memory? I don't have such systems. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 17:52 ` Christoph Lameter 2005-01-12 18:04 ` Christoph Hellwig @ 2005-01-12 18:20 ` Andrew Walrond 1 sibling, 0 replies; 286+ messages in thread From: Andrew Walrond @ 2005-01-12 18:20 UTC (permalink / raw) To: linux-kernel On Wednesday 12 January 2005 17:52, Christoph Lameter wrote: > On Wed, 12 Jan 2005, Christoph Hellwig wrote: > > These smaller systems are more likely x86/x86_64 machines ;-) > > But they will not have been build in 1998 either like the machine I used > for the i386 tests. Could you do some tests on contemporary x86/x86_64 > SMP systems with large memory? I have various dual x86_64 systems with 1-4Gb ram. What tests do you want run? Andrew Walrond ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 16:39 ` Christoph Lameter 2005-01-12 16:49 ` Christoph Hellwig @ 2005-01-12 18:43 ` Andrew Morton 2005-01-12 19:06 ` Christoph Lameter 2005-01-12 23:16 ` Nick Piggin 1 sibling, 2 replies; 286+ messages in thread From: Andrew Morton @ 2005-01-12 18:43 UTC (permalink / raw) To: Christoph Lameter Cc: nickpiggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Christoph Lameter <clameter@sgi.com> wrote: > > > Do we have measurements of the negative and/or positive impact on smaller > > machines? > > Here is a measurement of 256M allocation on a 2 way SMP machine 2x > PIII-500Mhz: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895 > 0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566 > > With patch: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357 > 0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412 > > So only a very minor improvements for old machines (this one from ~ 98). OK. But have you written a test to demonstrate any performance regressions? From, say, the use of atomic ops on ptes? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 18:43 ` Andrew Morton @ 2005-01-12 19:06 ` Christoph Lameter 2005-01-14 3:39 ` Roman Zippel 2005-01-12 23:16 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 19:06 UTC (permalink / raw) To: Andrew Morton Cc: nickpiggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 12 Jan 2005, Andrew Morton wrote: > > So only a very minor improvements for old machines (this one from ~ 98). > > OK. But have you written a test to demonstrate any performance > regressions? From, say, the use of atomic ops on ptes? If I knew of any regressions, I would certain try to deal with them. The test is written to check for concurrent page fault performance and it has repeatedly helped me to find problems with page faults. I have used it for a couple of other patchsets too. If the patch would be available in mm then it certainly would get more exposure and it may become clear that there are some regressions. Introduction of the cmpxchg is one atomic operations that replaces the two spinlock ops typically necessary in an unpatched kernel. Obtaining the spinlock requires an spinlock (which is an atomic operation) and then the release involves a barrier. So there is a net win for all SMP cases as far as I can see. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 19:06 ` Christoph Lameter @ 2005-01-14 3:39 ` Roman Zippel 2005-01-14 4:14 ` Andi Kleen 0 siblings, 1 reply; 286+ messages in thread From: Roman Zippel @ 2005-01-14 3:39 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, nickpiggin, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Hi, Christoph Lameter wrote: > Introduction of the cmpxchg is one atomic operations that replaces the two > spinlock ops typically necessary in an unpatched kernel. Obtaining the > spinlock requires an spinlock (which is an atomic operation) and then the > release involves a barrier. So there is a net win for all SMP cases as far > as I can see. But there might be a loss in the UP case. Spinlocks are optimized away, but your cmpxchg emulation enables/disables interrupts with every access. bye, Roman ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 3:39 ` Roman Zippel @ 2005-01-14 4:14 ` Andi Kleen 2005-01-14 12:02 ` Roman Zippel 0 siblings, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-14 4:14 UTC (permalink / raw) To: Roman Zippel Cc: Christoph Lameter, Andrew Morton, nickpiggin, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, Jan 14, 2005 at 04:39:16AM +0100, Roman Zippel wrote: > Hi, > > Christoph Lameter wrote: > > >Introduction of the cmpxchg is one atomic operations that replaces the two > >spinlock ops typically necessary in an unpatched kernel. Obtaining the > >spinlock requires an spinlock (which is an atomic operation) and then the > >release involves a barrier. So there is a net win for all SMP cases as far > >as I can see. > > But there might be a loss in the UP case. Spinlocks are optimized away, > but your cmpxchg emulation enables/disables interrupts with every access. Only for 386s and STI/CLI is quite cheap there. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 4:14 ` Andi Kleen @ 2005-01-14 12:02 ` Roman Zippel 0 siblings, 0 replies; 286+ messages in thread From: Roman Zippel @ 2005-01-14 12:02 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Andrew Morton, nickpiggin, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Hi, On Fri, 14 Jan 2005, Andi Kleen wrote: > > But there might be a loss in the UP case. Spinlocks are optimized away, > > but your cmpxchg emulation enables/disables interrupts with every access. > > Only for 386s and STI/CLI is quite cheap there. But it's still not free and what about other archs? Why not just check __HAVE_ARCH_CMPXCHG and provide a replacement, which is guaranteed cheaper if no interrupt synchronisation is needed. bye, Roman ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 18:43 ` Andrew Morton 2005-01-12 19:06 ` Christoph Lameter @ 2005-01-12 23:16 ` Nick Piggin 2005-01-12 23:30 ` Andrew Morton 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-12 23:16 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Andrew Morton wrote: > Christoph Lameter <clameter@sgi.com> wrote: > >>>Do we have measurements of the negative and/or positive impact on smaller >> >> > machines? >> >> Here is a measurement of 256M allocation on a 2 way SMP machine 2x >> PIII-500Mhz: >> >> Gb Rep Threads User System Wall flt/cpu/s fault/wsec >> 0 10 1 0.005s 0.016s 0.002s 54357.280 52261.895 >> 0 10 2 0.008s 0.019s 0.002s 43112.368 42463.566 >> >> With patch: >> >> Gb Rep Threads User System Wall flt/cpu/s fault/wsec >> 0 10 1 0.005s 0.016s 0.002s 54357.280 53439.357 >> 0 10 2 0.008s 0.018s 0.002s 44650.831 44202.412 >> >> So only a very minor improvements for old machines (this one from ~ 98). > > > OK. But have you written a test to demonstrate any performance > regressions? From, say, the use of atomic ops on ptes? > Performance wise, Christoph's never had as much of a problem as my patches because it isn't doing extra atomic operations in copy_page_range. However, it looks like it should be. For the same reason there needs to be an atomic read in handle_mm_fault. And it probably needs atomic ops in other places too, I think. So my patches cost about 7% in lmbench fork benchmark... however, I've been thinking we could take the mmap_sem for writing before doing the copy_page_range which could reduce the need for atomic ops. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:16 ` Nick Piggin @ 2005-01-12 23:30 ` Andrew Morton 2005-01-12 23:50 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Andrew Morton @ 2005-01-12 23:30 UTC (permalink / raw) To: Nick Piggin Cc: clameter, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > So my patches cost about 7% in lmbench fork benchmark. OK, well that's the sort of thing we need to understand fully. What sort of CPU was that on? Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we agree to take. But we need to fully understand all the costs and benefits. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:30 ` Andrew Morton @ 2005-01-12 23:50 ` Nick Piggin 2005-01-12 23:54 ` Christoph Lameter 2005-01-13 3:09 ` page table lock patch V15 [0/7]: overview Hugh Dickins 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2005-01-12 23:50 UTC (permalink / raw) To: Andrew Morton Cc: clameter, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>So my patches cost about 7% in lmbench fork benchmark. > > > OK, well that's the sort of thing we need to understand fully. What sort > of CPU was that on? > That was on a P4, although I've seen pretty similar results on ia64 and other x86 CPUs. Note that this was with my ptl removal patches. I can't see why Christoph's would have _any_ extra overhead as they are, but it looks to me like they're lacking in atomic ops. So I'd expect something similar for Christoph's when they're properly atomic. > Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we > agree to take. But we need to fully understand all the costs and benefits. > I think copy_page_range is the one to keep an eye on. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:50 ` Nick Piggin @ 2005-01-12 23:54 ` Christoph Lameter 2005-01-13 0:10 ` Nick Piggin 2005-01-13 3:09 ` page table lock patch V15 [0/7]: overview Hugh Dickins 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-12 23:54 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Nick Piggin wrote: > Note that this was with my ptl removal patches. I can't see why Christoph's > would have _any_ extra overhead as they are, but it looks to me like they're > lacking in atomic ops. So I'd expect something similar for Christoph's when > they're properly atomic. Pointer operations and word size operations are atomic. So this is mostly okay. The issue arises on architectures that have a large pte size than the wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to the page table lock for these operations. PAE mode should do the same and not use atomic ops if they cannot be made to work in a reasonable manner. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:54 ` Christoph Lameter @ 2005-01-13 0:10 ` Nick Piggin 2005-01-13 0:16 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-13 0:10 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Christoph Lameter wrote: > On Thu, 13 Jan 2005, Nick Piggin wrote: > > >>Note that this was with my ptl removal patches. I can't see why Christoph's >>would have _any_ extra overhead as they are, but it looks to me like they're >>lacking in atomic ops. So I'd expect something similar for Christoph's when >>they're properly atomic. > > > Pointer operations and word size operations are atomic. So this is mostly > okay. > > The issue arises on architectures that have a large pte size than the > wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to > the page table lock for these operations. PAE mode should do the same and > not use atomic ops if they cannot be made to work in a reasonable manner. > Yep well you should be OK then. Your implementation has the advantage that it only instantiates previously clear ptes... hmm, no I'm wrong, your ptep_set_access_flags path modifies an existing pte. I think this can cause subtle races in copy_page_range, and maybe other places, can't it? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 0:10 ` Nick Piggin @ 2005-01-13 0:16 ` Christoph Lameter 2005-01-13 0:42 ` Nick Piggin 2005-01-13 3:18 ` Andi Kleen 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-13 0:16 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Nick Piggin wrote: > > Pointer operations and word size operations are atomic. So this is mostly > > okay. > > > > The issue arises on architectures that have a large pte size than the > > wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to > > the page table lock for these operations. PAE mode should do the same and > > not use atomic ops if they cannot be made to work in a reasonable manner. > > > > Yep well you should be OK then. Your implementation has the advantage > that it only instantiates previously clear ptes... hmm, no I'm wrong, > your ptep_set_access_flags path modifies an existing pte. I think this > can cause subtle races in copy_page_range, and maybe other places, > can't it? ptep_set_access_flags is only used after acquiring the page_table_lock and does not clear a pte. That is safe. The only critical thing is if a pte would be cleared while holding the page_table_lock. That used to occur in the swapper code but we modified that. There is still an issue as Hugh rightly observed. One cannot rely on a read of a pte/pud/pmd being atomic if the pte is > word size. This occurs for all higher levels in handle_mm_fault. Thus we would need to either acuire the page_table_lock for some architectures or provide primitives get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 0:16 ` Christoph Lameter @ 2005-01-13 0:42 ` Nick Piggin 2005-01-13 22:19 ` Peter Chubb 2005-01-13 3:18 ` Andi Kleen 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-13 0:42 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, torvalds, ak, hugh, linux-mm, linux-ia64, linux-kernel, benh Christoph Lameter wrote: > On Thu, 13 Jan 2005, Nick Piggin wrote: > > >>>Pointer operations and word size operations are atomic. So this is mostly >>>okay. >>> >>>The issue arises on architectures that have a large pte size than the >>>wordsize. This is only on i386 PAE mode and S/390. S/390 falls back to >>>the page table lock for these operations. PAE mode should do the same and >>>not use atomic ops if they cannot be made to work in a reasonable manner. >>> >> >>Yep well you should be OK then. Your implementation has the advantage >>that it only instantiates previously clear ptes... hmm, no I'm wrong, >>your ptep_set_access_flags path modifies an existing pte. I think this >>can cause subtle races in copy_page_range, and maybe other places, >>can't it? > > > ptep_set_access_flags is only used after acquiring the page_table_lock and > does not clear a pte. That is safe. The only critical thing is if a pte > would be cleared while holding the page_table_lock. That used to occur in > the swapper code but we modified that. > I mean what used to be the ptep_set_access_flags path. Where you are now modifying a pte without the ptl. However after a second look, it seems like that won't be a problem. > There is still an issue as Hugh rightly observed. One cannot rely on a > read of a pte/pud/pmd being atomic if the pte is > word size. This occurs > for all higher levels in handle_mm_fault. Thus we would need to either > acuire the page_table_lock for some architectures or provide primitives > get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH. > Yes I know. I would say that having arch-definable accessors for the page tables wouldn't be a bad idea anyway, and the flexibility may come in handy for other things. It would be a big, annoying patch though :( ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 0:42 ` Nick Piggin @ 2005-01-13 22:19 ` Peter Chubb 0 siblings, 0 replies; 286+ messages in thread From: Peter Chubb @ 2005-01-13 22:19 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: Nick> I would say that having arch-definable accessors for Nick> the page tables wouldn't be a bad idea anyway, and the Nick> flexibility may come in handy for other things. Nick> It would be a big, annoying patch though :( We're currently working in a slightly different direction, to try to hide page-table implemention details from anything outside the page table implementation. Our goal is to be able to try out other page tables (e.g., Liedtke's guarded page table) instead of the 2/3/4 level fixed hierarchy. We're currently working on a 2.6.10 snapshot; obviously we'll have to roll up to 2.6.11 before releasing (and there are lots of changes there because of the recent 4-layer page table implementation). -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 0:16 ` Christoph Lameter 2005-01-13 0:42 ` Nick Piggin @ 2005-01-13 3:18 ` Andi Kleen 2005-01-13 17:11 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-13 3:18 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh > There is still an issue as Hugh rightly observed. One cannot rely on a > read of a pte/pud/pmd being atomic if the pte is > word size. This occurs > for all higher levels in handle_mm_fault. Thus we would need to either > acuire the page_table_lock for some architectures or provide primitives > get_pgd, get_pud etc that take the page_table_lock on PAE mode. ARGH. > Alternatively you can use a lazy load, checking for changes. (untested) pte_t read_pte(volatile pte_t *pte) { pte_t n; do { n.pte_low = pte->pte_low; rmb(); n.pte_high = pte->pte_high; rmb(); } while (n.pte_low != pte->pte_low); return pte; } No atomic operations, I bet it's actually faster than the cmpxchg8. There is a small risk for livelock, but not much worse than with an ordinary spinlock. Not that I get it what you want it for exactly - the content of the pte could change any time when you don't hold page_table_lock, right? -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 3:18 ` Andi Kleen @ 2005-01-13 17:11 ` Christoph Lameter 2005-01-13 17:25 ` Linus Torvalds 2005-01-13 18:02 ` Andi Kleen 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-13 17:11 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 13 Jan 2005, Andi Kleen wrote: > Alternatively you can use a lazy load, checking for changes. > (untested) > > pte_t read_pte(volatile pte_t *pte) > { > pte_t n; > do { > n.pte_low = pte->pte_low; > rmb(); > n.pte_high = pte->pte_high; > rmb(); > } while (n.pte_low != pte->pte_low); > return pte; > } > > No atomic operations, I bet it's actually faster than the cmpxchg8. > There is a small risk for livelock, but not much worse than with an > ordinary spinlock. Hmm.... This may replace the get of a 64 bit value. But here could still be another process that is setting the pte in a non-atomic way. > Not that I get it what you want it for exactly - the content > of the pte could change any time when you don't hold page_table_lock, right? The content of the pte can change anytime the page_table_lock is held and it may change from cleared to a value through a cmpxchg while the lock is not held. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 17:11 ` Christoph Lameter @ 2005-01-13 17:25 ` Linus Torvalds 2005-01-13 18:02 ` Andi Kleen 1 sibling, 0 replies; 286+ messages in thread From: Linus Torvalds @ 2005-01-13 17:25 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Nick Piggin, Andrew Morton, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Christoph Lameter wrote: > > On Wed, 13 Jan 2005, Andi Kleen wrote: > > > Alternatively you can use a lazy load, checking for changes. > > (untested) > > > > pte_t read_pte(volatile pte_t *pte) > > { > > pte_t n; > > do { > > n.pte_low = pte->pte_low; > > rmb(); > > n.pte_high = pte->pte_high; > > rmb(); > > } while (n.pte_low != pte->pte_low); > > return pte; > > } > > > > No atomic operations, I bet it's actually faster than the cmpxchg8. > > There is a small risk for livelock, but not much worse than with an > > ordinary spinlock. > > Hmm.... This may replace the get of a 64 bit value. But here could still > be another process that is setting the pte in a non-atomic way. There's a nice standard way of doing that, namely sequence numbers. However, most of the time it isn't actually faster than just getting the lock. There are two real costs in getting a lock: serialization and cache bouncing. The ordering often requires _more_ serialization than a lock/unlock sequence, so sequences like the above are often slower than the trivial lock is, at least in the absense of lock contention. So sequence numbers (or multiple reads) only tend make sense where there is a _lot_ more reads than writes, and where you get lots of lock contention. If there are lots of writes, my gut feel (but hey, all locking optimization should be backed up by real numbers) is that it's better to have a lock close to the data, since you'll get the cacheline bounces _anyway_, and locking often has lower serialization costs. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 17:11 ` Christoph Lameter 2005-01-13 17:25 ` Linus Torvalds @ 2005-01-13 18:02 ` Andi Kleen 2005-01-13 18:16 ` Christoph Lameter 2005-01-14 1:09 ` Christoph Lameter 1 sibling, 2 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-13 18:02 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote: > On Wed, 13 Jan 2005, Andi Kleen wrote: > > > Alternatively you can use a lazy load, checking for changes. > > (untested) > > > > pte_t read_pte(volatile pte_t *pte) > > { > > pte_t n; > > do { > > n.pte_low = pte->pte_low; > > rmb(); > > n.pte_high = pte->pte_high; > > rmb(); > > } while (n.pte_low != pte->pte_low); > > return pte; It should be return n; here of course. > > } > > > > No atomic operations, I bet it's actually faster than the cmpxchg8. > > There is a small risk for livelock, but not much worse than with an > > ordinary spinlock. > > Hmm.... This may replace the get of a 64 bit value. But here could still > be another process that is setting the pte in a non-atomic way. The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way when its present bit is set (because the hardware could asynchronously change bits in the PTE that would get lost). Atomic way means clearing first and then replacing in an atomic operation. This helps you because you shouldn't be looking at the pte anyways when pte_present is false. When it is not false it is always updated atomically. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 18:02 ` Andi Kleen @ 2005-01-13 18:16 ` Christoph Lameter 2005-01-13 20:17 ` Andi Kleen 2005-01-14 1:09 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-13 18:16 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Andi Kleen wrote: > The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way > when its present bit is set (because the hardware could asynchronously > change bits in the PTE that would get lost). Atomic way means clearing > first and then replacing in an atomic operation. Hmm. I replaced that portion in the swapper with an xchg operation and inspect the result later. Clearing a pte and then setting it to something would open a window for the page fault handler to set up a new pte there since it does not take the page_table_lock. That xchg must be atomic for PAE mode to work then. > This helps you because you shouldn't be looking at the pte anyways > when pte_present is false. When it is not false it is always updated > atomically. so pmd_present, pud_none and pgd_none could be considered atomic even if the pm/u/gd_t is a multi-word entity? In that case the current approach would work for higher level entities and in particular S/390 would be in the clear. But then the issues of replacing multi-word ptes on i386 PAE remain. If no write lock is held on mmap_sem then all writes to pte's must be atomic in order for the get_pte_atomic operation to work reliably. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 18:16 ` Christoph Lameter @ 2005-01-13 20:17 ` Andi Kleen 0 siblings, 0 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-13 20:17 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, Jan 13, 2005 at 10:16:58AM -0800, Christoph Lameter wrote: > On Thu, 13 Jan 2005, Andi Kleen wrote: > > > The rule in i386/x86-64 is that you cannot set the PTE in a non atomic way > > when its present bit is set (because the hardware could asynchronously > > change bits in the PTE that would get lost). Atomic way means clearing > > first and then replacing in an atomic operation. > > Hmm. I replaced that portion in the swapper with an xchg operation > and inspect the result later. Clearing a pte and then setting it to > something would open a window for the page fault handler to set up a new Yes, it usually assumes page table lock hold. > pte there since it does not take the page_table_lock. That xchg must be > atomic for PAE mode to work then. You can always use cmpxchg8 for that if you want. Just to make it really atomic you may need a LOCK prefix, and with that the cost is not much lower than a real spinlock. > > > This helps you because you shouldn't be looking at the pte anyways > > when pte_present is false. When it is not false it is always updated > > atomically. > > so pmd_present, pud_none and pgd_none could be considered atomic even if > the pm/u/gd_t is a multi-word entity? In that case the current approach The optimistic read function I posted would do this. But you have to read multiple entries anyways, which could get non atomic no? (e.g. to do something on a PTE you always need to read PGD/PUD/PMD) In theory you could do this lazily with retires too, but it would be probably somewhat costly and complicated. > would work for higher level entities and in particular S/390 would be in > the clear. > > But then the issues of replacing multi-word ptes on i386 PAE remain. If no > write lock is held on mmap_sem then all writes to pte's must be atomic in mmap_sem is only for VMAs. The page tables itself are protected by page table lock. > order for the get_pte_atomic operation to work reliably. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 18:02 ` Andi Kleen 2005-01-13 18:16 ` Christoph Lameter @ 2005-01-14 1:09 ` Christoph Lameter 2005-01-14 4:39 ` Andi Kleen 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-14 1:09 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Andi Kleen wrote: > On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote: > > On Wed, 13 Jan 2005, Andi Kleen wrote: > > > > > Alternatively you can use a lazy load, checking for changes. > > > (untested) > > > > > > pte_t read_pte(volatile pte_t *pte) > > > { > > > pte_t n; > > > do { > > > n.pte_low = pte->pte_low; > > > rmb(); > > > n.pte_high = pte->pte_high; > > > rmb(); > > > } while (n.pte_low != pte->pte_low); > > > return pte; I think this is not necessary. Most IA32 processors do 64 bit operations in an atomic way in the same way as IA64. We can cut out all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if we just use convince the compiler to use 64 bit fetches and stores. 486 cpus and earlier are the only ones unable to do 64 bit atomic ops but those wont be able to use PAE mode anyhow. Page 231 of Volume 3 of the Intel IA32 manual states regarding atomicity of operations: 7.1.1. Guaranteed Atomic Operations The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors guarantee that the following basic memory operations will always be carried out atomically: o reading or writing a byte o reading or writing a word aligned on a 16-bit boundary o reading or writing a doubleword aligned on a 32-bit boundary The Pentium 4, Intel Xeon, and P6 family, and Pentium processors guarantee that the following additional memory operations will always be carried out atomically: o reading or writing a quadword aligned on a 64-bit boundary o 16-bit accesses to uncached memory locations that fit within a 32-bit data bus o The P6 family processors guarantee that the following additional memory operation will always be carried out atomically: o unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a 32-byte cache .... off to look for 64bit store and load instructions in the intel manuals. I feel much better about keeping the existing approach. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 1:09 ` Christoph Lameter @ 2005-01-14 4:39 ` Andi Kleen 2005-01-14 4:52 ` page table lock patch V15 [0/7]: overview II Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-14 4:39 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, Jan 13, 2005 at 05:09:04PM -0800, Christoph Lameter wrote: > On Thu, 13 Jan 2005, Andi Kleen wrote: > > > On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote: > > > On Wed, 13 Jan 2005, Andi Kleen wrote: > > > > > > > Alternatively you can use a lazy load, checking for changes. > > > > (untested) > > > > > > > > pte_t read_pte(volatile pte_t *pte) > > > > { > > > > pte_t n; > > > > do { > > > > n.pte_low = pte->pte_low; > > > > rmb(); > > > > n.pte_high = pte->pte_high; > > > > rmb(); > > > > } while (n.pte_low != pte->pte_low); > > > > return pte; > > I think this is not necessary. Most IA32 processors do 64 > bit operations in an atomic way in the same way as IA64. We can cut out > all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if > we just use convince the compiler to use 64 bit fetches and stores. 486 That would mean either cmpxchg8 (slow) or using MMX/SSE (even slower because you would need to save FPU stable and disable exceptions). I think FPU is far too slow and complicated. I benchmarked lazy read and cmpxchg 8: Athlon64: readpte hot 42 readpte cold 426 readpte_cmp hot 33 readpte_cmp cold 2693 Nocona: readpte hot 140 readpte cold 960 readpte_cmp hot 48 readpte_cmp cold 2668 As you can see cmpxchg is slightly faster for the cache hot case, but incredibly slow for cache cold (probably because it does something nasty on the bus). This is pretty consistent to Intel and AMD CPUs. Given that page tables are likely more often cache cold than hot I would use the lazy variant. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 4:39 ` Andi Kleen @ 2005-01-14 4:52 ` Andi Kleen 2005-01-14 4:59 ` Nick Piggin 2005-01-14 4:54 ` page table lock patch V15 [0/7]: overview Nick Piggin 2005-01-14 16:52 ` Christoph Lameter 2 siblings, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-14 4:52 UTC (permalink / raw) To: clameter Cc: Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh, nickpiggin Andi Kleen <ak@muc.de> writes: > As you can see cmpxchg is slightly faster for the cache hot case, > but incredibly slow for cache cold (probably because it does something > nasty on the bus). This is pretty consistent to Intel and AMD CPUs. > Given that page tables are likely more often cache cold than hot > I would use the lazy variant. Sorry, my benchmark program actually had a bug (first loop included page faults). Here are updated numbers. They are somewhat different: Athlon 64: readpte hot 25 readpte cold 171 readpte_cmp hot 18 readpte_cmp cold 162 Nocona: readpte hot 118 readpte cold 443 readpte_cmp hot 22 readpte_cmp cold 224 The difference is much smaller here. Assuming cache cold cmpxchg8b is better, at least on the Intel CPUs which have a slow rmb(). -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 4:52 ` page table lock patch V15 [0/7]: overview II Andi Kleen @ 2005-01-14 4:59 ` Nick Piggin 2005-01-14 10:47 ` Andi Kleen 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-14 4:59 UTC (permalink / raw) To: Andi Kleen Cc: clameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, 2005-01-14 at 05:52 +0100, Andi Kleen wrote: > Andi Kleen <ak@muc.de> writes: > > As you can see cmpxchg is slightly faster for the cache hot case, > > but incredibly slow for cache cold (probably because it does something > > nasty on the bus). This is pretty consistent to Intel and AMD CPUs. > > Given that page tables are likely more often cache cold than hot > > I would use the lazy variant. > > Sorry, my benchmark program actually had a bug (first loop included > page faults). Here are updated numbers. They are somewhat different: > > Athlon 64: > readpte hot 25 > readpte cold 171 > readpte_cmp hot 18 > readpte_cmp cold 162 > > Nocona: > readpte hot 118 > readpte cold 443 > readpte_cmp hot 22 > readpte_cmp cold 224 > > The difference is much smaller here. Assuming cache cold cmpxchg8b is > better, at least on the Intel CPUs which have a slow rmb(). > I have a question for the x86 gurus. We're currently using the lock prefix for set_64bit. This will lock the bus for the RMW cycle, but is it a prerequisite for the atomic 64-bit store? Even on UP? Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 4:59 ` Nick Piggin @ 2005-01-14 10:47 ` Andi Kleen 2005-01-14 10:57 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-14 10:47 UTC (permalink / raw) To: Nick Piggin Cc: clameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh > I have a question for the x86 gurus. We're currently using the lock > prefix for set_64bit. This will lock the bus for the RMW cycle, but > is it a prerequisite for the atomic 64-bit store? Even on UP? An atomic 64bit store doesn't need a lock prefix. A cmpxchg will need to though. Note that UP kernels define LOCK to nothing. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 10:47 ` Andi Kleen @ 2005-01-14 10:57 ` Nick Piggin 2005-01-14 11:11 ` Andi Kleen 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-14 10:57 UTC (permalink / raw) To: Andi Kleen Cc: clameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Andi Kleen wrote: >>I have a question for the x86 gurus. We're currently using the lock >>prefix for set_64bit. This will lock the bus for the RMW cycle, but >>is it a prerequisite for the atomic 64-bit store? Even on UP? > > > An atomic 64bit store doesn't need a lock prefix. A cmpxchg will > need to though. Are you sure the cmpxchg8b need a lock prefix? Sure it does to get the proper "atomic cmpxchg" semantics, but what about a simple 64-bit store... If it boils down to 8 byte load, 8 byte store on the memory bus, and that store is atomic, then maybe a lock isn't needed at all? I think when emulating a *load*, then the lock is needed, because otherwise the subsequent store may overwrite some value that has just been stored by another processor.... but for a store I'm not so sure. > Note that UP kernels define LOCK to nothing. > Yes. In this case (include/asm-i386/system.h:__set_64bit), it is using lowercase lock, which I think is not defined away, right? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 10:57 ` Nick Piggin @ 2005-01-14 11:11 ` Andi Kleen 2005-01-14 16:57 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Andi Kleen @ 2005-01-14 11:11 UTC (permalink / raw) To: Nick Piggin Cc: clameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, Jan 14, 2005 at 09:57:16PM +1100, Nick Piggin wrote: > Andi Kleen wrote: > >>I have a question for the x86 gurus. We're currently using the lock > >>prefix for set_64bit. This will lock the bus for the RMW cycle, but > >>is it a prerequisite for the atomic 64-bit store? Even on UP? > > > > > >An atomic 64bit store doesn't need a lock prefix. A cmpxchg will > >need to though. > > Are you sure the cmpxchg8b need a lock prefix? Sure it does to If you want it to be atomic on SMP then yes. > get the proper "atomic cmpxchg" semantics, but what about a > simple 64-bit store... If it boils down to 8 byte load, 8 byte A 64bit store with a 64bit store instruction is atomic. But to do that on 32bit x86 you need SSE/MMX (not an option in the kernel) or cmpxchg8 > store on the memory bus, and that store is atomic, then maybe > a lock isn't needed at all? More complex operations than store or load are not atomic without LOCK (and not all operations can have a lock prefix). There are a few instructions with implicit lock. If you want the gory details read chapter 7 in the IA32 Software Developer's Manual Volume 3. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview II 2005-01-14 11:11 ` Andi Kleen @ 2005-01-14 16:57 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-14 16:57 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, 14 Jan 2005, Andi Kleen wrote: > > Are you sure the cmpxchg8b need a lock prefix? Sure it does to > > If you want it to be atomic on SMP then yes. > > > get the proper "atomic cmpxchg" semantics, but what about a > > simple 64-bit store... If it boils down to 8 byte load, 8 byte > > A 64bit store with a 64bit store instruction is atomic. But > to do that on 32bit x86 you need SSE/MMX (not an option in the kernel) > or cmpxchg8 > > > store on the memory bus, and that store is atomic, then maybe > > a lock isn't needed at all? > > More complex operations than store or load are not atomic without > LOCK (and not all operations can have a lock prefix). There are a few > instructions with implicit lock. If you want the gory details read > chapter 7 in the IA32 Software Developer's Manual Volume 3. It needs a lock prefix. Volume 2 of the IA32 manual states on page 150 regarding cmpxchg (Note that the atomicity mentioned here seems to apply to the complete instruction not the 64 bit fetches and stores): This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor's bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back ifthe comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 4:39 ` Andi Kleen 2005-01-14 4:52 ` page table lock patch V15 [0/7]: overview II Andi Kleen @ 2005-01-14 4:54 ` Nick Piggin 2005-01-14 10:46 ` Andi Kleen 2005-01-14 16:52 ` Christoph Lameter 2 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-14 4:54 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, 2005-01-14 at 05:39 +0100, Andi Kleen wrote: > As you can see cmpxchg is slightly faster for the cache hot case, > but incredibly slow for cache cold (probably because it does something > nasty on the bus). This is pretty consistent to Intel and AMD CPUs. > Given that page tables are likely more often cache cold than hot > I would use the lazy variant. > I have a question about your trickery with the read_pte function ;) pte_t read_pte(volatile pte_t *pte) { pte_t n; do { n.pte_low = pte->pte_low; rmb(); n.pte_high = pte->pte_high; rmb(); } while (n.pte_low != pte->pte_low); return pte; } Versus the existing set_pte function. Presumably the order here can't be changed otherwise you could set the present bit before the high bit, and race with the hardware MMU? static inline void set_pte(pte_t *ptep, pte_t pte) { ptep->pte_high = pte.pte_high; smp_wmb(); ptep->pte_low = pte.pte_low; } Now take the following interleaving: CPU0 read_pte CPU1 set_pte n.pte_low = pte->pte_low; rmb(); ptep->pte_high = pte.pte_high; smp_wmb(); n.pte_high = pte->pte_high; rmb(); while (n.pte_low != pte->pte_low); return pte; ptep->pte_low = pte.pte_low; So I think you can get a non atomic result. Are you relying on assumptions about the value of pte_low not causing any problems in the page fault handler? Or am I missing something? Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 4:54 ` page table lock patch V15 [0/7]: overview Nick Piggin @ 2005-01-14 10:46 ` Andi Kleen 0 siblings, 0 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-14 10:46 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, Jan 14, 2005 at 03:54:59PM +1100, Nick Piggin wrote: > On Fri, 2005-01-14 at 05:39 +0100, Andi Kleen wrote: > > > As you can see cmpxchg is slightly faster for the cache hot case, > > but incredibly slow for cache cold (probably because it does something > > nasty on the bus). This is pretty consistent to Intel and AMD CPUs. > > Given that page tables are likely more often cache cold than hot > > I would use the lazy variant. > > > > I have a question about your trickery with the read_pte function ;) > > pte_t read_pte(volatile pte_t *pte) > { > pte_t n; > do { > n.pte_low = pte->pte_low; > rmb(); > n.pte_high = pte->pte_high; > rmb(); > } while (n.pte_low != pte->pte_low); > return pte; > } > > Versus the existing set_pte function. Presumably the order here > can't be changed otherwise you could set the present bit before > the high bit, and race with the hardware MMU? The hardware MMU only ever adds some bits (D etc.). Never changes the address. It won't clear P bits. The page fault handler also doesn't clear them, only the swapper does. With that knowledge you could probably do some optimizations. > So I think you can get a non atomic result. Are you relying on > assumptions about the value of pte_low not causing any problems > in the page fault handler? I don't know. You have to ask Christopher L. I only commented on one subthread where he asked about atomic pte reading, but haven't studied his patches in detail. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 4:39 ` Andi Kleen 2005-01-14 4:52 ` page table lock patch V15 [0/7]: overview II Andi Kleen 2005-01-14 4:54 ` page table lock patch V15 [0/7]: overview Nick Piggin @ 2005-01-14 16:52 ` Christoph Lameter 2005-01-14 17:01 ` Andi Kleen 2 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-14 16:52 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Thu, 14 Jan 2005, Andi Kleen wrote: > > I think this is not necessary. Most IA32 processors do 64 > > bit operations in an atomic way in the same way as IA64. We can cut out > > all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if > > we just use convince the compiler to use 64 bit fetches and stores. 486 > > That would mean either cmpxchg8 (slow) or using MMX/SSE (even slower > because you would need to save FPU stable and disable > exceptions). It strange that the instruction set does not contain some simple 64bit store or load and the FPU state seems to be complex to manage...sigh. Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt context but the rest of the prep for mmx only saves the fpu state if its in use. So that code would only be used rarely. The mmx 64 bit instructions seem to be quite fast according to the manual. Double the cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4). One could simply do a movq. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 16:52 ` Christoph Lameter @ 2005-01-14 17:01 ` Andi Kleen 2005-01-14 17:08 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-14 17:01 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh > Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt > context but the rest of the prep for mmx only saves the fpu state if its > in use. So that code would only be used rarely. The mmx 64 bit > instructions seem to be quite fast according to the manual. Double the > cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4). With all the other overhead (disabling exceptions, saving register etc.) will be likely slower. Also you would need fallback paths for CPUs without MMX but with PAE (like Ppro). You can benchmark it if you want, but I wouldn't be very optimistic. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 17:01 ` Andi Kleen @ 2005-01-14 17:08 ` Christoph Lameter 2005-01-14 17:11 ` Andi Kleen 2005-01-14 17:43 ` Linus Torvalds 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter 2 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-01-14 17:08 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, 14 Jan 2005, Andi Kleen wrote: > > Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt > > context but the rest of the prep for mmx only saves the fpu state if its > > in use. So that code would only be used rarely. The mmx 64 bit > > instructions seem to be quite fast according to the manual. Double the > > cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4). > > With all the other overhead (disabling exceptions, saving register etc.) > will be likely slower. Also you would need fallback paths for CPUs > without MMX but with PAE (like Ppro). You can benchmark > it if you want, but I wouldn't be very optimistic. So the PentiumPro is a cpu with atomic 64 bit operations in a cmpxchg but no instruction to do an atomic 64 bit store or load although the architecture conceptually supports 64bit atomic stores and loads? Wild. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 17:08 ` Christoph Lameter @ 2005-01-14 17:11 ` Andi Kleen 0 siblings, 0 replies; 286+ messages in thread From: Andi Kleen @ 2005-01-14 17:11 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, Jan 14, 2005 at 09:08:54AM -0800, Christoph Lameter wrote: > On Fri, 14 Jan 2005, Andi Kleen wrote: > > > > Looked at arch/i386/lib/mmx.c. It avoids the mmx ops in an interrupt > > > context but the rest of the prep for mmx only saves the fpu state if its > > > in use. So that code would only be used rarely. The mmx 64 bit > > > instructions seem to be quite fast according to the manual. Double the > > > cycles than the 32 bit instructions on Pentium M (somewhat higher on Pentium 4). > > > > With all the other overhead (disabling exceptions, saving register etc.) > > will be likely slower. Also you would need fallback paths for CPUs > > without MMX but with PAE (like Ppro). You can benchmark > > it if you want, but I wouldn't be very optimistic. > > So the PentiumPro is a cpu with atomic 64 bit operations in a cmpxchg but > no instruction to do an atomic 64 bit store or load although the > architecture conceptually supports 64bit atomic stores and loads? Wild. It can do 64bit x87 FP loads/stores. But I doubt that is what you're looking for. -Andi ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-14 17:01 ` Andi Kleen 2005-01-14 17:08 ` Christoph Lameter @ 2005-01-14 17:43 ` Linus Torvalds 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter 2 siblings, 0 replies; 286+ messages in thread From: Linus Torvalds @ 2005-01-14 17:43 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Nick Piggin, Andrew Morton, hugh, linux-mm, linux-ia64, linux-kernel, benh On Fri, 14 Jan 2005, Andi Kleen wrote: > > With all the other overhead (disabling exceptions, saving register etc.) > will be likely slower. Also you would need fallback paths for CPUs > without MMX but with PAE (like Ppro). You can benchmark > it if you want, but I wouldn't be very optimistic. We could just say that PAE requires MMX. Quite frankly, if you have a PPro, you probably don't need PAE anyway - I don't see a whole lot of people that spent huge amounts of money on memory and CPU (a PPro that had more than 4GB in it was _quite_ expensive at the time) who haven't upgraded to a PII by now.. IOW, the overlap of "really needs PAE" and "doesn't have MMX" is probably effectively zero. That said, you're probably right in that it probably _is_ expensive enough that it doesn't help. Even if the process doesn't use FP/MMX (so that you can avoid the overhead of state save/restore), you need to - disable preemption - clear "TS" (pretty expensive in itself, since it touches CR0) - .. do any operations .. - set "TS" (again, CR0) - enable preemption so it's likely a thousand cycles minimum on a P4 (I'm just assuming that the P4 will serialize on CR0 accesses, which implies that it's damn expensive), and possibly a hundred on other x86 implementations. That's in the noise for something that does a full page table copy, but it likely makes using MMX for single page table entries a total loss. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V16 [0/4]: redesign overview 2005-01-14 17:01 ` Andi Kleen 2005-01-14 17:08 ` Christoph Lameter 2005-01-14 17:43 ` Linus Torvalds @ 2005-01-28 20:35 ` Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes Christoph Lameter ` (3 more replies) 2 siblings, 4 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-28 20:35 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Changes from V15->V16 of this patch: Complete Redesign. An introduction to what this patch does and a patch archive can be found on http://oss.sgi.com/projects/page_fault_performance. The archive also has a combined patch. The basic approach in this patchset is the same as used in SGI's 2.4.X based kernels which have been in production use in ProPack 3 for a long time. The patchset is composed of 4 patches (and was tested against 2.6.11-rc2-bk6 on ia64, i386 and x86_64): 1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes The current way of synchronizing with the CPU or arch specific interrupts updating page table entries is to first set a pte to zero before writing a new value. This patch uses ptep_xchg and ptep_cmpxchg to avoid writing the zero for certain configurations. The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be enabled as a experimental feature during kernel configuration if the hardware is able to support atomic operations and if an SMP kernel is being configured. A Kconfig update for i386, x86_64 and ia64 has been provided. On i386 this options is restricted to CPUs better than a 486 and non PAE mode (that way all the cmpxchg issues on old i386 CPUS and the problems with 64bit atomic operations on recent i386 CPUS are avoided). If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and ptep_xcmpxchg are realized by falling back to clearing a pte before updating it. The patch does not change the use of mm->page_table_lock and the only performance improvement is the replacement of xchg-with-zero-and-then-write-new-pte-value with an xchg with the new value for SMP on some architectures if CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything major to VM operations. 2/4: Macros for mm counter manipulation There are various approaches to handling mm counters if the page_table_lock is no longer acquired. This patch defines macros in include/linux/sched.h to handle these counters and makes sure that these macros are used throughout the kernel to access and manipulate rss and anon_rss. There should be no change to the generated code as a result of this patch. 3/4: Drop the first use of the page_table_lock in handle_mm_fault The patch introduces two new functions: page_table_atomic_start(mm), page_table_atomic_stop(mm) that fall back to the use of the page_table_lock if CONFIG_ATOMIC_TABLE_OPS is not defined. If CONFIG_ATOMIC_TABLE_OPS is defined those functions may be used to prep the CPU for atomic table ops (i386 in PAE mode may f.e. get the MMX register ready for 64bit atomic ops) but are simply empty by default. Two operations may then be performed on the page table without acquiring the page table lock: a) updating access bits in pte b) anonymous read faults installed a mapping to the zero page. All counters are still protected with the page_table_lock thus avoiding any issues there. Some additional statistics are added to /proc/meminfo to give some statistics. Also counts spurious faults with no effect. There is a surprisingly high number of those on ia64 (used to populate the cpu caches with the pte??) 4/4: Drop the use of the page_table_lock in do_anonymous_page The second acquisition of the page_table_lock is removed from do_anonymous_page and allows the anonymous write fault to be possible without the page_table_lock. The macros for manipulating rss and anon_rss in include/linux/sched.h are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic operations for rss and anon_rss (safest solution for now, other solutions may easily be implemented by changing those macros). This patch typically yield significant increases in page fault performance for threaded applications on SMP systems. I have an additional patch that drops the page_table_lock for COW but that raises a lot of other issues. I will post that patch separately and only to linux-mm. ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter @ 2005-01-28 20:36 ` Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [2/4]: mm counter macros Christoph Lameter ` (2 subsequent siblings) 3 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-28 20:36 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh The current way of updating ptes in the Linux vm includes first clearing a pte before setting it to another value. The clearing is performed while holding the page_table_lock to insure that the entry has not been modified by the CPU directly, by an arch specific interrupt handler or another page fault handler running on another CPU. This approach is necessary for some architectures that cannot perform atomic updates of page table entries to set the page table entry to not present for the MMU logic. If a page table entry is cleared then a second CPU may generate a page fault for that entry. The fault handler on the second CPU will then attempt to acquire the page_table_lock and wait until the first CPU has completed updating the page table entry. The fault handler on the second CPU will then discover that everything is ok and simply do nothing (apart from incrementing the counters for a minor fault and marking the page again as accessed). However, most architectures actually support atomic operations on page table entries. The use of atomic operations on page table entries would allow the update of a page table entry in a single atomic operation instead of writing to the page table entry twice. There would also be no danger of generating a spurious page fault on other CPUs. The following patch introduces two new atomic operations ptep_xchg and ptep_cmpxchg that may be provided by an architecture. The fallback in include/asm-generic/pgtable.h is to simulate both operations through the existing ptep_get_and_clear function. So there is essentially no change if atomic operations on ptes have not been defined. Architectures that do not support atomic operations on ptes may continue to use the clearing of a pte for locking type purposes. Atomic operations may be enabled in the kernel configuration on i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode. Generic atomic definitions for ptep_xchg and ptep_cmpxchg have been provided based on the existing xchg() and cmpxchg() functions that already work atomically on many platforms. It is very easy to implement this for any architecture by adding the appropriate definitions to arch/xx/Kconfig. The provided generic atomic functions may be overridden as usual by defining the appropriate__HAVE_ARCH_xxx constant and providing an implementation. My aim to reduce the use of the page_table_lock in the page fault handler rely on a pte never being clear if the pte is in use even when the page_table_lock is not held. Clearing a pte before setting it to another values could result in a situation in which a fault generated by another cpu could install a pte which is then immediately overwritten by the first CPU setting the pte to a valid value again. This patch is important for future work on reducing the use of spinlocks in the vm. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-27 16:27:40.000000000 -0800 @@ -575,11 +575,6 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -594,11 +589,15 @@ static int try_to_unmap_one(struct page list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now that the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; acct_update_integrals(); page_remove_rmap(page); @@ -691,15 +690,15 @@ static void try_to_unmap_cluster(unsigne if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_clear_flush(vma, address, pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-27 14:52:11.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-27 16:27:40.000000000 -0800 @@ -513,14 +513,18 @@ static void zap_pte_range(struct mmu_gat page->index > details->last_index)) continue; } - pte = ptep_get_and_clear(ptep); - tlb_remove_tlb_entry(tlb, ptep, address+offset); - if (unlikely(!page)) + if (unlikely(!page)) { + pte = ptep_get_and_clear(ptep); + tlb_remove_tlb_entry(tlb, ptep, address+offset); continue; + } if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, address+offset) != page->index) - set_pte(ptep, pgoff_to_pte(page->index)); + pte = ptep_xchg(ptep, pgoff_to_pte(page->index)); + else + pte = ptep_get_and_clear(ptep); + tlb_remove_tlb_entry(tlb, ptep, address+offset); if (pte_dirty(pte)) set_page_dirty(page); if (PageAnon(page)) Index: linux-2.6.10/mm/mprotect.c =================================================================== --- linux-2.6.10.orig/mm/mprotect.c 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/mm/mprotect.c 2005-01-27 16:27:40.000000000 -0800 @@ -48,12 +48,16 @@ change_pte_range(pmd_t *pmd, unsigned lo if (pte_present(*pte)) { pte_t entry; - /* Avoid an SMP race with hardware updated dirty/clean - * bits by wiping the pte and then setting the new pte - * into place. - */ - entry = ptep_get_and_clear(pte); - set_pte(pte, pte_modify(entry, newprot)); + /* Deal with a potential SMP race with hardware/arch + * interrupt updating dirty/clean bits through the use + * of ptep_cmpxchg. + */ + do { + entry = *pte; + } while (!ptep_cmpxchg(pte, + entry, + pte_modify(entry, newprot) + )); } address += PAGE_SIZE; pte++; Index: linux-2.6.10/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable.h 2004-12-24 13:34:30.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-27 16:27:40.000000000 -0800 @@ -102,6 +102,92 @@ static inline pte_t ptep_get_and_clear(p }) #endif +#ifdef CONFIG_ATOMIC_TABLE_OPS + +/* + * The architecture does support atomic table operations. + * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using + * cmpxchg and xchg. + */ +#ifndef __HAVE_ARCH_PTEP_XCHG +#define ptep_xchg(__ptep, __pteval) \ + __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval))) +#endif + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__ptep,__oldval,__newval) \ + (cmpxchg(&pte_val(*(__ptep)), \ + pte_val(__oldval), \ + pte_val(__newval) \ + ) == pte_val(__oldval) \ + ) +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __pte = ptep_xchg(__ptep, __pteval); \ + flush_tlb_page(__vma, __address); \ + __pte; \ +}) +#endif + +#else + +/* + * No support for atomic operations on the page table. + * Exchanging of pte values is done by first swapping zeros into + * a pte and then putting new content into the pte entry. + * However, these functions will generate an empty pte for a + * short time frame. This means that the page_table_lock must be held + * to avoid a page fault that would install a new entry. + */ +#ifndef __HAVE_ARCH_PTEP_XCHG +#define ptep_xchg(__ptep, __pteval) \ +({ \ + pte_t __pte = ptep_get_and_clear(__ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#ifndef __HAVE_ARCH_PTEP_XCHG +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __pte = ptep_clear_flush(__vma, __address, __ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) +#else +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __pte = ptep_xchg(__ptep, __pteval); \ + flush_tlb_page(__vma, __address); \ + __pte; \ +}) +#endif +#endif + +/* + * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg + * since cmpxchg may not be available on certain architectures. Instead + * the clearing of a pte is used as a form of locking mechanism. + * This approach will only work if the page_table_lock is held to insure + * that the pte is not populated by a page fault generated on another + * CPU. + */ +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__ptep, __old, __new) \ +({ \ + pte_t prev = ptep_get_and_clear(__ptep); \ + int r = pte_val(prev) == pte_val(__old); \ + set_pte(__ptep, r ? (__new) : prev); \ + r; \ +}) +#endif +#endif + #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(pte_t *ptep) { Index: linux-2.6.10/arch/ia64/Kconfig =================================================================== --- linux-2.6.10.orig/arch/ia64/Kconfig 2005-01-27 14:47:14.000000000 -0800 +++ linux-2.6.10/arch/ia64/Kconfig 2005-01-27 16:36:56.000000000 -0800 @@ -280,6 +280,17 @@ config PREEMPT Say Y here if you are building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. +config ATOMIC_TABLE_OPS + bool "Atomic Page Table Operations (EXPERIMENTAL)" + depends on SMP && EXPERIMENTAL + help + Atomic Page table operations allow page faults + without the use (or with reduce use of) spinlocks + and allow greater concurrency for a task with multiple + threads in the page fault handler. This is in particular + useful for high CPU counts and processes that use + large amounts of memory. + config HAVE_DEC_LOCK bool depends on (SMP || PREEMPT) Index: linux-2.6.10/arch/i386/Kconfig =================================================================== --- linux-2.6.10.orig/arch/i386/Kconfig 2005-01-27 14:47:14.000000000 -0800 +++ linux-2.6.10/arch/i386/Kconfig 2005-01-27 16:37:05.000000000 -0800 @@ -868,6 +868,17 @@ config HAVE_DEC_LOCK depends on (SMP || PREEMPT) && X86_CMPXCHG default y +config ATOMIC_TABLE_OPS + bool "Atomic Page Table Operations (EXPERIMENTAL)" + depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE + help + Atomic Page table operations allow page faults + without the use (or with reduce use of) spinlocks + and allow greater concurrency for a task with multiple + threads in the page fault handler. This is in particular + useful for high CPU counts and processes that use + large amounts of memory. + # turning this on wastes a bunch of space. # Summit needs it only when NUMA is on config BOOT_IOREMAP Index: linux-2.6.10/arch/x86_64/Kconfig =================================================================== --- linux-2.6.10.orig/arch/x86_64/Kconfig 2005-01-27 14:52:10.000000000 -0800 +++ linux-2.6.10/arch/x86_64/Kconfig 2005-01-27 16:37:15.000000000 -0800 @@ -240,6 +240,17 @@ config PREEMPT Say Y here if you are feeling brave and building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. +config ATOMIC_TABLE_OPS + bool "Atomic Page Table Operations (EXPERIMENTAL)" + depends on SMP && EXPERIMENTAL + help + Atomic Page table operations allow page faults + without the use (or with reduce use of) spinlocks + and allow greater concurrency for a task with multiple + threads in the page fault handler. This is in particular + useful for high CPU counts and processes that use + large amounts of memory. + config PREEMPT_BKL bool "Preempt The Big Kernel Lock" depends on PREEMPT ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V16 [2/4]: mm counter macros 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes Christoph Lameter @ 2005-01-28 20:36 ` Christoph Lameter 2005-01-28 20:37 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 2005-01-28 20:38 ` page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page Christoph Lameter 3 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-28 20:36 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh This patch extracts all the interesting pieces for handling rss and anon_rss into definitions in include/linux/sched.h. All rss operations are performed through the following three macros: get_mm_counter(mm, member) -> Obtain the value of a counter set_mm_counter(mm, member, value) -> Set the value of a counter update_mm_counter(mm, member, value) -> Add a value to a counter The simple definitions provided in this patch should result in no change to to the generated code. With this patch it becomes easier to add new counters and it is possible to redefine the method of counter handling (f.e. the page fault scalability patches may want to use atomic operations or split rss). Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/linux/sched.h =================================================================== --- linux-2.6.10.orig/include/linux/sched.h 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-01-28 11:02:00.000000000 -0800 @@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi extern void arch_unmap_area(struct vm_area_struct *area); extern void arch_unmap_area_topdown(struct vm_area_struct *area); +#define set_mm_counter(mm, member, value) (mm)->member = (value) +#define get_mm_counter(mm, member) ((mm)->member) +#define update_mm_counter(mm, member, value) (mm)->member += (value) +#define MM_COUNTER_T unsigned long struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ @@ -219,7 +223,7 @@ struct mm_struct { atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; - spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -229,9 +233,13 @@ struct mm_struct { unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; + /* Special counters protected by the page_table_lock */ + MM_COUNTER_T rss; + MM_COUNTER_T anon_rss; + unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ unsigned dumpable:1; Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-28 11:01:58.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-28 11:02:00.000000000 -0800 @@ -324,9 +324,9 @@ copy_one_pte(struct mm_struct *dst_mm, pte = pte_mkclean(pte); pte = pte_mkold(pte); get_page(page); - dst_mm->rss++; + update_mm_counter(dst_mm, rss, 1); if (PageAnon(page)) - dst_mm->anon_rss++; + update_mm_counter(dst_mm, anon_rss, 1); set_pte(dst_pte, pte); page_dup_rmap(page); } @@ -528,7 +528,7 @@ static void zap_pte_range(struct mmu_gat if (pte_dirty(pte)) set_page_dirty(page); if (PageAnon(page)) - tlb->mm->anon_rss--; + update_mm_counter(tlb->mm, anon_rss, -1); else if (pte_young(pte)) mark_page_accessed(page); tlb->freed++; @@ -1345,13 +1345,14 @@ static int do_wp_page(struct mm_struct * spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { - if (PageAnon(old_page)) - mm->anon_rss--; + if (PageAnon(old_page)) + update_mm_counter(mm, anon_rss, -1); if (PageReserved(old_page)) { - ++mm->rss; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); } else + page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1755,7 +1756,7 @@ static int do_swap_page(struct mm_struct if (vm_swap_full()) remove_exclusive_swap_page(page); - mm->rss++; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); @@ -1823,7 +1824,7 @@ do_anonymous_page(struct mm_struct *mm, spin_unlock(&mm->page_table_lock); goto out; } - mm->rss++; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, @@ -1941,7 +1942,7 @@ retry: /* Only go through if we didn't race with anybody else... */ if (pte_none(*page_table)) { if (!PageReserved(new_page)) - ++mm->rss; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); @@ -2272,8 +2273,10 @@ void update_mem_hiwater(void) struct task_struct *tsk = current; if (tsk->mm) { - if (tsk->mm->hiwater_rss < tsk->mm->rss) - tsk->mm->hiwater_rss = tsk->mm->rss; + unsigned long rss = get_mm_counter(tsk->mm, rss); + + if (tsk->mm->hiwater_rss < rss) + tsk->mm->hiwater_rss = rss; if (tsk->mm->hiwater_vm < tsk->mm->total_vm) tsk->mm->hiwater_vm = tsk->mm->total_vm; } Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-28 11:01:58.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-28 11:02:00.000000000 -0800 @@ -258,7 +258,7 @@ static int page_referenced_one(struct pa pte_t *pte; int referenced = 0; - if (!mm->rss) + if (!get_mm_counter(mm, rss)) goto out; address = vma_address(page, vma); if (address == -EFAULT) @@ -437,7 +437,7 @@ void page_add_anon_rmap(struct page *pag BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; + update_mm_counter(vma->vm_mm, anon_rss, 1); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; @@ -510,7 +510,7 @@ static int try_to_unmap_one(struct page pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) + if (!get_mm_counter(mm, rss)) goto out; address = vma_address(page, vma); if (address == -EFAULT) @@ -591,14 +591,14 @@ static int try_to_unmap_one(struct page } pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); - mm->anon_rss--; + update_mm_counter(mm, anon_rss, -1); } else pteval = ptep_clear_flush(vma, address, pte); /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); - mm->rss--; + update_mm_counter(mm, rss, -1); acct_update_integrals(); page_remove_rmap(page); page_cache_release(page); @@ -705,7 +705,7 @@ static void try_to_unmap_cluster(unsigne page_remove_rmap(page); page_cache_release(page); acct_update_integrals(); - mm->rss--; + update_mm_counter(mm, rss, -1); (*mapcount)--; } @@ -804,7 +804,7 @@ static int try_to_unmap_file(struct page if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && + while (get_mm_counter(vma->vm_mm, rss) && cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); Index: linux-2.6.10/fs/proc/task_mmu.c =================================================================== --- linux-2.6.10.orig/fs/proc/task_mmu.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-28 11:02:00.000000000 -0800 @@ -24,7 +24,7 @@ char *task_mem(struct mm_struct *mm, cha "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + get_mm_counter(mm, rss) << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -39,11 +39,13 @@ unsigned long task_vsize(struct mm_struc int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + int rss = get_mm_counter(mm, rss); + + *shared = rss - get_mm_counter(mm, anon_rss); *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; - *resident = mm->rss; + *resident = rss; return mm->total_vm; } Index: linux-2.6.10/mm/mmap.c =================================================================== --- linux-2.6.10.orig/mm/mmap.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/mm/mmap.c 2005-01-28 11:02:00.000000000 -0800 @@ -2003,7 +2003,7 @@ void exit_mmap(struct mm_struct *mm) vma = mm->mmap; mm->mmap = mm->mmap_cache = NULL; mm->mm_rb = RB_ROOT; - mm->rss = 0; + set_mm_counter(mm, rss, 0); mm->total_vm = 0; mm->locked_vm = 0; Index: linux-2.6.10/kernel/fork.c =================================================================== --- linux-2.6.10.orig/kernel/fork.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/kernel/fork.c 2005-01-28 11:02:00.000000000 -0800 @@ -174,8 +174,8 @@ static inline int dup_mmap(struct mm_str mm->mmap_cache = NULL; mm->free_area_cache = oldmm->mmap_base; mm->map_count = 0; - mm->rss = 0; - mm->anon_rss = 0; + set_mm_counter(mm, rss, 0); + set_mm_counter(mm, anon_rss, 0); cpus_clear(mm->cpu_vm_mask); mm->mm_rb = RB_ROOT; rb_link = &mm->mm_rb.rb_node; @@ -471,7 +471,7 @@ static int copy_mm(unsigned long clone_f if (retval) goto free_pt; - mm->hiwater_rss = mm->rss; + mm->hiwater_rss = get_mm_counter(mm,rss); mm->hiwater_vm = mm->total_vm; good_mm: Index: linux-2.6.10/include/asm-generic/tlb.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/tlb.h 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/include/asm-generic/tlb.h 2005-01-28 11:02:00.000000000 -0800 @@ -88,11 +88,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u { int freed = tlb->freed; struct mm_struct *mm = tlb->mm; - int rss = mm->rss; + int rss = get_mm_counter(mm, rss); if (rss < freed) freed = rss; - mm->rss = rss - freed; + update_mm_counter(mm, rss, -freed); tlb_flush_mmu(tlb, start, end); /* keep the page table cache within bounds */ Index: linux-2.6.10/fs/binfmt_flat.c =================================================================== --- linux-2.6.10.orig/fs/binfmt_flat.c 2004-12-24 13:33:47.000000000 -0800 +++ linux-2.6.10/fs/binfmt_flat.c 2005-01-28 11:02:00.000000000 -0800 @@ -650,7 +650,7 @@ static int load_flat_file(struct linux_b current->mm->start_brk = datapos + data_len + bss_len; current->mm->brk = (current->mm->start_brk + 3) & ~3; current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len; - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); } if (flags & FLAT_FLAG_KTRACE) Index: linux-2.6.10/fs/exec.c =================================================================== --- linux-2.6.10.orig/fs/exec.c 2005-01-28 11:01:50.000000000 -0800 +++ linux-2.6.10/fs/exec.c 2005-01-28 11:02:00.000000000 -0800 @@ -326,7 +326,7 @@ void install_arg_page(struct vm_area_str pte_unmap(pte); goto out; } - mm->rss++; + update_mm_counter(mm, rss, 1); lru_cache_add_active(page); set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte( page, vma->vm_page_prot)))); Index: linux-2.6.10/fs/binfmt_som.c =================================================================== --- linux-2.6.10.orig/fs/binfmt_som.c 2005-01-28 11:01:50.000000000 -0800 +++ linux-2.6.10/fs/binfmt_som.c 2005-01-28 11:02:00.000000000 -0800 @@ -259,7 +259,7 @@ load_som_binary(struct linux_binprm * bp create_som_tables(bprm); current->mm->start_stack = bprm->p; - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); #if 0 printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk); Index: linux-2.6.10/mm/fremap.c =================================================================== --- linux-2.6.10.orig/mm/fremap.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/mm/fremap.c 2005-01-28 11:02:00.000000000 -0800 @@ -39,7 +39,7 @@ static inline void zap_pte(struct mm_str set_page_dirty(page); page_remove_rmap(page); page_cache_release(page); - mm->rss--; + update_mm_counter(mm, rss, -1); } } } else { @@ -92,7 +92,7 @@ int install_page(struct mm_struct *mm, s zap_pte(mm, vma, addr, pte); - mm->rss++; + update_mm_counter(mm,rss, 1); flush_icache_page(vma, page); set_pte(pte, mk_pte(page, prot)); page_add_file_rmap(page); Index: linux-2.6.10/mm/swapfile.c =================================================================== --- linux-2.6.10.orig/mm/swapfile.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/mm/swapfile.c 2005-01-28 11:02:00.000000000 -0800 @@ -432,7 +432,7 @@ static void unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir, swp_entry_t entry, struct page *page) { - vma->vm_mm->rss++; + update_mm_counter(vma->vm_mm, rss, 1); get_page(page); set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot))); page_add_anon_rmap(page, vma, address); Index: linux-2.6.10/fs/binfmt_aout.c =================================================================== --- linux-2.6.10.orig/fs/binfmt_aout.c 2005-01-28 11:01:50.000000000 -0800 +++ linux-2.6.10/fs/binfmt_aout.c 2005-01-28 11:02:00.000000000 -0800 @@ -317,7 +317,7 @@ static int load_aout_binary(struct linux (current->mm->start_brk = N_BSSADDR(ex)); current->mm->free_area_cache = current->mm->mmap_base; - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); current->mm->mmap = NULL; compute_creds(bprm); current->flags &= ~PF_FORKNOEXEC; Index: linux-2.6.10/arch/ia64/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/ia64/mm/hugetlbpage.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/ia64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -73,7 +73,7 @@ set_huge_pte (struct mm_struct *mm, stru { pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); @@ -116,7 +116,7 @@ int copy_hugetlb_page_range(struct mm_st ptepage = pte_page(entry); get_page(ptepage); set_pte(dst_pte, entry); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; } return 0; @@ -246,7 +246,7 @@ void unmap_hugepage_range(struct vm_area put_page(page); pte_clear(pte); } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm, rss, - ((end - start) >> PAGE_SHIFT)); flush_tlb_range(vma, start, end); } Index: linux-2.6.10/fs/binfmt_elf.c =================================================================== --- linux-2.6.10.orig/fs/binfmt_elf.c 2005-01-28 11:01:55.000000000 -0800 +++ linux-2.6.10/fs/binfmt_elf.c 2005-01-28 11:02:00.000000000 -0800 @@ -764,7 +764,7 @@ static int load_elf_binary(struct linux_ /* Do this so that we can load the interpreter, if need be. We will change some of these later */ - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); current->mm->free_area_cache = current->mm->mmap_base; retval = setup_arg_pages(bprm, STACK_TOP, executable_stack); if (retval < 0) { Index: linux-2.6.10/include/asm-ia64/tlb.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/tlb.h 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/tlb.h 2005-01-28 11:02:00.000000000 -0800 @@ -161,11 +161,11 @@ tlb_finish_mmu (struct mmu_gather *tlb, { unsigned long freed = tlb->freed; struct mm_struct *mm = tlb->mm; - unsigned long rss = mm->rss; + unsigned long rss = get_mm_counter(mm, rss); if (rss < freed) freed = rss; - mm->rss = rss - freed; + update_mm_counter(mm, rss, -freed); /* * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and * tlb->end_addr. Index: linux-2.6.10/include/asm-arm/tlb.h =================================================================== --- linux-2.6.10.orig/include/asm-arm/tlb.h 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/include/asm-arm/tlb.h 2005-01-28 11:02:00.000000000 -0800 @@ -54,11 +54,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u { struct mm_struct *mm = tlb->mm; unsigned long freed = tlb->freed; - int rss = mm->rss; + int rss = get_mm_counter(mm, rss); if (rss < freed) freed = rss; - mm->rss = rss - freed; + update_mm_counter(mm, rss, -freed); if (freed) { flush_tlb_mm(mm); Index: linux-2.6.10/include/asm-arm26/tlb.h =================================================================== --- linux-2.6.10.orig/include/asm-arm26/tlb.h 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/include/asm-arm26/tlb.h 2005-01-28 11:02:00.000000000 -0800 @@ -37,11 +37,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u { struct mm_struct *mm = tlb->mm; unsigned long freed = tlb->freed; - int rss = mm->rss; + int rss = get_mm_counter(mm, rss); if (rss < freed) freed = rss; - mm->rss = rss - freed; + update_mm_counter(mm, rss, -freed); if (freed) { flush_tlb_mm(mm); Index: linux-2.6.10/include/asm-sparc64/tlb.h =================================================================== --- linux-2.6.10.orig/include/asm-sparc64/tlb.h 2004-12-24 13:35:23.000000000 -0800 +++ linux-2.6.10/include/asm-sparc64/tlb.h 2005-01-28 11:02:00.000000000 -0800 @@ -80,11 +80,11 @@ static inline void tlb_finish_mmu(struct { unsigned long freed = mp->freed; struct mm_struct *mm = mp->mm; - unsigned long rss = mm->rss; + unsigned long rss = get_mm_counter(mm, rss); if (rss < freed) freed = rss; - mm->rss = rss - freed; + update_mm_counter(mm, rss, -freed); tlb_flush_mmu(mp); Index: linux-2.6.10/arch/sh/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/sh/mm/hugetlbpage.c 2004-12-24 13:34:58.000000000 -0800 +++ linux-2.6.10/arch/sh/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc unsigned long i; pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) entry = pte_mkwrite(pte_mkdirty(mk_pte(page, @@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st pte_val(entry) += PAGE_SIZE; dst_pte++; } - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; } return 0; @@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area pte++; } } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT)); flush_tlb_range(vma, start, end); } Index: linux-2.6.10/arch/x86_64/ia32/ia32_aout.c =================================================================== --- linux-2.6.10.orig/arch/x86_64/ia32/ia32_aout.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/x86_64/ia32/ia32_aout.c 2005-01-28 11:02:00.000000000 -0800 @@ -313,7 +313,7 @@ static int load_aout_binary(struct linux (current->mm->start_brk = N_BSSADDR(ex)); current->mm->free_area_cache = TASK_UNMAPPED_BASE; - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); current->mm->mmap = NULL; compute_creds(bprm); current->flags &= ~PF_FORKNOEXEC; Index: linux-2.6.10/arch/ppc64/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/ppc64/mm/hugetlbpage.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/ppc64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -153,7 +153,7 @@ static void set_huge_pte(struct mm_struc { pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); @@ -311,7 +311,7 @@ int copy_hugetlb_page_range(struct mm_st ptepage = pte_page(entry); get_page(ptepage); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); set_pte(dst_pte, entry); addr += HPAGE_SIZE; @@ -421,7 +421,7 @@ void unmap_hugepage_range(struct vm_area put_page(page); } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT)); flush_tlb_pending(); } Index: linux-2.6.10/arch/sh64/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/sh64/mm/hugetlbpage.c 2004-12-24 13:34:30.000000000 -0800 +++ linux-2.6.10/arch/sh64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc unsigned long i; pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) entry = pte_mkwrite(pte_mkdirty(mk_pte(page, @@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st pte_val(entry) += PAGE_SIZE; dst_pte++; } - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; } return 0; @@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area pte++; } } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT)); flush_tlb_range(vma, start, end); } Index: linux-2.6.10/arch/sparc64/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/sparc64/mm/hugetlbpage.c 2004-12-24 13:35:01.000000000 -0800 +++ linux-2.6.10/arch/sparc64/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -59,7 +59,7 @@ static void set_huge_pte(struct mm_struc unsigned long i; pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) entry = pte_mkwrite(pte_mkdirty(mk_pte(page, @@ -112,7 +112,7 @@ int copy_hugetlb_page_range(struct mm_st pte_val(entry) += PAGE_SIZE; dst_pte++; } - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; } return 0; @@ -203,7 +203,7 @@ void unmap_hugepage_range(struct vm_area pte++; } } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT)); flush_tlb_range(vma, start, end); } Index: linux-2.6.10/arch/mips/kernel/irixelf.c =================================================================== --- linux-2.6.10.orig/arch/mips/kernel/irixelf.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/mips/kernel/irixelf.c 2005-01-28 11:02:00.000000000 -0800 @@ -690,7 +690,7 @@ static int load_irix_binary(struct linux /* Do this so that we can load the interpreter, if need be. We will * change some of these later. */ - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT); current->mm->start_stack = bprm->p; Index: linux-2.6.10/arch/m68k/atari/stram.c =================================================================== --- linux-2.6.10.orig/arch/m68k/atari/stram.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/m68k/atari/stram.c 2005-01-28 11:02:00.000000000 -0800 @@ -635,7 +635,7 @@ static inline void unswap_pte(struct vm_ set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot))); swap_free(entry); get_page(page); - ++vma->vm_mm->rss; + update_mm_counter(vma->vm_mm, rss, 1); } static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir, Index: linux-2.6.10/arch/i386/mm/hugetlbpage.c =================================================================== --- linux-2.6.10.orig/arch/i386/mm/hugetlbpage.c 2005-01-28 11:01:47.000000000 -0800 +++ linux-2.6.10/arch/i386/mm/hugetlbpage.c 2005-01-28 11:02:00.000000000 -0800 @@ -46,7 +46,7 @@ static void set_huge_pte(struct mm_struc { pte_t entry; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); if (write_access) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); @@ -86,7 +86,7 @@ int copy_hugetlb_page_range(struct mm_st ptepage = pte_page(entry); get_page(ptepage); set_pte(dst_pte, entry); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); + update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; } return 0; @@ -222,7 +222,7 @@ void unmap_hugepage_range(struct vm_area page = pte_page(pte); put_page(page); } - mm->rss -= (end - start) >> PAGE_SHIFT; + update_mm_counter(mm ,rss, -((end - start) >> PAGE_SHIFT)); flush_tlb_range(vma, start, end); } Index: linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c =================================================================== --- linux-2.6.10.orig/arch/sparc64/kernel/binfmt_aout32.c 2005-01-28 11:01:48.000000000 -0800 +++ linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c 2005-01-28 11:02:00.000000000 -0800 @@ -241,7 +241,7 @@ static int load_aout32_binary(struct lin current->mm->brk = ex.a_bss + (current->mm->start_brk = N_BSSADDR(ex)); - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); current->mm->mmap = NULL; compute_creds(bprm); current->flags &= ~PF_FORKNOEXEC; Index: linux-2.6.10/fs/proc/array.c =================================================================== --- linux-2.6.10.orig/fs/proc/array.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/fs/proc/array.c 2005-01-28 11:02:00.000000000 -0800 @@ -423,7 +423,7 @@ static int do_task_stat(struct task_stru jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.10/fs/binfmt_elf_fdpic.c =================================================================== --- linux-2.6.10.orig/fs/binfmt_elf_fdpic.c 2005-01-28 11:01:50.000000000 -0800 +++ linux-2.6.10/fs/binfmt_elf_fdpic.c 2005-01-28 10:53:07.000000000 -0800 @@ -299,7 +299,7 @@ static int load_elf_fdpic_binary(struct /* do this so that we can load the interpreter, if need be * - we will change some of these later */ - current->mm->rss = 0; + set_mm_counter(current->mm, rss, 0); #ifdef CONFIG_MMU retval = setup_arg_pages(bprm, current->mm->start_stack, executable_stack); Index: linux-2.6.10/mm/nommu.c =================================================================== --- linux-2.6.10.orig/mm/nommu.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/mm/nommu.c 2005-01-28 11:04:33.000000000 -0800 @@ -958,10 +958,11 @@ void arch_unmap_area(struct vm_area_stru void update_mem_hiwater(void) { struct task_struct *tsk = current; + unsigned long rss = get_mm_counter(tsk->mm, rss); if (likely(tsk->mm)) { - if (tsk->mm->hiwater_rss < tsk->mm->rss) - tsk->mm->hiwater_rss = tsk->mm->rss; + if (tsk->mm->hiwater_rss < rss) + tsk->mm->hiwater_rss = rss; if (tsk->mm->hiwater_vm < tsk->mm->total_vm) tsk->mm->hiwater_vm = tsk->mm->total_vm; } Index: linux-2.6.10/kernel/acct.c =================================================================== --- linux-2.6.10.orig/kernel/acct.c 2005-01-28 11:01:51.000000000 -0800 +++ linux-2.6.10/kernel/acct.c 2005-01-28 11:03:13.000000000 -0800 @@ -544,7 +544,7 @@ void acct_update_integrals(void) if (delta == 0) return; tsk->acct_stimexpd = tsk->stime; - tsk->acct_rss_mem1 += delta * tsk->mm->rss; + tsk->acct_rss_mem1 += delta * get_mm_counter(tsk->mm, rss); tsk->acct_vm_mem1 += delta * tsk->mm->total_vm; } } ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [2/4]: mm counter macros Christoph Lameter @ 2005-01-28 20:37 ` Christoph Lameter 2005-02-01 4:08 ` Nick Piggin 2005-02-01 4:16 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Nick Piggin 2005-01-28 20:38 ` page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page Christoph Lameter 3 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-28 20:37 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Both of these become spin_lock(page_table_lock) and spin_unlock(page_table_lock) if atomic page table operations are not configured (CONFIG_ATOMIC_TABLE_OPS undefined). Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest layer of the page table. Higher layers may also be populated in an atomic way by defining pmd_test_and_populate() etc. The generic versions of these functions fall back to the page_table_lock (populating higher level page table entries is rare and therefore this is not likely to be performance critical). For ia64 the definitions for higher level atomic operations is included and these may easily be added for other architectures. This patch depends on the pte_cmpxchg patch to be applied first and will only remove the first use of the page_table_lock in the page fault handler. This will allow the following page table operations without acquiring the page_table_lock: 1. Updating of access bits (handle_mm_faults) 2. Anonymous read faults (do_anonymous_page) The page_table_lock is still acquired for creating a new pte for an anonymous write fault and therefore the problems with rss that were addressed by splitting rss into the task structure do not yet occur. The patch also adds some diagnostic features by counting the number of cmpxchg failures (useful for verification if this patch works right) and the number of faults received that led to no change in the page table. These statistics may be viewed via /proc/meminfo Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:27:59.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-27 16:28:54.000000000 -0800 @@ -36,6 +36,8 @@ * (Gerhard.Wichert@pdb.siemens.de) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) + * Jan 2005 Scalability improvement by reducing the use and the length of time + * the page table lock is held (Christoph Lameter) */ #include <linux/kernel_stat.h> @@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a * change only once the write actually happens. This avoids a few races, * and potentially makes it more efficient. * - * We hold the mm semaphore and the page_table_lock on entry and exit - * with the page_table_lock released. + * We hold the mm semaphore and have started atomic pte operations, + * exit with pte ops completed. */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) @@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct * pte_unmap(page_table); printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n", address); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_OOM; } old_page = pfn_to_page(pfn); @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* + * If the bits are not updated then another fault + * will be generated with another chance of updating. + */ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_new_page; @@ -1340,7 +1348,8 @@ static int do_wp_page(struct mm_struct * copy_cow_page(old_page,new_page,address); /* - * Re-check the pte - we dropped the lock + * Re-check the pte - so far we may not have acquired the + * page_table_lock */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1692,8 +1701,7 @@ void swapin_readahead(swp_entry_t entry, } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore and have started atomic pte operations */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1705,15 +1713,14 @@ static int do_swap_page(struct mm_struct int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1732,12 +1739,11 @@ static int do_swap_page(struct mm_struct grab_swap_token(); } - mark_page_accessed(page); + SetPageReferenced(page); lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1771,80 +1777,94 @@ static int do_swap_page(struct mm_struct set_pte(page_table, pte); page_add_anon_rmap(page, vma, address); + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, address, pte); + pte_unmap(page_table); + spin_unlock(&mm->page_table_lock); + if (write_access) { + page_table_atomic_start(mm); if (do_wp_page(mm, vma, address, page_table, pmd, pte) == VM_FAULT_OOM) ret = VM_FAULT_OOM; - goto out; } - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, address, pte); - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); out: return ret; } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held and atomic pte operations started. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; - struct page * page = ZERO_PAGE(addr); + struct page * page; - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + if (unlikely(!write_access)) { - /* ..except if it's a write access */ - if (write_access) { - /* Allocate our own private page. */ + /* Read-only mapping of ZERO_PAGE. */ + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + + /* + * If the cmpxchg fails then another fault may be + * generated that may then be successful + */ + + if (ptep_cmpxchg(page_table, orig_entry, entry)) + update_mmu_cache(vma, addr, entry); + else + inc_page_state(cmpxchg_fail_anon_read); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); - if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_MINOR; + } - spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); + page_table_atomic_stop(mm); - if (!pte_none(*page_table)) { - pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; - } - update_mm_counter(mm, rss, 1); - acct_update_integrals(); - update_mem_hiwater(); - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - SetPageReferenced(page); - page_add_anon_rmap(page, vma, addr); + /* Allocate our own private page. */ + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + + page = alloc_page_vma(GFP_HIGHUSER, vma, addr); + if (!page) + return VM_FAULT_OOM; + clear_user_highpage(page, addr); + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, + vma->vm_page_prot)), + vma); + + spin_lock(&mm->page_table_lock); + + if (!ptep_cmpxchg(page_table, orig_entry, entry)) { + pte_unmap(page_table); + page_cache_release(page); + spin_unlock(&mm->page_table_lock); + inc_page_state(cmpxchg_fail_anon_write); + return VM_FAULT_MINOR; } - set_pte(page_table, entry); - pte_unmap(page_table); + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + page_add_anon_rmap(page, vma, addr); + lru_cache_add_active(page); + update_mm_counter(mm, rss, 1); + acct_update_integrals(); + update_mem_hiwater(); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); + update_mmu_cache(vma, addr, entry); + pte_unmap(page_table); spin_unlock(&mm->page_table_lock); -out: + return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* @@ -1856,12 +1876,12 @@ no_mem: * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held and atomic pte operations started. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1872,9 +1892,9 @@ do_no_page(struct mm_struct *mm, struct if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1982,7 +2002,7 @@ oom: * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1995,13 +2015,13 @@ static int do_file_page(struct mm_struct if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } - pgoff = pte_to_pgoff(*pte); + pgoff = pte_to_pgoff(entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -2020,49 +2040,45 @@ static int do_file_page(struct mm_struct * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; entry = *pte; if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + new_entry = pte_mkyoung(entry); if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + + /* + * If the cmpxchg fails then we will get another fault which + * has another chance of successfully updating the page table entry. + */ + if (ptep_cmpxchg(pte, entry, new_entry)) { + flush_tlb_page(vma, address); + update_mmu_cache(vma, address, entry); + } else + inc_page_state(cmpxchg_fail_flag_update); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); + if (pte_val(new_entry) == pte_val(entry)) + inc_page_state(spurious_page_faults); return VM_FAULT_MINOR; } @@ -2081,33 +2097,73 @@ int handle_mm_fault(struct mm_struct *mm inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd. However, the arch may fall back + * in page_table_atomic_start to the page table lock. + * + * We may be able to avoid taking and releasing the page_table_lock + * for the p??_alloc functions through atomic operations so we + * duplicate the functionality of pmd_alloc, pud_alloc and + * pte_alloc_map here. */ + page_table_atomic_start(mm); pgd = pgd_offset(mm, address); - spin_lock(&mm->page_table_lock); + if (unlikely(pgd_none(*pgd))) { + pud_t *new; + + page_table_atomic_stop(mm); + new = pud_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; + + page_table_atomic_start(mm); + if (!pgd_test_and_populate(mm, pgd, new)) + pud_free(new); + } + + pud = pud_offset(pgd, address); + if (unlikely(pud_none(*pud))) { + pmd_t *new; + + page_table_atomic_stop(mm); + new = pmd_alloc_one(mm, address); - pud = pud_alloc(mm, pgd, address); - if (!pud) - goto oom; - - pmd = pmd_alloc(mm, pud, address); - if (!pmd) - goto oom; - - pte = pte_alloc_map(mm, pmd, address); - if (!pte) - goto oom; + if (!new) + return VM_FAULT_OOM; - return handle_pte_fault(mm, vma, address, write_access, pte, pmd); + page_table_atomic_start(mm); + + if (!pud_test_and_populate(mm, pud, new)) + pmd_free(new); + } - oom: - spin_unlock(&mm->page_table_lock); - return VM_FAULT_OOM; + pmd = pmd_offset(pud, address); + if (unlikely(!pmd_present(*pmd))) { + struct page *new; + + page_table_atomic_stop(mm); + new = pte_alloc_one(mm, address); + + if (!new) + return VM_FAULT_OOM; + + page_table_atomic_start(mm); + + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else { + inc_page_state(nr_page_table_pages); + mm->nr_ptes++; + } + } + + pte = pte_offset_map(pmd, address); + return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } #ifndef __ARCH_HAS_4LEVEL_HACK Index: linux-2.6.10/include/asm-generic/pgtable-nopud.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopud.h 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopud.h 2005-01-27 16:28:54.000000000 -0800 @@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd) { static inline int pgd_present(pgd_t pgd) { return 1; } static inline void pgd_clear(pgd_t *pgd) { } #define pud_ERROR(pud) (pgd_ERROR((pud).pgd)) - #define pgd_populate(mm, pgd, pud) do { } while (0) + +#define __HAVE_ARCH_PGD_TEST_AND_POPULATE +static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) +{ + return 1; +} + /* * (puds are folded into pgds so this doesn't get actually called, * but the define is needed for a generic inline function.) Index: linux-2.6.10/include/asm-generic/pgtable-nopmd.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable-nopmd.h 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable-nopmd.h 2005-01-27 16:28:54.000000000 -0800 @@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud) #define pmd_ERROR(pmd) (pud_ERROR((pmd).pud)) #define pud_populate(mm, pmd, pte) do { } while (0) +#define __ARCH_HAVE_PUD_TEST_AND_POPULATE +static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) +{ + return 1; +} /* * (pmds are folded into puds so this doesn't get actually called, Index: linux-2.6.10/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-generic/pgtable.h 2005-01-27 16:27:40.000000000 -0800 +++ linux-2.6.10/include/asm-generic/pgtable.h 2005-01-27 16:30:35.000000000 -0800 @@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p #ifdef CONFIG_ATOMIC_TABLE_OPS /* - * The architecture does support atomic table operations. - * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using + * The architecture does support atomic table operations and + * all operations on page table entries must always be atomic. + * + * This means that the kernel will never encounter a partially updated + * page table entry. + * + * Since the architecture does support atomic table operations, we + * may provide generic atomic ptep_xchg and ptep_cmpxchg using * cmpxchg and xchg. */ #ifndef __HAVE_ARCH_PTEP_XCHG @@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p }) #endif +/* + * page_table_atomic_start and page_table_atomic_stop may be used to + * define special measures that an arch needs to guarantee atomic + * operations outside of a spinlock. In the case that an arch does + * not support atomic page table operations we will fall back to the + * page table lock. + */ +#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START +#define page_table_atomic_start(mm) do { } while (0) +#endif + +#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START +#define page_table_atomic_stop(mm) do { } while (0) +#endif + +/* + * Fallback functions for atomic population of higher page table + * structures. These simply acquire the page_table_lock for + * synchronization. An architecture may override these generic + * functions to provide atomic populate functions to make these + * more effective. + */ + +#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pud) \ +({ \ + int __rc; \ + spin_lock(&mm->page_table_lock); \ + __rc = pgd_none(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pud); \ + spin_unlock(&mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE +#define pud_test_and_populate(__mm, __pud, __pmd) \ +({ \ + int __rc; \ + spin_lock(&mm->page_table_lock); \ + __rc = pud_none(*(__pud)); \ + if (__rc) pud_populate(__mm, __pud, __pmd); \ + spin_unlock(&mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&mm->page_table_lock); \ + __rc; \ +}) +#endif + #else /* @@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p * short time frame. This means that the page_table_lock must be held * to avoid a page fault that would install a new entry. */ + +/* Fall back to the page table lock to synchronize page table access */ +#define page_table_atomic_start(mm) spin_lock(&(mm)->page_table_lock) +#define page_table_atomic_stop(mm) spin_unlock(&(mm)->page_table_lock) + #ifndef __HAVE_ARCH_PTEP_XCHG #define ptep_xchg(__ptep, __pteval) \ ({ \ @@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p r; \ }) #endif + +/* + * Fallback functions for atomic population of higher page table + * structures. These rely on the page_table_lock being held. + */ +#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pud) \ +({ \ + int __rc; \ + __rc = pgd_none(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pud); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE +#define pud_test_and_populate(__mm, __pud, __pmd) \ +({ \ + int __rc; \ + __rc = pud_none(*(__pud)); \ + if (__rc) pud_populate(__mm, __pud, __pmd); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + __rc; \ +}) +#endif + #endif #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT Index: linux-2.6.10/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgtable.h 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgtable.h 2005-01-27 16:33:24.000000000 -0800 @@ -554,6 +554,8 @@ do { \ #define FIXADDR_USER_START GATE_ADDR #define FIXADDR_USER_END (GATE_ADDR + 2*PERCPU_PAGE_SIZE) +#define __HAVE_ARCH_PUD_TEST_AND_POPULATE +#define __HAVE_ARCH_PMD_TEST_AND_POPULATE #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR @@ -561,7 +563,7 @@ do { \ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE -#include <asm-generic/pgtable.h> #include <asm-generic/pgtable-nopud.h> +#include <asm-generic/pgtable.h> #endif /* _ASM_IA64_PGTABLE_H */ Index: linux-2.6.10/include/linux/page-flags.h =================================================================== --- linux-2.6.10.orig/include/linux/page-flags.h 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/include/linux/page-flags.h 2005-01-27 16:28:54.000000000 -0800 @@ -131,6 +131,17 @@ struct page_state { unsigned long allocstall; /* direct reclaim calls */ unsigned long pgrotated; /* pages rotated to tail of the LRU */ + + /* Low level counters */ + unsigned long spurious_page_faults; /* Faults with no ops */ + unsigned long cmpxchg_fail_flag_update; /* cmpxchg failures for pte flag update */ + unsigned long cmpxchg_fail_flag_reuse; /* cmpxchg failures when cow reuse of pte */ + unsigned long cmpxchg_fail_anon_read; /* cmpxchg failures on anonymous read */ + unsigned long cmpxchg_fail_anon_write; /* cmpxchg failures on anonymous write */ + + /* rss deltas for the current executing thread */ + long rss; + long anon_rss; }; extern void get_page_state(struct page_state *ret); Index: linux-2.6.10/fs/proc/proc_misc.c =================================================================== --- linux-2.6.10.orig/fs/proc/proc_misc.c 2005-01-27 14:47:19.000000000 -0800 +++ linux-2.6.10/fs/proc/proc_misc.c 2005-01-27 16:28:54.000000000 -0800 @@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page, unsigned long allowed; struct vmalloc_info vmi; - get_page_state(&ps); + get_full_page_state(&ps); get_zone_counts(&active, &inactive, &free); /* @@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page, "PageTables: %8lu kB\n" "VmallocTotal: %8lu kB\n" "VmallocUsed: %8lu kB\n" - "VmallocChunk: %8lu kB\n", + "VmallocChunk: %8lu kB\n" + "Spurious page faults : %8lu\n" + "cmpxchg fail flag update: %8lu\n" + "cmpxchg fail COW reuse : %8lu\n" + "cmpxchg fail anon read : %8lu\n" + "cmpxchg fail anon write : %8lu\n", K(i.totalram), K(i.freeram), K(i.bufferram), @@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page, K(ps.nr_page_table_pages), VMALLOC_TOTAL >> 10, vmi.used >> 10, - vmi.largest_chunk >> 10 + vmi.largest_chunk >> 10, + ps.spurious_page_faults, + ps.cmpxchg_fail_flag_update, + ps.cmpxchg_fail_flag_reuse, + ps.cmpxchg_fail_anon_read, + ps.cmpxchg_fail_anon_write ); len += hugetlb_report_meminfo(page + len); Index: linux-2.6.10/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-27 14:47:20.000000000 -0800 +++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-27 16:33:10.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PUD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_ pud_val(*pud_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE; +} + static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) { @@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-01-28 20:37 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter @ 2005-02-01 4:08 ` Nick Piggin 2005-02-01 18:47 ` Christoph Lameter 2005-02-01 19:01 ` Christoph Lameter 2005-02-01 4:16 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Nick Piggin 1 sibling, 2 replies; 286+ messages in thread From: Nick Piggin @ 2005-02-01 4:08 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Christoph Lameter wrote: Slightly OT: are you still planning to move the update_mem_hiwater and friends crud out of these fastpaths? It looks like at least that function is unsafe to be lockless. > @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * > flush_cache_page(vma, address); > entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), > vma); > - ptep_set_access_flags(vma, address, page_table, entry, 1); > - update_mmu_cache(vma, address, entry); > + /* > + * If the bits are not updated then another fault > + * will be generated with another chance of updating. > + */ > + if (ptep_cmpxchg(page_table, pte, entry)) > + update_mmu_cache(vma, address, entry); > + else > + inc_page_state(cmpxchg_fail_flag_reuse); > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > + page_table_atomic_stop(mm); > return VM_FAULT_MINOR; > } > } > pte_unmap(page_table); > + page_table_atomic_stop(mm); > > /* > * Ok, we need to copy. Oh, well.. > */ > if (!PageReserved(old_page)) > page_cache_get(old_page); > - spin_unlock(&mm->page_table_lock); > I don't think you can do this unless you have done something funky that I missed. And that kind of shoots down your lockless COW too, although it looks like you can safely have the second part of do_wp_page without the lock. Basically - your lockless COW patch itself seems like it should be OK, but this hunk does not. I would be very interested if you are seeing performance gains with your lockless COW patches, BTW. Basically, getting a reference on a struct page was the only thing I found I wasn't able to do lockless with pte cmpxchg. Because it can race with unmapping in rmap.c and reclaim and reuse, which probably isn't too good. That means: the only operations you are able to do lockless is when there is no backing page (ie. the anonymous unpopulated->populated case). A per-pte lock is sufficient for this case, of course, which is why the pte-locked system is completely free of the page table lock. Although I may have some fact fundamentally wrong? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-01 4:08 ` Nick Piggin @ 2005-02-01 18:47 ` Christoph Lameter 2005-02-01 19:01 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-02-01 18:47 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 1 Feb 2005, Nick Piggin wrote: > Slightly OT: are you still planning to move the update_mem_hiwater and > friends crud out of these fastpaths? It looks like at least that function > is unsafe to be lockless. Yes. I have a patch pending and the author of the CSA patches is a cowoerker of mine. The patch will be resubmitted once certain aspects of the timer subsystem are stabilized and/or when he gets back from his vacation. The statistics are not critical to system operation. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-01 4:08 ` Nick Piggin 2005-02-01 18:47 ` Christoph Lameter @ 2005-02-01 19:01 ` Christoph Lameter 2005-02-02 0:31 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-02-01 19:01 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 1 Feb 2005, Nick Piggin wrote: > > pte_unmap(page_table); > > + page_table_atomic_stop(mm); > > > > /* > > * Ok, we need to copy. Oh, well.. > > */ > > if (!PageReserved(old_page)) > > page_cache_get(old_page); > > - spin_unlock(&mm->page_table_lock); > > > > I don't think you can do this unless you have done something funky that I > missed. And that kind of shoots down your lockless COW too, although it > looks like you can safely have the second part of do_wp_page without the > lock. Basically - your lockless COW patch itself seems like it should be > OK, but this hunk does not. See my comment at the end of this message. > I would be very interested if you are seeing performance gains with your > lockless COW patches, BTW. So far I have not had time to focus on benchmarking that. > Basically, getting a reference on a struct page was the only thing I found > I wasn't able to do lockless with pte cmpxchg. Because it can race with > unmapping in rmap.c and reclaim and reuse, which probably isn't too good. > That means: the only operations you are able to do lockless is when there > is no backing page (ie. the anonymous unpopulated->populated case). > > A per-pte lock is sufficient for this case, of course, which is why the > pte-locked system is completely free of the page table lock. Introducing pte locking would allow us to go further with parallelizing this but its another invasive procedure. I think parallelizing COW is only possible to do reliable with some pte locking scheme. But then the question is if the pte locking is really faster than obtaining a spinlock. I suspect this may not be the case. > Although I may have some fact fundamentally wrong? The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-01 19:01 ` Christoph Lameter @ 2005-02-02 0:31 ` Nick Piggin 2005-02-02 1:20 ` Christoph Lameter 2005-02-17 0:57 ` page fault scalability patchsets update: prezeroing, prefaulting and atomic operations Christoph Lameter 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2005-02-02 0:31 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote: > On Tue, 1 Feb 2005, Nick Piggin wrote: > > A per-pte lock is sufficient for this case, of course, which is why the > > pte-locked system is completely free of the page table lock. > > Introducing pte locking would allow us to go further with parallelizing > this but its another invasive procedure. I think parallelizing COW is only > possible to do reliable with some pte locking scheme. But then the > question is if the pte locking is really faster than obtaining a spinlock. > I suspect this may not be the case. > Well most likely not although I haven't been able to detect much difference. But in your case you would probably be happy to live with that if it meant better parallelising of an important function... but we'll leave future discussion to another thread ;) > > Although I may have some fact fundamentally wrong? > > The unmapping in rmap.c would change the pte. This would be discovered > after acquiring the spinlock later in do_wp_page. Which would then lead to > the operation being abandoned. > Oh yes, but suppose your page_cache_get is happening at the same time as free_pages_check, after the page gets freed by the scanner? I can't actually think of anything that would cause a real problem (ie. not a debug check), off the top of my head. But can you say there _isn't_ anything? Regardless, it seems pretty dirty to me. But could possibly be made workable, of course. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-02 0:31 ` Nick Piggin @ 2005-02-02 1:20 ` Christoph Lameter 2005-02-02 1:41 ` Nick Piggin 2005-02-17 0:57 ` page fault scalability patchsets update: prezeroing, prefaulting and atomic operations Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-02-02 1:20 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 2 Feb 2005, Nick Piggin wrote: > > The unmapping in rmap.c would change the pte. This would be discovered > > after acquiring the spinlock later in do_wp_page. Which would then lead to > > the operation being abandoned. > Oh yes, but suppose your page_cache_get is happening at the same time > as free_pages_check, after the page gets freed by the scanner? I can't > actually think of anything that would cause a real problem (ie. not a > debug check), off the top of my head. But can you say there _isn't_ > anything? > > Regardless, it seems pretty dirty to me. But could possibly be made > workable, of course. Would it make you feel better if we would move the spin_unlock back to the prior position? This would ensure that the fallback case is exactly the same. Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-31 08:59:07.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-02-01 10:55:30.000000000 -0800 @@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct * } } pte_unmap(page_table); - page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. @@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct * if (!PageReserved(old_page)) page_cache_get(old_page); + page_table_atomic_stop(mm); if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-02 1:20 ` Christoph Lameter @ 2005-02-02 1:41 ` Nick Piggin 2005-02-02 2:49 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-02-02 1:41 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote: > On Wed, 2 Feb 2005, Nick Piggin wrote: > > > > The unmapping in rmap.c would change the pte. This would be discovered > > > after acquiring the spinlock later in do_wp_page. Which would then lead to > > > the operation being abandoned. > > Oh yes, but suppose your page_cache_get is happening at the same time > > as free_pages_check, after the page gets freed by the scanner? I can't > > actually think of anything that would cause a real problem (ie. not a > > debug check), off the top of my head. But can you say there _isn't_ > > anything? > > > > Regardless, it seems pretty dirty to me. But could possibly be made > > workable, of course. > > Would it make you feel better if we would move the spin_unlock back to the > prior position? This would ensure that the fallback case is exactly the > same. > Well yeah, but the interesting case is when that isn't a lock ;) I'm not saying what you've got is no good. I'm sure it would be fine for testing. And if it happens that we can do the "page_count doesn't mean anything after it has reached zero and been freed. Nor will it necessarily be zero when a new page is allocated" thing without many problems, then this may be a fine way to do it. I was just pointing out this could be a problem without putting a lot of thought into it... Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-02 1:41 ` Nick Piggin @ 2005-02-02 2:49 ` Christoph Lameter 2005-02-02 3:09 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-02-02 2:49 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 2 Feb 2005, Nick Piggin wrote: > Well yeah, but the interesting case is when that isn't a lock ;) > > I'm not saying what you've got is no good. I'm sure it would be fine > for testing. And if it happens that we can do the "page_count doesn't > mean anything after it has reached zero and been freed. Nor will it > necessarily be zero when a new page is allocated" thing without many > problems, then this may be a fine way to do it. > > I was just pointing out this could be a problem without putting a > lot of thought into it... Surely we need to do this the right way. Do we really need to use page_cache_get()? Is anything relying on page_count == 2 of the old_page? I mean we could just speculatively copy, risk copying crap and discard that later when we find that the pte has changed. This would simplify the function: Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-02-01 18:10:46.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-02-01 18:43:08.000000000 -0800 @@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct * /* * Ok, we need to copy. Oh, well.. */ - if (!PageReserved(old_page)) - page_cache_get(old_page); - if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { @@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct * new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto no_new_page; + /* + * No page_cache_get so we may copy some crap + * that is later discarded if the pte has changed + */ copy_user_highpage(new_page, old_page, address); } /* @@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct * acct_update_integrals(); update_mem_hiwater(); } else - page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct * } pte_unmap(page_table); page_cache_release(new_page); - page_cache_release(old_page); spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-02 2:49 ` Christoph Lameter @ 2005-02-02 3:09 ` Nick Piggin 2005-02-04 6:27 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-02-02 3:09 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: > On Wed, 2 Feb 2005, Nick Piggin wrote: > > > Well yeah, but the interesting case is when that isn't a lock ;) > > > > I'm not saying what you've got is no good. I'm sure it would be fine > > for testing. And if it happens that we can do the "page_count doesn't > > mean anything after it has reached zero and been freed. Nor will it > > necessarily be zero when a new page is allocated" thing without many > > problems, then this may be a fine way to do it. > > > > I was just pointing out this could be a problem without putting a > > lot of thought into it... > > Surely we need to do this the right way. Do we really need to > use page_cache_get()? Is anything relying on page_count == 2 of > the old_page? > > I mean we could just speculatively copy, risk copying crap and > discard that later when we find that the pte has changed. This would > simplify the function: > I think this may be the better approach. Anyone else? Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-02 3:09 ` Nick Piggin @ 2005-02-04 6:27 ` Nick Piggin 0 siblings, 0 replies; 286+ messages in thread From: Nick Piggin @ 2005-02-04 6:27 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote: > On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: > > On Wed, 2 Feb 2005, Nick Piggin wrote: > > I mean we could just speculatively copy, risk copying crap and > > discard that later when we find that the pte has changed. This would > > simplify the function: > > > > I think this may be the better approach. Anyone else? > Not to say it is perfect either. Normal semantics say not to touch a page if it is not somehow pinned. So this may cause problems in corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else). But I think a plain read of the page when it isn't pinned is less yucky than writing into the non-pinned struct page. ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patchsets update: prezeroing, prefaulting and atomic operations 2005-02-02 0:31 ` Nick Piggin 2005-02-02 1:20 ` Christoph Lameter @ 2005-02-17 0:57 ` Christoph Lameter 2005-02-24 6:04 ` A Proposal for an MMU abstraction layer Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2005-02-17 0:57 UTC (permalink / raw) To: torvalds, Andrew Morton Cc: Nick Piggin, Andi Kleen, hugh, linux-mm, linux-ia64, linux-kernel, benh I thought I save myself the constant crossposting of large amounts of patches. Patches, documentation, test results etc are available at http://oss.sgi.com/projects/page_fault_performance/ Changes: - Performance tests for all patchsets (i386 single processor, Altix 8 processors and Altix 128 processors) - Archives of patches so far - Some docs (still needs work) - Patches against 2.6.11-rc4-bk4 Patch specific: atomic operations for page faults (V17) - Avoid incrementing page count for page in do_wp_page (see discussion with Nick Piggin on last patchset) prezeroing (V7) - set /proc/sys/vm/scrub_load to 1 by default to avoid slight performance loss during kernel compile on i386 - Scrubd needs to be configured in kernel configuration as an experimental feature. - Patch still follows kswapd's method to bind node specific scrubd daemons to each NUMA node. Cannot find any new infrastructure to assign tasks to certain nodes. kthread_bind() binds to single cpu and not to a NUMA node. Guess other API work would have to be first done to realize Andrews proposed approach. prefaulting (V5) - Set default for /proc/sys/vm/max_prealloc_order to 1 to avoid overallocating pages which led to a performance loss in some situations. This is pretty complex thing to manage so please tell me if I missed anything ... ^ permalink raw reply [flat|nested] 286+ messages in thread
* A Proposal for an MMU abstraction layer 2005-02-17 0:57 ` page fault scalability patchsets update: prezeroing, prefaulting and atomic operations Christoph Lameter @ 2005-02-24 6:04 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-02-24 6:04 UTC (permalink / raw) To: torvalds, Andrew Morton Cc: Nick Piggin, Andi Kleen, hugh, linux-mm, linux-ia64, linux-kernel, benh 1. Rationale ============ Currently the Linux kernel implements a hierachical page table utilizing 4 layers. Architectures that have less layers may cause the kernel to not generate code for certain layers. However, there are other means for mmu to describe page tables to the system. For example the Itanium (and other CPUs) support hashed page table structures or linear page tables. IA64 has to simulate the hierachical layers through its linear page tables and implements the higher layers in software. Moreover, different architectures have different means of implementing huge page table entries. On IA32 this is realized by omitting the lower layer entries and providing single PMD entry replacing 512/1024 PTE entries. On IA64 a PTE entry is used for that purpose. Other architecture realize huge page table entries through groups of PTE entries. There are hooks for each of these methods in the kernel. Moreover the way of handling huge pages is not like other pages but they are managed through a file system. Only one size of huge pages is supported. It would be much better if huge pages would be handled more like regular pages and also to have support for multiple page sizes (which then may lead to support variable page sizes in the VM). It would be best to hide these implementation differences in an mmu abstraction layer. Various architectures could then implement their own way of representing page table entries. We would provide a legacy 4 layer, 3 layer and 2 layer implementation that would take care of the existing implementations. These generic implementations can then be taken by an architecture and emendedto provide the huge page table entries in way fitting for that architecture. For IA64 and otherplatforms that allow alternate ways of maintaining translations, we could avoid maintaining a hierachical table. There are a couple of additional features for page tables that then could also be worked into that abstraction layer: A. Global translation entries. B. Variable page size. C. Use a transactional scheme to allow a variety of synchronization schemes. Early idea for an mmu abstraction layer API =========================================== Three new opaque types: mmu_entry_t mmu_translation_set_t mmu_transaction_t *mmu_entry_t* replaces the existing pte_t and has roughly the same features. However, mmu_entry_t describes a translation of a logical address to a physical address in general. This means that the mmu_entry_t must be able to represent all possible mappings including mappings for huge pages and pages of various sizes if these features are supported by the method of handling page tables. If statistics need to be kept about entries then this entry will also contain a number to indicate what counter to update when inserting or deleting this type of entry [spare bits may be used for this purpose] *mmu_translation_set_t* represents a virtual address space for a process and is essentially a set of mmu_entry_t's plus additional management information that may be necessary to manage an address space. *mmu_transaction_t* allows to perform transactions on translation entries and maintains the state of a transaction. The state information allows to undo changes or commit them in a way that must appear to be atomic to any other access in the system. Operations on mmu_translation_set_t ----------------------------------- void mmu_new_translation_set(struct mmu_translation_set_t *t); Generates an empty translation set void mmu_dup_translation_set(struct mmu_translation_set_t *t, struct mmu_translation_set *t); Generates a duplicate of a translation set void mmu_remove_translation_set(struct mmu_translation_set *t); Removes a translation set void mmu_clear_range(struct mmu_translation_set_t *t, unsigned long start, unsigned long end); Wipe out a range of addresses in the translation set void mmu_copy_range(struct mmu_translation_set *dest, struct mmu_translation_set_t *src, unsinged long dest_start, unsigned long src_start, unsigned long length); These functions are not implemented for the period in which old and new schemes are coexisting since this would require a major change to mm_struct. Transactional operations ------------------------ void mmu_transaction(struct mmu_transaction_t *ta, struct mmu_translation_set_t *tr); Begin a transaction For the coexistence period this is implemented as mmu_transaction(struct mmu_transaction_t , struct mm_struct *mm, struct vm_are_struct *); void mmu_commit(struct mmu_transaction_t); Commit changes done void mmu_forget(struct mmu_transaction_t); Undo changes undone struct mmu_entry_t mmu_find(struct mmu_transaction_t *ta, unsigned long address); Find mmu entry and make this the current entry void mmu_update(struct mmu_transaction_t *ta, mmu_entry_t entry); Update the current entry void mmu_add(struct mmu_transaction_t *ta, mmu_entry_t entry, unsigned long address); Add a new translation entry void mmu_remove(struct mmu_transaction_t *ta); Remove current translation entry Operations on mmu_entry_t ------------------------- The same as for pte_t now. Additional struct mmu_entry mkglobal(struct mmu_entry) Define an entry to be global (valid for all translation sets) struct mmu_entry mksize(struct mmu_entry entry, unsigned order) Set the page size in an entry to order. struct mmu_entry mkcount(struct mmu_entry entry, unsigned long counter) Adding and removing this entry must lead to an update of the specified counter. Not for coexistence period. Statistics ---------- void mmu_stats(struct mmu_translation_set, unsigned long *entries, unsigned long *size_in_pages, unsigned long *counters[]); Not for coexistence period. Scanning through mmu entries ---------------------------- void mmu_scan(struct mmu_translation_set_t *t, unsigned long start, unsigned long end, mmu_entry_t (*func)(struct mmu_entry_t, void *private), void *private); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-01-28 20:37 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 2005-02-01 4:08 ` Nick Piggin @ 2005-02-01 4:16 ` Nick Piggin 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture 2005-02-01 18:44 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 1 sibling, 2 replies; 286+ messages in thread From: Nick Piggin @ 2005-02-01 4:16 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Christoph Lameter wrote: > The page fault handler attempts to use the page_table_lock only for short > time periods. It repeatedly drops and reacquires the lock. When the lock > is reacquired, checks are made if the underlying pte has changed before > replacing the pte value. These locations are a good fit for the use of > ptep_cmpxchg. > > The following patch allows to remove the first time the page_table_lock is > acquired and uses atomic operations on the page table instead. A section > using atomic pte operations is begun with > > page_table_atomic_start(struct mm_struct *) > > and ends with > > page_table_atomic_stop(struct mm_struct *) > Hmm, this is moving toward the direction my patches take. I think it may be the right way to go if you're lifting the ptl from some core things, because some architectures won't want to audit and stuff, and some may need the lock. Naturally I prefer the complete replacement that is made with my patch - however this of course means one has to move *everything* over to be pte_cmpxchg safe, which runs against your goal of getting the low hanging fruit with as little fuss as possible for the moment. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Kernel 2.4.21 hangs up 2005-02-01 4:16 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Nick Piggin @ 2005-02-01 8:20 ` baswaraj kasture 2005-02-01 8:35 ` Arjan van de Ven ` (2 more replies) 2005-02-01 18:44 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 1 sibling, 3 replies; 286+ messages in thread From: baswaraj kasture @ 2005-02-01 8:20 UTC (permalink / raw) To: Nick Piggin, Christoph Lameter Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Hi, I compiled kernel 2.4.21 with intel compiler . While booting it hangs-up . further i found that it hangsup due to call to "calibrate_delay" routine in "init/main.c". Also found that loop in the callibrate_delay" routine goes infinite.When i comment out the call to "callibrate_delay" routine, it works fine.Even compiling "init/main.c" with "-O0" works fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0. Any pointers will be great help. Thanks, -Baswaraj __________________________________ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Kernel 2.4.21 hangs up 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture @ 2005-02-01 8:35 ` Arjan van de Ven 2005-02-01 9:03 ` Christian Hildner 2005-02-01 17:46 ` Kernel 2.4.21 hangs up David Mosberger 2 siblings, 0 replies; 286+ messages in thread From: Arjan van de Ven @ 2005-02-01 8:35 UTC (permalink / raw) To: baswaraj kasture; +Cc: inux-ia64, linux-kernel [-- Attachment #1: Type: text/plain, Size: 523 bytes --] On Tue, 2005-02-01 at 00:20 -0800, baswaraj kasture wrote: > Hi, > > I compiled kernel 2.4.21 with intel compiler . 2.4.21 isn't supposed to be compilable with the intel compiler... > fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0. ... and the RHEL3 kernel most certainly isn't. I strongly suggest that you stick to gcc for compiling the RHEL3 kernel. Also sticking half the world on the CC is considered rude if those people have nothing to do with the subject at hand, as is the case here. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Kernel 2.4.21 hangs up 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture 2005-02-01 8:35 ` Arjan van de Ven @ 2005-02-01 9:03 ` Christian Hildner 2005-02-07 6:14 ` Kernel 2.4.21 gives kernel panic at boot time baswaraj kasture 2005-02-01 17:46 ` Kernel 2.4.21 hangs up David Mosberger 2 siblings, 1 reply; 286+ messages in thread From: Christian Hildner @ 2005-02-01 9:03 UTC (permalink / raw) To: baswaraj kasture Cc: Nick Piggin, Christoph Lameter, Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh baswaraj kasture schrieb: >Hi, > >I compiled kernel 2.4.21 with intel compiler . >While booting it hangs-up . further i found that it >hangsup due to call to "calibrate_delay" routine in >"init/main.c". Also found that loop in the >callibrate_delay" routine goes infinite.When i comment >out the call to "callibrate_delay" routine, it works >fine.Even compiling "init/main.c" with "-O0" works >fine. I am using IA-64 (Intel Itanium 2 ) with EL3.0. > >Any pointers will be great help. > - Download ski from http://www.hpl.hp.com/research/linux/ski/download.php - Compile your kernel for the simulator - set simulator breakpoint at calibrate_delay - look at ar.itc and cr.itm (cr.itm must be greater than ar.itc) Or for debugging on hardware: -run into loop, press the TOC button, reboot and analyze the dump with efi shell + errdump init Christian ^ permalink raw reply [flat|nested] 286+ messages in thread
* Kernel 2.4.21 gives kernel panic at boot time 2005-02-01 9:03 ` Christian Hildner @ 2005-02-07 6:14 ` baswaraj kasture 0 siblings, 0 replies; 286+ messages in thread From: baswaraj kasture @ 2005-02-07 6:14 UTC (permalink / raw) To: Christian Hildner Cc: Nick Piggin, Christoph Lameter, Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Hi, I have compiled the kerne 2.4.21. Compilation went well. but i got follwing message at boot time. ============================================= . . /lib/mptscsih.o : unresolved symbol mpt_deregister_Rsmp_6fb5ab71 /lib/mptscsih.o : Unresolved symbol mpt_event_register_Rsmp_34ace96b ERROR : /bin/insmod exited abnormally Mounting /proc filesystem Creating block devices VFS : cannot open root device "LABEL=/ or 00:00 Please append corrrect "root=" boot option Kernel panic : VFS : Unable to mount root fs on 00:00 =========================================== I have following lines in my elilo.conf , ------------------------------ #original kernel image=vmlinuz-2.4.21-9.EL label=linux initrd=initrd-2.4.21-9.EL.img read-only append="root=LABEL=/" #icc-O2 image=iccvmlinux label=icc_O2 initrd=iccinitrd-preBasicc.img read-only append="root=LABEL=/" --------------------------- First one works fine. Any clues why i am getting this error. Is it related to SCSI Driver ? Further "/sbin/mkinitrd -f -v " gave follwing messge, ====================================== . . . Looking for deps of module scsi_mod Looking for deps of module sd_mod Looking for deps of module unknown Looking for deps of module mptbase Looking for deps of module mptscsih mptbase Looking for deps of module mptbase Looking for deps of module ide-disk Looking for deps of module ext3 Using modules: ./kernel/drivers/message/fusion/mptbase.o ./kernel/drivers/message/fusion/mptscsih.o Using loopback device /dev/loop0 /sbin/nash -> /tmp/initrd.EsIvQ9/bin/nash /sbin/insmod.static -> /tmp/initrd.EsIvQ9/bin/insmod `/lib/modules/2.4.21preBasicc/./kernel/drivers/message/fusion/mptbase.o' -> `/tmp/initrd.EsIvQ9/lib/mptbase.o' `/lib/modules/2.4.21preBasicc/./kernel/drivers/message/fusion/mptscsih.o' -> `/tmp/initrd.EsIvQ9/lib/mptscsih.o' Loading module mptbase Loading module mptscsih ======================================= Any clues will be great help ? Thanx, Baswaraj __________________________________ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Kernel 2.4.21 hangs up 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture 2005-02-01 8:35 ` Arjan van de Ven 2005-02-01 9:03 ` Christian Hildner @ 2005-02-01 17:46 ` David Mosberger 2005-02-01 17:54 ` Markus Trippelsdorf 2 siblings, 1 reply; 286+ messages in thread From: David Mosberger @ 2005-02-01 17:46 UTC (permalink / raw) To: baswaraj kasture; +Cc: linux-ia64, linux-kernel [I trimmed the cc-list...] >>>>> On Tue, 1 Feb 2005 00:20:01 -0800 (PST), baswaraj kasture <kbaswaraj@yahoo.com> said: Baswaraj> Hi, I compiled kernel 2.4.21 with intel compiler . That's curious. Last time I checked, the changes needed to use the Intel-compiler have not been backported to 2.4. What kernel sources are you working off of? Also, even with 2.6 you need a script from Intel which does some "magic" GCC->ICC option translations to build the kernel with the Intel compiler. AFAIK, this script has not been released by Intel (hint, hint...). Baswaraj> While booting it hangs-up . further i found that it Baswaraj> hangsup due to call to "calibrate_delay" routine in Baswaraj> "init/main.c". Also found that loop in the Baswaraj> callibrate_delay" routine goes infinite. I suspect your kernel was just miscompiled. We have used the Intel-compiler internally on a 2.6 kernel and it worked fine at the time, though I haven't tried recently. --david ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Kernel 2.4.21 hangs up 2005-02-01 17:46 ` Kernel 2.4.21 hangs up David Mosberger @ 2005-02-01 17:54 ` Markus Trippelsdorf 2005-02-01 18:08 ` David Mosberger 0 siblings, 1 reply; 286+ messages in thread From: Markus Trippelsdorf @ 2005-02-01 17:54 UTC (permalink / raw) To: davidm; +Cc: baswaraj kasture, linux-ia64, linux-kernel On Tue, 2005-02-01 at 09:46 -0800, David Mosberger wrote: > Also, even with 2.6 you need a script from Intel which does some > "magic" GCC->ICC option translations to build the kernel with the > Intel compiler. AFAIK, this script has not been released by Intel > (hint, hint...). > They posted it to the LKML so time ago (2004-03-12). (message): http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497 (script): http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497&q=p3 __ Markus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Kernel 2.4.21 hangs up 2005-02-01 17:54 ` Markus Trippelsdorf @ 2005-02-01 18:08 ` David Mosberger 0 siblings, 0 replies; 286+ messages in thread From: David Mosberger @ 2005-02-01 18:08 UTC (permalink / raw) To: Markus Trippelsdorf; +Cc: davidm, baswaraj kasture, linux-ia64, linux-kernel >>>>> On Tue, 01 Feb 2005 18:54:59 +0100, Markus Trippelsdorf <markus@trippelsdorf.de> said: Markus> They posted it to the LKML so time ago Markus> (2004-03-12). (message): Markus> http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497 Markus> (script): Markus> http://marc.theaimsgroup.com/?l=linux-kernel&m=107913092300497&q=p3 That script is for the x86-version of icc only. It doesn't work for ia64, which is the context of this discussion. --david ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault 2005-02-01 4:16 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Nick Piggin 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture @ 2005-02-01 18:44 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-02-01 18:44 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh On Tue, 1 Feb 2005, Nick Piggin wrote: > Hmm, this is moving toward the direction my patches take. You are right. But I am still weary of the transactional idea in your patchset (which is really not comparable with a database transaction after all...). I think moving to cmpxchg and xchg operations will give this more transparency and make it easier to understand and handle. > Naturally I prefer the complete replacement that is made with > my patch - however this of course means one has to move > *everything* over to be pte_cmpxchg safe, which runs against > your goal of getting the low hanging fruit with as little fuss > as possible for the moment. I would also prefer a replacement but there are certain cost-benefit tradeoffs with atomic operations vs. spinlock that may better be tackled on a case by case basis. Also this is pretty much at the core of the Linux VM and thus highly sensitive. Given its history and the danger of breaking things it may be best to preserve it intact as much as possible and move in small steps. ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter ` (2 preceding siblings ...) 2005-01-28 20:37 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter @ 2005-01-28 20:38 ` Christoph Lameter 3 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-28 20:38 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Andrew Morton, torvalds, hugh, linux-mm, linux-ia64, linux-kernel, benh Do not use the page_table_lock in do_anonymous_page. This will significantly increase the parallelism in the page fault handler in SMP systems. The patch also modifies the definitions of _mm_counter functions so that rss and anon_rss become atomic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:39:24.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-27 16:39:24.000000000 -0800 @@ -1839,12 +1839,12 @@ do_anonymous_page(struct mm_struct *mm, vma->vm_page_prot)), vma); - spin_lock(&mm->page_table_lock); + page_table_atomic_start(mm); if (!ptep_cmpxchg(page_table, orig_entry, entry)) { pte_unmap(page_table); page_cache_release(page); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); inc_page_state(cmpxchg_fail_anon_write); return VM_FAULT_MINOR; } @@ -1862,7 +1862,7 @@ do_anonymous_page(struct mm_struct *mm, update_mmu_cache(vma, addr, entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } Index: linux-2.6.10/include/linux/sched.h =================================================================== --- linux-2.6.10.orig/include/linux/sched.h 2005-01-27 16:39:24.000000000 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-01-27 16:40:24.000000000 -0800 @@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi extern void arch_unmap_area(struct vm_area_struct *area); extern void arch_unmap_area_topdown(struct vm_area_struct *area); +#ifdef CONFIG_ATOMIC_TABLE_OPS +/* + * Atomic page table operations require that the counters are also + * incremented atomically +*/ +#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value) +#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member)) +#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member) +#define MM_COUNTER_T atomic_t + +#else +/* + * No atomic page table operations. Counters are protected by + * the page table lock + */ #define set_mm_counter(mm, member, value) (mm)->member = (value) #define get_mm_counter(mm, member) ((mm)->member) #define update_mm_counter(mm, member, value) (mm)->member += (value) #define MM_COUNTER_T unsigned long +#endif struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-12 23:50 ` Nick Piggin 2005-01-12 23:54 ` Christoph Lameter @ 2005-01-13 3:09 ` Hugh Dickins 2005-01-13 3:46 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2005-01-13 3:09 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, clameter, torvalds, ak, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Nick Piggin wrote: > Andrew Morton wrote: > > Note that this was with my ptl removal patches. I can't see why Christoph's > would have _any_ extra overhead as they are, but it looks to me like they're > lacking in atomic ops. So I'd expect something similar for Christoph's when > they're properly atomic. > > > Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we > > agree to take. But we need to fully understand all the costs and benefits. > > I think copy_page_range is the one to keep an eye on. Christoph's currently lack set_pte_atomics in the fault handlers, yes. But I don't see why they should need set_pte_atomics in copy_page_range (which is why I persuaded him to drop forcing set_pte to atomic). dup_mmap has down_write of the src mmap_sem, keeping out any faults on that. copy_pte_range has spin_lock of the dst page_table_lock and the src page_table_lock, keeping swapout away from those. Why would atomic set_ptes be needed there? Probably in yours, but not in Christoph's. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 3:09 ` page table lock patch V15 [0/7]: overview Hugh Dickins @ 2005-01-13 3:46 ` Nick Piggin 2005-01-13 17:14 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2005-01-13 3:46 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, clameter, torvalds, ak, linux-mm, linux-ia64, linux-kernel, benh Hugh Dickins wrote: > On Thu, 13 Jan 2005, Nick Piggin wrote: > >>Andrew Morton wrote: >> >>Note that this was with my ptl removal patches. I can't see why Christoph's >>would have _any_ extra overhead as they are, but it looks to me like they're >>lacking in atomic ops. So I'd expect something similar for Christoph's when >>they're properly atomic. >> >> >>>Look, -7% on a 2-way versus +700% on a many-way might well be a tradeoff we >>>agree to take. But we need to fully understand all the costs and benefits. >> >>I think copy_page_range is the one to keep an eye on. > > > Christoph's currently lack set_pte_atomics in the fault handlers, yes. > But I don't see why they should need set_pte_atomics in copy_page_range > (which is why I persuaded him to drop forcing set_pte to atomic). > > dup_mmap has down_write of the src mmap_sem, keeping out any faults on > that. copy_pte_range has spin_lock of the dst page_table_lock and the > src page_table_lock, keeping swapout away from those. Why would atomic > set_ptes be needed there? Probably in yours, but not in Christoph's. > I was more thinking of atomic pte reads there. I had for some reason thought that dup_mmap only had a down_read of the mmap_sem. But even if it did only down_read, a further look showed this wouldn't have been a problem for Christoph anyway. That dim light-bulb probably changes things for my patches too; I may be able to do copy_page_range with fewer atomics. I'm still not too sure that all places read the pte atomically where needed. But presently this is not a really big concern because it only would really slow down i386 PAE if anything. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page table lock patch V15 [0/7]: overview 2005-01-13 3:46 ` Nick Piggin @ 2005-01-13 17:14 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-13 17:14 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, torvalds, ak, linux-mm, linux-ia64, linux-kernel, benh On Thu, 13 Jan 2005, Nick Piggin wrote: > I'm still not too sure that all places read the pte atomically where needed. > But presently this is not a really big concern because it only would > really slow down i386 PAE if anything. S/390 is also affected. And I vaguely recall special issues with sparc too. That is why I dropped the arch support for that a long time ago from the patchset. Then there was some talk a couple of months back to use another addressing mode on IA64 that may also require 128 bit ptes. There are significantly different ways of doing optimal SMP locking for these scenarios. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 19:38 ` page fault scalability patch V14 [5/7]: x86_64 " Christoph Lameter 2005-01-04 19:46 ` Andi Kleen @ 2005-01-04 21:21 ` Brian Gerst 2005-01-04 21:26 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Brian Gerst @ 2005-01-04 21:21 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > Changelog > * Provide atomic pte operations for x86_64 > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > Index: linux-2.6.10/include/asm-x86_64/pgalloc.h > =================================================================== > --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 10:31:31.000000000 -0800 > +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-03 12:21:28.000000000 -0800 > @@ -7,6 +7,10 @@ > #include <linux/threads.h> > #include <linux/mm.h> > > +#define PMD_NONE 0 > +#define PUD_NONE 0 > +#define PGD_NONE 0 > + > #define pmd_populate_kernel(mm, pmd, pte) \ > set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) > #define pud_populate(mm, pud, pmd) \ > @@ -14,11 +18,24 @@ > #define pgd_populate(mm, pgd, pud) \ > set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) > > +#define pmd_test_and_populate(mm, pmd, pte) \ > + (cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) > +#define pud_test_and_populate(mm, pud, pmd) \ > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) ^^^ Shouldn't this be pud? > +#define pgd_test_and_populate(mm, pgd, pud) \ > + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) > + > + -- Brian Gerst ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V14 [5/7]: x86_64 atomic pte operations 2005-01-04 21:21 ` page fault scalability patch V14 [5/7]: x86_64 atomic pte operations Brian Gerst @ 2005-01-04 21:26 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 21:26 UTC (permalink / raw) To: Brian Gerst Cc: Linus Torvalds, Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Tue, 4 Jan 2005, Brian Gerst wrote: > > +#define pud_test_and_populate(mm, pud, pmd) \ > > + (cmpxchg((int *)pgd, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) > ^^^ > Shouldn't this be pud? Corrrect. Sigh. Could someone test this on x86_64? Index: linux-2.6.10/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgalloc.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgalloc.h 2005-01-04 12:31:14.000000000 -0800 @@ -7,6 +7,10 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PUD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pud_populate(mm, pud, pmd) \ @@ -14,11 +18,24 @@ #define pgd_populate(mm, pgd, pud) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))) +#define pmd_test_and_populate(mm, pmd, pte) \ + (cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | __pa(pte)) == PMD_NONE) +#define pud_test_and_populate(mm, pud, pmd) \ + (cmpxchg(pud, PUD_NONE, _PAGE_TABLE | __pa(pmd)) == PUD_NONE) +#define pgd_test_and_populate(mm, pgd, pud) \ + (cmpxchg(pgd, PGD_NONE, _PAGE_TABLE | __pa(pud)) == PGD_NONE) + + static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg(pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.10/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-x86_64/pgtable.h 2005-01-03 15:02:01.000000000 -0800 +++ linux-2.6.10/include/asm-x86_64/pgtable.h 2005-01-04 12:29:25.000000000 -0800 @@ -413,6 +413,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [6/7]: s390 atomic pte operationsw 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter ` (4 preceding siblings ...) 2005-01-04 19:38 ` page fault scalability patch V14 [5/7]: x86_64 " Christoph Lameter @ 2005-01-04 19:38 ` Christoph Lameter 2005-01-04 19:39 ` page fault scalability patch V14 [7/7]: Split RSS counters Christoph Lameter 6 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:38 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for s390 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/asm-s390/pgtable.h =================================================================== --- linux-2.6.10.orig/include/asm-s390/pgtable.h 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/include/asm-s390/pgtable.h 2005-01-03 12:12:03.000000000 -0800 @@ -569,6 +569,15 @@ return pte; } +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + struct mm_struct *__mm = __vma->vm_mm; \ + pte_t __pte; \ + __pte = ptep_clear_flush(__vma, __address, __ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) + static inline void ptep_set_wrprotect(pte_t *ptep) { pte_t old_pte = *ptep; @@ -780,6 +789,14 @@ #define kern_addr_valid(addr) (1) +/* Atomic PTE operations */ +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + +static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + /* * No page table caches to initialise */ @@ -793,6 +810,7 @@ #define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH #define __HAVE_ARCH_PTEP_GET_AND_CLEAR #define __HAVE_ARCH_PTEP_CLEAR_FLUSH +#define __HAVE_ARCH_PTEP_XCHG_FLUSH #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME Index: linux-2.6.10/include/asm-s390/pgalloc.h =================================================================== --- linux-2.6.10.orig/include/asm-s390/pgalloc.h 2004-12-24 13:35:00.000000000 -0800 +++ linux-2.6.10/include/asm-s390/pgalloc.h 2005-01-03 12:12:03.000000000 -0800 @@ -97,6 +97,10 @@ pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd); } +static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd) +{ + return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV; +} #endif /* __s390x__ */ static inline void @@ -119,6 +123,18 @@ pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT)); } +static inline int +pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page) +{ + int rc; + spin_lock(&mm->page_table_lock); + + rc=pte_same(*pmd, _PAGE_INVALID_EMPTY); + if (rc) pmd_populate(mm, pmd, page); + spin_unlock(&mm->page_table_lock); + return rc; +} + /* * page table entry allocation/free routines. */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V14 [7/7]: Split RSS counters 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter ` (5 preceding siblings ...) 2005-01-04 19:38 ` page fault scalability patch V14 [6/7]: s390 atomic pte operationsw Christoph Lameter @ 2005-01-04 19:39 ` Christoph Lameter 6 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-04 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, akpm, Nick Piggin, linux-mm, linux-ia64, linux-kernel Changelog * Split rss counter into the task structure * remove 3 checks of rss in mm/rmap.c * increment current->rss instead of mm->rss in the page fault handler * move incrementing of anon_rss out of page_add_anon_rmap to group the increments more tightly and allow a better cache utilization Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/linux/sched.h =================================================================== --- linux-2.6.10.orig/include/linux/sched.h 2004-12-24 13:33:59.000000000 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-01-03 12:21:32.000000000 -0800 @@ -30,6 +30,7 @@ #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> +#include <linux/rcupdate.h> struct exec_domain; @@ -217,6 +218,7 @@ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + long rss, anon_rss; struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -226,7 +228,7 @@ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ @@ -236,6 +238,7 @@ /* Architecture-specific MM context */ mm_context_t context; + struct list_head task_list; /* Tasks using this mm */ /* Token based thrashing protection. */ unsigned long swap_token_time; @@ -545,6 +548,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Split counters from mm */ + long rss; + long anon_rss; /* task state */ struct linux_binfmt *binfmt; @@ -578,6 +584,10 @@ struct completion *vfork_done; /* for vfork() */ int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ + + /* List of other tasks using the same mm */ + struct list_head mm_tasks; + struct rcu_head rcu_head; /* For freeing the task via rcu */ unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; @@ -1124,6 +1134,12 @@ #endif +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss); + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk); +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif + Index: linux-2.6.10/fs/proc/task_mmu.c =================================================================== --- linux-2.6.10.orig/fs/proc/task_mmu.c 2004-12-24 13:34:01.000000000 -0800 +++ linux-2.6.10/fs/proc/task_mmu.c 2005-01-03 12:21:32.000000000 -0800 @@ -6,8 +6,9 @@ char *task_mem(struct mm_struct *mm, char *buffer) { - unsigned long data, text, lib; + unsigned long data, text, lib, rss, anon_rss; + get_rss(mm, &rss, &anon_rss); data = mm->total_vm - mm->shared_vm - mm->stack_vm; text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; @@ -22,7 +23,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + rss << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -37,11 +38,14 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + unsigned long rss, anon_rss; + + get_rss(mm, &rss, &anon_rss); + *shared = rss - anon_rss; *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; - *resident = mm->rss; + *resident = rss; return mm->total_vm; } Index: linux-2.6.10/fs/proc/array.c =================================================================== --- linux-2.6.10.orig/fs/proc/array.c 2004-12-24 13:35:00.000000000 -0800 +++ linux-2.6.10/fs/proc/array.c 2005-01-03 12:21:32.000000000 -0800 @@ -302,7 +302,7 @@ static int do_task_stat(struct task_struct *task, char * buffer, int whole) { - unsigned long vsize, eip, esp, wchan = ~0UL; + unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL; long priority, nice; int tty_pgrp = -1, tty_nr = 0; sigset_t sigign, sigcatch; @@ -325,6 +325,7 @@ vsize = task_vsize(mm); eip = KSTK_EIP(task); esp = KSTK_ESP(task); + get_rss(mm, &rss, &anon_rss); } get_task_comm(tcomm, task); @@ -420,7 +421,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? rss : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.10/mm/rmap.c =================================================================== --- linux-2.6.10.orig/mm/rmap.c 2005-01-03 10:31:41.000000000 -0800 +++ linux-2.6.10/mm/rmap.c 2005-01-03 12:21:32.000000000 -0800 @@ -264,8 +264,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -446,8 +444,6 @@ BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; - anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; index += vma->vm_pgoff; @@ -519,8 +515,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -817,8 +811,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.10/kernel/fork.c =================================================================== --- linux-2.6.10.orig/kernel/fork.c 2004-12-24 13:33:59.000000000 -0800 +++ linux-2.6.10/kernel/fork.c 2005-01-03 12:21:32.000000000 -0800 @@ -78,10 +78,16 @@ static kmem_cache_t *task_struct_cachep; #endif +static void rcu_free_task(struct rcu_head *head) +{ + struct task_struct *tsk = container_of(head ,struct task_struct, rcu_head); + free_task_struct(tsk); +} + void free_task(struct task_struct *tsk) { free_thread_info(tsk->thread_info); - free_task_struct(tsk); + call_rcu(&tsk->rcu_head, rcu_free_task); } EXPORT_SYMBOL(free_task); @@ -98,7 +104,7 @@ put_group_info(tsk->group_info); if (!profile_handoff_task(tsk)) - free_task(tsk); + call_rcu(&tsk->rcu_head, rcu_free_task); } void __init fork_init(unsigned long mempages) @@ -151,6 +157,7 @@ *tsk = *orig; tsk->thread_info = ti; ti->task = tsk; + tsk->rss = 0; /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); @@ -292,6 +299,7 @@ atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); + INIT_LIST_HEAD(&mm->task_list); mm->core_waiters = 0; mm->nr_ptes = 0; spin_lock_init(&mm->page_table_lock); @@ -400,6 +408,8 @@ /* Get rid of any cached register state */ deactivate_mm(tsk, mm); + if (mm) + mm_remove_thread(mm, tsk); /* notify parent sleeping on vfork() */ if (vfork_done) { @@ -447,8 +457,8 @@ * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. + * (mm_add_thread does use the ptl .... ) */ - spin_unlock_wait(&oldmm->page_table_lock); goto good_mm; } @@ -470,6 +480,7 @@ goto free_pt; good_mm: + mm_add_thread(mm, tsk); tsk->mm = mm; tsk->active_mm = mm; return 0; @@ -1063,7 +1074,7 @@ atomic_dec(&p->user->processes); free_uid(p->user); bad_fork_free: - free_task(p); + call_rcu(&p->rcu_head, rcu_free_task); goto fork_out; } Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-03 11:15:55.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-03 12:21:32.000000000 -0800 @@ -909,6 +909,7 @@ struct page *map; int lookup_write = write; while (!(map = follow_page(mm, start, lookup_write))) { + unsigned long rss, anon_rss; /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -921,6 +922,17 @@ map = ZERO_PAGE(start); break; } + if (mm != current->mm) { + /* + * handle_mm_fault uses the current pointer + * for a split rss counter. The current pointer + * is not correct if we are using a different mm + */ + rss = current->rss; + anon_rss = current->anon_rss; + current->rss = 0; + current->anon_rss = 0; + } spin_unlock(&mm->page_table_lock); switch (handle_mm_fault(mm,vma,start,write)) { case VM_FAULT_MINOR: @@ -945,6 +957,12 @@ */ lookup_write = write && !force; spin_lock(&mm->page_table_lock); + if (mm != current->mm) { + mm->rss += current->rss; + mm->anon_rss += current->anon_rss; + current->rss = rss; + current->anon_rss = anon_rss; + } } if (pages) { pages[i] = get_page_map(map); @@ -1325,6 +1343,7 @@ break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); + mm->anon_rss++; /* Free the old page.. */ new_page = old_page; @@ -1608,6 +1627,7 @@ flush_icache_page(vma, page); set_pte(page_table, pte); page_add_anon_rmap(page, vma, address); + mm->anon_rss++; if (write_access) { if (do_wp_page(mm, vma, address, @@ -1674,6 +1694,7 @@ page_add_anon_rmap(page, vma, addr); lru_cache_add_active(page); mm->rss++; + mm->anon_rss++; } pte_unmap(page_table); @@ -1780,6 +1801,7 @@ if (anon) { lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); + mm->anon_rss++; } else page_add_file_rmap(new_page); pte_unmap(page_table); @@ -2143,3 +2165,49 @@ } #endif + +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss) +{ + struct list_head *y; + struct task_struct *t; + long rss_sum, anon_rss_sum; + + rcu_read_lock(); + rss_sum = mm->rss; + anon_rss_sum = mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss_sum += t->rss; + anon_rss_sum += t->anon_rss; + } + if (rss_sum < 0) + rss_sum = 0; + if (anon_rss_sum < 0) + anon_rss_sum = 0; + rcu_read_unlock(); + *rss = rss_sum; + *anon_rss = anon_rss_sum; +} + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!mm) + return; + + spin_lock(&mm->page_table_lock); + mm->rss += tsk->rss; + mm->anon_rss += tsk->anon_rss; + list_del_rcu(&tsk->mm_tasks); + spin_unlock(&mm->page_table_lock); +} + +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + spin_lock(&mm->page_table_lock); + tsk->rss = 0; + tsk->anon_rss = 0; + list_add_rcu(&tsk->mm_tasks, &mm->task_list); + spin_unlock(&mm->page_table_lock); +} + + Index: linux-2.6.10/include/linux/init_task.h =================================================================== --- linux-2.6.10.orig/include/linux/init_task.h 2004-12-24 13:33:52.000000000 -0800 +++ linux-2.6.10/include/linux/init_task.h 2005-01-03 12:21:32.000000000 -0800 @@ -42,6 +42,7 @@ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ + .task_list = LIST_HEAD_INIT(name.task_list), \ } #define INIT_SIGNALS(sig) { \ @@ -112,6 +113,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \ } Index: linux-2.6.10/fs/exec.c =================================================================== --- linux-2.6.10.orig/fs/exec.c 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/fs/exec.c 2005-01-03 12:21:32.000000000 -0800 @@ -549,6 +549,7 @@ tsk->active_mm = mm; activate_mm(active_mm, mm); task_unlock(tsk); + mm_add_thread(mm, current); arch_pick_mmap_layout(mm); if (old_mm) { if (active_mm != old_mm) BUG(); Index: linux-2.6.10/fs/aio.c =================================================================== --- linux-2.6.10.orig/fs/aio.c 2004-12-24 13:34:44.000000000 -0800 +++ linux-2.6.10/fs/aio.c 2005-01-03 12:21:32.000000000 -0800 @@ -577,6 +577,7 @@ tsk->active_mm = mm; activate_mm(active_mm, mm); task_unlock(tsk); + mm_add_thread(mm, tsk); mmdrop(active_mm); } @@ -596,6 +597,7 @@ { struct task_struct *tsk = current; + mm_remove_thread(mm,tsk); task_lock(tsk); tsk->flags &= ~PF_BORROWED_MM; tsk->mm = NULL; Index: linux-2.6.10/mm/swapfile.c =================================================================== --- linux-2.6.10.orig/mm/swapfile.c 2005-01-03 10:31:31.000000000 -0800 +++ linux-2.6.10/mm/swapfile.c 2005-01-03 12:21:32.000000000 -0800 @@ -432,6 +432,7 @@ swp_entry_t entry, struct page *page) { vma->vm_mm->rss++; + vma->vm_mm->anon_rss++; get_page(page); set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot))); page_add_anon_rmap(page, vma, address); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (6 preceding siblings ...) 2004-12-01 23:45 ` page fault scalability patch V12 [7/7]: Split counter for rss Christoph Lameter @ 2004-12-02 0:10 ` Linus Torvalds 2004-12-02 0:55 ` Andrew Morton 2004-12-02 6:21 ` Jeff Garzik 2004-12-09 8:00 ` Nick Piggin 2004-12-09 18:37 ` Hugh Dickins 9 siblings, 2 replies; 286+ messages in thread From: Linus Torvalds @ 2004-12-02 0:10 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Wed, 1 Dec 2004, Christoph Lameter wrote: > > Changes from V11->V12 of this patch: > - dump sloppy_rss in favor of list_rss (Linus' proposal) > - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) > > This is a series of patches that increases the scalability of > the page fault handler for SMP. Here are some performance results > on a machine with 512 processors allocating 32 GB with an increasing > number of threads (that are assigned a processor each). Ok, consider me convinced. I don't want to apply this before I get 2.6.10 out the door, but I'm happy with it. I assume Andrew has already picked up the previous version. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 0:10 ` page fault scalability patch V12 [0/7]: Overview and performance tests Linus Torvalds @ 2004-12-02 0:55 ` Andrew Morton 2004-12-02 1:46 ` Christoph Lameter 2004-12-02 6:21 ` Jeff Garzik 1 sibling, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-12-02 0:55 UTC (permalink / raw) To: Linus Torvalds Cc: clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > > > > On Wed, 1 Dec 2004, Christoph Lameter wrote: > > > > Changes from V11->V12 of this patch: > > - dump sloppy_rss in favor of list_rss (Linus' proposal) > > - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) > > > > This is a series of patches that increases the scalability of > > the page fault handler for SMP. Here are some performance results > > on a machine with 512 processors allocating 32 GB with an increasing > > number of threads (that are assigned a processor each). > > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 > out the door, but I'm happy with it. There were concerns about some architectures relying upon page_table_lock for exclusivity within their own pte handling functions. Have they all been resolved? > I assume Andrew has already picked up the previous version. Nope. It has major clashes with the 4-level-pagetable work. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 0:55 ` Andrew Morton @ 2004-12-02 1:46 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-02 1:46 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Wed, 1 Dec 2004, Andrew Morton wrote: > > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 > > out the door, but I'm happy with it. > > There were concerns about some architectures relying upon page_table_lock > for exclusivity within their own pte handling functions. Have they all > been resolved? The patch will fall back on the page_table_lock if an architecture cannot provide atomic pte operations. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 0:10 ` page fault scalability patch V12 [0/7]: Overview and performance tests Linus Torvalds 2004-12-02 0:55 ` Andrew Morton @ 2004-12-02 6:21 ` Jeff Garzik 2004-12-02 6:34 ` Andrew Morton 1 sibling, 1 reply; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 6:21 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Linus Torvalds wrote: > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 > out the door, but I'm happy with it. I assume Andrew has already picked up > the previous version. Does that mean that 2.6.10 is actually close to the door? /me runs... ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:21 ` Jeff Garzik @ 2004-12-02 6:34 ` Andrew Morton 2004-12-02 6:48 ` Jeff Garzik ` (3 more replies) 0 siblings, 4 replies; 286+ messages in thread From: Andrew Morton @ 2004-12-02 6:34 UTC (permalink / raw) To: Jeff Garzik Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Jeff Garzik <jgarzik@pobox.com> wrote: > > Linus Torvalds wrote: > > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 > > out the door, but I'm happy with it. I assume Andrew has already picked up > > the previous version. > > > Does that mean that 2.6.10 is actually close to the door? > We need an -rc3 yet. And I need to do another pass through the regressions-since-2.6.9 list. We've made pretty good progress there recently. Mid to late December is looking like the 2.6.10 date. We need to be be achieving higher-quality major releases than we did in 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer stabilisation periods. Of course, nobody will test -rc3 and a zillion people will test final 2.6.10, which is when we get lots of useful bug reports. If this keeps on happening then we'll need to get more serious about the 2.6.10.n process. Or start alternating between stable and flakey releases, so 2.6.11 will be a feature release with a 2-month development period and 2.6.12 will be a bugfix-only release, with perhaps a 2-week development period, so people know that the even-numbered releases are better stabilised. We'll see. It all depends on how many bugs you can fix in the next two weeks ;) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:34 ` Andrew Morton @ 2004-12-02 6:48 ` Jeff Garzik 2004-12-02 7:02 ` Andrew Morton 2004-12-02 19:48 ` Diego Calleja 2004-12-02 7:00 ` Jeff Garzik ` (2 subsequent siblings) 3 siblings, 2 replies; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 6:48 UTC (permalink / raw) To: Andrew Morton Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Andrew Morton wrote: > We need to be be achieving higher-quality major releases than we did in > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > stabilisation periods. I'm still hoping that distros (like my employer) and orgs like OSDL will step up, and hook 2.6.x BK snapshots into daily test harnesses. Something like John Cherry's reports to lkml on warnings and errors would be darned useful. His reports are IMO an ideal model: show day-to-day _changes_ in test results. Don't just dump a huge list of testsuite results, results which are often clogged with expected failures and testsuite bug noise. Jeff ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:48 ` Jeff Garzik @ 2004-12-02 7:02 ` Andrew Morton 2004-12-02 7:26 ` Martin J. Bligh ` (2 more replies) 2004-12-02 19:48 ` Diego Calleja 1 sibling, 3 replies; 286+ messages in thread From: Andrew Morton @ 2004-12-02 7:02 UTC (permalink / raw) To: Jeff Garzik Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Jeff Garzik <jgarzik@pobox.com> wrote: > > Andrew Morton wrote: > > We need to be be achieving higher-quality major releases than we did in > > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > > stabilisation periods. > > > I'm still hoping that distros (like my employer) and orgs like OSDL will > step up, and hook 2.6.x BK snapshots into daily test harnesses. I believe that both IBM and OSDL are doing this, or are getting geared up to do this. With both Linus bk and -mm. However I have my doubts about how useful it will end up being. These test suites don't seem to pick up many regressions. I've challenged Gerrit to go back through a release cycle's bugfixes and work out how many of those bugs would have been detected by the test suite. My suspicion is that the answer will be "a very small proportion", and that really is the bottom line. We simply get far better coverage testing by releasing code, because of all the wild, whacky and weird things which people do with their computers. Bless them. > Something like John Cherry's reports to lkml on warnings and errors > would be darned useful. His reports are IMO an ideal model: show > day-to-day _changes_ in test results. Don't just dump a huge list of > testsuite results, results which are often clogged with expected > failures and testsuite bug noise. > Yes, we need humans between the tests and the developers. Someone who has good experience with the tests and who can say "hey, something changed when I do X". If nothing changed, we don't hear anything. It's a developer role, not a testing role. All testing is, really. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:02 ` Andrew Morton @ 2004-12-02 7:26 ` Martin J. Bligh 2004-12-02 7:31 ` Jeff Garzik 2004-12-02 18:43 ` page fault scalability patch V12 [0/7]: Overview and performance tests cliff white 2004-12-02 16:24 ` Gerrit Huizenga 2004-12-02 17:34 ` cliff white 2 siblings, 2 replies; 286+ messages in thread From: Martin J. Bligh @ 2004-12-02 7:26 UTC (permalink / raw) To: Andrew Morton, Jeff Garzik Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel --Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800): > Jeff Garzik <jgarzik@pobox.com> wrote: >> >> Andrew Morton wrote: >> > We need to be be achieving higher-quality major releases than we did in >> > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer >> > stabilisation periods. >> >> >> I'm still hoping that distros (like my employer) and orgs like OSDL will >> step up, and hook 2.6.x BK snapshots into daily test harnesses. > > I believe that both IBM and OSDL are doing this, or are getting geared up > to do this. With both Linus bk and -mm. I already run a bunch of tests on a variety of machines for every new kernel ... but don't have an automated way to compare the results as yet, so don't actually look at them much ;-(. Sometime soon (quite possibly over Christmas) things will calm down enough I'll get a couple of days to write the appropriate perl script, and start publishing stuff. > However I have my doubts about how useful it will end up being. These test > suites don't seem to pick up many regressions. I've challenged Gerrit to > go back through a release cycle's bugfixes and work out how many of those > bugs would have been detected by the test suite. > > My suspicion is that the answer will be "a very small proportion", and that > really is the bottom line. Yeah, probably. Though the stress tests catch a lot more than the functionality ones. The big pain in the ass is drivers, because I don't have a hope in hell of testing more than 1% of them. M. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:26 ` Martin J. Bligh @ 2004-12-02 7:31 ` Jeff Garzik 2004-12-02 18:10 ` cliff white 2004-12-02 18:43 ` page fault scalability patch V12 [0/7]: Overview and performance tests cliff white 1 sibling, 1 reply; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 7:31 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Martin J. Bligh wrote: > Yeah, probably. Though the stress tests catch a lot more than the > functionality ones. The big pain in the ass is drivers, because I don't > have a hope in hell of testing more than 1% of them. My dream is that hardware vendors rotate their current machines through a test shop :) It would be nice to make sure that the popular drivers get daily test coverage. Jeff, dreaming on ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:31 ` Jeff Garzik @ 2004-12-02 18:10 ` cliff white 2004-12-02 18:17 ` Gerrit Huizenga ` (2 more replies) 0 siblings, 3 replies; 286+ messages in thread From: cliff white @ 2004-12-02 18:10 UTC (permalink / raw) To: Jeff Garzik Cc: mbligh, akpm, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Thu, 02 Dec 2004 02:31:35 -0500 Jeff Garzik <jgarzik@pobox.com> wrote: > Martin J. Bligh wrote: > > Yeah, probably. Though the stress tests catch a lot more than the > > functionality ones. The big pain in the ass is drivers, because I don't > > have a hope in hell of testing more than 1% of them. > > My dream is that hardware vendors rotate their current machines through > a test shop :) It would be nice to make sure that the popular drivers > get daily test coverage. > > Jeff, dreaming on OSDL has recently re-done the donation policy, and we're much better positioned to support that sort of thing now - Contact Tom Hanrahan at OSDL if you are a vendor, or know a vendor. ( Or you can become a vendor ) cliffw > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> > -- The church is near, but the road is icy. The bar is far, but i will walk carefully. - Russian proverb ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 18:10 ` cliff white @ 2004-12-02 18:17 ` Gerrit Huizenga 2004-12-02 20:25 ` linux-os 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter 2 siblings, 0 replies; 286+ messages in thread From: Gerrit Huizenga @ 2004-12-02 18:17 UTC (permalink / raw) To: cliff white Cc: Jeff Garzik, mbligh, akpm, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Thu, 02 Dec 2004 10:10:29 PST, cliff white wrote: > On Thu, 02 Dec 2004 02:31:35 -0500 > Jeff Garzik <jgarzik@pobox.com> wrote: > > > Martin J. Bligh wrote: > > > Yeah, probably. Though the stress tests catch a lot more than the > > > functionality ones. The big pain in the ass is drivers, because I don't > > > have a hope in hell of testing more than 1% of them. > > > > My dream is that hardware vendors rotate their current machines through > > a test shop :) It would be nice to make sure that the popular drivers > > get daily test coverage. > > > > Jeff, dreaming on > > OSDL has recently re-done the donation policy, and we're much better positioned > to support that sort of thing now - Contact Tom Hanrahan at OSDL if you > are a vendor, or know a vendor. ( Or you can become a vendor ) Specifically Tom Hanrahan == hanrahat@osdl.org gerrit ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 18:10 ` cliff white 2004-12-02 18:17 ` Gerrit Huizenga @ 2004-12-02 20:25 ` linux-os 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter 2 siblings, 0 replies; 286+ messages in thread From: linux-os @ 2004-12-02 20:25 UTC (permalink / raw) To: cliff white Cc: Jeff Garzik, mbligh, akpm, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Thu, 2 Dec 2004, cliff white wrote: > On Thu, 02 Dec 2004 02:31:35 -0500 > Jeff Garzik <jgarzik@pobox.com> wrote: > >> Martin J. Bligh wrote: >>> Yeah, probably. Though the stress tests catch a lot more than the >>> functionality ones. The big pain in the ass is drivers, because I don't >>> have a hope in hell of testing more than 1% of them. >> >> My dream is that hardware vendors rotate their current machines through >> a test shop :) It would be nice to make sure that the popular drivers >> get daily test coverage. >> >> Jeff, dreaming on > It isn't going to happen until the time when the vendors call somebody a liar, try to get them fired, and then that somebody takes them to court and they lose 100 million dollars or so. Until that happens, vendors will continue to make junk and they will continue to lie about the performance of that junk. It doesn't help that Software Engineering has become a "hardware junk fixing" job. Basically many vendors in the PC and PC peripheral business are, for lack of a better word, liars who are in the business of perpetrating fraud upon the unsuspecting PC user. We have vendors who convincingly change mega-bits to mega-bytes, improving performance 8-fold without any expense at all. We have vendors reducing the size of a kilobyte and a megabyte, then getting the new lies entered into dictionaries, etc. The scheme goes on. In the meantime, if you try to perform DMA across a PCI/Bus at or near the specified rates, you will learn that the specifications are for "this chip" or "that chip", and have nothing to do with the performance when these chips get connected together. You will find that real performance is about 20 percent of the specification. Occasionally you find a vendor that doesn't lie and the same chip-set magically performs close to the published specifications. This is becoming rare because it costs money to build motherboards that work. This might require two or more prototypes to get the timing just right so the artificial delays and re-clocking, used to make junk work, isn't required. Once the PC (and not just the desk-top PC) became a commodity, everything points to the bottom-line. You get into the business by making something that looks and smells new. Then you sell it by writing specifications that are better than the most expensive on the market. Your sales-price is set below average market so you can unload this junk as rapidly as possible. Then, you do this over again, claiming that your equipment is "state-of-the-art"! And if anybody ever tests the junk and claims that it doesn't work as specified, you contact the president of his company and try to kill the messenger. Cheers, Dick Johnson Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips). Notice : All mail here is now cached for review by John Ashcroft. 98.36% of all statistics are fiction. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Anticipatory prefaulting in the page fault handler V1 2004-12-02 18:10 ` cliff white 2004-12-02 18:17 ` Gerrit Huizenga 2004-12-02 20:25 ` linux-os @ 2004-12-08 17:24 ` Christoph Lameter 2004-12-08 17:33 ` Jesse Barnes ` (5 more replies) 2 siblings, 6 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-08 17:24 UTC (permalink / raw) To: nickpiggin Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. In the page table scalability patches, we addressed the issue by changing the locking scheme so that multiple fault handlers are able to be processed concurrently on multiple cpus. This patch attempts to aggregate multiple page faults into a single one. It does that by noting anonymous page faults generated in sequence by an application. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. The patch may be combined with the page fault scalability patch (another edition of the patch is needed which will be forthcoming after the page fault scalability patch has been included). The combined patches will triple the possible page fault rate from ~1 mio faults sec to 3 mio faults sec. Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing number of threads (and thus increasing parallellism of page faults): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686 Patched kernel: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085 32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292 32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242 32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377 32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151 32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617 32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509 32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838 32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521 These number are roughly equal to what can be accomplished with the page fault scalability patches. Kernel patches with both the page fault scalability patches and prefaulting: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174 32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724 32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397 32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062 32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417 32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395 32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791 32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334 32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865 The fault rate doubles when both patches are applied. And on the high end (512 processors allocating 256G) (No numbers for regular kernels because they are extremely slow, also no number for a low number of threads. Also very slow) With prefaulting: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239 256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271 256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924 256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624 256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492 256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909 256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482 256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019 Page fault scalability patch and prefaulting. Max prefault order increased to 5 (max preallocation of 32 pages): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714 We are getting into an almost linear scalability in the high end with both patches and end up with a fault rate > 3 mio faults per second. The one thing that takes up a lot of time is still be the zeroing of pages in the page fault handler. There is a another set of patches that I am working on which will prezero pages and led to another an increase in performance by a factor of 2-4 (if prezeroed pages are available which may not always be the case). Maybe we can reach 10 mio fault /sec that way. Patch against 2.6.10-rc3-bk3: Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800 @@ -55,6 +55,7 @@ #include <linux/swapops.h> #include <linux/elf.h> +#include <linux/pagevec.h> #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1432,8 +1433,106 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); + struct page * page; + + addr &= PAGE_MASK; + + if (current->anon_fault_next_addr == addr) { + unsigned long end_addr; + int order = current->anon_fault_order; + + /* Sequence of page faults detected. Perform preallocation of pages */ + /* The order of preallocations increases with each successful prediction */ + order++; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (1 << (order + PAGE_SHIFT)); + else + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; + + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + + current->anon_fault_next_addr = end_addr; + current->anon_fault_order = order; + + if (write_access) { + + struct pagevec pv; + unsigned long a; + struct page **p; + + pte_unmap(page_table); + spin_unlock(&mm->page_table_lock); + + pagevec_init(&pv, 0); + + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr;a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (p) { + clear_user_highpage(p, a); + pagevec_add(&pv,p); + } else + break; + } + end_addr = a; + + spin_lock(&mm->page_table_lock); + + for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) { + + page_table = pte_offset_map(pmd, addr); + if (!pte_none(*page_table)) { + /* Someone else got there first */ + page_cache_release(*p); + pte_unmap(page_table); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(*p); + mark_page_accessed(*p); + page_add_anon_rmap(*p, vma, addr); + + set_pte(page_table, entry); + pte_unmap(page_table); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + for(;addr < end_addr; addr += PAGE_SIZE) { + page_table = pte_offset_map(pmd, addr); + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + set_pte(page_table, entry); + pte_unmap(page_table); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + + }; + } + spin_unlock(&mm->page_table_lock); + return VM_FAULT_MINOR; + } + + current->anon_fault_next_addr = addr + PAGE_SIZE; + current->anon_fault_order = 0; + + page = ZERO_PAGE(addr); /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter @ 2004-12-08 17:33 ` Jesse Barnes 2004-12-08 17:56 ` Christoph Lameter 2004-12-08 17:55 ` Dave Hansen ` (4 subsequent siblings) 5 siblings, 1 reply; 286+ messages in thread From: Jesse Barnes @ 2004-12-08 17:33 UTC (permalink / raw) To: Christoph Lameter Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wednesday, December 8, 2004 9:24 am, Christoph Lameter wrote: > Page fault scalability patch and prefaulting. Max prefault order > increased to 5 (max preallocation of 32 pages): > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930 > 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484 > 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840 > 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256 > 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524 > 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272 > 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714 > > We are getting into an almost linear scalability in the high end with > both patches and end up with a fault rate > 3 mio faults per second. Nice results! Any idea how many applications benefit from this sort of anticipatory faulting? It has implications for NUMA allocation. Imagine an app that allocates a large virtual address space and then tries to fault in pages near each CPU in turn. With this patch applied, CPU 2 would be referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which would then be used by CPUs 4-6. Unless I'm missing something... And again, I'm not sure how important that is, maybe this approach will work well in the majority of cases (obviously it's a big win in faults/sec for your benchmark, but I wonder about subsequent references from other CPUs to those pages). You can look at /sys/devices/platform/nodeN/meminfo to see where the pages are coming from. Jesse ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:33 ` Jesse Barnes @ 2004-12-08 17:56 ` Christoph Lameter 2004-12-08 18:33 ` Jesse Barnes 2004-12-08 21:26 ` David S. Miller 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-08 17:56 UTC (permalink / raw) To: Jesse Barnes Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wed, 8 Dec 2004, Jesse Barnes wrote: > Nice results! Any idea how many applications benefit from this sort of > anticipatory faulting? It has implications for NUMA allocation. Imagine an > app that allocates a large virtual address space and then tries to fault in > pages near each CPU in turn. With this patch applied, CPU 2 would be > referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which > would then be used by CPUs 4-6. Unless I'm missing something... Faults are predicted for each thread executing on a different processor. So each processor does its own predictions which will not generate preallocations on a different processor (unless the thread is moved to another processor but that is a very special situation). > And again, I'm not sure how important that is, maybe this approach will work > well in the majority of cases (obviously it's a big win in faults/sec for > your benchmark, but I wonder about subsequent references from other CPUs to > those pages). You can look at /sys/devices/platform/nodeN/meminfo to see > where the pages are coming from. The origin of the pages has not changed and the existing locality constraints are observed. A patch like this is important for applications that allocate and preset large amounts of memory on startup. It will drastically reduce the startup times. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:56 ` Christoph Lameter @ 2004-12-08 18:33 ` Jesse Barnes 2004-12-08 21:26 ` David S. Miller 1 sibling, 0 replies; 286+ messages in thread From: Jesse Barnes @ 2004-12-08 18:33 UTC (permalink / raw) To: Christoph Lameter Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wednesday, December 8, 2004 9:56 am, Christoph Lameter wrote: > > And again, I'm not sure how important that is, maybe this approach will > > work well in the majority of cases (obviously it's a big win in > > faults/sec for your benchmark, but I wonder about subsequent references > > from other CPUs to those pages). You can look at > > /sys/devices/platform/nodeN/meminfo to see where the pages are coming > > from. > > The origin of the pages has not changed and the existing locality > constraints are observed. > > A patch like this is important for applications that allocate and preset > large amounts of memory on startup. It will drastically reduce the startup > times. Ok, that sounds good. My case was probably a bit contrived, but I'm glad to see that you had already thought of it anyway. Jesse ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:56 ` Christoph Lameter 2004-12-08 18:33 ` Jesse Barnes @ 2004-12-08 21:26 ` David S. Miller 2004-12-08 21:42 ` Linus Torvalds 1 sibling, 1 reply; 286+ messages in thread From: David S. Miller @ 2004-12-08 21:26 UTC (permalink / raw) To: Christoph Lameter Cc: jbarnes, nickpiggin, jgarzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wed, 8 Dec 2004 09:56:00 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote: > A patch like this is important for applications that allocate and preset > large amounts of memory on startup. It will drastically reduce the startup > times. I see. Yet I noticed that while the patch makes system time decrease, for some reason the wall time is increasing with the patch applied. Why is that, or am I misreading your tables? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 21:26 ` David S. Miller @ 2004-12-08 21:42 ` Linus Torvalds 0 siblings, 0 replies; 286+ messages in thread From: Linus Torvalds @ 2004-12-08 21:42 UTC (permalink / raw) To: David S. Miller Cc: Christoph Lameter, jbarnes, nickpiggin, jgarzik, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wed, 8 Dec 2004, David S. Miller wrote: > > I see. Yet I noticed that while the patch makes system time decrease, > for some reason the wall time is increasing with the patch applied. > Why is that, or am I misreading your tables? I assume that you're looking at the final "both patches applied" case. It has ten repetitions, while the other two tables only have three. That would explain the discrepancy. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter 2004-12-08 17:33 ` Jesse Barnes @ 2004-12-08 17:55 ` Dave Hansen 2004-12-08 19:07 ` Martin J. Bligh ` (3 subsequent siblings) 5 siblings, 0 replies; 286+ messages in thread From: Dave Hansen @ 2004-12-08 17:55 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Jeff Garzik, Linus Torvalds, hugh, Benjamin Herrenschmidt, linux-mm, linux-ia64, Linux Kernel Mailing List On Wed, 2004-12-08 at 09:24, Christoph Lameter wrote: > The page fault handler for anonymous pages can generate significant overhead > apart from its essential function which is to clear and setup a new page > table entry for a never accessed memory location. This overhead increases > significantly in an SMP environment. do_anonymous_page() is a relatively compact function at this point. This would probably be a lot more readable if it was broken out into at least another function or two that do_anonymous_page() calls into. That way, you also get a much cleaner separation if anyone needs to turn it off in the future. Speaking of that, have you seen this impair performance on any other workloads? -- Dave ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter 2004-12-08 17:33 ` Jesse Barnes 2004-12-08 17:55 ` Dave Hansen @ 2004-12-08 19:07 ` Martin J. Bligh 2004-12-08 22:50 ` Martin J. Bligh ` (2 subsequent siblings) 5 siblings, 0 replies; 286+ messages in thread From: Martin J. Bligh @ 2004-12-08 19:07 UTC (permalink / raw) To: Christoph Lameter, nickpiggin Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel > The page fault handler for anonymous pages can generate significant overhead > apart from its essential function which is to clear and setup a new page > table entry for a never accessed memory location. This overhead increases > significantly in an SMP environment. > > In the page table scalability patches, we addressed the issue by changing > the locking scheme so that multiple fault handlers are able to be processed > concurrently on multiple cpus. This patch attempts to aggregate multiple > page faults into a single one. It does that by noting > anonymous page faults generated in sequence by an application. > > If a fault occurred for page x and is then followed by page x+1 then it may > be reasonable to expect another page fault at x+2 in the future. If page > table entries for x+1 and x+2 would be prepared in the fault handling for > page x+1 then the overhead of taking a fault for x+2 is avoided. However > page x+2 may never be used and thus we may have increased the rss > of an application unnecessarily. The swapper will take care of removing > that page if memory should get tight. > > The following patch makes the anonymous fault handler anticipate future > faults. For each fault a prediction is made where the fault would occur > (assuming linear acccess by the application). If the prediction turns out to > be right (next fault is where expected) then a number of pages is > preallocated in order to avoid a series of future faults. The order of the > preallocation increases by the power of two for each success in sequence. > > The first successful prediction leads to an additional page being allocated. > Second successful prediction leads to 2 additional pages being allocated. > Third to 4 pages and so on. The max order is 3 by default. In a large > continous allocation the number of faults is reduced by a factor of 8. > > The patch may be combined with the page fault scalability patch (another > edition of the patch is needed which will be forthcoming after the > page fault scalability patch has been included). The combined patches > will triple the possible page fault rate from ~1 mio faults sec to 3 mio > faults sec. > > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing > number of threads (and thus increasing parallellism of page faults): Mmmm ... we tried doing this before for filebacked pages by sniffing the pagecache, but it crippled forky workloads (like kernel compile) with the extra cost in zap_pte_range, etc. Perhaps the locality is better for the anon stuff, but the cost is also higher. Exactly what benchmark were you running on this? If you just run a microbenchmark that allocates memory, then it will definitely be faster. On other things, I suspect not ... M. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter ` (2 preceding siblings ...) 2004-12-08 19:07 ` Martin J. Bligh @ 2004-12-08 22:50 ` Martin J. Bligh 2004-12-09 19:32 ` Christoph Lameter 2004-12-09 10:57 ` Pavel Machek 2004-12-14 15:28 ` Adam Litke 5 siblings, 1 reply; 286+ messages in thread From: Martin J. Bligh @ 2004-12-08 22:50 UTC (permalink / raw) To: Christoph Lameter, nickpiggin Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel > The page fault handler for anonymous pages can generate significant overhead > apart from its essential function which is to clear and setup a new page > table entry for a never accessed memory location. This overhead increases > significantly in an SMP environment. > > In the page table scalability patches, we addressed the issue by changing > the locking scheme so that multiple fault handlers are able to be processed > concurrently on multiple cpus. This patch attempts to aggregate multiple > page faults into a single one. It does that by noting > anonymous page faults generated in sequence by an application. > > If a fault occurred for page x and is then followed by page x+1 then it may > be reasonable to expect another page fault at x+2 in the future. If page > table entries for x+1 and x+2 would be prepared in the fault handling for > page x+1 then the overhead of taking a fault for x+2 is avoided. However > page x+2 may never be used and thus we may have increased the rss > of an application unnecessarily. The swapper will take care of removing > that page if memory should get tight. I tried benchmarking it ... but processes just segfault all the time. Any chance you could try it out on SMP ia32 system? M. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 22:50 ` Martin J. Bligh @ 2004-12-09 19:32 ` Christoph Lameter 2004-12-10 2:13 ` [OT:HUMOR] " Adam Heath 2004-12-13 14:30 ` Akinobu Mita 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 19:32 UTC (permalink / raw) To: Martin J. Bligh Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Wed, 8 Dec 2004, Martin J. Bligh wrote: > I tried benchmarking it ... but processes just segfault all the time. > Any chance you could try it out on SMP ia32 system? I tried it on my i386 system and it works fine. Sorry about the puny memory sizes (the system is a PIII-450 with 384k memory) clameter@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 3 1 0.000s 0.004s 0.000s 37407.481 29200.500 0 3 2 0.002s 0.002s 0.000s 31177.059 27227.723 clameter@schroedinger:~/pfault/code$ uname -a Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST 2004 i686 GNU/Linux Could you send me your .config? ^ permalink raw reply [flat|nested] 286+ messages in thread
* [OT:HUMOR] Re: Anticipatory prefaulting in the page fault handler V1 2004-12-09 19:32 ` Christoph Lameter @ 2004-12-10 2:13 ` Adam Heath 2004-12-13 14:30 ` Akinobu Mita 1 sibling, 0 replies; 286+ messages in thread From: Adam Heath @ 2004-12-10 2:13 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel On Thu, 9 Dec 2004, Christoph Lameter wrote: > On Wed, 8 Dec 2004, Martin J. Bligh wrote: > > > I tried benchmarking it ... but processes just segfault all the time. > > Any chance you could try it out on SMP ia32 system? > > I tried it on my i386 system and it works fine. Sorry about the puny > memory sizes (the system is a PIII-450 with 384k memory) You probably get such fast numbers, because I mean, how many pages can 384k of memory have? I only figure 96 total pages. > clameter@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1 > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 0 3 1 0.000s 0.004s 0.000s 37407.481 29200.500 > 0 3 2 0.002s 0.002s 0.000s 31177.059 27227.723 > > clameter@schroedinger:~/pfault/code$ uname -a > Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST > 2004 i686 GNU/Linux ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-09 19:32 ` Christoph Lameter 2004-12-10 2:13 ` [OT:HUMOR] " Adam Heath @ 2004-12-13 14:30 ` Akinobu Mita 2004-12-13 17:10 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Akinobu Mita @ 2004-12-13 14:30 UTC (permalink / raw) To: Christoph Lameter, Martin J. Bligh Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Friday 10 December 2004 04:32, Christoph Lameter wrote: > On Wed, 8 Dec 2004, Martin J. Bligh wrote: > > I tried benchmarking it ... but processes just segfault all the time. > > Any chance you could try it out on SMP ia32 system? > > I tried it on my i386 system and it works fine. Sorry about the puny > memory sizes (the system is a PIII-450 with 384k memory) > I also encountered processes segfault. Below patch fix several problems. 1) if no pages could allocated, returns VM_FAULT_OOM 2) fix duplicated pte_offset_map() call 3) don't set_pte() for the entry which already have been set Acutually, 3) fixes my segfault problem. --- 2.6-rc/mm/memory.c.orig 2004-12-13 22:17:04.000000000 +0900 +++ 2.6-rc/mm/memory.c 2004-12-13 22:22:14.000000000 +0900 @@ -1483,6 +1483,8 @@ do_anonymous_page(struct mm_struct *mm, } else break; } + if (a == addr) + goto no_mem; end_addr = a; spin_lock(&mm->page_table_lock); @@ -1514,8 +1516,17 @@ do_anonymous_page(struct mm_struct *mm, } } else { /* Read */ + int first = 1; + for(;addr < end_addr; addr += PAGE_SIZE) { - page_table = pte_offset_map(pmd, addr); + if (!first) + page_table = pte_offset_map(pmd, addr); + first = 0; + if (!pte_none(*page_table)) { + /* Someone else got there first */ + pte_unmap(page_table); + continue; + } entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); set_pte(page_table, entry); pte_unmap(page_table); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-13 14:30 ` Akinobu Mita @ 2004-12-13 17:10 ` Christoph Lameter 2004-12-13 22:16 ` Martin J. Bligh 2004-12-14 12:24 ` Anticipatory prefaulting in the page fault handler V1 Akinobu Mita 0 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-13 17:10 UTC (permalink / raw) To: Akinobu Mita Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Mon, 13 Dec 2004, Akinobu Mita wrote: > I also encountered processes segfault. > Below patch fix several problems. > > 1) if no pages could allocated, returns VM_FAULT_OOM > 2) fix duplicated pte_offset_map() call I also saw these two issues and I think I dealt with them in a forthcoming patch. > 3) don't set_pte() for the entry which already have been set Not sure how this could have happened in the patch. Could you try my updated version: Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-08 15:01:48.801457702 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-08 15:02:04.286479345 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-08 15:01:50.668339751 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-09 14:21:17.090061608 -0800 @@ -55,6 +55,7 @@ #include <linux/swapops.h> #include <linux/elf.h> +#include <linux/pagevec.h> #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1432,52 +1433,99 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); - - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + unsigned long end_addr; + + addr &= PAGE_MASK; + + if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) { + /* Single page */ + current->anon_fault_order = 0; + end_addr = addr + PAGE_SIZE; + } else { + /* Sequence of faults detect. Perform preallocation */ + int order = ++current->anon_fault_order; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (PAGE_SIZE << order); + else + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; - /* ..except if it's a write access */ + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + } if (write_access) { - /* Allocate our own private page. */ + + unsigned long a; + struct page **p; + struct pagevec pv; + pte_unmap(page_table); spin_unlock(&mm->page_table_lock); + pagevec_init(&pv, 0); + if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr; a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (likely(p)) { + clear_user_highpage(p, a); + pagevec_add(&pv, p); + } else { + if (a == addr) + return VM_FAULT_OOM; + break; + } + } spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { + for(p = pv.pages; addr < a; addr += PAGE_SIZE, p++) { + + page_table = pte_offset_map(pmd, addr); + if (unlikely(!pte_none(*page_table))) { + /* Someone else got there first */ + pte_unmap(page_table); + page_cache_release(*p); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(*p); + mark_page_accessed(*p); + page_add_anon_rmap(*p, vma, addr); + + set_pte(page_table, entry); pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); +nextread: + set_pte(page_table, entry); + pte_unmap(page_table); + update_mmu_cache(vma, addr, entry); + addr += PAGE_SIZE; + if (unlikely(addr < end_addr)) { + pte_offset_map(pmd, addr); + goto nextread; } - mm->rss++; - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - - set_pte(page_table, entry); - pte_unmap(page_table); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); + current->anon_fault_next_addr = addr; spin_unlock(&mm->page_table_lock); -out: return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-13 17:10 ` Christoph Lameter @ 2004-12-13 22:16 ` Martin J. Bligh 2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter 2004-12-14 12:24 ` Anticipatory prefaulting in the page fault handler V1 Akinobu Mita 1 sibling, 1 reply; 286+ messages in thread From: Martin J. Bligh @ 2004-12-13 22:16 UTC (permalink / raw) To: Christoph Lameter, Akinobu Mita Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel >> I also encountered processes segfault. >> Below patch fix several problems. >> >> 1) if no pages could allocated, returns VM_FAULT_OOM >> 2) fix duplicated pte_offset_map() call > > I also saw these two issues and I think I dealt with them in a forthcoming > patch. > >> 3) don't set_pte() for the entry which already have been set > > Not sure how this could have happened in the patch. > > Could you try my updated version: Urgle. There was a fix from Hugh too ... any chance you could just stick a whole new patch somewhere? I'm too idle/stupid to work it out ;-) M. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Anticipatory prefaulting in the page fault handler V2 2004-12-13 22:16 ` Martin J. Bligh @ 2004-12-14 1:32 ` Christoph Lameter 2004-12-14 19:31 ` Adam Litke 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-12-14 1:32 UTC (permalink / raw) To: Martin J. Bligh Cc: Akinobu Mita, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Changes from V1 to V2: - Eliminate duplicate code and reorganize things - Use SetReferenced instead of mark_accessed (Hugh Dickins) - Fix the problem of the preallocation order increasing out of bounds (leading to memory being overwritten with pointers to struct page) - Return VM_FAULT_OOM if not able to allocate a single page - Tested on i386 and ia64 - New performance test for low cpu counts (up to 8 so that this does not seem to be too exotic) The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. In the page table scalability patches, we addressed the issue by changing the locking scheme so that multiple fault handlers are able to be processed concurrently on multiple cpus. This patch attempts to aggregate multiple page faults into a single one. It does that by noting anonymous page faults generated in sequence by an application. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. Standard Kernel on a 8 Cpu machine allocating 1 and 4GB with an increasing number of threads (and thus increasing parallellism of page faults): ia64 2.6.10-rc3-bk7 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 3 1 0.047s 2.163s 2.021s 88925.153 88859.030 1 3 2 0.040s 3.215s 1.069s 60385.889 115677.685 1 3 4 0.041s 3.509s 1.023s 55370.338 158971.609 1 3 8 0.047s 4.130s 1.014s 47049.904 172405.990 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.155s 11.277s 11.043s 68788.420 68747.223 4 3 2 0.161s 16.459s 8.061s 47315.277 91322.962 4 3 4 0.170s 14.708s 4.079s 52852.007 164043.773 4 3 8 0.171s 23.257s 4.028s 33565.604 183348.574 ia64 Patched kernel: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 3 1 0.008s 2.080s 2.008s 94121.792 94101.359 1 3 2 0.015s 3.128s 1.064s 62523.771 119563.496 1 3 4 0.008s 2.714s 1.012s 72185.910 175020.971 1 3 8 0.016s 2.963s 0.087s 65965.457 223921.949 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.034s 10.861s 10.089s 72179.444 72181.353 4 3 2 0.050s 14.303s 7.072s 54786.447 101738.901 4 3 4 0.038s 13.478s 4.044s 58182.649 176913.840 4 3 8 0.063s 13.584s 3.007s 57620.638 256109.927 i386 2.6.10-rc3-bk3 256M allocation 2x Pentium III 500 Mhz Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 3 1 0.020s 1.566s 1.058s123827.513 123842.098 0 3 2 0.017s 2.439s 1.043s 79999.154 136931.671 i386 2.6.10-rc3-bk3 patches Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 3 1 0.020s 1.527s 1.039s126945.181 140930.664 0 3 2 0.016s 2.417s 1.026s 80754.809 155162.903 Patch against 2.6.10-rc3-bk7: Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-13 15:14:40.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-13 15:15:55.000000000 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-13 15:14:40.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-13 16:49:31.000000000 -0800 @@ -55,6 +55,7 @@ #include <linux/swapops.h> #include <linux/elf.h> +#include <linux/pagevec.h> #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1432,52 +1433,102 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); - - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + unsigned long end_addr; + + addr &= PAGE_MASK; + + if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) { + /* Single page */ + current->anon_fault_order = 0; + end_addr = addr + PAGE_SIZE; + } else { + /* Sequence of faults detect. Perform preallocation */ + int order = ++current->anon_fault_order; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (PAGE_SIZE << order); + else { + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; + current->anon_fault_order = 3; + } - /* ..except if it's a write access */ + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + } if (write_access) { - /* Allocate our own private page. */ + + unsigned long a; + int i; + struct pagevec pv; + pte_unmap(page_table); spin_unlock(&mm->page_table_lock); + pagevec_init(&pv, 0); + if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr; a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (likely(p)) { + clear_user_highpage(p, a); + pagevec_add(&pv, p); + } else { + if (a == addr) + return VM_FAULT_OOM; + break; + } + } spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { + for(i = 0; addr < a; addr += PAGE_SIZE, i++) { + struct page *p = pv.pages[i]; + + page_table = pte_offset_map(pmd, addr); + if (unlikely(!pte_none(*page_table))) { + /* Someone else got there first */ + pte_unmap(page_table); + page_cache_release(p); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(p); + SetPageReferenced(p); + page_add_anon_rmap(p, vma, addr); + + set_pte(page_table, entry); pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); +nextread: + set_pte(page_table, entry); + pte_unmap(page_table); + update_mmu_cache(vma, addr, entry); + addr += PAGE_SIZE; + if (unlikely(addr < end_addr)) { + pte_offset_map(pmd, addr); + goto nextread; } - mm->rss++; - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - - set_pte(page_table, entry); - pte_unmap(page_table); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); + current->anon_fault_next_addr = addr; spin_unlock(&mm->page_table_lock); -out: return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V2 2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter @ 2004-12-14 19:31 ` Adam Litke 2004-12-15 19:03 ` Anticipatory prefaulting in the page fault handler V3 Christoph Lameter 2005-01-05 0:29 ` Anticipatory prefaulting in the page fault handler V4 Christoph Lameter 0 siblings, 2 replies; 286+ messages in thread From: Adam Litke @ 2004-12-14 19:31 UTC (permalink / raw) To: Christoph Lameter Cc: Martin J. Bligh, Akinobu Mita, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Just to add another data point: This works on my 4-way ppc64 (Power4) box. I am seeing no degradation when running this on kernbench (which is expected). For the curious, here are the results: Kernbench results with anon-prefault: 349.86user 49.64system 1:57.85elapsed 338%CPU (0avgtext+0avgdata 0maxresident)k 349.65user 49.81system 1:58.31elapsed 337%CPU (0avgtext+0avgdata 0maxresident)k 349.48user 50.00system 1:53.70elapsed 351%CPU (0avgtext+0avgdata 0maxresident)k 349.73user 49.69system 1:57.67elapsed 339%CPU (0avgtext+0avgdata 0maxresident)k 349.75user 49.85system 1:52.71elapsed 354%CPU (0avgtext+0avgdata 0maxresident)k Elapsed: 116.048s User: 349.694s System: 49.798s CPU: 343.8% Kernbench results without anon-prefault: 350.86user 52.54system 1:53.45elapsed 355%CPU (0avgtext+0avgdata 0maxresident)k 350.99user 52.36system 1:52.05elapsed 359%CPU (0avgtext+0avgdata 0maxresident)k 350.92user 52.68system 1:54.14elapsed 353%CPU (0avgtext+0avgdata 0maxresident)k 350.98user 52.38system 1:56.17elapsed 347%CPU (0avgtext+0avgdata 0maxresident)k 351.16user 52.31system 1:53.90elapsed 354%CPU (0avgtext+0avgdata 0maxresident)k Elapsed: 113.942s User: 350.982s System: 52.454s CPU: 353.6% On Mon, 2004-12-13 at 19:32, Christoph Lameter wrote: > Changes from V1 to V2: > - Eliminate duplicate code and reorganize things > - Use SetReferenced instead of mark_accessed (Hugh Dickins) > - Fix the problem of the preallocation order increasing out of bounds > (leading to memory being overwritten with pointers to struct page) > - Return VM_FAULT_OOM if not able to allocate a single page > - Tested on i386 and ia64 > - New performance test for low cpu counts (up to 8 so that this does not > seem to be too exotic) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 286+ messages in thread
* Anticipatory prefaulting in the page fault handler V3 2004-12-14 19:31 ` Adam Litke @ 2004-12-15 19:03 ` Christoph Lameter 2005-01-05 0:29 ` Anticipatory prefaulting in the page fault handler V4 Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-15 19:03 UTC (permalink / raw) To: Adam Litke Cc: Martin J. Bligh, Akinobu Mita, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Changes from V2 to V3: - check for empty pte before setting additional pte's in aggregate read The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. In the page table scalability patches, we addressed the issue by changing the locking scheme so that multiple fault handlers are able to be processed concurrently on multiple cpus. This patch attempts to aggregate multiple page faults into a single one. It does that by noting anonymous page faults generated in sequence by an application. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. Patch against 2.6.10-rc3-bk7: Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-13 15:14:40.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-14 12:21:26.000000000 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-13 15:14:40.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-14 12:23:36.000000000 -0800 @@ -55,6 +55,7 @@ #include <linux/swapops.h> #include <linux/elf.h> +#include <linux/pagevec.h> #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1432,52 +1433,103 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); - - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + unsigned long end_addr; + + addr &= PAGE_MASK; + + if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) { + /* Single page */ + current->anon_fault_order = 0; + end_addr = addr + PAGE_SIZE; + } else { + /* Sequence of faults detect. Perform preallocation */ + int order = ++current->anon_fault_order; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (PAGE_SIZE << order); + else { + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; + current->anon_fault_order = 3; + } - /* ..except if it's a write access */ + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + } if (write_access) { - /* Allocate our own private page. */ + + unsigned long a; + int i; + struct pagevec pv; + pte_unmap(page_table); spin_unlock(&mm->page_table_lock); + pagevec_init(&pv, 0); + if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr; a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (likely(p)) { + clear_user_highpage(p, a); + pagevec_add(&pv, p); + } else { + if (a == addr) + return VM_FAULT_OOM; + break; + } + } spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { + for(i = 0; addr < a; addr += PAGE_SIZE, i++) { + struct page *p = pv.pages[i]; + + page_table = pte_offset_map(pmd, addr); + if (unlikely(!pte_none(*page_table))) { + /* Someone else got there first */ + pte_unmap(page_table); + page_cache_release(p); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(p); + SetPageReferenced(p); + page_add_anon_rmap(p, vma, addr); + + set_pte(page_table, entry); pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); +nextread: + set_pte(page_table, entry); + pte_unmap(page_table); + update_mmu_cache(vma, addr, entry); + addr += PAGE_SIZE; + if (unlikely(addr < end_addr)) { + page_table = pte_offset_map(pmd, addr); + if (likely(pte_none(*page_table))) + goto nextread; } - mm->rss++; - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - - set_pte(page_table, entry); - pte_unmap(page_table); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); + current->anon_fault_next_addr = addr; spin_unlock(&mm->page_table_lock); -out: return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* ^ permalink raw reply [flat|nested] 286+ messages in thread
* Anticipatory prefaulting in the page fault handler V4 2004-12-14 19:31 ` Adam Litke 2004-12-15 19:03 ` Anticipatory prefaulting in the page fault handler V3 Christoph Lameter @ 2005-01-05 0:29 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2005-01-05 0:29 UTC (permalink / raw) To: Adam Litke Cc: Martin J. Bligh, Akinobu Mita, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Changes from V3 to V4: - Add /proc/sys/vm/max_prealloc_order to limit preallocations - Tested against 2.6.10-bk7 (This version of the patch is not depending on atomic pte operations and will conflict with the page fault scalabilty patchset. I have another patch that works with atomic pte operations) The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. In the page table scalability patches, we addressed the issue by changing the locking scheme so that multiple fault handlers are able to be processed concurrently on multiple cpus. This patch attempts to aggregate multiple page faults into a single one. It does that by noting anonymous page faults generated in sequence by an application. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. Patch against 2.6.10-bk7: Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.10/include/linux/sched.h =================================================================== --- linux-2.6.10.orig/include/linux/sched.h 2005-01-04 13:55:00.000000000 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-01-04 14:00:27.000000000 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.10/mm/memory.c =================================================================== --- linux-2.6.10.orig/mm/memory.c 2005-01-04 13:55:00.000000000 -0800 +++ linux-2.6.10/mm/memory.c 2005-01-04 14:00:27.000000000 -0800 @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> +#include <linux/pagevec.h> #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1626,6 +1627,8 @@ return ret; } +int sysctl_max_prealloc_order = 4; + /* * We are called with the MM semaphore and page_table_lock * spinlock held to protect against concurrent faults in @@ -1637,52 +1640,105 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); + unsigned long end_addr; + + addr &= PAGE_MASK; - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + if (likely((vma->vm_flags & VM_RAND_READ) + || current->anon_fault_next_addr != addr) + || current->anon_fault_order >= sysctl_max_prealloc_order) { + /* Single page */ + current->anon_fault_order = 0; + end_addr = addr + PAGE_SIZE; + } else { + /* Sequence of faults detect. Perform preallocation */ + int order = ++current->anon_fault_order; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (PAGE_SIZE << order); + else { + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; + current->anon_fault_order = 3; + } - /* ..except if it's a write access */ + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + } if (write_access) { - /* Allocate our own private page. */ + + unsigned long a; + int i; + struct pagevec pv; + pte_unmap(page_table); spin_unlock(&mm->page_table_lock); + pagevec_init(&pv, 0); + if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr; a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (likely(p)) { + clear_user_highpage(p, a); + pagevec_add(&pv, p); + } else { + if (a == addr) + return VM_FAULT_OOM; + break; + } + } spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { + for(i = 0; addr < a; addr += PAGE_SIZE, i++) { + struct page *p = pv.pages[i]; + + page_table = pte_offset_map(pmd, addr); + if (unlikely(!pte_none(*page_table))) { + /* Someone else got there first */ + pte_unmap(page_table); + page_cache_release(p); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(p); + SetPageReferenced(p); + page_add_anon_rmap(p, vma, addr); + + set_pte(page_table, entry); pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); +nextread: + set_pte(page_table, entry); + pte_unmap(page_table); + update_mmu_cache(vma, addr, entry); + addr += PAGE_SIZE; + if (unlikely(addr < end_addr)) { + page_table = pte_offset_map(pmd, addr); + if (likely(pte_none(*page_table))) + goto nextread; } - mm->rss++; - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - lru_cache_add_active(page); - SetPageReferenced(page); - page_add_anon_rmap(page, vma, addr); } - - set_pte(page_table, entry); - pte_unmap(page_table); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); + current->anon_fault_next_addr = addr; spin_unlock(&mm->page_table_lock); -out: return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; } /* Index: linux-2.6.10/kernel/sysctl.c =================================================================== --- linux-2.6.10.orig/kernel/sysctl.c 2005-01-04 13:55:00.000000000 -0800 +++ linux-2.6.10/kernel/sysctl.c 2005-01-04 14:00:27.000000000 -0800 @@ -56,6 +56,7 @@ extern int C_A_D; extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; +extern int sysctl_max_prealloc_order; extern int max_threads; extern int sysrq_enabled; extern int core_uses_pid; @@ -826,6 +827,16 @@ .strategy = &sysctl_jiffies, }, #endif + { + .ctl_name = VM_MAX_PREFAULT_ORDER, + .procname = "max_prealloc_order", + .data = &sysctl_max_prealloc_order, + .maxlen = sizeof(sysctl_max_prealloc_order), + .mode = 0644, + .proc_handler = &proc_dointvec, + .strategy = &sysctl_intvec, + .extra1 = &zero, + }, { .ctl_name = 0 } }; Index: linux-2.6.10/include/linux/sysctl.h =================================================================== --- linux-2.6.10.orig/include/linux/sysctl.h 2005-01-04 13:55:00.000000000 -0800 +++ linux-2.6.10/include/linux/sysctl.h 2005-01-04 14:00:27.000000000 -0800 @@ -169,6 +169,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_MAX_PREFAULT_ORDER=29, /* max prefault order during anonymous page faults */ }; ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-13 17:10 ` Christoph Lameter 2004-12-13 22:16 ` Martin J. Bligh @ 2004-12-14 12:24 ` Akinobu Mita 2004-12-14 15:25 ` Akinobu Mita 2004-12-14 20:25 ` Christoph Lameter 1 sibling, 2 replies; 286+ messages in thread From: Akinobu Mita @ 2004-12-14 12:24 UTC (permalink / raw) To: Christoph Lameter Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Tuesday 14 December 2004 02:10, Christoph Lameter wrote: > On Mon, 13 Dec 2004, Akinobu Mita wrote: > > 3) don't set_pte() for the entry which already have been set > > Not sure how this could have happened in the patch. This is why I inserted pte_none() for each page_table in case of read fault too. If read access fault occured for the address "addr". It is completely unnecessary to check by pte_none() to the page_table for "addr". Because page_table_lock has never been released until do_anonymous_page returns (in case of read access fault) But there is not any guarantee that the page_tables for addr+PAGE_SIZE, addr+2*PAGE_SIZE, ... have not been mapped yet. Anyway, I will try your V2 patch. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-14 12:24 ` Anticipatory prefaulting in the page fault handler V1 Akinobu Mita @ 2004-12-14 15:25 ` Akinobu Mita 2004-12-14 20:25 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Akinobu Mita @ 2004-12-14 15:25 UTC (permalink / raw) To: Christoph Lameter Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Tuesday 14 December 2004 21:24, Akinobu Mita wrote: > But there is not any guarantee that the page_tables for addr+PAGE_SIZE, > addr+2*PAGE_SIZE, ... have not been mapped yet. > > Anyway, I will try your V2 patch. > Below patch fixes V2 patch, and adds debug printk. The output coincides with segfaulted processes. # dmesg | grep ^comm: comm: xscreensaver, addr_orig: ccdc40, addr: cce000, pid: 2995 comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029 comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029 comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029 comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029 comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029 comm: X, addr_orig: 87e8000, addr: 87e9000, pid: 2874 comm: X, addr_orig: 87ea000, addr: 87eb000, pid: 2874 --- The read access prefaulting may override the page_table which has been already mapped. this patch fixes it. and it shows which process might suffer this problem. --- 2.6-rc/mm/memory.c.orig 2004-12-14 22:06:08.000000000 +0900 +++ 2.6-rc/mm/memory.c 2004-12-14 23:42:34.000000000 +0900 @@ -1434,6 +1434,7 @@ do_anonymous_page(struct mm_struct *mm, { pte_t entry; unsigned long end_addr; + unsigned long addr_orig = addr; addr &= PAGE_MASK; @@ -1517,9 +1518,15 @@ do_anonymous_page(struct mm_struct *mm, /* Read */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); nextread: - set_pte(page_table, entry); - pte_unmap(page_table); - update_mmu_cache(vma, addr, entry); + if (!pte_none(*page_table)) { + printk("comm: %s, addr_orig: %lx, addr: %lx, pid: %d\n", + current->comm, addr_orig, addr, current->pid); + pte_unmap(page_table); + } else { + set_pte(page_table, entry); + pte_unmap(page_table); + update_mmu_cache(vma, addr, entry); + } addr += PAGE_SIZE; if (unlikely(addr < end_addr)) { pte_offset_map(pmd, addr); ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-14 12:24 ` Anticipatory prefaulting in the page fault handler V1 Akinobu Mita 2004-12-14 15:25 ` Akinobu Mita @ 2004-12-14 20:25 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-14 20:25 UTC (permalink / raw) To: Akinobu Mita Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Tue, 14 Dec 2004, Akinobu Mita wrote: > This is why I inserted pte_none() for each page_table in case of > read fault too. > > If read access fault occured for the address "addr". > It is completely unnecessary to check by pte_none() to the page_table > for "addr". Because page_table_lock has never been released until > do_anonymous_page returns (in case of read access fault) > > But there is not any guarantee that the page_tables for addr+PAGE_SIZE, > addr+2*PAGE_SIZE, ... have not been mapped yet. Right. Thanks for pointing that out. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter ` (3 preceding siblings ...) 2004-12-08 22:50 ` Martin J. Bligh @ 2004-12-09 10:57 ` Pavel Machek 2004-12-09 11:32 ` Nick Piggin 2004-12-09 17:05 ` Christoph Lameter 2004-12-14 15:28 ` Adam Litke 5 siblings, 2 replies; 286+ messages in thread From: Pavel Machek @ 2004-12-09 10:57 UTC (permalink / raw) To: Christoph Lameter Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Hi! > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing > number of threads (and thus increasing parallellism of page faults): > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 ... > Patched kernel: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 ... > These number are roughly equal to what can be accomplished with the > page fault scalability patches. > > Kernel patches with both the page fault scalability patches and > prefaulting: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 ... > > The fault rate doubles when both patches are applied. ... > We are getting into an almost linear scalability in the high end with > both patches and end up with a fault rate > 3 mio faults per second. Well, with both patches you also slow single-threaded case more than twice. What are the effects of this patch on UP system? Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-09 10:57 ` Pavel Machek @ 2004-12-09 11:32 ` Nick Piggin 2004-12-09 17:05 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-12-09 11:32 UTC (permalink / raw) To: Pavel Machek Cc: Christoph Lameter, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel Pavel Machek wrote: > Hi! > > >>Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing >>number of threads (and thus increasing parallellism of page faults): >> >> Gb Rep Threads User System Wall flt/cpu/s fault/wsec >> 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 > > ... > >>Patched kernel: >> >>Gb Rep Threads User System Wall flt/cpu/s fault/wsec >> 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 > > ... > >>These number are roughly equal to what can be accomplished with the >>page fault scalability patches. >> >>Kernel patches with both the page fault scalability patches and >>prefaulting: >> >> Gb Rep Threads User System Wall flt/cpu/s fault/wsec >> 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 > > ... > >>The fault rate doubles when both patches are applied. > > ... > >>We are getting into an almost linear scalability in the high end with >>both patches and end up with a fault rate > 3 mio faults per second. > > > Well, with both patches you also slow single-threaded case more than > twice. What are the effects of this patch on UP system? fault/wsec is the important number. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-09 10:57 ` Pavel Machek 2004-12-09 11:32 ` Nick Piggin @ 2004-12-09 17:05 ` Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 17:05 UTC (permalink / raw) To: Pavel Machek Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, Pavel Machek wrote: > Hi! > > > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing > > number of threads (and thus increasing parallellism of page faults): > > > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > > 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 > ... > > Patched kernel: > > > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > > 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 > ... > > These number are roughly equal to what can be accomplished with the > > page fault scalability patches. > > > > Kernel patches with both the page fault scalability patches and > > prefaulting: > > > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > > 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 > ... > > > > The fault rate doubles when both patches are applied. > ... > > We are getting into an almost linear scalability in the high end with > > both patches and end up with a fault rate > 3 mio faults per second. > > Well, with both patches you also slow single-threaded case more than > twice. What are the effects of this patch on UP system? The faults per second are slightly increased, so its faster. The last numbers are 10 repetitions and not 3. Do not look at the wall time. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter ` (4 preceding siblings ...) 2004-12-09 10:57 ` Pavel Machek @ 2004-12-14 15:28 ` Adam Litke 5 siblings, 0 replies; 286+ messages in thread From: Adam Litke @ 2004-12-14 15:28 UTC (permalink / raw) To: Christoph Lameter Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel What benchmark are you using to generate the following results? I'd like to run this on some of my hardware and see how the results compare. On Wed, 2004-12-08 at 11:24, Christoph Lameter wrote: > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing > number of threads (and thus increasing parallellism of page faults): > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 > 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646 > 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239 > 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950 > 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358 > 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928 > 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112 > 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599 > 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918 > 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686 > > Patched kernel: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 > 32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085 > 32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292 > 32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242 > 32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377 > 32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151 > 32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617 > 32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509 > 32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838 > 32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521 > > These number are roughly equal to what can be accomplished with the > page fault scalability patches. > > Kernel patches with both the page fault scalability patches and > prefaulting: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 > 32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174 > 32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724 > 32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397 > 32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062 > 32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417 > 32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395 > 32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791 > 32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334 > 32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865 > > The fault rate doubles when both patches are applied. > > And on the high end (512 processors allocating 256G) (No numbers > for regular kernels because they are extremely slow, also no > number for a low number of threads. Also very slow) > > With prefaulting: > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239 > 256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271 > 256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924 > 256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624 > 256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492 > 256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909 > 256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482 > 256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019 > > Page fault scalability patch and prefaulting. Max prefault order > increased to 5 (max preallocation of 32 pages): > > Gb Rep Threads User System Wall flt/cpu/s fault/wsec > 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930 > 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484 > 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840 > 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256 > 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524 > 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272 > 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714 > > We are getting into an almost linear scalability in the high end with > both patches and end up with a fault rate > 3 mio faults per second. > > The one thing that takes up a lot of time is still be the zeroing > of pages in the page fault handler. There is a another > set of patches that I am working on which will prezero pages > and led to another an increase in performance by a factor of 2-4 > (if prezeroed pages are available which may not always be the case). > Maybe we can reach 10 mio fault /sec that way. > > Patch against 2.6.10-rc3-bk3: > > Index: linux-2.6.9/include/linux/sched.h > =================================================================== > --- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800 > +++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800 > @@ -537,6 +537,8 @@ > #endif > > struct list_head tasks; > + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ > + int anon_fault_order; /* Last order of allocation on fault */ > /* > * ptrace_list/ptrace_children forms the list of my children > * that were stolen by a ptracer. > Index: linux-2.6.9/mm/memory.c > =================================================================== > --- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800 > +++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800 > @@ -55,6 +55,7 @@ > > #include <linux/swapops.h> > #include <linux/elf.h> > +#include <linux/pagevec.h> > > #ifndef CONFIG_DISCONTIGMEM > /* use the per-pgdat data instead for discontigmem - mbligh */ > @@ -1432,8 +1433,106 @@ > unsigned long addr) > { > pte_t entry; > - struct page * page = ZERO_PAGE(addr); > + struct page * page; > + > + addr &= PAGE_MASK; > + > + if (current->anon_fault_next_addr == addr) { > + unsigned long end_addr; > + int order = current->anon_fault_order; > + > + /* Sequence of page faults detected. Perform preallocation of pages */ > > + /* The order of preallocations increases with each successful prediction */ > + order++; > + > + if ((1 << order) < PAGEVEC_SIZE) > + end_addr = addr + (1 << (order + PAGE_SHIFT)); > + else > + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; > + > + if (end_addr > vma->vm_end) > + end_addr = vma->vm_end; > + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) > + end_addr &= PMD_MASK; > + > + current->anon_fault_next_addr = end_addr; > + current->anon_fault_order = order; > + > + if (write_access) { > + > + struct pagevec pv; > + unsigned long a; > + struct page **p; > + > + pte_unmap(page_table); > + spin_unlock(&mm->page_table_lock); > + > + pagevec_init(&pv, 0); > + > + if (unlikely(anon_vma_prepare(vma))) > + return VM_FAULT_OOM; > + > + /* Allocate the necessary pages */ > + for(a = addr;a < end_addr ; a += PAGE_SIZE) { > + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); > + > + if (p) { > + clear_user_highpage(p, a); > + pagevec_add(&pv,p); > + } else > + break; > + } > + end_addr = a; > + > + spin_lock(&mm->page_table_lock); > + > + for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) { > + > + page_table = pte_offset_map(pmd, addr); > + if (!pte_none(*page_table)) { > + /* Someone else got there first */ > + page_cache_release(*p); > + pte_unmap(page_table); > + continue; > + } > + > + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p, > + vma->vm_page_prot)), > + vma); > + > + mm->rss++; > + lru_cache_add_active(*p); > + mark_page_accessed(*p); > + page_add_anon_rmap(*p, vma, addr); > + > + set_pte(page_table, entry); > + pte_unmap(page_table); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, addr, entry); > + } > + } else { > + /* Read */ > + for(;addr < end_addr; addr += PAGE_SIZE) { > + page_table = pte_offset_map(pmd, addr); > + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); > + set_pte(page_table, entry); > + pte_unmap(page_table); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, addr, entry); > + > + }; > + } > + spin_unlock(&mm->page_table_lock); > + return VM_FAULT_MINOR; > + } > + > + current->anon_fault_next_addr = addr + PAGE_SIZE; > + current->anon_fault_order = 0; > + > + page = ZERO_PAGE(addr); > /* Read-only mapping of ZERO_PAGE. */ > entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:26 ` Martin J. Bligh 2004-12-02 7:31 ` Jeff Garzik @ 2004-12-02 18:43 ` cliff white 2004-12-06 19:33 ` Marcelo Tosatti 1 sibling, 1 reply; 286+ messages in thread From: cliff white @ 2004-12-02 18:43 UTC (permalink / raw) To: Martin J. Bligh Cc: akpm, jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Wed, 01 Dec 2004 23:26:59 -0800 "Martin J. Bligh" <mbligh@aracnet.com> wrote: > --Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800): > > > Jeff Garzik <jgarzik@pobox.com> wrote: > >> > >> Andrew Morton wrote: > >> > We need to be be achieving higher-quality major releases than we did in > >> > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > >> > stabilisation periods. > >> > >> > >> I'm still hoping that distros (like my employer) and orgs like OSDL will > >> step up, and hook 2.6.x BK snapshots into daily test harnesses. > > > > I believe that both IBM and OSDL are doing this, or are getting geared up > > to do this. With both Linus bk and -mm. > > I already run a bunch of tests on a variety of machines for every new > kernel ... but don't have an automated way to compare the results as yet, > so don't actually look at them much ;-(. Sometime soon (quite possibly over > Christmas) things will calm down enough I'll get a couple of days to write > the appropriate perl script, and start publishing stuff. We've had the most success when one person has an itch to scratch, and works with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very glad he had the time to do such excellent work. We worked with Con Kolivas, likewise. We've done tools to automate LTP comparisons ( bryce@osdl.org has posted results ) and reaim, we've been able to post some regression to lkml, and tied in with developers to get bugs fixed. But OSDL has been limited by manpower. One of the issues with the performance tests is the amount of data produced - for example, the deep IO tests produce ton's o' numbers, but the developer community wants a single "+/- 5%" type response- we need some opinions and help on how to do the data reduction necessary. What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future have a good starting place. cliffw OSDL > > > However I have my doubts about how useful it will end up being. These test > > suites don't seem to pick up many regressions. I've challenged Gerrit to > > go back through a release cycle's bugfixes and work out how many of those > > bugs would have been detected by the test suite. > > > > My suspicion is that the answer will be "a very small proportion", and that > > really is the bottom line. > > Yeah, probably. Though the stress tests catch a lot more than the > functionality ones. The big pain in the ass is drivers, because I don't > have a hope in hell of testing more than 1% of them. > > M. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> > -- The church is near, but the road is icy. The bar is far, but i will walk carefully. - Russian proverb ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 18:43 ` page fault scalability patch V12 [0/7]: Overview and performance tests cliff white @ 2004-12-06 19:33 ` Marcelo Tosatti 0 siblings, 0 replies; 286+ messages in thread From: Marcelo Tosatti @ 2004-12-06 19:33 UTC (permalink / raw) To: cliff white Cc: Martin J. Bligh, akpm, jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Thu, Dec 02, 2004 at 10:43:30AM -0800, cliff white wrote: > On Wed, 01 Dec 2004 23:26:59 -0800 > "Martin J. Bligh" <mbligh@aracnet.com> wrote: > > > --Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800): > > > > > Jeff Garzik <jgarzik@pobox.com> wrote: > > >> > > >> Andrew Morton wrote: > > >> > We need to be be achieving higher-quality major releases than we did in > > >> > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > > >> > stabilisation periods. > > >> > > >> > > >> I'm still hoping that distros (like my employer) and orgs like OSDL will > > >> step up, and hook 2.6.x BK snapshots into daily test harnesses. > > > > > > I believe that both IBM and OSDL are doing this, or are getting geared up > > > to do this. With both Linus bk and -mm. > > > > I already run a bunch of tests on a variety of machines for every new > > kernel ... but don't have an automated way to compare the results as yet, > > so don't actually look at them much ;-(. Sometime soon (quite possibly over > > Christmas) things will calm down enough I'll get a couple of days to write > > the appropriate perl script, and start publishing stuff. > > We've had the most success when one person has an itch to scratch, and works > with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very > glad he had the time to do such excellent work. We worked with Con Kolivas, likewise. > > We've done tools to automate LTP comparisons ( bryce@osdl.org has posted results ) > and reaim, we've been able to post some regression to lkml, and tied in with developers > to get bugs fixed. But OSDL has been limited by manpower. > > One of the issues with the performance tests is the amount of data produced - > for example, the deep IO tests produce ton's o' numbers, but the developer community wants > a single "+/- 5%" type response- we need some opinions and help on how to do the data reduction > necessary. Yep, reaim produces a single "global throughput" result in MB/s, which is wonderful for readability. Now iozone on the other extreme produces output for each kind of operation (read, write, rw, sync version of those) for each client IIRC. tiobench also has detailed output for each operation. We ought to reduce all benchmark results to "read", "write" and "global" (read+write/2) numbers. I'm willing to work on the data reduction and graphic generation scripts for STP results. I think I can do that. > > What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future > have a good starting place. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:02 ` Andrew Morton 2004-12-02 7:26 ` Martin J. Bligh @ 2004-12-02 16:24 ` Gerrit Huizenga 2004-12-02 17:34 ` cliff white 2 siblings, 0 replies; 286+ messages in thread From: Gerrit Huizenga @ 2004-12-02 16:24 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Wed, 01 Dec 2004 23:02:17 PST, Andrew Morton wrote: > Jeff Garzik <jgarzik@pobox.com> wrote: > > > > Andrew Morton wrote: > > > We need to be be achieving higher-quality major releases than we did in > > > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > > > stabilisation periods. > > > > > > I'm still hoping that distros (like my employer) and orgs like OSDL will > > step up, and hook 2.6.x BK snapshots into daily test harnesses. > > I believe that both IBM and OSDL are doing this, or are getting geared up > to do this. With both Linus bk and -mm. > > However I have my doubts about how useful it will end up being. These test > suites don't seem to pick up many regressions. I've challenged Gerrit to > go back through a release cycle's bugfixes and work out how many of those > bugs would have been detected by the test suite. > > My suspicion is that the answer will be "a very small proportion", and that > really is the bottom line. Yeah, sort of what Martin said. LTP, for instance, doesn't find a lot of what is in our internal bugzilla or the bugme database. Automated testing tends not to cover all the range of desktop peripherals and drivers that make up a large quantity of the code but gets very little coverage. Our stress testing is extensive and was finding 3 year old problems when we first ran it but it is pretty expensive to run those types of tests (machines, people, data analysis) so we typically run those tests on distros rather than mainline to help validate distro quality. However, that said, the LTP stuff is still *necessary* - it would catch quite a number of regressions if we were to regress. The good thing is that most changes today haven't been leading to regressions. That could change at any time, and one of the keys is to make sure that when we do find regressions we get a test into LTP to make sure that that particular regression never happens again. I haven't looked at the code coverage for LTP in a while but it is actually a high line count coverage test for core kernel. I don't remember if it was over 80% or not, but usually 85-88% is the point of diminishing returns for a regression suite. I think a more important proactive step here is to understand what regressions we *do* have an whether or not we can construct a test that in the future will catch that regression (or better, a class of regressions). And, maybe we need some kind of filter person or group for lkml that can see what the key regressions are (e.g. akpm, if you know of a set of regressions that you are working, maybe periodically sending those to the ltp mailing list) we could focus on creating tests for those regressions. We are also working to set up large ISV applications in a couple of spots - both inside IBM and there is a similar effort underway at OSDL. Those ISV applications will catch a class of real world usage models and also check for regressions. I don't know if it is possible to set up a better testing environment for the wild, whacky and weird things that people do but, yes, Bless them. ;-) > We simply get far better coverage testing by releasing code, because of all > the wild, whacky and weird things which people do with their computers. > Bless them. > > > Something like John Cherry's reports to lkml on warnings and errors > > would be darned useful. His reports are IMO an ideal model: show > > day-to-day _changes_ in test results. Don't just dump a huge list of > > testsuite results, results which are often clogged with expected > > failures and testsuite bug noise. > > Yes, we need humans between the tests and the developers. Someone who has > good experience with the tests and who can say "hey, something changed > when I do X". If nothing changed, we don't hear anything. > > It's a developer role, not a testing role. All testing is, really. Yep. However, smart developers continue to write scripts to automate the rote and mundane tasks that they hate doing. Towards that end, there was a recent effort at Bull on the NPTL work which serves as a very interesting model: http://nptl.bullopensource.org/Tests/results/run-browse.php Basically, you can compare results from any test run with any other and get a summary of differences. That helps give a quick status check and helps you focus on the correct issues when tracking down defects. gerrit ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:02 ` Andrew Morton 2004-12-02 7:26 ` Martin J. Bligh 2004-12-02 16:24 ` Gerrit Huizenga @ 2004-12-02 17:34 ` cliff white 2 siblings, 0 replies; 286+ messages in thread From: cliff white @ 2004-12-02 17:34 UTC (permalink / raw) To: Andrew Morton Cc: jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Wed, 1 Dec 2004 23:02:17 -0800 Andrew Morton <akpm@osdl.org> wrote: > Jeff Garzik <jgarzik@pobox.com> wrote: > > > > Andrew Morton wrote: > > > We need to be be achieving higher-quality major releases than we did in > > > 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer > > > stabilisation periods. > > > > > > I'm still hoping that distros (like my employer) and orgs like OSDL will > > step up, and hook 2.6.x BK snapshots into daily test harnesses. > > I believe that both IBM and OSDL are doing this, or are getting geared up > to do this. With both Linus bk and -mm. Gee, OSDL has been doing this sort of testing for > 1 years now. Getting bandwidth to look at the results has been a problem. We need more eyeballs and community support badly, i'm very glad Marcelo has shown recent interest. > > However I have my doubts about how useful it will end up being. These test > suites don't seem to pick up many regressions. I've challenged Gerrit to > go back through a release cycle's bugfixes and work out how many of those > bugs would have been detected by the test suite. > > My suspicion is that the answer will be "a very small proportion", and that > really is the bottom line. > > We simply get far better coverage testing by releasing code, because of all > the wild, whacky and weird things which people do with their computers. > Bless them. > > > Something like John Cherry's reports to lkml on warnings and errors > > would be darned useful. His reports are IMO an ideal model: show > > day-to-day _changes_ in test results. Don't just dump a huge list of > > testsuite results, results which are often clogged with expected > > failures and testsuite bug noise. > > > > Yes, we need humans between the tests and the developers. Someone who has > good experience with the tests and who can say "hey, something changed > when I do X". If nothing changed, we don't hear anything. I would agree, and would do almost anything to help/assist/enable any humans interested. We need some expertise on when to run certain tests, to avoid data overload. I've noticed that when developer's submit test results with a patch, it sometimes helps in the decision on patch acceptance. Is there a way to promote this sort of behaviour? cliffw OSDL > > It's a developer role, not a testing role. All testing is, really. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> > -- The church is near, but the road is icy. The bar is far, but i will walk carefully. - Russian proverb ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:48 ` Jeff Garzik 2004-12-02 7:02 ` Andrew Morton @ 2004-12-02 19:48 ` Diego Calleja 2004-12-02 20:12 ` Jeff Garzik 1 sibling, 1 reply; 286+ messages in thread From: Diego Calleja @ 2004-12-02 19:48 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel El Thu, 02 Dec 2004 01:48:25 -0500 Jeff Garzik <jgarzik@pobox.com> escribió: > I'm still hoping that distros (like my employer) and orgs like OSDL will > step up, and hook 2.6.x BK snapshots into daily test harnesses. Automated .deb's and .rpm's for the -bk snapshots (and yum/apt repositories) would be nice for all those people who run unsupported distros. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 19:48 ` Diego Calleja @ 2004-12-02 20:12 ` Jeff Garzik 2004-12-02 20:30 ` Diego Calleja ` (2 more replies) 0 siblings, 3 replies; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 20:12 UTC (permalink / raw) To: Diego Calleja; +Cc: linux-kernel Diego Calleja wrote: > El Thu, 02 Dec 2004 01:48:25 -0500 Jeff Garzik <jgarzik@pobox.com> > escribió: > > > >>I'm still hoping that distros (like my employer) and orgs like OSDL will >>step up, and hook 2.6.x BK snapshots into daily test harnesses. > > > Automated .deb's and .rpm's for the -bk snapshots (and yum/apt repositories) > would be nice for all those people who run unsupported distros. Now, that's a darned good idea... Should be simple for rpm at least, given the "make rpm" target. I wonder if we have, or could add, a 'make deb' target. Jeff ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 20:12 ` Jeff Garzik @ 2004-12-02 20:30 ` Diego Calleja 2004-12-02 21:08 ` Wichert Akkerman 2004-12-03 0:07 ` Francois Romieu 2 siblings, 0 replies; 286+ messages in thread From: Diego Calleja @ 2004-12-02 20:30 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel, sam El Thu, 02 Dec 2004 15:12:22 -0500 Jeff Garzik <jgarzik@pobox.com> escribió: > > Automated .deb's and .rpm's for the -bk snapshots (and yum/apt > > repositories) would be nice for all those people who run unsupported > > distros. > > Now, that's a darned good idea... > > Should be simple for rpm at least, given the "make rpm" target. I > wonder if we have, or could add, a 'make deb' target. There was a patch for that long time ago before 2.6 was out IIRC? I don't know where it went (CC'ing Sam who should know ;) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 20:12 ` Jeff Garzik 2004-12-02 20:30 ` Diego Calleja @ 2004-12-02 21:08 ` Wichert Akkerman 2004-12-03 0:07 ` Francois Romieu 2 siblings, 0 replies; 286+ messages in thread From: Wichert Akkerman @ 2004-12-02 21:08 UTC (permalink / raw) To: Jeff Garzik; +Cc: Diego Calleja, linux-kernel Previously Jeff Garzik wrote: > Should be simple for rpm at least, given the "make rpm" target. I > wonder if we have, or could add, a 'make deb' target. make deb-pkg has been there for a while. Wichert. -- Wichert Akkerman <wichert@wiggy.net> It is simple to make things. http://www.wiggy.net/ It is hard to make things simple. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 20:12 ` Jeff Garzik 2004-12-02 20:30 ` Diego Calleja 2004-12-02 21:08 ` Wichert Akkerman @ 2004-12-03 0:07 ` Francois Romieu 2 siblings, 0 replies; 286+ messages in thread From: Francois Romieu @ 2004-12-03 0:07 UTC (permalink / raw) To: Jeff Garzik; +Cc: Diego Calleja, linux-kernel Jeff Garzik <jgarzik@pobox.com> : [...] > Should be simple for rpm at least, given the "make rpm" target. I > wonder if we have, or could add, a 'make deb' target. http://www.wiggy.net/files/kerneldeb-1.2.ptc ? -- Ueimor ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:34 ` Andrew Morton 2004-12-02 6:48 ` Jeff Garzik @ 2004-12-02 7:00 ` Jeff Garzik 2004-12-02 7:05 ` Benjamin Herrenschmidt 2004-12-02 14:30 ` Andy Warner 2004-12-02 18:27 ` Grant Grundler 2004-12-07 10:51 ` Pavel Machek 3 siblings, 2 replies; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 7:00 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, benh, linux-kernel, linux-ide Andrew Morton wrote: > We need an -rc3 yet. And I need to do another pass through the > regressions-since-2.6.9 list. We've made pretty good progress there > recently. Mid to late December is looking like the 2.6.10 date. another for that list, BTW: I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, but works on UP. Potentially related to >=4GB of RAM. Details, in case anyone is interested: Unless my code is screwed up (certainly possible), PIO data-in [using the insw() call] seems to return all zeroes on a true-blue SMP machine, for the identify-device command. When this happens, libata (correctly) detects a bad id page and bails. (problem doesn't show up on single CPU w/ HT) What changed from 2.6.8 to 2.6.9 is 2.6.8: bitbang ATA taskfile registers (loads command) bitbang ATA data register (read id page) 2.6.9: bitbang ATA taskfile registers queue_work() workqueue thread bitbangs ATA data register (read id page) So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the on-board SATA PCI device, then immediately after that, CPU 1 sending I/O traffic. Anyway, back to debugging... :) Jeff ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:00 ` Jeff Garzik @ 2004-12-02 7:05 ` Benjamin Herrenschmidt 2004-12-02 7:11 ` Jeff Garzik 2004-12-02 14:30 ` Andy Warner 1 sibling, 1 reply; 286+ messages in thread From: Benjamin Herrenschmidt @ 2004-12-02 7:05 UTC (permalink / raw) To: Jeff Garzik Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide On Thu, 2004-12-02 at 02:00 -0500, Jeff Garzik wrote: > > 2.6.9: > bitbang ATA taskfile registers > queue_work() > workqueue thread bitbangs ATA data register (read id page) > > So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the > on-board SATA PCI device, then immediately after that, CPU 1 sending I/O > traffic. > > Anyway, back to debugging... :) They may not end up in order if they are stores (the stores to the taskfile may be out of order vs; the loads/stores to/from the data register) unless you have a spinlock protecting both or a full sync (on ppc), but then, I don't know the ordering things on x86_64. This could certainly be a problem on ppc & ppc64 too. Ben. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:05 ` Benjamin Herrenschmidt @ 2004-12-02 7:11 ` Jeff Garzik 2004-12-02 11:16 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 286+ messages in thread From: Jeff Garzik @ 2004-12-02 7:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide Benjamin Herrenschmidt wrote: > They may not end up in order if they are stores (the stores to the > taskfile may be out of order vs; the loads/stores to/from the data > register) unless you have a spinlock protecting both or a full sync (on > ppc), but then, I don't know the ordering things on x86_64. This could > certainly be a problem on ppc & ppc64 too. Is synchronization beyond in[bwl] needed, do you think? This specific problem is only on Intel ICHx AFAICS, which is PIO not MMIO and x86-only. I presumed insw() by its very nature already has synchronization, but perhaps not... Jeff ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:11 ` Jeff Garzik @ 2004-12-02 11:16 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 286+ messages in thread From: Benjamin Herrenschmidt @ 2004-12-02 11:16 UTC (permalink / raw) To: Jeff Garzik Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide On Thu, 2004-12-02 at 02:11 -0500, Jeff Garzik wrote: > Benjamin Herrenschmidt wrote: > > They may not end up in order if they are stores (the stores to the > > taskfile may be out of order vs; the loads/stores to/from the data > > register) unless you have a spinlock protecting both or a full sync (on > > ppc), but then, I don't know the ordering things on x86_64. This could > > certainly be a problem on ppc & ppc64 too. > > > Is synchronization beyond in[bwl] needed, do you think? Yes, when potentially hop'ing between CPUs, definitely. > This specific problem is only on Intel ICHx AFAICS, which is PIO not > MMIO and x86-only. I presumed insw() by its very nature already has > synchronization, but perhaps not... Hrm... on "pure" x86, I would expect so at the HW level, not sure about x86_64... but there would be definitely an issue on ppc with your scheme. You need at least a full barrier before you trigger the workqueue. That may not be the problem you are facing now, but it would become one. Ben. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 7:00 ` Jeff Garzik 2004-12-02 7:05 ` Benjamin Herrenschmidt @ 2004-12-02 14:30 ` Andy Warner 2005-01-06 23:40 ` Jeff Garzik 1 sibling, 1 reply; 286+ messages in thread From: Andy Warner @ 2004-12-02 14:30 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, torvalds, benh, linux-kernel, linux-ide [-- Attachment #1: Type: text/plain, Size: 2854 bytes --] Jeff Garzik wrote: > [...] > I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes > ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, > but works on UP. Potentially related to >=4GB of RAM. > > > > Details, in case anyone is interested: > Unless my code is screwed up (certainly possible), PIO data-in [using > the insw() call] seems to return all zeroes on a true-blue SMP machine, > for the identify-device command. When this happens, libata (correctly) > detects a bad id page and bails. (problem doesn't show up on single CPU > w/ HT) Ah, I might have been here recently, with the pass-thru stuff. What I saw was that in an SMP machine: 1. queue_work() can result in the work running (on another CPU) instantly. 2. Having one CPU beat on PIO registers reading data from one port would significantly alter the timing of the CMD->BSY->DRQ sequence used in PIO. This behaviour was far worse for competing ports within one chip, which I put down to arbitration problems. 3. CPU utilisation would go through the roof. Effectively the entire pio_task state machine reduced to a busy spin loop. 4. The state machine needed some tweaks, especially in error handling cases. I made some changes, which effectively solved the problem for promise TX4-150 cards, and was going to test the results on other chipsets next week before speaking up. Specifically, I have seen some issues with SiI 3114 cards. I was trying to explore using interrupts instead of polling state but for some reason, I was not getting them for PIO data operations, or I misunderstand the spec, after removing ata_qc_set_polling() - again I saw a difference in behaviour between the Promise & SiI cards here. I'm about to go offline for 3 days, and hadn't prepared for this yet. The best I can do is provide a patch (attached) that applies against 2.6.9. It also seems to apply against libata-2.6, but barfs a bit against libata-dev-2.6. The changes boil down to these: 1. Minor changes in how status/error regs are read. Including attempts to use altstatus, while I was exploring interrupts. 2. State machine logic changes. 3. Replace calls to queue_work() with queue_delayed_work() to stop SMP machines going crazy. With these changes, on a platform consisting of 2.6.9 and Promise TX4-150 cards, I can move terabytes of parallel PIO data, without error. My gut says that the PIO mechanism should be overhauled, I composed a "how much should we pay for this muffler" email to linux-ide at least twice while working on this, but never sent it - wanting to send a solution in rather than just making more comments from the peanut gallery. I'll pick up the thread on this next week, when I'm back online. I hope this helps. -- andyw@pobox.com Andy Warner Voice: (612) 801-8549 Fax: (208) 575-5634 [-- Attachment #2: 2.6.9-pio-smp.patch --] [-- Type: text/plain, Size: 1862 bytes --] diff -r -u -X dontdiff linux-2.6.9-vanilla/drivers/scsi/libata-core.c linux-2.6.9/drivers/scsi/libata-core.c --- linux-2.6.9-vanilla/drivers/scsi/libata-core.c 2004-10-18 16:53:06.000000000 -0500 +++ linux-2.6.9/drivers/scsi/libata-core.c 2004-11-24 11:01:40.000000000 -0600 @@ -2099,7 +2099,7 @@ } drv_stat = ata_wait_idle(ap); - if (!ata_ok(drv_stat)) { + if (drv_stat & (ATA_ERR | ATA_DF)) { ap->pio_task_state = PIO_ST_ERR; return; } @@ -2254,23 +2254,17 @@ * chk-status again. If still busy, fall back to * PIO_ST_POLL state. */ - status = ata_busy_wait(ap, ATA_BUSY, 5); - if (status & ATA_BUSY) { + status = ata_altstatus(ap) ; + if (!(status & ATA_DRQ)) { msleep(2); - status = ata_busy_wait(ap, ATA_BUSY, 10); - if (status & ATA_BUSY) { + status = ata_altstatus(ap) ; + if (!(status & ATA_DRQ)) { ap->pio_task_state = PIO_ST_POLL; ap->pio_task_timeout = jiffies + ATA_TMOUT_PIO; return; } } - /* handle BSY=0, DRQ=0 as error */ - if ((status & ATA_DRQ) == 0) { - ap->pio_task_state = PIO_ST_ERR; - return; - } - qc = ata_qc_from_tag(ap, ap->active_tag); assert(qc != NULL); @@ -2321,17 +2315,15 @@ case PIO_ST_TMOUT: case PIO_ST_ERR: ata_pio_error(ap); - break; + return ; } - if ((ap->pio_task_state != PIO_ST_IDLE) && - (ap->pio_task_state != PIO_ST_TMOUT) && - (ap->pio_task_state != PIO_ST_ERR)) { + if (ap->pio_task_state != PIO_ST_IDLE) { if (timeout) queue_delayed_work(ata_wq, &ap->pio_task, timeout); else - queue_work(ata_wq, &ap->pio_task); + queue_delayed_work(ata_wq, &ap->pio_task, 2); } } @@ -2624,7 +2616,7 @@ ata_qc_set_polling(qc); ata_tf_to_host_nolock(ap, &qc->tf); ap->pio_task_state = PIO_ST; - queue_work(ata_wq, &ap->pio_task); + queue_delayed_work(ata_wq, &ap->pio_task, 2); break; case ATA_PROT_ATAPI: ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 14:30 ` Andy Warner @ 2005-01-06 23:40 ` Jeff Garzik 0 siblings, 0 replies; 286+ messages in thread From: Jeff Garzik @ 2005-01-06 23:40 UTC (permalink / raw) To: Andy Warner; +Cc: Andrew Morton, torvalds, benh, linux-kernel, linux-ide Andy Warner wrote: > Jeff Garzik wrote: > >>[...] >>I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes >>ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, >>but works on UP. Potentially related to >=4GB of RAM. >> >> >> >>Details, in case anyone is interested: >>Unless my code is screwed up (certainly possible), PIO data-in [using >>the insw() call] seems to return all zeroes on a true-blue SMP machine, >>for the identify-device command. When this happens, libata (correctly) >>detects a bad id page and bails. (problem doesn't show up on single CPU >>w/ HT) > > > Ah, I might have been here recently, with the pass-thru stuff. > > What I saw was that in an SMP machine: > > 1. queue_work() can result in the work running (on another > CPU) instantly. > > 2. Having one CPU beat on PIO registers reading data from one port > would significantly alter the timing of the CMD->BSY->DRQ sequence > used in PIO. This behaviour was far worse for competing ports > within one chip, which I put down to arbitration problems. > > 3. CPU utilisation would go through the roof. Effectively the > entire pio_task state machine reduced to a busy spin loop. > > 4. The state machine needed some tweaks, especially in error > handling cases. > > I made some changes, which effectively solved the problem for promise > TX4-150 cards, and was going to test the results on other chipsets > next week before speaking up. Specifically, I have seen some > issues with SiI 3114 cards. > > I was trying to explore using interrupts instead of polling state > but for some reason, I was not getting them for PIO data operations, > or I misunderstand the spec, after removing ata_qc_set_polling() - again > I saw a difference in behaviour between the Promise & SiI cards > here. > > I'm about to go offline for 3 days, and hadn't prepared for this > yet. The best I can do is provide a patch (attached) that applies > against 2.6.9. It also seems to apply against libata-2.6, but > barfs a bit against libata-dev-2.6. > > The changes boil down to these: > > 1. Minor changes in how status/error regs are read. > Including attempts to use altstatus, while I was > exploring interrupts. > > 2. State machine logic changes. > > 3. Replace calls to queue_work() with queue_delayed_work() > to stop SMP machines going crazy. > > With these changes, on a platform consisting of 2.6.9 and > Promise TX4-150 cards, I can move terabytes of parallel > PIO data, without error. > > My gut says that the PIO mechanism should be overhauled, I > composed a "how much should we pay for this muffler" email > to linux-ide at least twice while working on this, but never > sent it - wanting to send a solution in rather than just > making more comments from the peanut gallery. > > I'll pick up the thread on this next week, when I'm back online. > I hope this helps. Please let me know if you still have problems? The PIO SMP problems seem to be fixed here. Jeff ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:34 ` Andrew Morton 2004-12-02 6:48 ` Jeff Garzik 2004-12-02 7:00 ` Jeff Garzik @ 2004-12-02 18:27 ` Grant Grundler 2004-12-02 18:33 ` Andrew Morton 2004-12-02 18:36 ` Christoph Hellwig 2004-12-07 10:51 ` Pavel Machek 3 siblings, 2 replies; 286+ messages in thread From: Grant Grundler @ 2004-12-02 18:27 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Wed, Dec 01, 2004 at 10:34:41PM -0800, Andrew Morton wrote: > Of course, nobody will test -rc3 and a zillion people will test final > 2.6.10, which is when we get lots of useful bug reports. If this keeps on > happening then we'll need to get more serious about the 2.6.10.n process. > > Or start alternating between stable and flakey releases, so 2.6.11 will be > a feature release with a 2-month development period and 2.6.12 will be a > bugfix-only release, with perhaps a 2-week development period, so people > know that the even-numbered releases are better stabilised. No matter what scheme you adopt, I (and others) will adapt as well. When working on a new feature or bug fix, I don't chase -bk releases since I don't want to find new, unrelated issues that interfere with the issue I was originally chasing. I roll to a new release when the issue I care about is "cooked". Anything that takes longer than a month or so is just hopeless since I fall too far behind. (e.g. IRQ handling in parisc-linux needs to be completely rewritten to pickup irq_affinity support - I just don't have enough time to get it done in < 2 monthes. We started on this last year and gave up.) I see "2.6.10.n process" as the right way to handle bug fix only releases. I'm happy to work on 2.6.10.0 and understand the initial release was a "best effort". 2.6.odd/.even release described above is a variant of 2.6.10.n releases where n = {0, 1}. The question is how many parallel releases do people (you and linus) want us keep "alive" at the same time? odd/even implies only one vs several if 2.6.X.n scheme is continued beyond 2.6.8.1. Also need to think about how well any scheme align's with what distro's need to support releases. Like the "Adopt-a-Highway" program in California to pickup trash along highways, I'm wondering if distros would be willing/interested in adopting a particular release and maintain it in bk. e.g. SuSE clearly has interest in some sort of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n). The question of *who* (at respective distro) would be the release maintainer is a titanic sized rathole. But there is a release manager today at each distro and perhaps it's easier if s/he remains invisible to us. hth, grant ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 18:27 ` Grant Grundler @ 2004-12-02 18:33 ` Andrew Morton 2004-12-02 18:36 ` Christoph Hellwig 1 sibling, 0 replies; 286+ messages in thread From: Andrew Morton @ 2004-12-02 18:33 UTC (permalink / raw) To: Grant Grundler Cc: jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Grant Grundler <iod00d@hp.com> wrote: > > 2.6.odd/.even release described above is a variant of 2.6.10.n releases > where n = {0, 1}. The question is how many parallel releases do people > (you and linus) want us keep "alive" at the same time? 2.6.odd/.even is actually a significantly different process. a) because there's only one tree, linearly growing. That's considerably simpler than maintaining a branch. And b) because everyone knows that there won't be a new development tree opened until we've all knuckled down and fixed the bugs which we put into the previous one, dammit. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 18:27 ` Grant Grundler 2004-12-02 18:33 ` Andrew Morton @ 2004-12-02 18:36 ` Christoph Hellwig 1 sibling, 0 replies; 286+ messages in thread From: Christoph Hellwig @ 2004-12-02 18:36 UTC (permalink / raw) To: Grant Grundler Cc: Andrew Morton, Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Thu, Dec 02, 2004 at 10:27:16AM -0800, Grant Grundler wrote: > Also need to think about how well any scheme align's with what distro's > need to support releases. Like the "Adopt-a-Highway" program in > California to pickup trash along highways, I'm wondering if distros > would be willing/interested in adopting a particular release > and maintain it in bk. e.g. SuSE clearly has interest in some sort > of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n). Unfortunately the SLES9 kernels don't really look anything like 2.6.5 except from the version number. There's far too much trash from Business Partners in there. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-02 6:34 ` Andrew Morton ` (2 preceding siblings ...) 2004-12-02 18:27 ` Grant Grundler @ 2004-12-07 10:51 ` Pavel Machek 3 siblings, 0 replies; 286+ messages in thread From: Pavel Machek @ 2004-12-07 10:51 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hi! > Or start alternating between stable and flakey releases, so 2.6.11 will be > a feature release with a 2-month development period and 2.6.12 will be a > bugfix-only release, with perhaps a 2-week development period, so people > know that the even-numbered releases are better stabilised. If you expect "feature 2.6.11", you might as well call it 2.7.0, followed by 2.8.0. Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (7 preceding siblings ...) 2004-12-02 0:10 ` page fault scalability patch V12 [0/7]: Overview and performance tests Linus Torvalds @ 2004-12-09 8:00 ` Nick Piggin 2004-12-09 17:03 ` Christoph Lameter 2004-12-09 18:37 ` Hugh Dickins 9 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-12-09 8:00 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > Changes from V11->V12 of this patch: > - dump sloppy_rss in favor of list_rss (Linus' proposal) > - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) > [snip] > For more than 8 cpus the page fault rate increases by orders > of magnitude. For more than 64 cpus the improvement in performace > is 10 times better. Those numbers are pretty impressive. I thought you'd said with earlier patches that performance was about doubled from 8 to 512 CPUS. Did I remember correctly? If so, where is the improvement coming from? The per-thread RSS I guess? On another note, these patches are basically only helpful to new anonymous page faults. I guess this is the main thing you are concerned about at the moment, but I wonder if you would see improvements with my patch to remove the ptl from the other types of faults as well? The downside of my patch - well the main downsides - compared to yours are its intrusiveness, and the extra cost involved in copy_page_range which yours appears not to require. As I've said earlier though, I wouldn't mind your patches going in. At least they should probably get into -mm soon, when Andrew has time (and after the 4level patches are sorted out). That wouldn't stop my patch (possibly) being merged some time after that if and when it was found worthy... ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-09 8:00 ` Nick Piggin @ 2004-12-09 17:03 ` Christoph Lameter 2004-12-10 4:30 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 17:03 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, Nick Piggin wrote: > > For more than 8 cpus the page fault rate increases by orders > > of magnitude. For more than 64 cpus the improvement in performace > > is 10 times better. > > Those numbers are pretty impressive. I thought you'd said with earlier > patches that performance was about doubled from 8 to 512 CPUS. Did I > remember correctly? If so, where is the improvement coming from? The > per-thread RSS I guess? Right. The per-thread RSS seems to have made a big difference for high CPU counts. Also I was conservative in the estimates in earlier post since I did not have the numbers for the very high cpu counts. > On another note, these patches are basically only helpful to new > anonymous page faults. I guess this is the main thing you are concerned > about at the moment, but I wonder if you would see improvements with > my patch to remove the ptl from the other types of faults as well? I can try that but I am frankly a bit sceptical since the ptl protects many other variables. It may be more efficient to have the ptl in these cases than doing the atomic ops all over the place. Do you have any number you could post? I believe I send you a copy of the code that I use for performance tests last week or so, > The downside of my patch - well the main downsides - compared to yours > are its intrusiveness, and the extra cost involved in copy_page_range > which yours appears not to require. Is the patch known to be okay for ia64? I can try to see how it does. > As I've said earlier though, I wouldn't mind your patches going in. At > least they should probably get into -mm soon, when Andrew has time (and > after the 4level patches are sorted out). That wouldn't stop my patch > (possibly) being merged some time after that if and when it was found > worthy... I'd certainly be willing to poke around and see how beneficial this is. If it turns out to accellerate other functionality of the vm then you have my full support. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-09 17:03 ` Christoph Lameter @ 2004-12-10 4:30 ` Nick Piggin 0 siblings, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-12-10 4:30 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > On Thu, 9 Dec 2004, Nick Piggin wrote: > > >>>For more than 8 cpus the page fault rate increases by orders >>>of magnitude. For more than 64 cpus the improvement in performace >>>is 10 times better. >> >>Those numbers are pretty impressive. I thought you'd said with earlier >>patches that performance was about doubled from 8 to 512 CPUS. Did I >>remember correctly? If so, where is the improvement coming from? The >>per-thread RSS I guess? > > > Right. The per-thread RSS seems to have made a big difference for high CPU > counts. Also I was conservative in the estimates in earlier post since I > did not have the numbers for the very high cpu counts. > Ah OK. > >>On another note, these patches are basically only helpful to new >>anonymous page faults. I guess this is the main thing you are concerned >>about at the moment, but I wonder if you would see improvements with >>my patch to remove the ptl from the other types of faults as well? > > > I can try that but I am frankly a bit sceptical since the ptl protects > many other variables. It may be more efficient to have the ptl in these > cases than doing the atomic ops all over the place. Do you have any number > you could post? I believe I send you a copy of the code that I use for > performance tests last week or so, > Yep I have your test program. No real numbers because the biggest thing I have to test on is a 4-way - there is improvement, but it is not so impressive as your 512 way tests! :) > >>The downside of my patch - well the main downsides - compared to yours >>are its intrusiveness, and the extra cost involved in copy_page_range >>which yours appears not to require. > > > Is the patch known to be okay for ia64? I can try to see how it > does. > I think it just needs one small fix to the swapping code, and it should be pretty stable. So in fact it would probably work for you as is (if you don't swap), but I'd rather have something more stable before I ask you to test. I'll try to find time to do that in the next few days. > >>As I've said earlier though, I wouldn't mind your patches going in. At >>least they should probably get into -mm soon, when Andrew has time (and >>after the 4level patches are sorted out). That wouldn't stop my patch >>(possibly) being merged some time after that if and when it was found >>worthy... > > > I'd certainly be willing to poke around and see how beneficial this is. If > it turns out to accellerate other functionality of the vm then you > have my full support. > Great, thanks. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter ` (8 preceding siblings ...) 2004-12-09 8:00 ` Nick Piggin @ 2004-12-09 18:37 ` Hugh Dickins 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter ` (3 more replies) 9 siblings, 4 replies; 286+ messages in thread From: Hugh Dickins @ 2004-12-09 18:37 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 6655 bytes --] On Wed, 1 Dec 2004, Christoph Lameter wrote: > > Changes from V11->V12 of this patch: > - dump sloppy_rss in favor of list_rss (Linus' proposal) > - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) > > This is a series of patches that increases the scalability of > the page fault handler for SMP. Here are some performance results > on a machine with 512 processors allocating 32 GB with an increasing > number of threads (that are assigned a processor each). Your V12 patches would apply well to 2.6.10-rc3, except that (as noted before) your mailer or whatever is eating trailing whitespace: trivial patch attached to apply before yours, removing that whitespace so yours apply. But what your patches need to apply to would be 2.6.10-mm. Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would have tested out okay up to 4GB but not above: trivial patch attached. Your scalability figures show a superb improvement. But they are (I presume) for the best case: intense initial faulting of distinct areas of anonymous memory by parallel cpus running a multithreaded process. This is not a common case: how much do what real-world apps benefit? Since you also avoid taking the page_table_lock in handle_pte_fault, there should be some scalability benefit to all kinds of page fault: do you have any results to show how much (perhaps hard to quantify, since even tmpfs file faults introduce other scalability issues)? How do the scalability figures compare if you omit patch 7/7 i.e. revert the per-task rss complications you added in for Linus? I remain a fan of sloppy rss, which you earlier showed to be accurate enough (I'd say), though I guess should be checked on other architectures than your ia64. I can't see the point of all that added ugliness for numbers which don't need to be precise - but perhaps there's no way of rearranging fields, and the point at which mm->(anon_)rss is updated (near up of mmap_sem?), to avoid destructive cacheline bounce. What I'm asking is, do you have numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus this time, but only up to 32 with sloppy rss? The ratios do look better with the latest, but the numbers are altogether lower so we don't know. The split rss patch, if it stays, needs some work. For example, task_statm uses "get_shared" to total up rss-anon_rss from the tasks, but assumes mm->rss is already accurate. Scrap the separate get_rss, get_anon_rss, get_shared functions: just one get_rss to make a single pass through the tasks adding up both rss and anon_rss at the same time. I am bothered that every read of /proc/<pid>/status or /proc/<pid>/statm is going to reread through all of that task_list each time; yet in that massively parallel case that concerns you, there should be little change to rss after startup. Perhaps a later optimization would be to avoid task_list completely for singly threaded processes. I'd like get_rss to update mm->rss and mm->anon_rss and flag it uptodate to avoid subsequent task_list iterations, but the locking might defeat your whole purpose. Updating current->rss in do_anonymous_page, current->anon_rss in page_add_anon_rmap, is not always correct: ptrace's access_process_vm uses get_user_pages on another task. You need check that current->mm == mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss, fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock for that). You'll also need to check !(current->flags & PF_BORROWED_MM), to guard against use_mm. Or... just go back to sloppy rss. Moving to the main patch, 1/7, the major issue I see there is the way do_anonymous_page does update_mmu_cache after setting the pte, without any page_table_lock to bracket them together. Obviously no problem on architectures where update_mmu_cache is a no-op! But although there's been plenty of discussion, particularly with Ben and Nick, I've not noticed anything to guarantee that as safe on all architectures. I do think it's fine for you to post your patches before completing hooks in all the arches, but isn't this a significant issue which needs to be sorted before your patches go into -mm? You hazily refer to such issues in 0/7, but now you need to work with arch maintainers to settle them and show the patches. A lesser issue with the reordering in do_anonymous_page: don't you need to move the lru_cache_add_active after the page_add_anon_rmap, to avoid the very slight chance that vmscan will pick the page off the LRU and unmap it before you've counted it in, hitting page_remove_rmap's BUG_ON(page_mapcount(page) < 0)? (I do wonder why do_anonymous_page calls mark_page_accessed as well as lru_cache_add_active. The other instances of lru_cache_add_active for an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, why here? But that's nothing new with your patch, and although you've reordered the calls, the final page state is the same as before.) Where handle_pte_fault does "entry = *pte" without page_table_lock: you're quite right to passing down precisely that entry to the fault handlers below, but there's still a problem on the 32bit architectures supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints of entry may be out of synch. Not a problem for do_anonymous_page, or anything else relying on ptep_cmpxchg to check; but a problem for do_wp_page (which could find !pfn_valid and kill the process) and probably others (harder to think through). Your 4/7 patch for i386 has an unused atomic get_64bit function from Nick, I think you'll have to define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. Hmm, that will only work if you're using atomic set_64bit rather than relying on page_table_lock in the complementary places which matter. Which I believe you are indeed doing in your 3level set_pte. Shouldn't __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock? But by making every set_pte use set_64bit, you are significantly slowing down many operations which do not need that atomicity. This is quite visible in the fork/exec/shell results from lmbench on i386 PAE (and is the only interesting difference, for good or bad, that I noticed with your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are no faults on dst mm (nor on src mm) while copy_page_range is copying, so its set_ptes don't need to be atomic; likewise during zap_pte_range (either mmap_sem is held exclusively, or it's in the final exit_mmap). Probably revert set_pte and set_pte_atomic to what they were, and use set_pte_atomic where it's needed. Hugh [-- Attachment #2: Remove trailing whitespace before C.L. patches --] [-- Type: TEXT/PLAIN, Size: 1736 bytes --] --- 2.6.10-rc3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000 +++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000 @@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo #define cmpxchg(ptr,o,n)\ ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) - + #ifdef __KERNEL__ -struct alt_instr { +struct alt_instr { __u8 *instr; /* original instruction */ __u8 *replacement; __u8 cpuid; /* cpuid bit set for replacement */ --- 2.6.10-rc3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100 +++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000 @@ -99,7 +99,7 @@ static inline void pgd_populate(struct m #endif /* __s390x__ */ -static inline void +static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) { #ifndef __s390x__ --- 2.6.10-rc3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000 +++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000 @@ -1424,7 +1424,7 @@ out: /* * We are called with the MM semaphore and page_table_lock * spinlock held to protect against concurrent faults in - * multithreaded programs. + * multithreaded programs. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, @@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct * Fall back to the linear mapping if the fs does not support * ->populate: */ - if (!vma->vm_ops || !vma->vm_ops->populate || + if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); return do_no_page(mm, vma, address, write_access, pte, pmd); [-- Attachment #3: 3level ptep_cmpxchg use cmpxchg8b --] [-- Type: TEXT/PLAIN, Size: 570 bytes --] --- 2.6.10-rc3-cl/include/asm-i386/pgtable-3level.h 2004-12-05 14:01:11.000000000 +0000 +++ linux/include/asm-i386/pgtable-3level.h 2004-12-09 13:17:44.000000000 +0000 @@ -147,7 +147,7 @@ static inline pmd_t pfn_pmd(unsigned lon static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) { - return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); + return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); } ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 18:37 ` Hugh Dickins @ 2004-12-09 22:02 ` Christoph Lameter 2004-12-09 22:52 ` Andrew Morton 2004-12-09 22:52 ` William Lee Irwin III 2004-12-10 4:26 ` page fault scalability patch V12 [0/7]: Overview and performance tests Nick Piggin ` (2 subsequent siblings) 3 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 22:02 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, Hugh Dickins wrote: > How do the scalability figures compare if you omit patch 7/7 i.e. revert > the per-task rss complications you added in for Linus? I remain a fan > of sloppy rss, which you earlier showed to be accurate enough (I'd say), > though I guess should be checked on other architectures than your ia64. > I can't see the point of all that added ugliness for numbers which don't > need to be precise - but perhaps there's no way of rearranging fields, > and the point at which mm->(anon_)rss is updated (near up of mmap_sem?), > to avoid destructive cacheline bounce. What I'm asking is, do you have > numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus > this time, but only up to 32 with sloppy rss? The ratios do look better > with the latest, but the numbers are altogether lower so we don't know. Here is a full set of numbers for sloppy and tasklist. The sloppy version is 2.6.9-rc2-bk14 with the prefault patch also applied and the tasklist version is 2.6.9-rc2-bk12 w/o prefault (you can get the numbers of 2.6.9-rc2-bk12 w prefault in the post titled "anticipatory prefaulting in the page fault handler")). Even with this handicap tasklist is still slightly better! I would expect tasklist to increase in importance for combination patches which increase the fault rate even more. The tasklist is likely to be unavoidable once I get the prezeroing patch debugged and integrated which should at least give us a peak pulse performance for page faults > 5 mio faults /sec. I was not also able to get the high numbers of > 3 mio faults with atomic rss + prefaulting but was able to get that with tasklist + prefault. The atomic version shares the locality problems with the sloppy approach. sloppy (2.6.10-bk14-rss-sloppy-prefault): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 10 1 0.040s 6.505s 6.054s100117.616 100072.760 1 10 2 0.041s 7.394s 4.005s 88138.739 161535.358 1 10 4 0.049s 7.863s 2.049s 82819.743 262839.190 1 10 8 0.093s 8.657s 1.077s 74889.898 369606.184 1 10 16 0.621s 13.278s 1.076s 47150.165 371506.561 1 10 32 3.154s 35.337s 2.029s 17025.784 285469.956 1 10 64 11.602s 77.548s 2.086s 7351.089 228908.831 1 10 128 41.999s 217.106s 4.030s 2529.316 152087.458 1 10 256 40.482s 106.627s 3.022s 4454.885 203363.548 1 10 512 63.673s 61.361s 3.040s 5241.403 192528.941 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 10 1 0.176s 41.276s 41.045s 63238.628 63237.008 4 10 2 0.154s 31.074s 16.095s 83943.753 154606.489 4 10 4 0.193s 31.886s 9.096s 81715.471 263190.941 4 10 8 0.210s 33.577s 6.061s 77584.707 396402.083 4 10 16 0.473s 52.997s 6.036s 49025.701 411640.587 4 10 32 3.331s 142.296s 7.093s 18000.934 330197.326 4 10 64 10.820s 318.485s 8.088s 7960.503 295042.520 4 10 128 56.012s 928.004s 12.037s 2664.019 211812.600 4 10 256 46.197s 464.579s 7.026s 5132.263 360940.189 4 10 512 57.396s 225.876s 4.081s 9254.125 544185.485 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 16 10 1 0.948s 221.167s 222.009s 47208.624 47212.786 16 10 2 0.824s 205.021s 110.022s 50939.876 95134.456 16 10 4 0.689s 168.670s 53.055s 61914.226 195802.740 16 10 8 0.683s 137.278s 27.034s 76004.706 383471.968 16 10 16 0.969s 216.288s 24.031s 48264.109 431329.422 16 10 32 3.932s 587.987s 30.002s 17714.820 349219.905 16 10 64 13.542s 1253.834s 32.051s 8273.588 322528.516 16 10 128 54.197s 3161.896s 38.064s 3260.403 271357.849 16 10 256 57.610s 1668.913s 21.038s 6073.335 490410.386 16 10 512 36.721s 833.691s 11.069s 12046.872 896970.623 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 10 1 2.080s 470.722s 472.075s 44355.728 44360.409 32 10 2 1.836s 456.343s 242.088s 45771.267 86344.100 32 10 4 1.671s 432.569s 131.065s 48294.609 159291.360 32 10 8 1.457s 354.825s 71.027s 58862.070 294242.410 32 10 16 1.660s 431.057s 48.038s 48464.636 433466.055 32 10 32 3.639s 1190.388s 59.040s 17563.676 353012.708 32 10 64 14.623s 2490.393s 63.040s 8371.808 330750.309 32 10 128 68.481s 6415.265s 76.053s 3234.476 274023.655 32 10 256 63.428s 3216.337s 39.044s 6394.212 531665.931 32 10 512 50.693s 1644.307s 21.035s 12372.572 982183.559 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 64 10 1 4.457s 1021.948s1026.030s 40863.994 40868.119 64 10 2 3.929s 994.825s 525.030s 41995.308 79844.658 64 10 4 3.661s 931.523s 269.014s 44849.990 155838.443 64 10 8 3.355s 858.565s 153.098s 48662.260 272381.402 64 10 16 3.130s 904.485s 101.090s 46212.285 411581.778 64 10 32 5.007s 2366.494s 116.079s 17686.275 359107.203 64 10 64 17.472s 5195.222s 126.012s 8046.325 332545.646 64 10 128 65.249s 12515.845s 147.053s 3333.815 284290.928 64 10 256 61.328s 6706.566s 78.061s 6197.354 533523.711 64 10 512 60.656s 3201.068s 39.095s 12859.162 1049637.054 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 128 10 8 7.481s 1875.297s 318.049s 44554.389 263386.340 128 10 16 7.128s 2048.919s 230.060s 40799.672 363757.736 128 10 32 9.584s 4758.868s 241.094s 17591.883 346711.571 128 10 64 17.955s 10135.674s 249.025s 8261.684 336547.279 128 10 128 66.939s 25006.914s 287.019s 3345.560 292086.404 128 10 256 62.454s 12892.242s 149.035s 6475.341 561653.696 128 10 512 59.082s 6456.965s 77.002s 12873.768 1089026.647 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 10 8 17.201s 4672.781s 860.094s 35772.446 194870.225 256 10 16 16.641s 5071.433s 588.076s 32973.603 284954.772 256 10 32 17.745s 9193.335s 478.005s 18214.166 350950.045 256 10 64 25.474s 20440.137s 510.037s 8197.759 328725.189 256 10 128 65.451s 50015.195s 572.044s 3350.040 293079.914 256 10 256 61.296s 25191.675s 290.084s 6643.660 576852.282 256 10 512 58.911s 12589.530s 149.012s 13264.255 1125015.367 tasklist (2.6.10-rc2-bk12-rss-tasklist): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 3 1 0.045s 2.042s 2.009s 94121.837 94039.902 1 3 2 0.053s 2.217s 1.022s 86554.869 160093.661 1 3 4 0.036s 2.325s 0.074s 83261.622 265213.249 1 3 8 0.065s 2.507s 0.053s 76404.784 370587.422 1 3 16 0.168s 4.727s 0.057s 40152.877 341385.368 1 3 32 0.829s 11.408s 0.070s 16066.277 280690.973 1 3 64 4.324s 25.591s 0.093s 6571.995 209956.473 1 3 128 19.370s 81.568s 1.055s 1947.799 126774.712 1 3 256 13.042s 46.608s 1.009s 3295.950 179708.774 1 3 512 19.410s 28.085s 0.092s 4139.454 211823.959 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.161s 12.698s 12.086s 61156.292 61149.853 4 3 2 0.152s 10.469s 5.073s 74037.518 137039.041 4 3 4 0.179s 9.401s 2.098s 82081.949 263750.289 4 3 8 0.156s 10.194s 1.098s 75979.430 395361.526 4 3 16 0.407s 18.084s 2.010s 42527.778 373673.111 4 3 32 0.824s 44.316s 2.031s 17421.815 339975.566 4 3 64 4.706s 96.587s 2.066s 7763.856 295588.217 4 3 128 17.453s 259.672s 3.053s 2837.813 222395.530 4 3 256 17.090s 136.816s 2.017s 5109.777 361440.098 4 3 512 13.466s 78.242s 1.043s 8575.295 548859.306 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 16 3 1 0.678s 61.548s 62.023s 50551.998 50544.748 16 3 2 0.691s 63.381s 34.027s 49095.790 91791.474 16 3 4 0.663s 52.083s 16.086s 59639.041 186542.124 16 3 8 0.585s 43.339s 9.031s 71614.583 337721.897 16 3 16 0.744s 75.174s 8.003s 41435.328 391278.035 16 3 32 1.713s 171.942s 8.086s 18114.674 354760.887 16 3 64 4.720s 366.803s 9.055s 8467.079 329273.168 16 3 128 22.637s 849.059s 10.093s 3608.741 287572.764 16 3 256 15.849s 472.565s 6.009s 6440.683 515916.601 16 3 512 15.479s 245.305s 3.046s 12062.521 909147.611 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.451s 140.151s 141.060s 44430.367 44428.115 32 3 2 1.399s 136.349s 73.041s 45673.303 85699.793 32 3 4 1.321s 129.760s 39.027s 47996.303 160197.217 32 3 8 1.279s 100.648s 20.039s 61724.641 308454.557 32 3 16 1.414s 153.975s 15.090s 40488.236 395681.716 32 3 32 2.534s 337.021s 17.016s 18528.487 366445.400 32 3 64 4.271s 709.872s 18.057s 8809.787 338656.440 32 3 128 18.734s 1805.094s 21.084s 3449.586 288005.644 32 3 256 14.698s 963.787s 11.078s 6429.787 534077.540 32 3 512 15.299s 453.990s 5.098s 13406.321 1050416.414 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 64 3 1 3.018s 301.014s 304.004s 41386.617 41384.901 64 3 2 2.941s 296.780s 157.005s 41981.967 80116.179 64 3 4 2.810s 280.803s 82.047s 44366.266 152575.551 64 3 8 2.763s 268.745s 48.099s 46344.377 256813.576 64 3 16 2.764s 332.029s 34.030s 37584.030 366744.317 64 3 32 3.337s 704.321s 34.074s 17781.025 362195.710 64 3 64 7.395s 1475.497s 36.078s 8485.379 342026.888 64 3 128 22.227s 3188.934s 40.044s 3918.492 311115.971 64 3 256 18.004s 1834.246s 21.093s 6793.308 573753.797 64 3 512 19.367s 861.324s 10.099s 14287.531 1144168.224 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 128 3 4 5.857s 626.055s 189.010s 39824.798 133076.331 128 3 8 5.837s 592.587s 107.080s 42053.423 233443.791 128 3 16 5.852s 666.252s 71.008s 37443.301 354011.649 128 3 32 6.305s 1365.184s 69.075s 18349.259 360755.364 128 3 64 8.450s 2914.730s 72.046s 8609.057 347288.474 128 3 128 21.188s 6719.590s 79.078s 3733.370 315402.750 128 3 256 18.263s 3672.379s 43.049s 6818.817 578587.427 128 3 512 17.625s 1901.969s 22.082s 13109.967 1102629.479 128 3 256 24.035s 3392.117s 40.074s 7366.714 617628.607 128 3 512 17.000s 1820.242s 21.072s 13697.601 1158632.106 Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 3 4 11.976s 1660.924s 514.023s 30086.443 97877.018 256 3 8 11.618s 1301.448s 223.063s 38331.361 225057.902 256 3 16 11.696s 1409.158s 148.074s 35423.488 338379.838 256 3 32 12.678s 2668.417s 140.042s 18772.788 358421.926 256 3 64 15.933s 5833.804s 145.068s 8604.085 345487.685 256 3 128 32.640s 13437.080s 159.079s 3736.651 314981.569 256 3 256 23.875s 6835.241s 81.007s 7337.919 620777.397 256 3 512 17.566s 3392.148s 41.003s 14761.249 1226507.319 256 3 256 21.314s 6648.629s 79.085s 7546.038 630270.726 256 3 512 15.994s 3400.378s 40.087s 14732.481 1231399.906 ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter @ 2004-12-09 22:52 ` Andrew Morton 2004-12-09 22:52 ` William Lee Irwin III 1 sibling, 0 replies; 286+ messages in thread From: Andrew Morton @ 2004-12-09 22:52 UTC (permalink / raw) To: Christoph Lameter Cc: hugh, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Christoph Lameter <clameter@sgi.com> wrote: > > On Thu, 9 Dec 2004, Hugh Dickins wrote: > > > How do the scalability figures compare if you omit patch 7/7 i.e. revert > > the per-task rss complications you added in for Linus? I remain a fan > > of sloppy rss, which you earlier showed to be accurate enough (I'd say), > > though I guess should be checked on other architectures than your ia64. > > I can't see the point of all that added ugliness for numbers which don't > > need to be precise - but perhaps there's no way of rearranging fields, > > and the point at which mm->(anon_)rss is updated (near up of mmap_sem?), > > to avoid destructive cacheline bounce. What I'm asking is, do you have > > numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus > > this time, but only up to 32 with sloppy rss? The ratios do look better > > with the latest, but the numbers are altogether lower so we don't know. > > Here is a full set of numbers for sloppy and tasklist. Yes, but that only tests the thing-which-you're-trying-to-improve. We also need to work out the impact of that tasklist walk on other people's worst cases. > sloppy (2.6.10-bk14-rss-sloppy-prefault): It would be helpful if you could generate a breif summary of benchmarking results as well as dumping the raw numbers, please. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter 2004-12-09 22:52 ` Andrew Morton @ 2004-12-09 22:52 ` William Lee Irwin III 2004-12-09 23:07 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-12-09 22:52 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, Dec 09, 2004 at 02:02:37PM -0800, Christoph Lameter wrote: > I was not also able to get the high numbers of > 3 mio faults with atomic > rss + prefaulting but was able to get that with tasklist + prefault. The > atomic version shares the locality problems with the sloppy approach. The implementation of the atomic version at least improperly places the counter's cacheline, so the results for that are gibberish. Unless the algorithms being compared are properly implemented, they're straw men, not valid comparisons. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 22:52 ` William Lee Irwin III @ 2004-12-09 23:07 ` Christoph Lameter 2004-12-09 23:29 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 23:07 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, William Lee Irwin III wrote: > Unless the algorithms being compared are properly implemented, they're > straw men, not valid comparisons. Sloppy rss left the rss in the section of mm that contained the counters. So that has a separate cacheline. The idea of putting the atomic ops in a group was to only have one exclusive cacheline for mmap_sem and the rss. Which could lead to more bouncing of a single cache line rather than bouncing multiple cache lines less. But it seems to me that the problem essentially remains the same if the rss counter is not split. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 23:07 ` Christoph Lameter @ 2004-12-09 23:29 ` William Lee Irwin III 2004-12-09 23:49 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-12-09 23:29 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, Dec 09, 2004 at 03:07:13PM -0800, Christoph Lameter wrote: > Sloppy rss left the rss in the section of mm that contained the counters. > So that has a separate cacheline. The idea of putting the atomic ops in a > group was to only have one exclusive cacheline for mmap_sem and the rss. > Which could lead to more bouncing of a single cache line rather than > bouncing multiple cache lines less. But it seems to me that the problem > essentially remains the same if the rss counter is not split. The prior results Robin Holt cited were that the counter needed to be in a different cacheline from the ->mmap_sem and ->page_table_lock. We shouldn't need to evaluate splitting for the atomic RSS algorithm. A faithful implementation would just move the atomic counters away from the ->mmap_sem and ->page_table_lock (just shuffle some mm fields). Obviously a complete set of results won't be needed unless it's very surprisingly competitive with the stronger algorithms. Things should be fine just making sure that behaves similarly to the one with the shared cacheline with ->mmap_sem in the sense of having a curve of similar shape on smaller systems. The absolute difference probably doesn't matter, but there is something to prove, and the largest risk of not doing so is exaggerating the low-end performance benefits of stronger algorithms. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12: rss tasklist vs sloppy rss 2004-12-09 23:29 ` William Lee Irwin III @ 2004-12-09 23:49 ` Christoph Lameter 0 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-09 23:49 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, William Lee Irwin III wrote: > On Thu, Dec 09, 2004 at 03:07:13PM -0800, Christoph Lameter wrote: > > Sloppy rss left the rss in the section of mm that contained the counters. > > So that has a separate cacheline. The idea of putting the atomic ops in a > > group was to only have one exclusive cacheline for mmap_sem and the rss. > > Which could lead to more bouncing of a single cache line rather than > > bouncing multiple cache lines less. But it seems to me that the problem > > essentially remains the same if the rss counter is not split. > > The prior results Robin Holt cited were that the counter needed to be > in a different cacheline from the ->mmap_sem and ->page_table_lock. > We shouldn't need to evaluate splitting for the atomic RSS algorithm. Ok. Then we would need rss and rss_anon on two additional cache lines? Both rss and anon_rss on one line? mmap_sem and the page_table_lock also each on different cache lines? > A faithful implementation would just move the atomic counters away from > the ->mmap_sem and ->page_table_lock (just shuffle some mm fields). > Obviously a complete set of results won't be needed unless it's very > surprisingly competitive with the stronger algorithms. Things should be > fine just making sure that behaves similarly to the one with the shared > cacheline with ->mmap_sem in the sense of having a curve of similar shape > on smaller systems. The absolute difference probably doesn't matter, > but there is something to prove, and the largest risk of not doing so > is exaggerating the low-end performance benefits of stronger algorithms. The advantage in the split rss solution is that it can be placed on the same cacheline as other stuff from task that is already needed. So there is minimal overhead involved. But I can certainly give it a spin and see what the results are. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-09 18:37 ` Hugh Dickins 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter @ 2004-12-10 4:26 ` Nick Piggin 2004-12-10 4:54 ` Nick Piggin 2004-12-10 18:43 ` Christoph Lameter 2004-12-10 20:03 ` pfault V12 : correction to tasklist rss Christoph Lameter 3 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-12-10 4:26 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Hugh Dickins wrote: > On Wed, 1 Dec 2004, Christoph Lameter wrote: > >>Changes from V11->V12 of this patch: >>- dump sloppy_rss in favor of list_rss (Linus' proposal) >>- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) >> >>This is a series of patches that increases the scalability of >>the page fault handler for SMP. Here are some performance results >>on a machine with 512 processors allocating 32 GB with an increasing >>number of threads (that are assigned a processor each). > > > Your V12 patches would apply well to 2.6.10-rc3, except that (as noted > before) your mailer or whatever is eating trailing whitespace: trivial > patch attached to apply before yours, removing that whitespace so yours > apply. But what your patches need to apply to would be 2.6.10-mm. > > Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would > have tested out okay up to 4GB but not above: trivial patch attached. > That looks obviously correct. Probably the reason why Martin was getting crashes. [snip] > Moving to the main patch, 1/7, the major issue I see there is the way > do_anonymous_page does update_mmu_cache after setting the pte, without > any page_table_lock to bracket them together. Obviously no problem on > architectures where update_mmu_cache is a no-op! But although there's > been plenty of discussion, particularly with Ben and Nick, I've not > noticed anything to guarantee that as safe on all architectures. I do > think it's fine for you to post your patches before completing hooks in > all the arches, but isn't this a significant issue which needs to be > sorted before your patches go into -mm? You hazily refer to such issues > in 0/7, but now you need to work with arch maintainers to settle them > and show the patches. > Yep, the update_mmu_cache issue is real. There is a parallel problem that is update_mmu_cache can be called on a pte who's page has since been evicted and reused. Again, that looks safe on IA64, but maybe not on other architectures. It can be solved by moving lru_cache_add to after update_mmu_cache in all cases but the "update accessed bit" type fault. I solved that by simply defining that out for architectures that don't need it - a raced fault will simply get repeated if need be. > A lesser issue with the reordering in do_anonymous_page: don't you need > to move the lru_cache_add_active after the page_add_anon_rmap, to avoid > the very slight chance that vmscan will pick the page off the LRU and > unmap it before you've counted it in, hitting page_remove_rmap's > BUG_ON(page_mapcount(page) < 0)? > That's what I had been doing too. Seems to be the right way to go. > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > lru_cache_add_active. The other instances of lru_cache_add_active for > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > why here? But that's nothing new with your patch, and although you've > reordered the calls, the final page state is the same as before.) > > Where handle_pte_fault does "entry = *pte" without page_table_lock: > you're quite right to passing down precisely that entry to the fault > handlers below, but there's still a problem on the 32bit architectures > supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints > of entry may be out of synch. Not a problem for do_anonymous_page, or > anything else relying on ptep_cmpxchg to check; but a problem for > do_wp_page (which could find !pfn_valid and kill the process) and > probably others (harder to think through). Your 4/7 patch for i386 has > an unused atomic get_64bit function from Nick, I think you'll have to > define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. > Indeed. This was a real problem for my patch, definitely. > Hmm, that will only work if you're using atomic set_64bit rather than > relying on page_table_lock in the complementary places which matter. > Which I believe you are indeed doing in your 3level set_pte. Shouldn't > __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock? > That's what I was wondering. It could be that the actual 64-bit store is still atomic without the lock prefix (just not the entire rmw), which I think would be sufficient. In that case, get_64bit may be able to drop the lock prefix as well. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 4:26 ` page fault scalability patch V12 [0/7]: Overview and performance tests Nick Piggin @ 2004-12-10 4:54 ` Nick Piggin 2004-12-10 5:06 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-12-10 4:54 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: > Yep, the update_mmu_cache issue is real. There is a parallel problem > that is update_mmu_cache can be called on a pte who's page has since > been evicted and reused. Again, that looks safe on IA64, but maybe > not on other architectures. > > It can be solved by moving lru_cache_add to after update_mmu_cache in > all cases but the "update accessed bit" type fault. I solved that by > simply defining that out for architectures that don't need it - a raced > fault will simply get repeated if need be. > The page-freed-before-update_mmu_cache issue can be solved in that way, not the set_pte and update_mmu_cache not performed under the same ptl section issue that you raised. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 4:54 ` Nick Piggin @ 2004-12-10 5:06 ` Benjamin Herrenschmidt 2004-12-10 5:19 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Benjamin Herrenschmidt @ 2004-12-10 5:06 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm, linux-ia64, Linux Kernel list On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote: > Nick Piggin wrote: > > > Yep, the update_mmu_cache issue is real. There is a parallel problem > > that is update_mmu_cache can be called on a pte who's page has since > > been evicted and reused. Again, that looks safe on IA64, but maybe > > not on other architectures. > > > > It can be solved by moving lru_cache_add to after update_mmu_cache in > > all cases but the "update accessed bit" type fault. I solved that by > > simply defining that out for architectures that don't need it - a raced > > fault will simply get repeated if need be. > > > > The page-freed-before-update_mmu_cache issue can be solved in that way, > not the set_pte and update_mmu_cache not performed under the same ptl > section issue that you raised. What is the problem with update_mmu_cache ? It doesn't need to be done in the same lock section since it's approx. equivalent to a HW fault, which doesn't take the ptl... Ben. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 5:06 ` Benjamin Herrenschmidt @ 2004-12-10 5:19 ` Nick Piggin 2004-12-10 12:30 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-12-10 5:19 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm, linux-ia64, Linux Kernel list Benjamin Herrenschmidt wrote: > On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote: > >>Nick Piggin wrote: >> >>The page-freed-before-update_mmu_cache issue can be solved in that way, >>not the set_pte and update_mmu_cache not performed under the same ptl >>section issue that you raised. > > > What is the problem with update_mmu_cache ? It doesn't need to be done > in the same lock section since it's approx. equivalent to a HW fault, > which doesn't take the ptl... > I don't think a problem has been observed, I think Hugh was just raising it as a general issue. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 5:19 ` Nick Piggin @ 2004-12-10 12:30 ` Hugh Dickins 0 siblings, 0 replies; 286+ messages in thread From: Hugh Dickins @ 2004-12-10 12:30 UTC (permalink / raw) To: Nick Piggin Cc: Benjamin Herrenschmidt, Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm, linux-ia64, Linux Kernel list On Fri, 10 Dec 2004, Nick Piggin wrote: > Benjamin Herrenschmidt wrote: > > On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote: > >> > >>The page-freed-before-update_mmu_cache issue can be solved in that way, > >>not the set_pte and update_mmu_cache not performed under the same ptl > >>section issue that you raised. > > > > What is the problem with update_mmu_cache ? It doesn't need to be done > > in the same lock section since it's approx. equivalent to a HW fault, > > which doesn't take the ptl... > > I don't think a problem has been observed, I think Hugh was just raising > it as a general issue. That's right, I know little of the arches on which update_mmu_cache does something, so cannot say that separation is a problem. And I did see mail from Ben a month ago in which he arrived at the conclusion that it's not a problem - but assumed he was speaking for ppc and ppc64. (He was also writing in the context of your patches rather than Christoph's.) Perhaps Ben has in mind a logical argument that if update_mmu_cache does just what its name implies, then doing it under a separate acquisition of page_table_lock cannot introduce incorrectness on any architecture. Maybe, but I'd still rather we heard that from an expert in each of the affected architectures. As it stands in Christoph's patches, update_mmu_cache is sometimes called inside page_table_lock and sometimes outside: I'd be surprised if that doesn't require adjustment for some architecture. Your idea to raise do_anonymous_page's update_mmu_cache before the lru_cache_add_active sounds just right; perhaps it should then even be subsumed into the architectural ptep_cmpxchg. But once we get this far, I do wonder again whether it's right to be changing the rules in do_anonymous_page alone (Christoph's patches) rather than all the other faults together (your patches). But there's no doubt that the do_anonymous_page case is easier, or more obviously easy, to deal with - it helps a lot to know that the page cannot yet be exposed to vmscan.c and rmap.c. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-09 18:37 ` Hugh Dickins 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter 2004-12-10 4:26 ` page fault scalability patch V12 [0/7]: Overview and performance tests Nick Piggin @ 2004-12-10 18:43 ` Christoph Lameter 2004-12-10 21:43 ` Hugh Dickins 2004-12-12 7:54 ` Nick Piggin 2004-12-10 20:03 ` pfault V12 : correction to tasklist rss Christoph Lameter 3 siblings, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-10 18:43 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel Thank you for the thorough review of my patches. Comments below On Thu, 9 Dec 2004, Hugh Dickins wrote: > Your V12 patches would apply well to 2.6.10-rc3, except that (as noted > before) your mailer or whatever is eating trailing whitespace: trivial > patch attached to apply before yours, removing that whitespace so yours > apply. But what your patches need to apply to would be 2.6.10-mm. I am still mystified as to why this is an issue at all. The patches apply just fine to the kernel sources as is. I have patched kernels numerous times with this patchset and never ran into any issue. quilt removes trailing whitespace from patches when they are generated as far as I can tell. Patches will be made against mm after Nick's modifications to the 4 level patches are in. > Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would > have tested out okay up to 4GB but not above: trivial patch attached. Thanks for the patch. > Your scalability figures show a superb improvement. But they are (I > presume) for the best case: intense initial faulting of distinct areas > of anonymous memory by parallel cpus running a multithreaded process. > This is not a common case: how much do what real-world apps benefit? This is common during the startup of distributed applications on our large machines. They seem to freeze for minutes on bootup. I am not sure how much real-world apps benefit. The numbers show that the benefit would mostly be for SMP applications. UP has only very minor improvements. > Since you also avoid taking the page_table_lock in handle_pte_fault, > there should be some scalability benefit to all kinds of page fault: > do you have any results to show how much (perhaps hard to quantify, > since even tmpfs file faults introduce other scalability issues)? I have not done such tests (yet). > The split rss patch, if it stays, needs some work. For example, > task_statm uses "get_shared" to total up rss-anon_rss from the tasks, > but assumes mm->rss is already accurate. Scrap the separate get_rss, > get_anon_rss, get_shared functions: just one get_rss to make a single > pass through the tasks adding up both rss and anon_rss at the same time. Next rev will have that. > Updating current->rss in do_anonymous_page, current->anon_rss in > page_add_anon_rmap, is not always correct: ptrace's access_process_vm > uses get_user_pages on another task. You need check that current->mm == > mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss, > fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock > for that). You'll also need to check !(current->flags & PF_BORROWED_MM), > to guard against use_mm. Or... just go back to sloppy rss. I will look into this issue. > Moving to the main patch, 1/7, the major issue I see there is the way > do_anonymous_page does update_mmu_cache after setting the pte, without > any page_table_lock to bracket them together. Obviously no problem on > architectures where update_mmu_cache is a no-op! But although there's > been plenty of discussion, particularly with Ben and Nick, I've not > noticed anything to guarantee that as safe on all architectures. I do > think it's fine for you to post your patches before completing hooks in > all the arches, but isn't this a significant issue which needs to be > sorted before your patches go into -mm? You hazily refer to such issues > in 0/7, but now you need to work with arch maintainers to settle them > and show the patches. I have worked with a couple of arches and received feedback that was integrated. I certainly welcome more feedback. A vague idea if there is more trouble on that front: One could take the ptl in the cmpxchg emulation and then unlock on update_mmu cache. > A lesser issue with the reordering in do_anonymous_page: don't you need > to move the lru_cache_add_active after the page_add_anon_rmap, to avoid > the very slight chance that vmscan will pick the page off the LRU and > unmap it before you've counted it in, hitting page_remove_rmap's > BUG_ON(page_mapcount(page) < 0)? Changed. > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > lru_cache_add_active. The other instances of lru_cache_add_active for > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > why here? But that's nothing new with your patch, and although you've > reordered the calls, the final page state is the same as before.) The mark_page_accessed is likely there avoid a future fault just to set the accessed bit. > Where handle_pte_fault does "entry = *pte" without page_table_lock: > you're quite right to passing down precisely that entry to the fault > handlers below, but there's still a problem on the 32bit architectures > supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints > of entry may be out of synch. Not a problem for do_anonymous_page, or > anything else relying on ptep_cmpxchg to check; but a problem for > do_wp_page (which could find !pfn_valid and kill the process) and > probably others (harder to think through). Your 4/7 patch for i386 has > an unused atomic get_64bit function from Nick, I think you'll have to > define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. That would be a performance issue. > Hmm, that will only work if you're using atomic set_64bit rather than > relying on page_table_lock in the complementary places which matter. > Which I believe you are indeed doing in your 3level set_pte. Shouldn't > __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock? > But by making every set_pte use set_64bit, you are significantly slowing > down many operations which do not need that atomicity. This is quite > visible in the fork/exec/shell results from lmbench on i386 PAE (and is > the only interesting difference, for good or bad, that I noticed with > your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are > no faults on dst mm (nor on src mm) while copy_page_range is copying, > so its set_ptes don't need to be atomic; likewise during zap_pte_range > (either mmap_sem is held exclusively, or it's in the final exit_mmap). > Probably revert set_pte and set_pte_atomic to what they were, and use > set_pte_atomic where it's needed. Good suggestions. Will see what I can do but I will need some assistence my main platform is ia64 and the hardware and opportunities for testing on i386 are limited. Again thanks for the detailed review. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 18:43 ` Christoph Lameter @ 2004-12-10 21:43 ` Hugh Dickins 2004-12-10 22:12 ` Andrew Morton 2004-12-12 7:54 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-12-10 21:43 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Fri, 10 Dec 2004, Christoph Lameter wrote: > On Thu, 9 Dec 2004, Hugh Dickins wrote: > > > Your V12 patches would apply well to 2.6.10-rc3, except that (as noted > > before) your mailer or whatever is eating trailing whitespace: trivial > > patch attached to apply before yours, removing that whitespace so yours > > apply. But what your patches need to apply to would be 2.6.10-mm. > > I am still mystified as to why this is an issue at all. The patches apply > just fine to the kernel sources as is. I have patched kernels numerous > times with this patchset and never ran into any issue. quilt removes trailing > whitespace from patches when they are generated as far as I can tell. Perhaps you've only tried applying your original patches, not the ones as received through the mail. It discourages people from trying them when "patch -p1" fails with rejects, however trivial. Or am I alone in seeing this? never had such a problem with other patches before. > > Your scalability figures show a superb improvement. But they are (I > > presume) for the best case: intense initial faulting of distinct areas > > of anonymous memory by parallel cpus running a multithreaded process. > > This is not a common case: how much do what real-world apps benefit? > > This is common during the startup of distributed applications on our large > machines. They seem to freeze for minutes on bootup. I am not sure how > much real-world apps benefit. The numbers show that the benefit would > mostly be for SMP applications. UP has only very minor improvements. How much do your patches speed the startup of these applications? Can you name them? > I have worked with a couple of arches and received feedback that was > integrated. I certainly welcome more feedback. A vague idea if there is > more trouble on that front: One could take the ptl in the cmpxchg > emulation and then unlock on update_mmu cache. Or move the update_mmu_cache into the ptep_cmpxchg emulation perhaps. > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > > lru_cache_add_active. The other instances of lru_cache_add_active for > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > > why here? But that's nothing new with your patch, and although you've > > reordered the calls, the final page state is the same as before.) > > The mark_page_accessed is likely there avoid a future fault just to set > the accessed bit. No, mark_page_accessed is an operation on the struct page (and the accessed bit of the pte is preset too anyway). > > Where handle_pte_fault does "entry = *pte" without page_table_lock: > > you're quite right to passing down precisely that entry to the fault > > handlers below, but there's still a problem on the 32bit architectures > > supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints > > of entry may be out of synch. Not a problem for do_anonymous_page, or > > anything else relying on ptep_cmpxchg to check; but a problem for > > do_wp_page (which could find !pfn_valid and kill the process) and > > probably others (harder to think through). Your 4/7 patch for i386 has > > an unused atomic get_64bit function from Nick, I think you'll have to > > define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. > > That would be a performance issue. Sadly, yes, but correctness must take precedence over performance. It may be possible to avoid it in most cases, doing the atomic later when in doubt: but would need careful thought. > Good suggestions. Will see what I can do but I will need some assistence > my main platform is ia64 and the hardware and opportunities for testing on > i386 are limited. There's plenty of us can be trying i386. It's other arches worrying me. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 21:43 ` Hugh Dickins @ 2004-12-10 22:12 ` Andrew Morton 2004-12-10 23:52 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-12-10 22:12 UTC (permalink / raw) To: Hugh Dickins Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > > > lru_cache_add_active. The other instances of lru_cache_add_active for > > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > > > why here? But that's nothing new with your patch, and although you've > > > reordered the calls, the final page state is the same as before.) > > > > The mark_page_accessed is likely there avoid a future fault just to set > > the accessed bit. > > No, mark_page_accessed is an operation on the struct page > (and the accessed bit of the pte is preset too anyway). The point is a good one - I guess that code is a holdover from earlier implementations. This is equivalent, no? --- 25/mm/memory.c~do_anonymous_page-use-setpagereferenced Fri Dec 10 14:11:32 2004 +++ 25-akpm/mm/memory.c Fri Dec 10 14:11:42 2004 @@ -1464,7 +1464,7 @@ do_anonymous_page(struct mm_struct *mm, vma->vm_page_prot)), vma); lru_cache_add_active(page); - mark_page_accessed(page); + SetPageReferenced(page); page_add_anon_rmap(page, vma, addr); } _ ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 22:12 ` Andrew Morton @ 2004-12-10 23:52 ` Hugh Dickins 2004-12-11 0:18 ` Andrew Morton 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-12-10 23:52 UTC (permalink / raw) To: Andrew Morton Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Fri, 10 Dec 2004, Andrew Morton wrote: > Hugh Dickins <hugh@veritas.com> wrote: > > > > > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > > > > lru_cache_add_active. The other instances of lru_cache_add_active for > > > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > > > > why here? But that's nothing new with your patch, and although you've > > > > reordered the calls, the final page state is the same as before.) > > The point is a good one - I guess that code is a holdover from earlier > implementations. > > This is equivalent, no? Yes, it is equivalent to use SetPageReferenced(page) there instead. But why is do_anonymous_page adding anything to lru_cache_add_active, when its other callers leave it at that? What's special about the do_anonymous_page case? Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 23:52 ` Hugh Dickins @ 2004-12-11 0:18 ` Andrew Morton 2004-12-11 0:44 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-12-11 0:18 UTC (permalink / raw) To: Hugh Dickins Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > On Fri, 10 Dec 2004, Andrew Morton wrote: > > Hugh Dickins <hugh@veritas.com> wrote: > > > > > > > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > > > > > lru_cache_add_active. The other instances of lru_cache_add_active for > > > > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > > > > > why here? But that's nothing new with your patch, and although you've > > > > > reordered the calls, the final page state is the same as before.) > > > > The point is a good one - I guess that code is a holdover from earlier > > implementations. > > > > This is equivalent, no? > > Yes, it is equivalent to use SetPageReferenced(page) there instead. > But why is do_anonymous_page adding anything to lru_cache_add_active, > when its other callers leave it at that? What's special about the > do_anonymous_page case? do_swap_page() is effectively doing the same as do_anonymous_page(). do_wp_page() and do_no_page() appear to be errant. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-11 0:18 ` Andrew Morton @ 2004-12-11 0:44 ` Hugh Dickins 2004-12-11 0:57 ` Andrew Morton 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-12-11 0:44 UTC (permalink / raw) To: Andrew Morton Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Fri, 10 Dec 2004, Andrew Morton wrote: > Hugh Dickins <hugh@veritas.com> wrote: > > But why is do_anonymous_page adding anything to lru_cache_add_active, > > when its other callers leave it at that? What's special about the > > do_anonymous_page case? > > do_swap_page() is effectively doing the same as do_anonymous_page(). > do_wp_page() and do_no_page() appear to be errant. Demur. do_swap_page has to mark_page_accessed because the page from the swap cache is already on the LRU, and for who knows how long. The others (and count in fs/exec.c's install_arg_page) are dealing with a freshly allocated page they are putting onto the active LRU. My inclination would be simply to remove the mark_page_accessed from do_anonymous_page; but I have no numbers to back that hunch. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-11 0:44 ` Hugh Dickins @ 2004-12-11 0:57 ` Andrew Morton 2004-12-11 9:23 ` Hugh Dickins 0 siblings, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-12-11 0:57 UTC (permalink / raw) To: Hugh Dickins Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > On Fri, 10 Dec 2004, Andrew Morton wrote: > > Hugh Dickins <hugh@veritas.com> wrote: > > > But why is do_anonymous_page adding anything to lru_cache_add_active, > > > when its other callers leave it at that? What's special about the > > > do_anonymous_page case? > > > > do_swap_page() is effectively doing the same as do_anonymous_page(). > > do_wp_page() and do_no_page() appear to be errant. > > Demur. do_swap_page has to mark_page_accessed because the page from > the swap cache is already on the LRU, and for who knows how long. Well. Some of the time. If the page was just read from swap, it's known to be on the active list. > The others (and count in fs/exec.c's install_arg_page) are dealing > with a freshly allocated page they are putting onto the active LRU. > > My inclination would be simply to remove the mark_page_accessed > from do_anonymous_page; but I have no numbers to back that hunch. > With the current implementation of page_referenced() the software-referenced bit doesn't matter anyway, as long as the pte's referenced bit got set. So as long as the thing is on the active list, we can simply remove the mark_page_accessed() call. Except one day the VM might get smarter about pages which are both software-referenced and pte-referenced. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-11 0:57 ` Andrew Morton @ 2004-12-11 9:23 ` Hugh Dickins 0 siblings, 0 replies; 286+ messages in thread From: Hugh Dickins @ 2004-12-11 9:23 UTC (permalink / raw) To: Andrew Morton Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel On Fri, 10 Dec 2004, Andrew Morton wrote: > Hugh Dickins <hugh@veritas.com> wrote: > > > > My inclination would be simply to remove the mark_page_accessed > > from do_anonymous_page; but I have no numbers to back that hunch. > > With the current implementation of page_referenced() the > software-referenced bit doesn't matter anyway, as long as the pte's > referenced bit got set. So as long as the thing is on the active list, we > can simply remove the mark_page_accessed() call. Yes, you're right. So we don't need numbers, can just delete that line. > Except one day the VM might get smarter about pages which are both > software-referenced and pte-referenced. And on that day, we'd be making other changes, which might well involve restoring the mark_page_accessed to do_anonymous_page and adding it in the similar places which currently lack it. But for now... --- 2.6.10-rc3/mm/memory.c 2004-12-05 12:56:12.000000000 +0000 +++ linux/mm/memory.c 2004-12-11 09:18:39.000000000 +0000 @@ -1464,7 +1464,6 @@ do_anonymous_page(struct mm_struct *mm, vma->vm_page_prot)), vma); lru_cache_add_active(page); - mark_page_accessed(page); page_add_anon_rmap(page, vma, addr); } ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-10 18:43 ` Christoph Lameter 2004-12-10 21:43 ` Hugh Dickins @ 2004-12-12 7:54 ` Nick Piggin 2004-12-12 9:33 ` Hugh Dickins 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-12-12 7:54 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > Thank you for the thorough review of my patches. Comments below > > On Thu, 9 Dec 2004, Hugh Dickins wrote: > > >>Your V12 patches would apply well to 2.6.10-rc3, except that (as noted >>before) your mailer or whatever is eating trailing whitespace: trivial >>patch attached to apply before yours, removing that whitespace so yours >>apply. But what your patches need to apply to would be 2.6.10-mm. > > > I am still mystified as to why this is an issue at all. The patches apply > just fine to the kernel sources as is. I have patched kernels numerous > times with this patchset and never ran into any issue. quilt removes trailing > whitespace from patches when they are generated as far as I can tell. > > Patches will be made against mm after Nick's modifications to the 4 level > patches are in. > I've been a bit slow with them, sorry.... but there hasn't been a hard decision to go one way or the other with the 4level patches yet. Fortunately, it looks like 2.6.10 is having a longish drying out period, so I should have something before it is released. I would just sit on them for a while, and submit them to -mm when the 4level patches get merged / ready to merge into 2.6. It shouldn't slow down the progress of your patch too much - they'll may have to wait until after 2.6.11 anyway I'd say (probably depends on the progress of other changes going in). >>probably others (harder to think through). Your 4/7 patch for i386 has >>an unused atomic get_64bit function from Nick, I think you'll have to >>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. > > > That would be a performance issue. > > Problems were pretty trivial to reproduce here with non atomic 64-bit loads being cut in half by atomic 64 bit stores. I don't see a way around them, unfortunately. Test case is to run with CONFIG_HIGHMEM (you needn't have > 4 GB of memory in the system, of course), and run 2-4 threads on a dual CPU system, doing parallel faulting of the *same* anonymous pages. What happens is that the load (`entry = *pte`) in handle_pte_fault gets cut in half, and handle_pte_fault drops down to do_swap_page, and you get an infinite loop trying to read in a non existant swap entry IIRC. >>Hmm, that will only work if you're using atomic set_64bit rather than >>relying on page_table_lock in the complementary places which matter. >>Which I believe you are indeed doing in your 3level set_pte. Shouldn't >>__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock? > > >>But by making every set_pte use set_64bit, you are significantly slowing >>down many operations which do not need that atomicity. This is quite >>visible in the fork/exec/shell results from lmbench on i386 PAE (and is >>the only interesting difference, for good or bad, that I noticed with >>your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are >>no faults on dst mm (nor on src mm) while copy_page_range is copying, >>so its set_ptes don't need to be atomic; likewise during zap_pte_range >>(either mmap_sem is held exclusively, or it's in the final exit_mmap). >>Probably revert set_pte and set_pte_atomic to what they were, and use >>set_pte_atomic where it's needed. > > > Good suggestions. Will see what I can do but I will need some assistence > my main platform is ia64 and the hardware and opportunities for testing on > i386 are limited. > I think I (and/or others) should be able to help with i386 if you are having trouble :) Nick ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-12 7:54 ` Nick Piggin @ 2004-12-12 9:33 ` Hugh Dickins 2004-12-12 9:48 ` Nick Piggin 2004-12-12 21:24 ` William Lee Irwin III 0 siblings, 2 replies; 286+ messages in thread From: Hugh Dickins @ 2004-12-12 9:33 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel On Sun, 12 Dec 2004, Nick Piggin wrote: > Christoph Lameter wrote: > > On Thu, 9 Dec 2004, Hugh Dickins wrote: > > >>probably others (harder to think through). Your 4/7 patch for i386 has > >>an unused atomic get_64bit function from Nick, I think you'll have to > >>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. > > > > That would be a performance issue. > > Problems were pretty trivial to reproduce here with non atomic 64-bit > loads being cut in half by atomic 64 bit stores. I don't see a way > around them, unfortunately. Of course, it'll only be a performance issue in the 64-on-32 cases: the 64-on-64 and 32-on-32 macro should reduce to exactly the present "entry = *pte". I've had the impression that Christoph and SGI have to care a great deal more about ia64 than the others; and as x86_64 advances, so i386 PAE grows less important. Just so long as a get_64bit there isn't a serious degradation from present behaviour, it's okay. Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock similarly racy, in both the 64-on-32 cases, and on architectures which have a more complex pmd_t (sparc, m68k, h8300)? Sigh. Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-12 9:33 ` Hugh Dickins @ 2004-12-12 9:48 ` Nick Piggin 2004-12-12 21:24 ` William Lee Irwin III 1 sibling, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-12-12 9:48 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Hugh Dickins wrote: > On Sun, 12 Dec 2004, Nick Piggin wrote: > >>Christoph Lameter wrote: >> >>>On Thu, 9 Dec 2004, Hugh Dickins wrote: >> >>>>probably others (harder to think through). Your 4/7 patch for i386 has >>>>an unused atomic get_64bit function from Nick, I think you'll have to >>>>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. >>> >>>That would be a performance issue. >> >>Problems were pretty trivial to reproduce here with non atomic 64-bit >>loads being cut in half by atomic 64 bit stores. I don't see a way >>around them, unfortunately. > > > Of course, it'll only be a performance issue in the 64-on-32 cases: > the 64-on-64 and 32-on-32 macro should reduce to exactly the present > "entry = *pte". > That's right, yep. There is no ordering requirement, only that the actual store and load be atomic. > I've had the impression that Christoph and SGI have to care a great > deal more about ia64 than the others; and as x86_64 advances, so > i386 PAE grows less important. Just so long as a get_64bit there > isn't a serious degradation from present behaviour, it's okay. > I don't think it was particularly serious for PAE. Probably not worth holding off until 2.7. We'll see. > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock > similarly racy, in both the 64-on-32 cases, and on architectures > which have a more complex pmd_t (sparc, m68k, h8300)? Sigh. > Can't comment on a specific architecture... some may have problems. I think i386 prepopulates pmds, so it is no problem; but generally: I think you can get away with it if you write the "unimportant" word(s) first, do a wmb(), then write the word containing the present bit. I guess this has to be done this way otherwise the hardware walker will blow up... Of course, the hardware walker would be doing either atomic or correctly ordered reads, while a plain dereference doesn't guarantee anything. I'm not sure of the history behind the code, but I would be in favour of making _all_ pagetable access go through accessor functions, even if nobody quite needs them yet. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-12 9:33 ` Hugh Dickins 2004-12-12 9:48 ` Nick Piggin @ 2004-12-12 21:24 ` William Lee Irwin III 2004-12-17 3:31 ` Christoph Lameter 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 1 sibling, 2 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-12-12 21:24 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Christoph Lameter, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel On Sun, Dec 12, 2004 at 09:33:11AM +0000, Hugh Dickins wrote: > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock > similarly racy, in both the 64-on-32 cases, and on architectures > which have a more complex pmd_t (sparc, m68k, h8300)? Sigh. yes. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V12 [0/7]: Overview and performance tests 2004-12-12 21:24 ` William Lee Irwin III @ 2004-12-17 3:31 ` Christoph Lameter 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 1 sibling, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:31 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel On Sun, 12 Dec 2004, William Lee Irwin III wrote: > On Sun, Dec 12, 2004 at 09:33:11AM +0000, Hugh Dickins wrote: > > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock > > similarly racy, in both the 64-on-32 cases, and on architectures > > which have a more complex pmd_t (sparc, m68k, h8300)? Sigh. > > yes. Those may fall back to use the page_table_lock for individual operations that cannot be realized in an atomic way. ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [0/8]: Overview 2004-12-12 21:24 ` William Lee Irwin III 2004-12-17 3:31 ` Christoph Lameter @ 2004-12-17 3:32 ` Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [1/8]: Reduce the use of the page_table_lock Christoph Lameter ` (7 more replies) 1 sibling, 8 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:32 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changes from V12->V13 of this patch: - list_rss: Fix issues Hugh dickins pointed out. - i386: Hugh Dickins patch. Fall back to not use get_64 if ptl is used to restore performance on PAE.. - introduce get_pte_atomic for non ptl access to pte. - i386 PAE: get_pte_atomic uses get_64bit - ptep cmpxchg must now include update_mmu_cache functionality. All arches updated. - add optional prefault patch [8/8] which is controllable via /proc/sys/vm/max_pre_alloc - This patch series tested by me on i386 PAE and ia64 Potential issues: - I avoided to check mm == current->mm for incrementing current->rss in the anonymous fault handler. Instead I added some code around handle_mm_fault in memory.c to deal with the situation if mm != current->mm. Potential unintended benefits from mm containing a tasklist: - Threads are now freed via rcu thus the tasklist may be traversed without acquiring the tasklist lock if the other list operations for task_struct are also switched to be of rcu type. - The list of threads of a mm is determined in two locations by a walk of all the tasks. The code could be change to use the tasklist in mm. This is a series of patches that increases the scalability of the page fault handler for SMP. The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's (pgd_test_and_populate, pmd_test_and_populate). The page table lock can be avoided in the following situations: 1. An empty pte or pmd entry is populated This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory. 2. Modifications of flags in a pte entry (write/accessed). These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too. One essential change in the VM is the use of pte_cmpxchg (or its generic emulation) on page table entries before doing an update_mmu_change without holding the page table lock. However, we do similar things now with other atomic pte operations such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch operates on an *cleared* pte and replaces it with a pte pointing to valid memory. The effect of this change on various architectures has to be thought through. Local definitions of ptep_cmpxchg and ptep_xchg may be necessary. For IA64 an icache coherency issue may arise that potentially requires the flushing of the icache (as done via update_mmu_cache on IA64) prior to the use of ptep_cmpxchg. Similar issues may arise on other platforms. The patch introduces a split counter for rss handling to avoid atomic operations and locks currently necessary for rss modifications. In addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be in the same cache line as tsk->mm (which is already used by the fault handler) and thus tsk->rss can be incremented without locks in a fast way. The cache line does not need to be shared between processors for the page table handler. A tasklist is generated for each mm (rcu based). Values in that list are added up to calculate rss or anon_rss values. The patchset is composed of 8 patches (and was tested against 2.6.10-rc3-bk10): 1/8: Avoid page_table_lock in handle_mm_fault This patch defers the acquisition of the page_table_lock as much as possible and uses atomic operations for allocating anonymous memory. These atomic operations are simulated by acquiring the page_table_lock for very small time frames if an architecture does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a pte will not be set to empty if a page is in transition to swap. If only the first two patches are applied then the time that the page_table_lock is held is simply reduced. The lock may then be acquired multiple times during a page fault. 2/8: Atomic pte operations for ia64 3/8: Make cmpxchg generally available on i386 The atomic operations on the page table rely heavily on cmpxchg instructions. This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486 cpus. The emulations are only included if a kernel is build for these old cpus and are skipped for the real cmpxchg instructions if the kernel that is build for 386 or 486 is then run on a more recent cpu. This patch may be used independently of the other patches. 4/8: Atomic pte operations for i386 A generally available cmpxchg (last patch) must be available for this patch to preserve the ability to build kernels for 386 and 486. 5/8: Atomic pte operation for x86_64 6/8: Atomic pte operations for s390 7/8: Split counter implementation for rss Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic to calculate rss from tasklist. 8/8: Prefaulting for the page table scalability patchset. Note that this patch is significantly different from the patches posted under the title "Anticipatory prefaulting" because the handling of the pte's differs significantly. This prefault patch can reach higher orders during prefaulting since no pagevec is needed to store the preallocated pages. The maximum order of preallocation can be controlled via /proc/sys/vm/max_prealloc_order and is set to 3 by default. Setting max_prealloc_order to zero switches off preallocation altogether. Signed-off-by: Christoph Lameter <clameter@sgi.com> ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V13 [1/8]: Reduce the use of the page_table_lock 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter @ 2004-12-17 3:33 ` Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [2/8]: ia64 atomic pte operations Christoph Lameter ` (6 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Increase parallelism in SMP configurations by deferring the acquisition of page_table_lock in handle_mm_fault * Anonymous memory page faults bypass the page_table_lock through the use of atomic page table operations * Swapper does not set pte to empty in transition to swap * Simulate atomic page table operations using the page_table_lock if an arch does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides a performance benefit since the page_table_lock is held for shorter periods of time. Signed-off-by: Christoph Lameter <clameter@sgi.com Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-15 15:00:22.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-16 10:03:27.000000000 -0800 @@ -1330,8 +1330,7 @@ } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1343,15 +1342,13 @@ int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1374,8 +1371,7 @@ lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1422,14 +1418,12 @@ } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; struct page * page = ZERO_PAGE(addr); @@ -1441,7 +1435,6 @@ if (write_access) { /* Allocate our own private page. */ pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_mem; @@ -1450,30 +1443,34 @@ goto no_mem; clear_user_highpage(page, addr); - spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { - pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; - } - mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)), vma); - lru_cache_add_active(page); - mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - set_pte(page_table, entry); + /* update the entry */ + if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { + if (write_access) { + pte_unmap(page_table); + page_cache_release(page); + } + goto out; + } + if (write_access) { + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + page_add_anon_rmap(page, vma, addr); + lru_cache_add_active(page); + mm->rss++; + + } pte_unmap(page_table); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, addr, entry); - spin_unlock(&mm->page_table_lock); out: return VM_FAULT_MINOR; no_mem: @@ -1489,12 +1486,12 @@ * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1505,9 +1502,8 @@ if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1605,7 +1601,7 @@ * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1618,13 +1614,12 @@ if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } - pgoff = pte_to_pgoff(*pte); + pgoff = pte_to_pgoff(entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -1643,49 +1638,46 @@ * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; - entry = *pte; + /* + * This must be a atomic operation since the page_table_lock is + * not held. If a pte_t larger than the word size is used an + * incorrect value could be read because another processor is + * concurrently updating the multi-word pte. The i386 PAE mode + * is raising its ugly head here. + */ + entry = get_pte_atomic(pte); if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + /* + * This is the case in which we only update some bits in the pte. + */ + new_entry = pte_mkyoung(entry); if (write_access) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + /* do_wp_page expects us to hold the page_table_lock */ + spin_lock(&mm->page_table_lock); return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + } + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + ptep_cmpxchg(vma, address, pte, entry, new_entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; } @@ -1703,22 +1695,45 @@ inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd */ - spin_lock(&mm->page_table_lock); - pmd = pmd_alloc(mm, pgd, address); + if (unlikely(pgd_none(*pgd))) { + pmd_t *new = pmd_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + /* Insure that the update is done in an atomic way */ + if (!pgd_test_and_populate(mm, pgd, new)) + pmd_free(new); + } + + pmd = pmd_offset(pgd, address); + + if (likely(pmd)) { + pte_t *pte; - if (pmd) { - pte_t * pte = pte_alloc_map(mm, pmd, address); - if (pte) + if (!pmd_present(*pmd)) { + struct page *new; + + new = pte_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else + inc_page_state(nr_page_table_pages); + } + + pte = pte_offset_map(pmd, address); + if (likely(pte)) return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } - spin_unlock(&mm->page_table_lock); return VM_FAULT_OOM; } Index: linux-2.6.9/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700 +++ linux-2.6.9/include/asm-generic/pgtable.h 2004-12-16 09:59:58.000000000 -0800 @@ -28,6 +28,11 @@ #endif /* __HAVE_ARCH_SET_PTE_ATOMIC */ #endif +/* Get a pte entry without the page table lock */ +#ifndef __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__x) *(__x) +#endif + #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Largely same as above, but only sets the access flags (dirty, @@ -134,4 +139,61 @@ #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) #endif +#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS +/* + * If atomic page table operations are not available then use + * the page_table_lock to insure some form of locking. + * Note thought that low level operations as well as the + * page_table_handling of the cpu may bypass all locking. + */ + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \ +({ \ + int __rc; \ + spin_lock(&__vma->vm_mm->page_table_lock); \ + __rc = pte_same(*(__ptep), __oldval); \ + if (__rc) { set_pte(__ptep, __newval); \ + update_mmu_cache(__vma, __addr, __newval); } \ + spin_unlock(&__vma->vm_mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pmd) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pgd_present(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pmd); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\ + flush_tlb_page(__vma, __address); \ + __p; \ +}) + +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-12-15 15:00:22.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-12-16 09:59:58.000000000 -0800 @@ -424,7 +424,10 @@ * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * - * The caller needs to hold the mm->page_table_lock. + * The caller needs to hold the mm->page_table_lock if page + * is pointing to something that is known by the vm. + * The lock does not need to be held if page is pointing + * to a newly allocated page. */ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) @@ -568,11 +571,6 @@ /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -587,11 +585,15 @@ list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; page_remove_rmap(page); page_cache_release(page); @@ -678,15 +680,21 @@ if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); + /* + * There would be a race here with handle_mm_fault and do_anonymous_page + * which bypasses the page_table_lock if we would zap the pte before + * putting something into it. On the other hand we need to + * have the dirty flag setting at the time we replaced the value. + */ /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_get_and_clear(pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [2/8]: ia64 atomic pte operations 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [1/8]: Reduce the use of the page_table_lock Christoph Lameter @ 2004-12-17 3:33 ` Christoph Lameter 2004-12-17 3:34 ` page fault scalability patch V13 [3/8]: universal cmpxchg for i386 Christoph Lameter ` (5 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for ia64 * Enhanced parallelism in page fault handler if applied together with the generic patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PGD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -78,12 +82,19 @@ preempt_enable(); } + static inline void pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) { pgd_val(*pgd_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE; +} static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) @@ -132,6 +143,13 @@ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { Index: linux-2.6.9/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800 @@ -30,6 +30,8 @@ #define _PAGE_P_BIT 0 #define _PAGE_A_BIT 5 #define _PAGE_D_BIT 6 +#define _PAGE_IG_BITS 53 +#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */ #define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */ #define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */ @@ -58,6 +60,7 @@ #define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL) #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_PROTNONE (__IA64_UL(1) << 63) +#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT) /* Valid only for a PTE with the present bit cleared: */ #define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */ @@ -270,6 +273,8 @@ #define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0) #define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0) #define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0) +#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0) + /* * Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the * access rights: @@ -281,8 +286,15 @@ #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D)) #define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D)) +#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK)) /* + * Lock functions for pte's + */ +#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep) +#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); } +#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val)) +/* * Macro to a page protection value as "uncacheable". Note that "protection" is really a * misnomer here as the protection value contains the memory attribute bits, dirty bits, * and various other bits as well. @@ -342,7 +354,6 @@ #define pte_unmap_nested(pte) do { } while (0) /* atomic versions of the some PTE manipulations: */ - static inline int ptep_test_and_clear_young (pte_t *ptep) { @@ -414,6 +425,26 @@ #endif } +/* + * IA-64 doesn't have any external MMU info: the page tables contain all the necessary + * information. However, we use this routine to take care of any (delayed) i-cache + * flushing that may be necessary. + */ +extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); + +static inline int +ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval) +{ + /* + * IA64 defers icache flushes. If the new pte is executable we may + * have to flush the icache to insure cache coherency immediately + * after the cmpxchg. + */ + if (pte_exec(newval)) + update_mmu_cache(vma, addr, newval); + return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte; +} + static inline int pte_same (pte_t a, pte_t b) { @@ -476,13 +507,6 @@ struct vm_area_struct * prev, unsigned long start, unsigned long end); #endif -/* - * IA-64 doesn't have any external MMU info: the page tables contain all the necessary - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ -extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Update PTEP with ENTRY, which is guaranteed to be a less @@ -560,6 +584,8 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE +#define __HAVE_ARCH_ATOMIC_TABLE_OPS +#define __HAVE_ARCH_LOCK_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _ASM_IA64_PGTABLE_H */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [3/8]: universal cmpxchg for i386 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [1/8]: Reduce the use of the page_table_lock Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [2/8]: ia64 atomic pte operations Christoph Lameter @ 2004-12-17 3:34 ` Christoph Lameter 2004-12-17 3:35 ` page fault scalability patch V13 [4/8]: atomic pte operations " Christoph Lameter ` (4 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:34 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Make cmpxchg and cmpxchg8b generally available on the i386 platform. * Provide emulation of cmpxchg suitable for uniprocessor if build and run on 386. * Provide emulation of cmpxchg8b suitable for uniprocessor systems if build and run on 386 or 486. * Provide an inline function to atomically get a 64 bit value via cmpxchg8b in an SMP system (courtesy of Nick Piggin) (important for i386 PAE mode and other places where atomic 64 bit operations are useful) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/arch/i386/Kconfig =================================================================== --- linux-2.6.9.orig/arch/i386/Kconfig 2004-12-10 09:58:03.000000000 -0800 +++ linux-2.6.9/arch/i386/Kconfig 2004-12-10 09:59:27.000000000 -0800 @@ -351,6 +351,11 @@ depends on !M386 default y +config X86_CMPXCHG8B + bool + depends on !M386 && !M486 + default y + config X86_XADD bool depends on !M386 Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c =================================================================== --- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-12-06 17:23:49.000000000 -0800 +++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-12-10 09:59:27.000000000 -0800 @@ -6,6 +6,7 @@ #include <linux/bitops.h> #include <linux/smp.h> #include <linux/thread_info.h> +#include <linux/module.h> #include <asm/processor.h> #include <asm/msr.h> @@ -287,5 +288,103 @@ return 0; } +#ifndef CONFIG_X86_CMPXCHG +unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new) +{ + u8 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u8)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u8 *)ptr; + if (prev == old) + *(u8 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u8); + +unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new) +{ + u16 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u16)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u16 *)ptr; + if (prev == old) + *(u16 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u16); + +unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) +{ + u32 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u32)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u32 *)ptr; + if (prev == old) + *(u32 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u32); +#endif + +#ifndef CONFIG_X86_CMPXCHG8B +unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + unsigned long flags; + + /* + * Check if the kernel was compiled for an old cpu but + * we are running really on a cpu capable of cmpxchg8b + */ + + if (cpu_has(cpu_data, X86_FEATURE_CX8)) + return __cmpxchg8b(ptr, old, newv); + + /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */ + local_irq_save(flags); + prev = *ptr; + if (prev == old) + *ptr = newv; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg8b_486); +#endif + // arch_initcall(intel_cpu_init); Index: linux-2.6.9/include/asm-i386/system.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/system.h 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/include/asm-i386/system.h 2004-12-10 10:00:49.000000000 -0800 @@ -149,6 +149,9 @@ #define __xg(x) ((struct __xchg_dummy *)(x)) +#define ll_low(x) *(((unsigned int*)&(x))+0) +#define ll_high(x) *(((unsigned int*)&(x))+1) + /* * The semantics of XCHGCMP8B are a bit strange, this is why * there is a loop and the loading of %%eax and %%edx has to @@ -184,8 +187,6 @@ { __set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL)); } -#define ll_low(x) *(((unsigned int*)&(x))+0) -#define ll_high(x) *(((unsigned int*)&(x))+1) static inline void __set_64bit_var (unsigned long long *ptr, unsigned long long value) @@ -203,6 +204,26 @@ __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \ __set_64bit(ptr, ll_low(value), ll_high(value)) ) +static inline unsigned long long __get_64bit(unsigned long long * ptr) +{ + unsigned long long ret; + __asm__ __volatile__ ( + "\n1:\t" + "movl (%1), %%eax\n\t" + "movl 4(%1), %%edx\n\t" + "movl %%eax, %%ebx\n\t" + "movl %%edx, %%ecx\n\t" + LOCK_PREFIX "cmpxchg8b (%1)\n\t" + "jnz 1b" + : "=A"(ret) + : "D"(ptr) + : "ebx", "ecx", "memory"); + return ret; +} + +#define get_64bit(ptr) __get_64bit(ptr) + + /* * Note: no "lock" prefix even on SMP: xchg always implies lock anyway * Note 2: xchg has side effect, so that attribute volatile is necessary, @@ -240,7 +261,41 @@ */ #ifdef CONFIG_X86_CMPXCHG + #define __HAVE_ARCH_CMPXCHG 1 +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) + +#else + +/* + * Building a kernel capable running on 80386. It may be necessary to + * simulate the cmpxchg on the 80386 CPU. For that purpose we define + * a function for each of the sizes we support. + */ + +extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8); +extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16); +extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32); + +static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, + unsigned long new, int size) +{ + switch (size) { + case 1: + return cmpxchg_386_u8(ptr, old, new); + case 2: + return cmpxchg_386_u16(ptr, old, new); + case 4: + return cmpxchg_386_u32(ptr, old, new); + } + return old; +} + +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) #endif static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, @@ -270,12 +325,34 @@ return old; } -#define cmpxchg(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ - (unsigned long)(n),sizeof(*(ptr)))) - +static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + __asm__ __volatile__( + LOCK_PREFIX "cmpxchg8b (%4)" + : "=A" (prev) + : "0" (old), "c" ((unsigned long)(newv >> 32)), + "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr) + : "memory"); + return prev; +} + +#ifdef CONFIG_X86_CMPXCHG8B +#define cmpxchg8b __cmpxchg8b +#else +/* + * Building a kernel capable of running on 80486 and 80386. Both + * do not support cmpxchg8b. Call a function that emulates the + * instruction if necessary. + */ +extern unsigned long long cmpxchg8b_486(volatile unsigned long long *, + unsigned long long, unsigned long long); +#define cmpxchg8b cmpxchg8b_486 +#endif + #ifdef __KERNEL__ -struct alt_instr { +struct alt_instr { __u8 *instr; /* original instruction */ __u8 *replacement; __u8 cpuid; /* cpuid bit set for replacement */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [4/8]: atomic pte operations for i386 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter ` (2 preceding siblings ...) 2004-12-17 3:34 ` page fault scalability patch V13 [3/8]: universal cmpxchg for i386 Christoph Lameter @ 2004-12-17 3:35 ` Christoph Lameter 2004-12-17 3:36 ` page fault scalability patch V13 [5/8]: atomic pte operations for AMD64 Christoph Lameter ` (3 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:35 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Atomic pte operations for i386 in regular and PAE modes Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-12-15 15:00:20.000000000 -0800 +++ linux-2.6.9/include/asm-i386/pgtable.h 2004-12-16 10:08:38.000000000 -0800 @@ -413,6 +413,7 @@ #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME +#define __HAVE_ARCH_ATOMIC_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _I386_PGTABLE_H */ Index: linux-2.6.9/include/asm-i386/pgtable-3level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-12-16 10:13:11.000000000 -0800 @@ -6,7 +6,8 @@ * tables on PPro+ CPUs. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> - */ + * August 26, 2004 added ptep_cmpxchg <christoph@lameter.com> +*/ #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low) @@ -42,21 +43,11 @@ return pte_x(pte); } -/* Rules for using set_pte: the pte being assigned *must* be - * either not present or in a state where the hardware will - * not attempt to update the pte. In places where this is - * not possible, use pte_get_and_clear to obtain the old pte - * value and then use set_pte to update it. -ben - */ -static inline void set_pte(pte_t *ptep, pte_t pte) -{ - ptep->pte_high = pte.pte_high; - smp_wmb(); - ptep->pte_low = pte.pte_low; -} #define __HAVE_ARCH_SET_PTE_ATOMIC #define set_pte_atomic(pteptr,pteval) \ set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) +#define set_pte(pteptr,pteval) \ + *(unsigned long long *)(pteptr) = pte_val(pteval) #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pgd(pgdptr,pgdval) \ @@ -142,4 +133,25 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val }) +/* Atomic PTE operations */ +#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \ +({ pte_t __r; \ + /* xchg acts as a barrier before the setting of the high bits. */\ + __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \ + __r.pte_high = (__ptep)->pte_high; \ + (__ptep)->pte_high = (__newval).pte_high; \ + flush_tlb_page(__vma, __addr); \ + (__r); \ +}) + +#define __HAVE_ARCH_PTEP_XCHG_FLUSH + +static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + +#define __HAVE_ARCH_GET_PTE_ATOMIC +#define get_pte_atomic(__ptep) __pte(get_64bit((unsigned long long *)(__ptep))) + #endif /* _I386_PGTABLE_3LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-12-16 10:08:38.000000000 -0800 @@ -82,4 +82,7 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) +/* Atomic PTE operations */ +#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low) + #endif /* _I386_PGTABLE_2LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-12-16 10:08:38.000000000 -0800 @@ -4,9 +4,12 @@ #include <linux/config.h> #include <asm/processor.h> #include <asm/fixmap.h> +#include <asm/system.h> #include <linux/threads.h> #include <linux/mm.h> /* for struct page */ +#define PMD_NONE 0L + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))) @@ -16,6 +19,19 @@ ((unsigned long long)page_to_pfn(pte) << (unsigned long long) PAGE_SHIFT))); } + +/* Atomic version */ +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ +#ifdef CONFIG_X86_PAE + return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE + + ((unsigned long long)page_to_pfn(pte) << + (unsigned long long) PAGE_SHIFT) ) == PMD_NONE; +#else + return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +#endif +} + /* * Allocate and free page tables. */ @@ -49,6 +65,7 @@ #define pmd_free(x) do { } while (0) #define __pmd_free_tlb(tlb,x) do { } while (0) #define pgd_populate(mm, pmd, pte) BUG() +#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; }) #define check_pgt_cache() do { } while (0) ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [5/8]: atomic pte operations for AMD64 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter ` (3 preceding siblings ...) 2004-12-17 3:35 ` page fault scalability patch V13 [4/8]: atomic pte operations " Christoph Lameter @ 2004-12-17 3:36 ` Christoph Lameter 2004-12-17 3:38 ` page fault scalability patch V13 [7/8]: Split RSS Christoph Lameter ` (2 subsequent siblings) 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:36 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for x86_64 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700 +++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-23 10:59:01.000000000 -0800 @@ -7,16 +7,26 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pgd_populate(mm, pgd, pmd) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd))) +#define pgd_test_and_populate(mm, pgd, pmd) \ + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE) static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.9/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-22 15:08:43.000000000 -0800 +++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-23 10:59:01.000000000 -0800 @@ -437,6 +437,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [7/8]: Split RSS 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter ` (4 preceding siblings ...) 2004-12-17 3:36 ` page fault scalability patch V13 [5/8]: atomic pte operations for AMD64 Christoph Lameter @ 2004-12-17 3:38 ` Christoph Lameter 2004-12-17 3:39 ` page fault scalability patch V13 [8/8]: Prefaulting using ptep_cmpxchg Christoph Lameter 2004-12-17 5:55 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:38 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Changelog * Split rss counter into the task structure * remove 3 checks of rss in mm/rmap.c * increment current->rss instead of mm->rss in the page fault handler * move incrementing of anon_rss out of page_add_anon_rmap to group the increments more tightly and allow a better cache utilization Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-11-30 20:33:31.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-11-30 20:33:50.000000000 -0800 @@ -30,6 +30,7 @@ #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> +#include <linux/rcupdate.h> struct exec_domain; @@ -217,6 +218,7 @@ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + long rss, anon_rss; struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -226,7 +228,7 @@ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ @@ -236,6 +238,8 @@ /* Architecture-specific MM context */ mm_context_t context; + struct list_head task_list; /* Tasks using this mm */ + struct rcu_head rcu_head; /* For freeing mm via rcu */ /* Token based thrashing protection. */ unsigned long swap_token_time; @@ -545,6 +549,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Split counters from mm */ + long rss; + long anon_rss; /* task state */ struct linux_binfmt *binfmt; @@ -578,6 +585,9 @@ struct completion *vfork_done; /* for vfork() */ int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ + + /* List of other tasks using the same mm */ + struct list_head mm_tasks; unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; @@ -1111,6 +1121,14 @@ #endif +unsigned long get_rss(struct mm_struct *mm); +unsigned long get_anon_rss(struct mm_struct *mm); +unsigned long get_shared(struct mm_struct *mm); + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk); +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif + Index: linux-2.6.9/fs/proc/task_mmu.c =================================================================== --- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-30 20:33:26.000000000 -0800 +++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-30 20:33:50.000000000 -0800 @@ -22,7 +22,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + get_rss(mm) << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -37,7 +37,7 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + *shared = get_shared(mm); *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; Index: linux-2.6.9/fs/proc/array.c =================================================================== --- linux-2.6.9.orig/fs/proc/array.c 2004-11-30 20:33:26.000000000 -0800 +++ linux-2.6.9/fs/proc/array.c 2004-11-30 20:33:50.000000000 -0800 @@ -420,7 +420,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-30 20:33:46.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-30 20:33:50.000000000 -0800 @@ -263,8 +263,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -438,7 +436,7 @@ BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; + current->anon_rss++; anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; @@ -510,8 +508,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -799,8 +795,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.9/kernel/fork.c =================================================================== --- linux-2.6.9.orig/kernel/fork.c 2004-11-30 20:33:42.000000000 -0800 +++ linux-2.6.9/kernel/fork.c 2004-11-30 20:33:50.000000000 -0800 @@ -151,6 +151,7 @@ *tsk = *orig; tsk->thread_info = ti; ti->task = tsk; + tsk->rss = 0; /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); @@ -292,6 +293,7 @@ atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); + INIT_LIST_HEAD(&mm->task_list); mm->core_waiters = 0; mm->nr_ptes = 0; spin_lock_init(&mm->page_table_lock); @@ -323,6 +325,13 @@ return mm; } +static void rcu_free_mm(struct rcu_head *head) +{ + struct mm_struct *mm = container_of(head ,struct mm_struct, rcu_head); + + free_mm(mm); +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -333,7 +342,7 @@ BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); - free_mm(mm); + call_rcu(&mm->rcu_head, rcu_free_mm); } /* @@ -400,6 +409,8 @@ /* Get rid of any cached register state */ deactivate_mm(tsk, mm); + if (mm) + mm_remove_thread(mm, tsk); /* notify parent sleeping on vfork() */ if (vfork_done) { @@ -447,8 +458,8 @@ * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. + * (mm_add_thread does use the ptl .... ) */ - spin_unlock_wait(&oldmm->page_table_lock); goto good_mm; } @@ -470,6 +481,7 @@ goto free_pt; good_mm: + mm_add_thread(mm, tsk); tsk->mm = mm; tsk->active_mm = mm; return 0; Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-11-30 20:33:46.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-11-30 20:33:50.000000000 -0800 @@ -1467,7 +1467,7 @@ */ lru_cache_add_active(page); page_add_anon_rmap(page, vma, addr); - mm->rss++; + current->rss++; } pte_unmap(page_table); @@ -1859,3 +1859,87 @@ } #endif + +unsigned long get_rss(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +unsigned long get_anon_rss(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->anon_rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +unsigned long get_shared(struct mm_struct *mm) +{ + struct list_head *y; + struct task_struct *t; + long rss; + + if (!mm) + return 0; + + rcu_read_lock(); + rss = mm->rss - mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss += t->rss - t->anon_rss; + } + if (rss < 0) + rss = 0; + rcu_read_unlock(); + return rss; +} + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!mm) + return; + + spin_lock(&mm->page_table_lock); + mm->rss += tsk->rss; + mm->anon_rss += tsk->anon_rss; + list_del_rcu(&tsk->mm_tasks); + spin_unlock(&mm->page_table_lock); +} + +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + spin_lock(&mm->page_table_lock); + list_add_rcu(&tsk->mm_tasks, &mm->task_list); + spin_unlock(&mm->page_table_lock); +} + + Index: linux-2.6.9/include/linux/init_task.h =================================================================== --- linux-2.6.9.orig/include/linux/init_task.h 2004-11-30 20:33:30.000000000 -0800 +++ linux-2.6.9/include/linux/init_task.h 2004-11-30 20:33:50.000000000 -0800 @@ -42,6 +42,7 @@ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ + .task_list = LIST_HEAD_INIT(name.task_list), \ } #define INIT_SIGNALS(sig) { \ @@ -112,6 +113,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \ } Index: linux-2.6.9/fs/exec.c =================================================================== --- linux-2.6.9.orig/fs/exec.c 2004-11-30 20:33:41.000000000 -0800 +++ linux-2.6.9/fs/exec.c 2004-11-30 20:33:50.000000000 -0800 @@ -543,6 +543,7 @@ active_mm = tsk->active_mm; tsk->mm = mm; tsk->active_mm = mm; + mm_add_thread(mm, current); activate_mm(active_mm, mm); task_unlock(tsk); arch_pick_mmap_layout(mm); ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V13 [8/8]: Prefaulting using ptep_cmpxchg 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter ` (5 preceding siblings ...) 2004-12-17 3:38 ` page fault scalability patch V13 [7/8]: Split RSS Christoph Lameter @ 2004-12-17 3:39 ` Christoph Lameter 2004-12-17 5:55 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 3:39 UTC (permalink / raw) To: William Lee Irwin III Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. The order of preallocation may be controlled through setting the maximum order in /proc/sys/vm/max_prealloc_order. Setting it to zero will disable preallocations. Signed_off_by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-16 10:59:26.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-16 11:06:37.000000000 -0800 @@ -548,6 +548,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Prefaulting */ + unsigned long anon_fault_next_addr; + int anon_fault_order; /* Split counters from mm */ long rss; long anon_rss; Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-16 10:59:26.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-16 11:06:37.000000000 -0800 @@ -1437,6 +1437,8 @@ return ret; } +int sysctl_max_prealloc_order = 3; + /* * We are called with the MM semaphore held. */ @@ -1445,57 +1447,103 @@ pte_t *page_table, pmd_t *pmd, int write_access, unsigned long addr, pte_t orig_entry) { - pte_t entry; - struct page * page = ZERO_PAGE(addr); + unsigned long end_addr; + + addr &= PAGE_MASK; + + /* Check if there is a sequential allocation of pages */ + if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) { + + /* Single page */ + current->anon_fault_order = 0; + end_addr = addr + PAGE_SIZE; + + } else { + int order = ++current->anon_fault_order; + + /* + * Calculate the number of pages to preallocate. The order of preallocations + * increases with each successful prediction + */ + if (unlikely(order > sysctl_max_prealloc_order)) + order = current->anon_fault_order = sysctl_max_prealloc_order; - /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + end_addr = addr + (PAGE_SIZE << order); + + /* Do not prefault beyond vm limits */ + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + + /* Stay in pmd */ + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + } - /* ..except if it's a write access */ if (write_access) { - /* Allocate our own private page. */ - pte_unmap(page_table); + int count = 0; if (unlikely(anon_vma_prepare(vma))) - goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); - if (!page) - goto no_mem; - clear_user_highpage(page, addr); + return VM_FAULT_OOM; - page_table = pte_offset_map(pmd, addr); + do { + pte_t entry; + struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, addr); + + if (unlikely(!page)) { + if (!count) + return VM_FAULT_OOM; + else + break; + } + + clear_user_highpage(page, addr); + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, + vma->vm_page_prot)), + vma); + + /* update the entry */ + if (unlikely(!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry))) { + pte_unmap(page_table); + page_cache_release(page); + break; + } + + page_add_anon_rmap(page, vma, addr); + lru_cache_add_active(page); + count++; - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); - } + pte_unmap(page_table); + addr += PAGE_SIZE; + if (addr >= end_addr) + break; + page_table = pte_offset_map(pmd, addr); + orig_entry = *page_table; + + } while (pte_none(orig_entry)); + + current->rss += count; + current->anon_rss += count; + + } else { + pte_t entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + /* Read */ + do { + if (unlikely(!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry))) + break; - /* update the entry */ - if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { - if (write_access) { pte_unmap(page_table); - page_cache_release(page); - } - goto out; - } - if (write_access) { - /* - * These two functions must come after the cmpxchg - * because if the page is on the LRU then try_to_unmap may come - * in and unmap the pte. - */ - page_add_anon_rmap(page, vma, addr); - lru_cache_add_active(page); - mm->rss++; - mm->anon_rss++; - + addr += PAGE_SIZE; + + if (addr >= end_addr) + break; + page_table = pte_offset_map(pmd, addr); + orig_entry = *page_table; + } while (pte_none(orig_entry)); } - pte_unmap(page_table); -out: - return VM_FAULT_MINOR; -no_mem: - return VM_FAULT_OOM; + current->anon_fault_next_addr = addr; + return VM_FAULT_MINOR; } /* Index: linux-2.6.9/kernel/sysctl.c =================================================================== --- linux-2.6.9.orig/kernel/sysctl.c 2004-12-15 15:00:22.000000000 -0800 +++ linux-2.6.9/kernel/sysctl.c 2004-12-16 11:06:37.000000000 -0800 @@ -56,6 +56,7 @@ extern int C_A_D; extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; +extern int sysctl_max_prealloc_order; extern int max_threads; extern int sysrq_enabled; extern int core_uses_pid; @@ -816,6 +817,16 @@ .strategy = &sysctl_jiffies, }, #endif + { + .ctl_name = VM_MAX_PREFAULT_ORDER, + .procname = "max_prealloc_order", + .data = &sysctl_max_prealloc_order, + .maxlen = sizeof(sysctl_max_prealloc_order), + .mode = 0644, + .proc_handler = &proc_dointvec, + .strategy = &sysctl_intvec, + .extra1 = &zero, + }, { .ctl_name = 0 } }; Index: linux-2.6.9/include/linux/sysctl.h =================================================================== --- linux-2.6.9.orig/include/linux/sysctl.h 2004-12-15 15:00:22.000000000 -0800 +++ linux-2.6.9/include/linux/sysctl.h 2004-12-16 11:06:37.000000000 -0800 @@ -168,6 +168,7 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_MAX_PREFAULT_ORDER=29, /* max prefault order during anonymous page faults */ }; ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V13 [0/8]: Overview 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter ` (6 preceding siblings ...) 2004-12-17 3:39 ` page fault scalability patch V13 [8/8]: Prefaulting using ptep_cmpxchg Christoph Lameter @ 2004-12-17 5:55 ` Christoph Lameter 7 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-12-17 5:55 UTC (permalink / raw) Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 149 bytes --] Its confirmed that my MUA (pine) eats blanks and reformats patches. Archive of all the patches attached. Use the reformatted stuff for comments only. [-- Attachment #2: page fault scalability patches v13 --] [-- Type: APPLICATION/x-gtar, Size: 15736 bytes --] ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: pfault V12 : correction to tasklist rss 2004-12-09 18:37 ` Hugh Dickins ` (2 preceding siblings ...) 2004-12-10 18:43 ` Christoph Lameter @ 2004-12-10 20:03 ` Christoph Lameter 2004-12-10 21:24 ` Hugh Dickins 3 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-12-10 20:03 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Thu, 9 Dec 2004, Hugh Dickins wrote: > Updating current->rss in do_anonymous_page, current->anon_rss in > page_add_anon_rmap, is not always correct: ptrace's access_process_vm > uses get_user_pages on another task. You need check that current->mm == > mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss, > fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock > for that). You'll also need to check !(current->flags & PF_BORROWED_MM), > to guard against use_mm. Or... just go back to sloppy rss. Use_mm can simply attach the kernel thread to the mm via mm_add_thread and will then update mm->rss when being detached again. The issue with ptrace and get_user_pages is a bit thorny. I did the check for mm = current->mm in the following patch. If mm != current->mm then do the sloppy thing and increment mm->rss without the page table lock. This should be a very special rare case. One could also set current to the target task in get_user_pages but then faults for the actual current task may increment the wrong counters. Could we live with that? Or simply leave as is. The pages are after all allocated by the ptrace process and it should be held responsible for it. My favorite rss solution is still just getting rid of rss and anon_rss and do the long loops in procfs. Whichever process wants to know better be willing to pay the price in cpu time and the code for incrementing rss can be removed from the page fault handler. We have no real way of establishing the ownership of shared pages anyways. Its counted when allocated. But the page may live on afterwards in another process and then not be accounted for although its only user is the new process. IMHO vm scans may be the only way of really getting an accurate count. But here is the improved list_rss patch: Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-10 11:39:00.000000000 -0800 @@ -30,6 +30,7 @@ #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> +#include <linux/rcupdate.h> struct exec_domain; @@ -217,6 +218,7 @@ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */ + long rss, anon_rss; struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -226,7 +228,7 @@ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ @@ -236,6 +238,8 @@ /* Architecture-specific MM context */ mm_context_t context; + struct list_head task_list; /* Tasks using this mm */ + struct rcu_head rcu_head; /* For freeing mm via rcu */ /* Token based thrashing protection. */ unsigned long swap_token_time; @@ -545,6 +549,9 @@ struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + /* Split counters from mm */ + long rss; + long anon_rss; /* task state */ struct linux_binfmt *binfmt; @@ -578,6 +585,9 @@ struct completion *vfork_done; /* for vfork() */ int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ + + /* List of other tasks using the same mm */ + struct list_head mm_tasks; unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; @@ -1124,6 +1134,12 @@ #endif +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss); + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk); +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk); + #endif /* __KERNEL__ */ #endif + Index: linux-2.6.9/fs/proc/task_mmu.c =================================================================== --- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-12-06 17:23:54.000000000 -0800 +++ linux-2.6.9/fs/proc/task_mmu.c 2004-12-10 11:39:00.000000000 -0800 @@ -6,8 +6,9 @@ char *task_mem(struct mm_struct *mm, char *buffer) { - unsigned long data, text, lib; + unsigned long data, text, lib, rss, anon_rss; + get_rss(mm, &rss, &anon_rss); data = mm->total_vm - mm->shared_vm - mm->stack_vm; text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; @@ -22,7 +23,7 @@ "VmPTE:\t%8lu kB\n", (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->rss << (PAGE_SHIFT-10), + rss << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10); @@ -37,11 +38,14 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - *shared = mm->rss - mm->anon_rss; + unsigned long rss, anon_rss; + + get_rss(mm, &rss, &anon_rss); + *shared = rss - anon_rss; *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->total_vm - mm->shared_vm; - *resident = mm->rss; + *resident = rss; return mm->total_vm; } Index: linux-2.6.9/fs/proc/array.c =================================================================== --- linux-2.6.9.orig/fs/proc/array.c 2004-12-06 17:23:54.000000000 -0800 +++ linux-2.6.9/fs/proc/array.c 2004-12-10 11:39:00.000000000 -0800 @@ -302,7 +302,7 @@ static int do_task_stat(struct task_struct *task, char * buffer, int whole) { - unsigned long vsize, eip, esp, wchan = ~0UL; + unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL; long priority, nice; int tty_pgrp = -1, tty_nr = 0; sigset_t sigign, sigcatch; @@ -325,6 +325,7 @@ vsize = task_vsize(mm); eip = KSTK_EIP(task); esp = KSTK_ESP(task); + get_rss(mm, &rss, &anon_rss); } get_task_comm(tcomm, task); @@ -420,7 +421,7 @@ jiffies_to_clock_t(task->it_real_value), start_time, vsize, - mm ? mm->rss : 0, /* you might want to shift this left 3 */ + mm ? rss : 0, /* you might want to shift this left 3 */ rsslim, mm ? mm->start_code : 0, mm ? mm->end_code : 0, Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-12-10 11:11:26.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-12-10 11:46:07.000000000 -0800 @@ -263,8 +263,6 @@ pte_t *pte; int referenced = 0; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -438,7 +436,10 @@ BUG_ON(PageReserved(page)); BUG_ON(!anon_vma); - vma->vm_mm->anon_rss++; + if (current->mm == vma->vm_mm) + current->anon_rss++; + else + vma->vm_mm->anon_rss++; anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; index = (address - vma->vm_start) >> PAGE_SHIFT; @@ -510,8 +511,6 @@ pte_t pteval; int ret = SWAP_AGAIN; - if (!mm->rss) - goto out; address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -799,8 +798,7 @@ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) continue; cursor = (unsigned long) vma->vm_private_data; - while (vma->vm_mm->rss && - cursor < max_nl_cursor && + while (cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { try_to_unmap_cluster(cursor, &mapcount, vma); cursor += CLUSTER_SIZE; Index: linux-2.6.9/kernel/fork.c =================================================================== --- linux-2.6.9.orig/kernel/fork.c 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/kernel/fork.c 2004-12-10 11:39:00.000000000 -0800 @@ -151,6 +151,7 @@ *tsk = *orig; tsk->thread_info = ti; ti->task = tsk; + tsk->rss = 0; /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); @@ -292,6 +293,7 @@ atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); + INIT_LIST_HEAD(&mm->task_list); mm->core_waiters = 0; mm->nr_ptes = 0; spin_lock_init(&mm->page_table_lock); @@ -323,6 +325,13 @@ return mm; } +static void rcu_free_mm(struct rcu_head *head) +{ + struct mm_struct *mm = container_of(head ,struct mm_struct, rcu_head); + + free_mm(mm); +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -333,7 +342,7 @@ BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); - free_mm(mm); + call_rcu(&mm->rcu_head, rcu_free_mm); } /* @@ -400,6 +409,8 @@ /* Get rid of any cached register state */ deactivate_mm(tsk, mm); + if (mm) + mm_remove_thread(mm, tsk); /* notify parent sleeping on vfork() */ if (vfork_done) { @@ -447,8 +458,8 @@ * new threads start up in user mode using an mm, which * allows optimizing out ipis; the tlb_gather_mmu code * is an example. + * (mm_add_thread does use the ptl .... ) */ - spin_unlock_wait(&oldmm->page_table_lock); goto good_mm; } @@ -470,6 +481,7 @@ goto free_pt; good_mm: + mm_add_thread(mm, tsk); tsk->mm = mm; tsk->active_mm = mm; return 0; Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-10 11:12:44.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-10 11:45:00.000000000 -0800 @@ -1467,8 +1467,10 @@ */ page_add_anon_rmap(page, vma, addr); lru_cache_add_active(page); - mm->rss++; - + if (current->mm == mm) + current->rss++; + else + mm->rss++; } pte_unmap(page_table); @@ -1859,3 +1861,49 @@ } #endif + +void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss) +{ + struct list_head *y; + struct task_struct *t; + long rss_sum, anon_rss_sum; + + rcu_read_lock(); + rss_sum = mm->rss; + anon_rss_sum = mm->anon_rss; + list_for_each_rcu(y, &mm->task_list) { + t = list_entry(y, struct task_struct, mm_tasks); + rss_sum += t->rss; + anon_rss_sum += t->anon_rss; + } + if (rss_sum < 0) + rss_sum = 0; + if (anon_rss_sum < 0) + anon_rss_sum = 0; + rcu_read_unlock(); + *rss = rss_sum; + *anon_rss = anon_rss_sum; +} + +void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!mm) + return; + + spin_lock(&mm->page_table_lock); + mm->rss += tsk->rss; + mm->anon_rss += tsk->anon_rss; + list_del_rcu(&tsk->mm_tasks); + spin_unlock(&mm->page_table_lock); +} + +void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk) +{ + spin_lock(&mm->page_table_lock); + tsk->rss = 0; + tsk->anon_rss = 0; + list_add_rcu(&tsk->mm_tasks, &mm->task_list); + spin_unlock(&mm->page_table_lock); +} + + Index: linux-2.6.9/include/linux/init_task.h =================================================================== --- linux-2.6.9.orig/include/linux/init_task.h 2004-12-06 17:23:55.000000000 -0800 +++ linux-2.6.9/include/linux/init_task.h 2004-12-10 11:39:00.000000000 -0800 @@ -42,6 +42,7 @@ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ .default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \ + .task_list = LIST_HEAD_INIT(name.task_list), \ } #define INIT_SIGNALS(sig) { \ @@ -112,6 +113,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \ } Index: linux-2.6.9/fs/exec.c =================================================================== --- linux-2.6.9.orig/fs/exec.c 2004-12-06 17:23:54.000000000 -0800 +++ linux-2.6.9/fs/exec.c 2004-12-10 11:39:00.000000000 -0800 @@ -543,6 +543,7 @@ active_mm = tsk->active_mm; tsk->mm = mm; tsk->active_mm = mm; + mm_add_thread(mm, current); activate_mm(active_mm, mm); task_unlock(tsk); arch_pick_mmap_layout(mm); Index: linux-2.6.9/fs/aio.c =================================================================== --- linux-2.6.9.orig/fs/aio.c 2004-12-06 17:23:54.000000000 -0800 +++ linux-2.6.9/fs/aio.c 2004-12-10 11:39:00.000000000 -0800 @@ -575,6 +575,7 @@ atomic_inc(&mm->mm_count); tsk->mm = mm; tsk->active_mm = mm; + mm_add_thread(mm, tsk); activate_mm(active_mm, mm); task_unlock(tsk); @@ -597,6 +598,7 @@ struct task_struct *tsk = current; task_lock(tsk); + mm_remove_thread(mm,tsk); tsk->flags &= ~PF_BORROWED_MM; tsk->mm = NULL; /* active_mm is still 'mm' */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: pfault V12 : correction to tasklist rss 2004-12-10 20:03 ` pfault V12 : correction to tasklist rss Christoph Lameter @ 2004-12-10 21:24 ` Hugh Dickins 2004-12-10 21:38 ` Andrew Morton 0 siblings, 1 reply; 286+ messages in thread From: Hugh Dickins @ 2004-12-10 21:24 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt, Nick Piggin, linux-mm, linux-ia64, linux-kernel On Fri, 10 Dec 2004, Christoph Lameter wrote: > On Thu, 9 Dec 2004, Hugh Dickins wrote: > > > Updating current->rss in do_anonymous_page, current->anon_rss in > > page_add_anon_rmap, is not always correct: ptrace's access_process_vm > > uses get_user_pages on another task. You need check that current->mm == > > mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss, > > fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock > > for that). You'll also need to check !(current->flags & PF_BORROWED_MM), > > to guard against use_mm. Or... just go back to sloppy rss. > > Use_mm can simply attach the kernel thread to the mm via mm_add_thread > and will then update mm->rss when being detached again. True. But please add and remove mm outside of the task_lock, there's no need to nest page_table_lock within it, is there? > The issue with ptrace and get_user_pages is a bit thorny. I did the check > for mm = current->mm in the following patch. If mm != current->mm then > do the sloppy thing and increment mm->rss without the page table lock. > This should be a very special rare case. I don't understand why you want to avoid taking mm->page_table_lock in that special rare case. I do prefer the sloppy rss approach, but if you're trying to be exact then it's regrettable to leave sloppy corners. Oh, is it because page_add_anon_rmap is usually called with page_table_lock, but without in your do_anonymous_page case? You'll have to move the anon_rss incrementation out of page_add_anon_rmap to its callsites (I was being a little bit lazy when I sited it in that one place, it's probably better to do it near mm->rss anyway.) > One could also set current to the target task in get_user_pages but then > faults for the actual current task may increment the wrong counters. Could > we live with that? No, "current" is not nearly so easy to play with as that. See i386. Even if it were, you might get burnt for heresy. > Or simply leave as is. The pages are after all allocated by the ptrace > process and it should be held responsible for it. No. > My favorite rss solution is still just getting rid of rss and > anon_rss and do the long loops in procfs. Whichever process wants to > know better be willing to pay the price in cpu time and the code for > incrementing rss can be removed from the page fault handler. We all seem to have different favourites. Your favourite makes quite a few people very angry. We've been there, we've done that, we've no wish to return. It'd be fine if just the process which wants to know paid the price; but it's every other that has to pay. > We have no real way of establishing the ownership of shared pages > anyways. Its counted when allocated. But the page may live on afterwards > in another process and then not be accounted for although its only user is > the new process. I didn't understand that bit. > IMHO vm scans may be the only way of really getting an accurate count. > > But here is the improved list_rss patch: Not studied in depth, but... am I going mad, or is your impressive RCUing the wrong way round? While we're scanning the list of tasks sharing the mm, there's no danger of the mm vanishing, but there is a danger of the task vanishing. Isn't it therefore the task which needs to be freed via RCU, not the mm? Hugh ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: pfault V12 : correction to tasklist rss 2004-12-10 21:24 ` Hugh Dickins @ 2004-12-10 21:38 ` Andrew Morton 2004-12-11 6:03 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Andrew Morton @ 2004-12-10 21:38 UTC (permalink / raw) To: Hugh Dickins Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > > We have no real way of establishing the ownership of shared pages > > anyways. Its counted when allocated. But the page may live on afterwards > > in another process and then not be accounted for although its only user is > > the new process. > > I didn't understand that bit. We did lose some accounting accuracy when the pagetable walk and the big tasklist walks were removed. Bill would probably have more details. Given that the code as it stood was a complete showstopper, the tradeoff seemed reasonable. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: pfault V12 : correction to tasklist rss 2004-12-10 21:38 ` Andrew Morton @ 2004-12-11 6:03 ` William Lee Irwin III 0 siblings, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-12-11 6:03 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: >>> We have no real way of establishing the ownership of shared pages >>> anyways. Its counted when allocated. But the page may live on afterwards >>> in another process and then not be accounted for although its only user is >>> the new process. On Fri, Dec 10, 2004 at 01:38:59PM -0800, Andrew Morton wrote: > We did lose some accounting accuracy when the pagetable walk and the big > tasklist walks were removed. Bill would probably have more details. Given > that the code as it stood was a complete showstopper, the tradeoff seemed > reasonable. There are several issues, not listed in order of importance here: (1) Workload monitoring with high multiprogramming levels was infeasible. (2) The long address space walks interfered with mmap() and page faults in the monitored processes, disturbing cluster membership and exceeding maximum response times in monitored workloads. (3) There's a general long-running ongoing effort to take on various places tasklist_lock is abused one-by-one to incrementally resolve or otherwise mitigate the rwlock starvation issues. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:22 ` Linus Torvalds 2004-11-22 22:27 ` Christoph Lameter @ 2004-11-22 22:32 ` Nick Piggin 2004-11-22 22:39 ` Christoph Lameter 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-22 22:32 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Linus Torvalds wrote: > > On Mon, 22 Nov 2004, Christoph Lameter wrote: > >>The problem is then that the proc filesystem must do an extensive scan >>over all threads to find users of a certain mm_struct. > > > The alternative is to just add a simple list into the task_struct and the > head of it into mm_struct. Then, at fork, you just finish the fork() with > > list_add(p->mm_list, p->mm->thread_list); > > and do the proper list_del() in exit_mm() or wherever. > > You'll still loop in /proc, but you'll do the minimal loop necessary. > Yes, that was what I was thinking we'd have to resort to. Not a bad idea. It would be nice if you could have it integrated with the locking that is already there - for example mmap_sem, although that might mean you'd have to take mmap_sem for writing which may limit scalability of thread creation / destruction... maybe a seperate lock / semaphore for that list itself would be OK. Deferred rss might be a practical solution, but I'd prefer this if it can be made workable. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:32 ` deferred rss update instead of sloppy rss Nick Piggin @ 2004-11-22 22:39 ` Christoph Lameter 2004-11-22 23:14 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 22:39 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel On Tue, 23 Nov 2004, Nick Piggin wrote: > Deferred rss might be a practical solution, but I'd prefer this if it can > be made workable. Both results in an additional field in task_struct that is going to be incremented when the page_table_lock is not held. It would be possible to switch to looping in procfs later. The main question with this patchset is: How and when can we get this get into the kernel? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: deferred rss update instead of sloppy rss 2004-11-22 22:39 ` Christoph Lameter @ 2004-11-22 23:14 ` Nick Piggin 0 siblings, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-22 23:14 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > On Tue, 23 Nov 2004, Nick Piggin wrote: > > >>Deferred rss might be a practical solution, but I'd prefer this if it can >>be made workable. > > > Both results in an additional field in task_struct that is going to be > incremented when the page_table_lock is not held. It would be possible > to switch to looping in procfs later. The main question with this patchset > is: > Sure. > How and when can we get this get into the kernel? > Well it is a good starting platform for the various PTL reduction patches floating around. I'd say Andrew could be convinced to stick it in -mm after 2.6.10, but we'd probably need a clear path to one of the PTL patches before anything would move into 2.6. ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [2/7]: page fault handler optimizations 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter 2004-11-19 19:43 ` page fault scalability patch V11 [1/7]: sloppy rss Christoph Lameter @ 2004-11-19 19:44 ` Christoph Lameter 2004-11-19 19:44 ` page fault scalability patch V11 [3/7]: ia64 atomic pte operations Christoph Lameter ` (7 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:44 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Increase parallelism in SMP configurations by deferring the acquisition of page_table_lock in handle_mm_fault * Anonymous memory page faults bypass the page_table_lock through the use of atomic page table operations * Swapper does not set pte to empty in transition to swap * Simulate atomic page table operations using the page_table_lock if an arch does not define __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides a performance benefit since the page_table_lock is held for shorter periods of time. Signed-off-by: Christoph Lameter <clameter@sgi.com Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-11-19 06:38:53.000000000 -0800 @@ -1330,8 +1330,7 @@ } /* - * We hold the mm semaphore and the page_table_lock on entry and - * should release the pagetable lock on exit.. + * We hold the mm semaphore */ static int do_swap_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, @@ -1343,15 +1342,13 @@ int ret = VM_FAULT_MINOR; pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); page = read_swap_cache_async(entry, vma, address); if (!page) { /* - * Back out if somebody else faulted in this pte while - * we released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1374,8 +1371,7 @@ lock_page(page); /* - * Back out if somebody else faulted in this pte while we - * released the page table lock. + * Back out if somebody else faulted in this pte */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); @@ -1422,14 +1418,12 @@ } /* - * We are called with the MM semaphore and page_table_lock - * spinlock held to protect against concurrent faults in - * multithreaded programs. + * We are called with the MM semaphore held. */ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, - unsigned long addr) + unsigned long addr, pte_t orig_entry) { pte_t entry; struct page * page = ZERO_PAGE(addr); @@ -1441,7 +1435,6 @@ if (write_access) { /* Allocate our own private page. */ pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_mem; @@ -1450,30 +1443,37 @@ goto no_mem; clear_user_highpage(page, addr); - spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); - if (!pte_none(*page_table)) { - pte_unmap(page_table); - page_cache_release(page); - spin_unlock(&mm->page_table_lock); - goto out; - } - mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)), vma); - lru_cache_add_active(page); mark_page_accessed(page); - page_add_anon_rmap(page, vma, addr); } - set_pte(page_table, entry); + /* update the entry */ + if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) { + if (write_access) { + pte_unmap(page_table); + page_cache_release(page); + } + goto out; + } + if (write_access) { + /* + * These two functions must come after the cmpxchg + * because if the page is on the LRU then try_to_unmap may come + * in and unmap the pte. + */ + lru_cache_add_active(page); + page_add_anon_rmap(page, vma, addr); + mm->rss++; + + } pte_unmap(page_table); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, addr, entry); - spin_unlock(&mm->page_table_lock); out: return VM_FAULT_MINOR; no_mem: @@ -1489,12 +1489,12 @@ * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * - * This is called with the MM semaphore held and the page table - * spinlock held. Exit with the spinlock released. + * This is called with the MM semaphore held. */ static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd) + unsigned long address, int write_access, pte_t *page_table, + pmd_t *pmd, pte_t orig_entry) { struct page * new_page; struct address_space *mapping = NULL; @@ -1505,9 +1505,8 @@ if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, - pmd, write_access, address); + pmd, write_access, address, orig_entry); pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); if (vma->vm_file) { mapping = vma->vm_file->f_mapping; @@ -1605,7 +1604,7 @@ * nonlinear vmas. */ static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma, - unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) + unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry) { unsigned long pgoff; int err; @@ -1618,13 +1617,12 @@ if (!vma->vm_ops || !vma->vm_ops->populate || (write_access && !(vma->vm_flags & VM_SHARED))) { pte_clear(pte); - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); } pgoff = pte_to_pgoff(*pte); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); if (err == -ENOMEM) @@ -1643,49 +1641,40 @@ * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * - * Note the "page_table_lock". It is to protect against kswapd removing - * pages from under us. Note that kswapd only ever _removes_ pages, never - * adds them. As such, once we have noticed that the page is not present, - * we can drop the lock early. - * - * The adding of pages is protected by the MM semaphore (which we hold), - * so we don't need to worry about a page being suddenly been added into - * our VM. - * - * We enter with the pagetable spinlock held, we are supposed to - * release it when done. + * Note that kswapd only ever _removes_ pages, never adds them. + * We need to insure to handle that case properly. */ static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; + pte_t new_entry; entry = *pte; if (!pte_present(entry)) { - /* - * If it truly wasn't present, we know that kswapd - * and the PTE updates will not touch it later. So - * drop the lock. - */ if (pte_none(entry)) - return do_no_page(mm, vma, address, write_access, pte, pmd); + return do_no_page(mm, vma, address, write_access, pte, pmd, entry); if (pte_file(entry)) - return do_file_page(mm, vma, address, write_access, pte, pmd); + return do_file_page(mm, vma, address, write_access, pte, pmd, entry); return do_swap_page(mm, vma, address, pte, pmd, entry, write_access); } + /* + * This is the case in which we only update some bits in the pte. + */ + new_entry = pte_mkyoung(entry); if (write_access) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + /* do_wp_page expects us to hold the page_table_lock */ + spin_lock(&mm->page_table_lock); return do_wp_page(mm, vma, address, pte, pmd, entry); - - entry = pte_mkdirty(entry); + } + new_entry = pte_mkdirty(new_entry); } - entry = pte_mkyoung(entry); - ptep_set_access_flags(vma, address, pte, entry, write_access); - update_mmu_cache(vma, address, entry); + if (ptep_cmpxchg(vma, address, pte, entry, new_entry)) + update_mmu_cache(vma, address, new_entry); pte_unmap(pte); - spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; } @@ -1703,22 +1692,45 @@ inc_page_state(pgfault); - if (is_vm_hugetlb_page(vma)) + if (unlikely(is_vm_hugetlb_page(vma))) return VM_FAULT_SIGBUS; /* mapping truncation does this. */ /* - * We need the page table lock to synchronize with kswapd - * and the SMP-safe atomic PTE updates. + * We rely on the mmap_sem and the SMP-safe atomic PTE updates. + * to synchronize with kswapd */ - spin_lock(&mm->page_table_lock); - pmd = pmd_alloc(mm, pgd, address); + if (unlikely(pgd_none(*pgd))) { + pmd_t *new = pmd_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + /* Insure that the update is done in an atomic way */ + if (!pgd_test_and_populate(mm, pgd, new)) + pmd_free(new); + } + + pmd = pmd_offset(pgd, address); + + if (likely(pmd)) { + pte_t *pte; + + if (!pmd_present(*pmd)) { + struct page *new; - if (pmd) { - pte_t * pte = pte_alloc_map(mm, pmd, address); - if (pte) + new = pte_alloc_one(mm, address); + if (!new) + return VM_FAULT_OOM; + + if (!pmd_test_and_populate(mm, pmd, new)) + pte_free(new); + else + inc_page_state(nr_page_table_pages); + } + + pte = pte_offset_map(pmd, address); + if (likely(pte)) return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } - spin_unlock(&mm->page_table_lock); return VM_FAULT_OOM; } Index: linux-2.6.9/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700 +++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-19 07:54:05.000000000 -0800 @@ -134,4 +134,60 @@ #define pgd_offset_gate(mm, addr) pgd_offset(mm, addr) #endif +#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS +/* + * If atomic page table operations are not available then use + * the page_table_lock to insure some form of locking. + * Note thought that low level operations as well as the + * page_table_handling of the cpu may bypass all locking. + */ + +#ifndef __HAVE_ARCH_PTEP_CMPXCHG +#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \ +({ \ + int __rc; \ + spin_lock(&__vma->vm_mm->page_table_lock); \ + __rc = pte_same(*(__ptep), __oldval); \ + if (__rc) set_pte(__ptep, __newval); \ + spin_unlock(&__vma->vm_mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE +#define pgd_test_and_populate(__mm, __pgd, __pmd) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pgd_present(*(__pgd)); \ + if (__rc) pgd_populate(__mm, __pgd, __pmd); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#ifndef __HAVE_PMD_TEST_AND_POPULATE +#define pmd_test_and_populate(__mm, __pmd, __page) \ +({ \ + int __rc; \ + spin_lock(&__mm->page_table_lock); \ + __rc = !pmd_present(*(__pmd)); \ + if (__rc) pmd_populate(__mm, __pmd, __page); \ + spin_unlock(&__mm->page_table_lock); \ + __rc; \ +}) +#endif + +#endif + +#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\ + flush_tlb_page(__vma, __address); \ + __p; \ +}) + +#endif + #endif /* _ASM_GENERIC_PGTABLE_H */ Index: linux-2.6.9/mm/rmap.c =================================================================== --- linux-2.6.9.orig/mm/rmap.c 2004-11-19 06:38:51.000000000 -0800 +++ linux-2.6.9/mm/rmap.c 2004-11-19 06:38:53.000000000 -0800 @@ -419,7 +419,10 @@ * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * - * The caller needs to hold the mm->page_table_lock. + * The caller needs to hold the mm->page_table_lock if page + * is pointing to something that is known by the vm. + * The lock does not need to be held if page is pointing + * to a newly allocated page. */ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) @@ -561,11 +564,6 @@ /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page->private }; @@ -580,11 +578,15 @@ list_add(&mm->mmlist, &init_mm.mmlist); spin_unlock(&mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm->anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm->rss--; page_remove_rmap(page); page_cache_release(page); @@ -671,15 +673,21 @@ if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); + /* + * There would be a race here with handle_mm_fault and do_anonymous_page + * which bypasses the page_table_lock if we would zap the pte before + * putting something into it. On the other hand we need to + * have the dirty flag setting at the time we replaced the value. + */ /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page->index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index)); + else + pteval = ptep_get_and_clear(pte); - /* Move the dirty bit to the physical page now the pte is gone. */ + /* Move the dirty bit to the physical page now that the pte is gone. */ if (pte_dirty(pteval)) set_page_dirty(page); ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [3/7]: ia64 atomic pte operations 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter 2004-11-19 19:43 ` page fault scalability patch V11 [1/7]: sloppy rss Christoph Lameter 2004-11-19 19:44 ` page fault scalability patch V11 [2/7]: page fault handler optimizations Christoph Lameter @ 2004-11-19 19:44 ` Christoph Lameter 2004-11-19 19:45 ` page fault scalability patch V11 [4/7]: universal cmpxchg for i386 Christoph Lameter ` (6 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:44 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for ia64 * Enhanced parallelism in page fault handler if applied together with the generic patch Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800 @@ -34,6 +34,10 @@ #define pmd_quicklist (local_cpu_data->pmd_quick) #define pgtable_cache_size (local_cpu_data->pgtable_cache_sz) +/* Empty entries of PMD and PGD */ +#define PMD_NONE 0 +#define PGD_NONE 0 + static inline pgd_t* pgd_alloc_one_fast (struct mm_struct *mm) { @@ -78,12 +82,19 @@ preempt_enable(); } + static inline void pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) { pgd_val(*pgd_entry) = __pa(pmd); } +/* Atomic populate */ +static inline int +pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd) +{ + return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE; +} static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) @@ -132,6 +143,13 @@ pmd_val(*pmd_entry) = page_to_phys(pte); } +/* Atomic populate */ +static inline int +pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte) +{ + return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE; +} + static inline void pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte) { Index: linux-2.6.9/include/asm-ia64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800 @@ -414,6 +425,26 @@ #endif } +/* + * IA-64 doesn't have any external MMU info: the page tables contain all the necessary + * information. However, we use this routine to take care of any (delayed) i-cache + * flushing that may be necessary. + */ +extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); + +static inline int +ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval) +{ + /* + * IA64 defers icache flushes. If the new pte is executable we may + * have to flush the icache to insure cache coherency immediately + * after the cmpxchg. + */ + if (pte_exec(newval)) + update_mmu_cache(vma, addr, newval); + return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte; +} + static inline int pte_same (pte_t a, pte_t b) { @@ -476,13 +507,6 @@ struct vm_area_struct * prev, unsigned long start, unsigned long end); #endif -/* - * IA-64 doesn't have any external MMU info: the page tables contain all the necessary - * information. However, we use this routine to take care of any (delayed) i-cache - * flushing that may be necessary. - */ -extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte); - #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Update PTEP with ENTRY, which is guaranteed to be a less @@ -560,6 +584,8 @@ #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME #define __HAVE_ARCH_PGD_OFFSET_GATE +#define __HAVE_ARCH_ATOMIC_TABLE_OPS +#define __HAVE_ARCH_LOCK_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _ASM_IA64_PGTABLE_H */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [4/7]: universal cmpxchg for i386 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (2 preceding siblings ...) 2004-11-19 19:44 ` page fault scalability patch V11 [3/7]: ia64 atomic pte operations Christoph Lameter @ 2004-11-19 19:45 ` Christoph Lameter 2004-11-19 19:46 ` page fault scalability patch V11 [5/7]: i386 atomic pte operations Christoph Lameter ` (5 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:45 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Make cmpxchg and cmpxchg8b generally available on the i386 platform. * Provide emulation of cmpxchg suitable for uniprocessor if build and run on 386. * Provide emulation of cmpxchg8b suitable for uniprocessor systems if build and run on 386 or 486. * Provide an inline function to atomically get a 64 bit value via cmpxchg8b in an SMP system (courtesy of Nick Piggin) (important for i386 PAE mode and other places where atomic 64 bit operations are useful) Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/arch/i386/Kconfig =================================================================== --- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800 +++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800 @@ -351,6 +351,11 @@ depends on !M386 default y +config X86_CMPXCHG8B + bool + depends on !M386 && !M486 + default y + config X86_XADD bool depends on !M386 Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c =================================================================== --- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800 +++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800 @@ -6,6 +6,7 @@ #include <linux/bitops.h> #include <linux/smp.h> #include <linux/thread_info.h> +#include <linux/module.h> #include <asm/processor.h> #include <asm/msr.h> @@ -287,5 +288,103 @@ return 0; } +#ifndef CONFIG_X86_CMPXCHG +unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new) +{ + u8 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u8)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u8 *)ptr; + if (prev == old) + *(u8 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u8); + +unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new) +{ + u16 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u16)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u16 *)ptr; + if (prev == old) + *(u16 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u16); + +unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) +{ + u32 prev; + unsigned long flags; + /* + * Check if the kernel was compiled for an old cpu but the + * currently running cpu can do cmpxchg after all + * All CPUs except 386 support CMPXCHG + */ + if (cpu_data->x86 > 3) + return __cmpxchg(ptr, old, new, sizeof(u32)); + + /* Poor man's cmpxchg for 386. Unsuitable for SMP */ + local_irq_save(flags); + prev = *(u32 *)ptr; + if (prev == old) + *(u32 *)ptr = new; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg_386_u32); +#endif + +#ifndef CONFIG_X86_CMPXCHG8B +unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + unsigned long flags; + + /* + * Check if the kernel was compiled for an old cpu but + * we are running really on a cpu capable of cmpxchg8b + */ + + if (cpu_has(cpu_data, X86_FEATURE_CX8)) + return __cmpxchg8b(ptr, old, newv); + + /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */ + local_irq_save(flags); + prev = *ptr; + if (prev == old) + *ptr = newv; + local_irq_restore(flags); + return prev; +} + +EXPORT_SYMBOL(cmpxchg8b_486); +#endif + // arch_initcall(intel_cpu_init); Index: linux-2.6.9/include/asm-i386/system.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800 @@ -149,6 +149,9 @@ #define __xg(x) ((struct __xchg_dummy *)(x)) +#define ll_low(x) *(((unsigned int*)&(x))+0) +#define ll_high(x) *(((unsigned int*)&(x))+1) + /* * The semantics of XCHGCMP8B are a bit strange, this is why * there is a loop and the loading of %%eax and %%edx has to @@ -184,8 +187,6 @@ { __set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL)); } -#define ll_low(x) *(((unsigned int*)&(x))+0) -#define ll_high(x) *(((unsigned int*)&(x))+1) static inline void __set_64bit_var (unsigned long long *ptr, unsigned long long value) @@ -203,6 +204,26 @@ __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \ __set_64bit(ptr, ll_low(value), ll_high(value)) ) +static inline unsigned long long __get_64bit(unsigned long long * ptr) +{ + unsigned long long ret; + __asm__ __volatile__ ( + "\n1:\t" + "movl (%1), %%eax\n\t" + "movl 4(%1), %%edx\n\t" + "movl %%eax, %%ebx\n\t" + "movl %%edx, %%ecx\n\t" + LOCK_PREFIX "cmpxchg8b (%1)\n\t" + "jnz 1b" + : "=A"(ret) + : "D"(ptr) + : "ebx", "ecx", "memory"); + return ret; +} + +#define get_64bit(ptr) __get_64bit(ptr) + + /* * Note: no "lock" prefix even on SMP: xchg always implies lock anyway * Note 2: xchg has side effect, so that attribute volatile is necessary, @@ -240,7 +261,41 @@ */ #ifdef CONFIG_X86_CMPXCHG + #define __HAVE_ARCH_CMPXCHG 1 +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) + +#else + +/* + * Building a kernel capable running on 80386. It may be necessary to + * simulate the cmpxchg on the 80386 CPU. For that purpose we define + * a function for each of the sizes we support. + */ + +extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8); +extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16); +extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32); + +static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old, + unsigned long new, int size) +{ + switch (size) { + case 1: + return cmpxchg_386_u8(ptr, old, new); + case 2: + return cmpxchg_386_u16(ptr, old, new); + case 4: + return cmpxchg_386_u32(ptr, old, new); + } + return old; +} + +#define cmpxchg(ptr,o,n)\ + ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr)))) #endif static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, @@ -270,10 +325,32 @@ return old; } -#define cmpxchg(ptr,o,n)\ - ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ - (unsigned long)(n),sizeof(*(ptr)))) - +static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr, + unsigned long long old, unsigned long long newv) +{ + unsigned long long prev; + __asm__ __volatile__( + LOCK_PREFIX "cmpxchg8b (%4)" + : "=A" (prev) + : "0" (old), "c" ((unsigned long)(newv >> 32)), + "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr) + : "memory"); + return prev; +} + +#ifdef CONFIG_X86_CMPXCHG8B +#define cmpxchg8b __cmpxchg8b +#else +/* + * Building a kernel capable of running on 80486 and 80386. Both + * do not support cmpxchg8b. Call a function that emulates the + * instruction if necessary. + */ +extern unsigned long long cmpxchg8b_486(volatile unsigned long long *, + unsigned long long, unsigned long long); +#define cmpxchg8b cmpxchg8b_486 +#endif + #ifdef __KERNEL__ struct alt_instr { __u8 *instr; /* original instruction */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [5/7]: i386 atomic pte operations 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (3 preceding siblings ...) 2004-11-19 19:45 ` page fault scalability patch V11 [4/7]: universal cmpxchg for i386 Christoph Lameter @ 2004-11-19 19:46 ` Christoph Lameter 2004-11-19 19:46 ` page fault scalability patch V11 [6/7]: x86_64 " Christoph Lameter ` (4 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:46 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Atomic pte operations for i386 in regular and PAE modes Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800 +++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800 @@ -413,6 +413,7 @@ #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME +#define __HAVE_ARCH_ATOMIC_TABLE_OPS #include <asm-generic/pgtable.h> #endif /* _I386_PGTABLE_H */ Index: linux-2.6.9/include/asm-i386/pgtable-3level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800 @@ -6,7 +6,8 @@ * tables on PPro+ CPUs. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> - */ + * August 26, 2004 added ptep_cmpxchg <christoph@lameter.com> +*/ #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low) @@ -42,26 +43,15 @@ return pte_x(pte); } -/* Rules for using set_pte: the pte being assigned *must* be - * either not present or in a state where the hardware will - * not attempt to update the pte. In places where this is - * not possible, use pte_get_and_clear to obtain the old pte - * value and then use set_pte to update it. -ben - */ -static inline void set_pte(pte_t *ptep, pte_t pte) -{ - ptep->pte_high = pte.pte_high; - smp_wmb(); - ptep->pte_low = pte.pte_low; -} -#define __HAVE_ARCH_SET_PTE_ATOMIC -#define set_pte_atomic(pteptr,pteval) \ +#define set_pte(pteptr,pteval) \ set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pgd(pgdptr,pgdval) \ set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval)) +#define set_pte_atomic set_pte + /* * Pentium-II erratum A13: in PAE mode we explicitly have to flush * the TLB via cr3 if the top-level pgd is changed... @@ -142,4 +132,23 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val }) +/* Atomic PTE operations */ +#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \ +({ pte_t __r; \ + /* xchg acts as a barrier before the setting of the high bits. */\ + __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \ + __r.pte_high = (__ptep)->pte_high; \ + (__ptep)->pte_high = (__newval).pte_high; \ + flush_tlb_page(__vma, __addr); \ + (__r); \ +}) + +#define __HAVE_ARCH_PTEP_XCHG_FLUSH + +static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + + #endif /* _I386_PGTABLE_3LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800 @@ -82,4 +82,7 @@ #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) +/* Atomic PTE operations */ +#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low) + #endif /* _I386_PGTABLE_2LEVEL_H */ Index: linux-2.6.9/include/asm-i386/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700 +++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800 @@ -4,9 +4,12 @@ #include <linux/config.h> #include <asm/processor.h> #include <asm/fixmap.h> +#include <asm/system.h> #include <linux/threads.h> #include <linux/mm.h> /* for struct page */ +#define PMD_NONE 0L + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))) @@ -16,6 +19,19 @@ ((unsigned long long)page_to_pfn(pte) << (unsigned long long) PAGE_SHIFT))); } + +/* Atomic version */ +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ +#ifdef CONFIG_X86_PAE + return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE + + ((unsigned long long)page_to_pfn(pte) << + (unsigned long long) PAGE_SHIFT) ) == PMD_NONE; +#else + return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +#endif +} + /* * Allocate and free page tables. */ @@ -49,6 +65,7 @@ #define pmd_free(x) do { } while (0) #define __pmd_free_tlb(tlb,x) do { } while (0) #define pgd_populate(mm, pmd, pte) BUG() +#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; }) #define check_pgt_cache() do { } while (0) ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [6/7]: x86_64 atomic pte operations 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (4 preceding siblings ...) 2004-11-19 19:46 ` page fault scalability patch V11 [5/7]: i386 atomic pte operations Christoph Lameter @ 2004-11-19 19:46 ` Christoph Lameter 2004-11-19 19:47 ` page fault scalability patch V11 [7/7]: s390 " Christoph Lameter ` (3 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:46 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for x86_64 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700 +++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-19 08:17:55.000000000 -0800 @@ -7,16 +7,26 @@ #include <linux/threads.h> #include <linux/mm.h> +#define PMD_NONE 0 +#define PGD_NONE 0 + #define pmd_populate_kernel(mm, pmd, pte) \ set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte))) #define pgd_populate(mm, pgd, pmd) \ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd))) +#define pgd_test_and_populate(mm, pgd, pmd) \ + (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE) static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT))); } +static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) +{ + return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE; +} + extern __inline__ pmd_t *get_pmd(void) { return (pmd_t *)get_zeroed_page(GFP_KERNEL); Index: linux-2.6.9/include/asm-x86_64/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-15 11:13:39.000000000 -0800 +++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-19 08:18:52.000000000 -0800 @@ -437,6 +437,10 @@ #define kc_offset_to_vaddr(o) \ (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o)) + +#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval)) +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY #define __HAVE_ARCH_PTEP_GET_AND_CLEAR ^ permalink raw reply [flat|nested] 286+ messages in thread
* page fault scalability patch V11 [7/7]: s390 atomic pte operations 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (5 preceding siblings ...) 2004-11-19 19:46 ` page fault scalability patch V11 [6/7]: x86_64 " Christoph Lameter @ 2004-11-19 19:47 ` Christoph Lameter 2004-11-19 19:59 ` page fault scalability patch V11 [0/7]: overview Linus Torvalds ` (2 subsequent siblings) 9 siblings, 0 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-19 19:47 UTC (permalink / raw) To: torvalds, akpm, Benjamin Herrenschmidt Cc: Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Changelog * Provide atomic pte operations for s390 Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/include/asm-s390/pgtable.h =================================================================== --- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700 +++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800 @@ -567,6 +567,15 @@ return pte; } +#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \ +({ \ + struct mm_struct *__mm = __vma->vm_mm; \ + pte_t __pte; \ + __pte = ptep_clear_flush(__vma, __address, __ptep); \ + set_pte(__ptep, __pteval); \ + __pte; \ +}) + static inline void ptep_set_wrprotect(pte_t *ptep) { pte_t old_pte = *ptep; @@ -778,6 +787,14 @@ #define kern_addr_valid(addr) (1) +/* Atomic PTE operations */ +#define __HAVE_ARCH_ATOMIC_TABLE_OPS + +static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval) +{ + return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval); +} + /* * No page table caches to initialise */ @@ -791,6 +808,7 @@ #define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH #define __HAVE_ARCH_PTEP_GET_AND_CLEAR #define __HAVE_ARCH_PTEP_CLEAR_FLUSH +#define __HAVE_ARCH_PTEP_XCHG_FLUSH #define __HAVE_ARCH_PTEP_SET_WRPROTECT #define __HAVE_ARCH_PTEP_MKDIRTY #define __HAVE_ARCH_PTE_SAME Index: linux-2.6.9/include/asm-s390/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700 +++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800 @@ -97,6 +97,10 @@ pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd); } +static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd) +{ + return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV; +} #endif /* __s390x__ */ static inline void @@ -119,6 +123,18 @@ pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT)); } +static inline int +pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page) +{ + int rc; + spin_lock(&mm->page_table_lock); + + rc=pte_same(*pmd, _PAGE_INVALID_EMPTY); + if (rc) pmd_populate(mm, pmd, page); + spin_unlock(&mm->page_table_lock); + return rc; +} + /* * page table entry allocation/free routines. */ ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (6 preceding siblings ...) 2004-11-19 19:47 ` page fault scalability patch V11 [7/7]: s390 " Christoph Lameter @ 2004-11-19 19:59 ` Linus Torvalds 2004-11-20 1:07 ` Nick Piggin 2004-11-20 2:03 ` William Lee Irwin III 2004-11-20 2:04 ` William Lee Irwin III 2004-11-20 2:06 ` Robin Holt 9 siblings, 2 replies; 286+ messages in thread From: Linus Torvalds @ 2004-11-19 19:59 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, 19 Nov 2004, Christoph Lameter wrote: > > Note that I have posted two other approaches of dealing with the rss problem: You could also make "rss" be a _signed_ integer per-thread. When unmapping a page, you decrement one of the threads that shares the mm (doesn't matter which - which is why the per-thread rss may go negative), and when mapping a page you increment it. Then, anybody who actually wants a global rss can just iterate over threads and add it all up. If you do it under the mmap_sem, it's stable, and if you do it outside the mmap_sem it's imprecise but stable in the long term (ie errors never _accumulate_, like the non-atomic case will do). Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of a lot better than the periodic scan. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-19 19:59 ` page fault scalability patch V11 [0/7]: overview Linus Torvalds @ 2004-11-20 1:07 ` Nick Piggin 2004-11-20 1:29 ` Christoph Lameter 2004-11-20 1:56 ` Linus Torvalds 2004-11-20 2:03 ` William Lee Irwin III 1 sibling, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 1:07 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Linus Torvalds wrote: > > On Fri, 19 Nov 2004, Christoph Lameter wrote: > >>Note that I have posted two other approaches of dealing with the rss problem: > > > You could also make "rss" be a _signed_ integer per-thread. > > When unmapping a page, you decrement one of the threads that shares the mm > (doesn't matter which - which is why the per-thread rss may go negative), > and when mapping a page you increment it. > > Then, anybody who actually wants a global rss can just iterate over > threads and add it all up. If you do it under the mmap_sem, it's stable, > and if you do it outside the mmap_sem it's imprecise but stable in the > long term (ie errors never _accumulate_, like the non-atomic case will > do). > > Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of > a lot better than the periodic scan. > I think this sounds like it might be a good idea. I prefer it to having the unbounded error of sloppy rss (as improbable as it may be in practice). The per thread rss may wrap (maybe not 64-bit counters), but even so, the summation over all threads should still end up being correct I think. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:07 ` Nick Piggin @ 2004-11-20 1:29 ` Christoph Lameter 2004-11-20 1:45 ` Nick Piggin 2004-11-20 1:58 ` Linus Torvalds 2004-11-20 1:56 ` Linus Torvalds 1 sibling, 2 replies; 286+ messages in thread From: Christoph Lameter @ 2004-11-20 1:29 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, Nick Piggin wrote: > I think this sounds like it might be a good idea. I prefer it to having > the unbounded error of sloppy rss (as improbable as it may be in practice). It may also be faster since the processors can have exclusive cache lines. This means we need to move rss into the task struct. But how does one get from mm struct to task struct? current is likely available most of the time. Is that always the case? > The per thread rss may wrap (maybe not 64-bit counters), but even so, > the summation over all threads should still end up being correct I > think. Note though that the mmap_sem is no protection. It is a read lock and may be held by multiple processes while incrementing and decrementing rss. This is likely reducing the number of collisions significantly but it wont be a guarantee like locking or atomic ops. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:29 ` Christoph Lameter @ 2004-11-20 1:45 ` Nick Piggin 2004-11-20 1:58 ` Linus Torvalds 1 sibling, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 1:45 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Christoph Lameter wrote: > On Sat, 20 Nov 2004, Nick Piggin wrote: > > >>I think this sounds like it might be a good idea. I prefer it to having >>the unbounded error of sloppy rss (as improbable as it may be in practice). > > > It may also be faster since the processors can have exclusive cache lines. > Yep. > This means we need to move rss into the task struct. But how does one get > from mm struct to task struct? current is likely available most of > the time. Is that always the case? > It is available everywhere that mm_struct is, I guess. So yes, I think `current` should be OK. > >>The per thread rss may wrap (maybe not 64-bit counters), but even so, >>the summation over all threads should still end up being correct I >>think. > > > Note though that the mmap_sem is no protection. It is a read lock and may > be held by multiple processes while incrementing and decrementing rss. > This is likely reducing the number of collisions significantly but it wont > be a guarantee like locking or atomic ops. > Yeah the read lock won't do anything to serialise it. I think what Linus is saying is that we _don't care_ most of the time (because the error will be bounded). But if it happened that we really do care anywhere, then the write lock should be sufficient. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:29 ` Christoph Lameter 2004-11-20 1:45 ` Nick Piggin @ 2004-11-20 1:58 ` Linus Torvalds 2004-11-20 2:06 ` Linus Torvalds 1 sibling, 1 reply; 286+ messages in thread From: Linus Torvalds @ 2004-11-20 1:58 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, 19 Nov 2004, Christoph Lameter wrote: > > Note though that the mmap_sem is no protection. It is a read lock and may > be held by multiple processes while incrementing and decrementing rss. > This is likely reducing the number of collisions significantly but it wont > be a guarantee like locking or atomic ops. It is, though, if you hold it for a write. The point being that you _can_ get an exact rss value if you want to. Not that I really see any overwhelming evidence of anybody ever really caring, but it's nice to know that you have the option. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:58 ` Linus Torvalds @ 2004-11-20 2:06 ` Linus Torvalds 0 siblings, 0 replies; 286+ messages in thread From: Linus Torvalds @ 2004-11-20 2:06 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, 19 Nov 2004, Linus Torvalds wrote: > > Not that I really see any overwhelming evidence of anybody ever really > caring, but it's nice to know that you have the option. Btw, if you are going to look at doing this rss thing, you need to make sure that thread exit ends up adding its rss to _some_ remaining sibling. I guess that was obvious, but it's worth pointing out. That may actually be the only case where we do _not_ have a nice SMP-safe access: we do have a stable sibling (tsk->thread_leader), but we don't have any good serialization _except_ for taking mmap_sem for writing. Which we currently don't do: we take it for reading (and then we possibly upgrade it to a write lock if we notice that there is a core-dump starting). We can avoid this too by having a per-mm atomic rss "spill" counter. So exit_mm() would basically do: ... tsk->mm = NULL; atomic_add(tsk->rss, &mm->rss_spill); ... and then the algorithm for getting rss would be: rss = atomic_read(mm->rss_spill); for_each_thread(..) rss += tsk->rss; Or does anybody see any better approaches? Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:07 ` Nick Piggin 2004-11-20 1:29 ` Christoph Lameter @ 2004-11-20 1:56 ` Linus Torvalds 2004-11-22 18:06 ` Bill Davidsen 1 sibling, 1 reply; 286+ messages in thread From: Linus Torvalds @ 2004-11-20 1:56 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, Nick Piggin wrote: > > The per thread rss may wrap (maybe not 64-bit counters), but even so, > the summation over all threads should still end up being correct I > think. Yes. As long as the total rss fits in an int, it doesn't matter if any of them wrap. Addition is still associative in twos-complement arithmetic even in the presense of overflows. If you actually want to make it proper standard C, I guess you'd have to make the thing unsigned, which gives you the mod-2**n guarantees even if somebody were to ever make a non-twos-complement machine. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 1:56 ` Linus Torvalds @ 2004-11-22 18:06 ` Bill Davidsen 0 siblings, 0 replies; 286+ messages in thread From: Bill Davidsen @ 2004-11-22 18:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Christoph Lameter, akpm, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Linus Torvalds wrote: > > On Sat, 20 Nov 2004, Nick Piggin wrote: > >>The per thread rss may wrap (maybe not 64-bit counters), but even so, >>the summation over all threads should still end up being correct I >>think. > > > Yes. As long as the total rss fits in an int, it doesn't matter if any of > them wrap. Addition is still associative in twos-complement arithmetic > even in the presense of overflows. > > If you actually want to make it proper standard C, I guess you'd have to > make the thing unsigned, which gives you the mod-2**n guarantees even if > somebody were to ever make a non-twos-complement machine. I think other stuff breaks as well, I think I saw you post some example code using something like (a & -a) or similar within the last few months. Fortunately neither 1's comp or BCD are likeliy to return in hardware. Big-end vs. little-end is still an issue, though. -- -bill davidsen (davidsen@tmr.com) "The secret to procrastination is to put things off until the last possible moment - but no longer" -me ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-19 19:59 ` page fault scalability patch V11 [0/7]: overview Linus Torvalds 2004-11-20 1:07 ` Nick Piggin @ 2004-11-20 2:03 ` William Lee Irwin III 2004-11-20 2:25 ` Nick Piggin 2004-11-20 3:37 ` Nick Piggin 1 sibling, 2 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 2:03 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, akpm, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote: > You could also make "rss" be a _signed_ integer per-thread. > When unmapping a page, you decrement one of the threads that shares the mm > (doesn't matter which - which is why the per-thread rss may go negative), > and when mapping a page you increment it. > Then, anybody who actually wants a global rss can just iterate over > threads and add it all up. If you do it under the mmap_sem, it's stable, > and if you do it outside the mmap_sem it's imprecise but stable in the > long term (ie errors never _accumulate_, like the non-atomic case will > do). > Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of > a lot better than the periodic scan. Unprivileged triggers for full-tasklist scans are NMI oops material. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:03 ` William Lee Irwin III @ 2004-11-20 2:25 ` Nick Piggin 2004-11-20 2:41 ` William Lee Irwin III 2004-11-20 3:37 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 2:25 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote: > >>You could also make "rss" be a _signed_ integer per-thread. >>When unmapping a page, you decrement one of the threads that shares the mm >>(doesn't matter which - which is why the per-thread rss may go negative), >>and when mapping a page you increment it. >>Then, anybody who actually wants a global rss can just iterate over >>threads and add it all up. If you do it under the mmap_sem, it's stable, >>and if you do it outside the mmap_sem it's imprecise but stable in the >>long term (ie errors never _accumulate_, like the non-atomic case will >>do). >>Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of >>a lot better than the periodic scan. > > > Unprivileged triggers for full-tasklist scans are NMI oops material. > What about pushing the per-thread rss delta back into the global atomic rss counter in each schedule()? Pros: This would take the task exiting problem into its stride as a matter of course. Single atomic read to get rss. Cons: would just be moving the atomic op somewhere else if we don't get many page faults per schedule. Not really nice dependancies. Assumes schedule (not context switch) must occur somewhat regularly. At present this is not true for SCHED_FIFO tasks. Too nasty? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:25 ` Nick Piggin @ 2004-11-20 2:41 ` William Lee Irwin III 2004-11-20 2:46 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 2:41 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> Unprivileged triggers for full-tasklist scans are NMI oops material. On Sat, Nov 20, 2004 at 01:25:37PM +1100, Nick Piggin wrote: > What about pushing the per-thread rss delta back into the global atomic > rss counter in each schedule()? > Pros: > This would take the task exiting problem into its stride as a matter of > course. > Single atomic read to get rss. > Cons: > would just be moving the atomic op somewhere else if we don't get > many page faults per schedule. > Not really nice dependancies. > Assumes schedule (not context switch) must occur somewhat regularly. > At present this is not true for SCHED_FIFO tasks. > Too nasty? This doesn't sound too hot. There's enough accounting that can't be done anywhere but schedule(), and this can be done elsewhere. Plus, you're moving an already too-frequent operation to a more frequent callsite. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:41 ` William Lee Irwin III @ 2004-11-20 2:46 ` Nick Piggin 0 siblings, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 2:46 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Unprivileged triggers for full-tasklist scans are NMI oops material. > > > On Sat, Nov 20, 2004 at 01:25:37PM +1100, Nick Piggin wrote: > >>What about pushing the per-thread rss delta back into the global atomic >>rss counter in each schedule()? >>Pros: >>This would take the task exiting problem into its stride as a matter of >>course. >>Single atomic read to get rss. >>Cons: >>would just be moving the atomic op somewhere else if we don't get >>many page faults per schedule. >>Not really nice dependancies. >>Assumes schedule (not context switch) must occur somewhat regularly. >>At present this is not true for SCHED_FIFO tasks. >>Too nasty? > > > This doesn't sound too hot. There's enough accounting that can't be > done anywhere but schedule(), and this can be done elsewhere. Plus, > you're moving an already too-frequent operation to a more frequent > callsite. > No, it won't somehow increase the number of atomic rss operations just because schedule is called more often. The number of ops will be at _most_ the number of page faults. But I agree with your overall evaluation of its 'hotness'. Just another idea. Give this monkey another thousand years at the keys and he'll come up with the perfect solution :P ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:03 ` William Lee Irwin III 2004-11-20 2:25 ` Nick Piggin @ 2004-11-20 3:37 ` Nick Piggin 2004-11-20 3:55 ` William Lee Irwin III 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 3:37 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > On Fri, Nov 19, 2004 at 11:59:03AM -0800, Linus Torvalds wrote: > >>You could also make "rss" be a _signed_ integer per-thread. >>When unmapping a page, you decrement one of the threads that shares the mm >>(doesn't matter which - which is why the per-thread rss may go negative), >>and when mapping a page you increment it. >>Then, anybody who actually wants a global rss can just iterate over >>threads and add it all up. If you do it under the mmap_sem, it's stable, >>and if you do it outside the mmap_sem it's imprecise but stable in the >>long term (ie errors never _accumulate_, like the non-atomic case will >>do). >>Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of >>a lot better than the periodic scan. > > > Unprivileged triggers for full-tasklist scans are NMI oops material. > Hang on, let's come back to this... We already have unprivileged do-for-each-thread triggers in the proc code. It's in do_task_stat, even. Rss reporting would basically just involve one extra addition within that loop. So... hmm, I can't see a problem with it. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:37 ` Nick Piggin @ 2004-11-20 3:55 ` William Lee Irwin III 2004-11-20 4:03 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 3:55 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> Unprivileged triggers for full-tasklist scans are NMI oops material. On Sat, Nov 20, 2004 at 02:37:04PM +1100, Nick Piggin wrote: > Hang on, let's come back to this... > We already have unprivileged do-for-each-thread triggers in the proc > code. It's in do_task_stat, even. Rss reporting would basically just > involve one extra addition within that loop. > So... hmm, I can't see a problem with it. /proc/ triggering NMI oopses was a persistent problem even before that code was merged. I've not bothered testing it as it at best aggravates it. And thread groups can share mm's. do_for_each_thread() won't suffice. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:55 ` William Lee Irwin III @ 2004-11-20 4:03 ` Nick Piggin 2004-11-20 4:06 ` Nick Piggin 2004-11-20 4:23 ` William Lee Irwin III 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 4:03 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Unprivileged triggers for full-tasklist scans are NMI oops material. > > > On Sat, Nov 20, 2004 at 02:37:04PM +1100, Nick Piggin wrote: > >>Hang on, let's come back to this... >>We already have unprivileged do-for-each-thread triggers in the proc >>code. It's in do_task_stat, even. Rss reporting would basically just >>involve one extra addition within that loop. >>So... hmm, I can't see a problem with it. > > > /proc/ triggering NMI oopses was a persistent problem even before that > code was merged. I've not bothered testing it as it at best aggravates it. > It isn't a problem. If it ever became a problem then we can just touch the nmi oopser in the loop. > And thread groups can share mm's. do_for_each_thread() won't suffice. > I think it will be just fine. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 4:03 ` Nick Piggin @ 2004-11-20 4:06 ` Nick Piggin 2004-11-20 4:23 ` William Lee Irwin III 1 sibling, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 4:06 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: > William Lee Irwin III wrote: >> And thread groups can share mm's. do_for_each_thread() won't suffice. >> > > I think it will be just fine. > Sorry, I misread. I think having per-thread rss counters will be fine (regardless of whether or not do_for_each_thread itself will suffice). ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 4:03 ` Nick Piggin 2004-11-20 4:06 ` Nick Piggin @ 2004-11-20 4:23 ` William Lee Irwin III 2004-11-20 4:29 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 4:23 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> /proc/ triggering NMI oopses was a persistent problem even before that >> code was merged. I've not bothered testing it as it at best aggravates it. On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote: > It isn't a problem. If it ever became a problem then we can just > touch the nmi oopser in the loop. Very, very wrong. The tasklist scans hold the read side of the lock and aren't even what's running with interrupts off. The contenders on the write side are what the NMI oopser oopses. And supposing the arch reenables interrupts in the write side's spinloop, you just get a box that silently goes out of service for extended periods of time, breaking cluster membership and more. The NMI oopser is just the report of the problem, not the problem itself. It's not a false report. The box is dead for > 5s at a time. William Lee Irwin III wrote: >> And thread groups can share mm's. do_for_each_thread() won't suffice. On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote: > I think it will be just fine. And that makes it wrong on both counts. The above fails any time LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features are used directly. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 4:23 ` William Lee Irwin III @ 2004-11-20 4:29 ` Nick Piggin 2004-11-20 5:38 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 4:29 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>/proc/ triggering NMI oopses was a persistent problem even before that >>>code was merged. I've not bothered testing it as it at best aggravates it. > > > On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote: > >>It isn't a problem. If it ever became a problem then we can just >>touch the nmi oopser in the loop. > > > Very, very wrong. The tasklist scans hold the read side of the lock > and aren't even what's running with interrupts off. The contenders > on the write side are what the NMI oopser oopses. > *blinks* So explain how this is "very very wrong", then? > And supposing the arch reenables interrupts in the write side's > spinloop, you just get a box that silently goes out of service for > extended periods of time, breaking cluster membership and more. The > NMI oopser is just the report of the problem, not the problem itself. > It's not a false report. The box is dead for > 5s at a time. > The point is, adding a for-each-thread loop or two in /proc isn't going to cause a problem that isn't already there. If you had zero for-each-thread loops then you might have a valid complaint. Seeing as you have more than zero, with slim chances of reducing that number, then there is no valid complaint. > > William Lee Irwin III wrote: > >>>And thread groups can share mm's. do_for_each_thread() won't suffice. > > > On Sat, Nov 20, 2004 at 03:03:17PM +1100, Nick Piggin wrote: > >>I think it will be just fine. > > > And that makes it wrong on both counts. The above fails any time > LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features > are used directly. > See my followup. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 4:29 ` Nick Piggin @ 2004-11-20 5:38 ` William Lee Irwin III 2004-11-20 5:50 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 5:38 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> Very, very wrong. The tasklist scans hold the read side of the lock >> and aren't even what's running with interrupts off. The contenders >> on the write side are what the NMI oopser oopses. On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote: > *blinks* > So explain how this is "very very wrong", then? There isn't anything left to explain. So if there's a question, be specific about it. William Lee Irwin III wrote: >> And supposing the arch reenables interrupts in the write side's >> spinloop, you just get a box that silently goes out of service for >> extended periods of time, breaking cluster membership and more. The >> NMI oopser is just the report of the problem, not the problem itself. >> It's not a false report. The box is dead for > 5s at a time. On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote: > The point is, adding a for-each-thread loop or two in /proc isn't > going to cause a problem that isn't already there. > If you had zero for-each-thread loops then you might have a valid > complaint. Seeing as you have more than zero, with slim chances of > reducing that number, then there is no valid complaint. This entire line of argument is bogus. A preexisting bug of a similar nature is not grounds for deliberately introducing any bug. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 5:38 ` William Lee Irwin III @ 2004-11-20 5:50 ` Nick Piggin 2004-11-20 6:23 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 5:50 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Very, very wrong. The tasklist scans hold the read side of the lock >>>and aren't even what's running with interrupts off. The contenders >>>on the write side are what the NMI oopser oopses. > > > On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote: > >>*blinks* >>So explain how this is "very very wrong", then? > > > There isn't anything left to explain. So if there's a question, be > specific about it. > Why am I very very wrong? Why won't touch_nmi_watchdog work from the read loop? And let's just be nice and try not to jump at the chance to point out when people are very very wrong, and keep count of the times they have been very very wrong. I'm trying to be constructive. > > William Lee Irwin III wrote: > >>>And supposing the arch reenables interrupts in the write side's >>>spinloop, you just get a box that silently goes out of service for >>>extended periods of time, breaking cluster membership and more. The >>>NMI oopser is just the report of the problem, not the problem itself. >>>It's not a false report. The box is dead for > 5s at a time. > > > On Sat, Nov 20, 2004 at 03:29:29PM +1100, Nick Piggin wrote: > >>The point is, adding a for-each-thread loop or two in /proc isn't >>going to cause a problem that isn't already there. >>If you had zero for-each-thread loops then you might have a valid >>complaint. Seeing as you have more than zero, with slim chances of >>reducing that number, then there is no valid complaint. > > > This entire line of argument is bogus. A preexisting bug of a similar > nature is not grounds for deliberately introducing any bug. > Sure, if that is a bug and someone is just about to fix it then yes you're right, we shouldn't introduce this. I didn't realise it was a bug. Sounds like it would be causing you lots of problems though - have you looked at how to fix it? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 5:50 ` Nick Piggin @ 2004-11-20 6:23 ` William Lee Irwin III 2004-11-20 6:49 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 6:23 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> There isn't anything left to explain. So if there's a question, be >> specific about it. On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote: > Why am I very very wrong? Why won't touch_nmi_watchdog work from > the read loop? > And let's just be nice and try not to jump at the chance to point > out when people are very very wrong, and keep count of the times > they have been very very wrong. I'm trying to be constructive. touch_nmi_watchdog() is only "protection" against local interrupt disablement triggering the NMI oopser because alert_counter[] increments are not atomic. Yet even supposing they were made so, the net effect of "covering up" this gross deficiency is making the user-observable problems it causes undiagnosable, as noted before. William Lee Irwin III wrote: >> This entire line of argument is bogus. A preexisting bug of a similar >> nature is not grounds for deliberately introducing any bug. On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote: > Sure, if that is a bug and someone is just about to fix it then > yes you're right, we shouldn't introduce this. I didn't realise > it was a bug. Sounds like it would be causing you lots of problems > though - have you looked at how to fix it? Kevin Marin was the first to report this issue to lkml. I had seen instances of it in internal corporate bugreports and it was one of the motivators for the work I did on pidhashing (one of the causes of the timeouts was worst cases in pid allocation). Manfred Spraul and myself wrote patches attempting to reduce read-side hold time in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh Soni wrote patches to carry out the long iterations in /proc/ locklessly. The last several of these affecting /proc/ have not gained acceptance, though the work has not been halted in any sense, as this problem recurs quite regularly. A considerable amount of sustained effort has gone toward mitigating and resolving rwlock starvation. Aggravating the rwlock starvation destabilizes, not pessimizes, and performance is secondary to stability. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 6:23 ` William Lee Irwin III @ 2004-11-20 6:49 ` Nick Piggin 2004-11-20 6:57 ` Andrew Morton 2004-11-20 7:15 ` William Lee Irwin III 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 6:49 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>There isn't anything left to explain. So if there's a question, be >>>specific about it. > > > On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote: > >>Why am I very very wrong? Why won't touch_nmi_watchdog work from >>the read loop? >>And let's just be nice and try not to jump at the chance to point >>out when people are very very wrong, and keep count of the times >>they have been very very wrong. I'm trying to be constructive. > > > touch_nmi_watchdog() is only "protection" against local interrupt > disablement triggering the NMI oopser because alert_counter[] > increments are not atomic. Yet even supposing they were made so, the That would be a bug in touch_nmi_watchdog then, because you're racy against your own NMI too. So I'm actually not very very wrong at all. I'm technically wrong because touch_nmi_watchdog has a theoretical 'bug'. In practice, multiple races with the non atomic increments to the same counter, and in an unbroken sequence would be about as likely as hardware failure. Anyway, this touch nmi thing is going off topic, sorry list. > net effect of "covering up" this gross deficiency is making the > user-observable problems it causes undiagnosable, as noted before. > Well the loops that are in there now aren't covered up, and they don't seem to be causing problems. Ergo there is no problem (we're being _practical_ here, right?) > > William Lee Irwin III wrote: > >>>This entire line of argument is bogus. A preexisting bug of a similar >>>nature is not grounds for deliberately introducing any bug. > > > On Sat, Nov 20, 2004 at 04:50:25PM +1100, Nick Piggin wrote: > >>Sure, if that is a bug and someone is just about to fix it then >>yes you're right, we shouldn't introduce this. I didn't realise >>it was a bug. Sounds like it would be causing you lots of problems >>though - have you looked at how to fix it? > > > Kevin Marin was the first to report this issue to lkml. I had seen > instances of it in internal corporate bugreports and it was one of > the motivators for the work I did on pidhashing (one of the causes > of the timeouts was worst cases in pid allocation). Manfred Spraul > and myself wrote patches attempting to reduce read-side hold time > in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically > subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh > Soni wrote patches to carry out the long iterations in /proc/ locklessly. > > The last several of these affecting /proc/ have not gained acceptance, > though the work has not been halted in any sense, as this problem > recurs quite regularly. A considerable amount of sustained effort has > gone toward mitigating and resolving rwlock starvation. > That's very nice. But there is no problem _now_, is there? > Aggravating the rwlock starvation destabilizes, not pessimizes, > and performance is secondary to stability. > Well luckily we're not going to be aggravating the rwlock stavation. If you found a problem with, and fixed do_task_stat: ?time, ???_flt, et al, then you would apply the same solution to per thread rss to fix it in the same way. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 6:49 ` Nick Piggin @ 2004-11-20 6:57 ` Andrew Morton 2004-11-20 7:04 ` Andrew Morton 2004-11-20 7:13 ` Nick Piggin 2004-11-20 7:15 ` William Lee Irwin III 1 sibling, 2 replies; 286+ messages in thread From: Andrew Morton @ 2004-11-20 6:57 UTC (permalink / raw) To: Nick Piggin Cc: wli, torvalds, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > per thread rss Given that we have contention problems updating a single mm-wide rss and given that the way to fix that up is to spread things out a bit, it seems wildly arbitrary to me that the way in which we choose to spread the counter out is to stick a bit of it into each task_struct. I'd expect that just shoving a pointer into mm_struct which points at a dynamically allocated array[NR_CPUS] of longs would suffice. We probably don't even need to spread them out on cachelines - having four or eight cpus sharing the same cacheline probably isn't going to hurt much. At least, that'd be my first attempt. If it's still not good enough, try something else. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 6:57 ` Andrew Morton @ 2004-11-20 7:04 ` Andrew Morton 2004-11-20 7:13 ` Nick Piggin 1 sibling, 0 replies; 286+ messages in thread From: Andrew Morton @ 2004-11-20 7:04 UTC (permalink / raw) To: nickpiggin, wli, torvalds, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel Andrew Morton <akpm@osdl.org> wrote: > > I'd expect that just shoving a pointer into mm_struct which points at a > dynamically allocated array[NR_CPUS] of longs would suffice. One might even be able to use percpu_counter.h, although that might end up hurting many-cpu fork times, due to all that work in __alloc_percpu(). ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 6:57 ` Andrew Morton 2004-11-20 7:04 ` Andrew Morton @ 2004-11-20 7:13 ` Nick Piggin 2004-11-20 8:00 ` William Lee Irwin III 2004-11-20 16:59 ` Martin J. Bligh 1 sibling, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 7:13 UTC (permalink / raw) To: Andrew Morton Cc: wli, torvalds, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>per thread rss > > > Given that we have contention problems updating a single mm-wide rss and > given that the way to fix that up is to spread things out a bit, it seems > wildly arbitrary to me that the way in which we choose to spread the > counter out is to stick a bit of it into each task_struct. > > I'd expect that just shoving a pointer into mm_struct which points at a > dynamically allocated array[NR_CPUS] of longs would suffice. We probably > don't even need to spread them out on cachelines - having four or eight > cpus sharing the same cacheline probably isn't going to hurt much. > > At least, that'd be my first attempt. If it's still not good enough, try > something else. > > That is what Bill thought too. I guess per-cpu and per-thread rss are the leading candidates. Per thread rss has the benefits of cacheline exclusivity, and not causing task bloat in the common case. Per CPU array has better worst case /proc properties, but shares cachelines (or not, if using percpu_counter as you suggested). I think I'd better leave it to others to finish off the arguments ;) ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 7:13 ` Nick Piggin @ 2004-11-20 8:00 ` William Lee Irwin III 2004-11-20 16:59 ` Martin J. Bligh 1 sibling, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 8:00 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, torvalds, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel Andrew Morton wrote: >> Given that we have contention problems updating a single mm-wide rss and >> given that the way to fix that up is to spread things out a bit, it seems >> wildly arbitrary to me that the way in which we choose to spread the >> counter out is to stick a bit of it into each task_struct. >> I'd expect that just shoving a pointer into mm_struct which points at a >> dynamically allocated array[NR_CPUS] of longs would suffice. We probably >> don't even need to spread them out on cachelines - having four or eight >> cpus sharing the same cacheline probably isn't going to hurt much. >> At least, that'd be my first attempt. If it's still not good enough, try >> something else. On Sat, Nov 20, 2004 at 06:13:03PM +1100, Nick Piggin wrote: > That is what Bill thought too. I guess per-cpu and per-thread rss are > the leading candidates. > Per thread rss has the benefits of cacheline exclusivity, and not > causing task bloat in the common case. > Per CPU array has better worst case /proc properties, but shares > cachelines (or not, if using percpu_counter as you suggested). > I think I'd better leave it to others to finish off the arguments ;) (1) The "task bloat" is more than tolerable on the systems capable of having enough cpus to see significant per-process memory footprint, where "significant" is smaller than a pagetable page even for systems twice as large as now shipped. (2) The cacheline exclusivity is not entirely gone in dense per-cpu arrays, it's merely "approximated" by sharing amongst small groups of adjacent cpus. This is fine for e.g. NUMA because those small groups of adjacent cpus will typically be on nearby nodes. (3) The price paid to get "perfect exclusivity" instead of "approximate exclusivity" is unbounded tasklist_lock hold time, which takes boxen down outright in every known instance. The properties are not for /proc/, they are for tasklist_lock. Every read stops all other writes. When you hold tasklist_lock for an extended period of time for read or write, (e.g. exhaustive tasklist search) you stop all fork()'s and exit()'s and execve()'s on a running system. The "worst case" analysis has nothing to do with speed. It has everything to do with taking a box down outright, much like unplugging power cables or dereferencing NULL. Unbounded tasklist_lock hold time kills running boxen dead. Read sides of rwlocks are not licenses to spin for aeons with locks held. And the "question" of sufficiency has in fact already been answered. SGI's own testing during the 2.4 out-of-tree patching cycle determined that an mm-global atomic counter was already sufficient so long as the cacheline was not shared with ->mmap_sem and the like. The "simplest" optimization of moving the field out of the way of ->mmap_sem already worked. The grander ones, if and *ONLY* if they don't have showstoppers like unbounded tasklist_lock hold time or castrating workload monitoring to unusability, will merely be more robust for future systems. Reiterating, this is all just fine so long as they don't cause any showstopping problems, like castrating the ability to monitor processes, or introducing more tasklist_lock starvation. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 7:13 ` Nick Piggin 2004-11-20 8:00 ` William Lee Irwin III @ 2004-11-20 16:59 ` Martin J. Bligh 2004-11-20 17:14 ` Linus Torvalds 1 sibling, 1 reply; 286+ messages in thread From: Martin J. Bligh @ 2004-11-20 16:59 UTC (permalink / raw) To: Nick Piggin, Andrew Morton Cc: wli, torvalds, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel >> Given that we have contention problems updating a single mm-wide rss and >> given that the way to fix that up is to spread things out a bit, it seems >> wildly arbitrary to me that the way in which we choose to spread the >> counter out is to stick a bit of it into each task_struct. >> >> I'd expect that just shoving a pointer into mm_struct which points at a >> dynamically allocated array[NR_CPUS] of longs would suffice. We probably >> don't even need to spread them out on cachelines - having four or eight >> cpus sharing the same cacheline probably isn't going to hurt much. >> >> At least, that'd be my first attempt. If it's still not good enough, try >> something else. >> >> > > That is what Bill thought too. I guess per-cpu and per-thread rss are > the leading candidates. > > Per thread rss has the benefits of cacheline exclusivity, and not > causing task bloat in the common case. > > Per CPU array has better worst case /proc properties, but shares > cachelines (or not, if using percpu_counter as you suggested). Per thread seems much nicer to me - mainly because it degrades cleanly to a single counter for 99% of processes, which are single threaded. M. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 16:59 ` Martin J. Bligh @ 2004-11-20 17:14 ` Linus Torvalds 2004-11-20 19:08 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Linus Torvalds @ 2004-11-20 17:14 UTC (permalink / raw) To: Martin J. Bligh Cc: Nick Piggin, Andrew Morton, wli, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, Martin J. Bligh wrote: > > Per thread seems much nicer to me - mainly because it degrades cleanly to > a single counter for 99% of processes, which are single threaded. I will pretty much guarantee that if you put the per-thread patches next to some abomination with per-cpu allocation for each mm, the choice will be clear. Especially if the per-cpu/per-mm thing tries to avoid false cacheline sharing, which sounds really "interesting" in itself. And without the cacheline sharing avoidance, what's the point of this again? It sure wasn't to make the code simpler. It was about performance and scalability. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 17:14 ` Linus Torvalds @ 2004-11-20 19:08 ` William Lee Irwin III 2004-11-20 19:16 ` Linus Torvalds 2004-11-20 20:25 ` [OT] " Adam Heath 0 siblings, 2 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 19:08 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, Nov 20, 2004 at 09:14:11AM -0800, Linus Torvalds wrote: > I will pretty much guarantee that if you put the per-thread patches next > to some abomination with per-cpu allocation for each mm, the choice will > be clear. Especially if the per-cpu/per-mm thing tries to avoid false > cacheline sharing, which sounds really "interesting" in itself. > And without the cacheline sharing avoidance, what's the point of this > again? It sure wasn't to make the code simpler. It was about performance > and scalability. "The perfect is the enemy of the good." The "perfect" cacheline separation achieved that way is at the cost of destabilizing the kernel. The dense per-cpu business is only really a concession to the notion that the counter needs to be split up at all, which has never been demonstrated with performance measurements. In fact, Robin Holt has performance measurements demonstrating the opposite. The "good" alternatives are negligibly different wrt. performance, and don't carry the high cost of rwlock starvation that breaks boxen. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 19:08 ` William Lee Irwin III @ 2004-11-20 19:16 ` Linus Torvalds 2004-11-20 19:33 ` William Lee Irwin III 2004-11-20 20:25 ` [OT] " Adam Heath 1 sibling, 1 reply; 286+ messages in thread From: Linus Torvalds @ 2004-11-20 19:16 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Andrew Morton, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, William Lee Irwin III wrote: > > "The perfect is the enemy of the good." Yes. But in this case, my suggestion _is_ the good. You seem to be pushing for a really horrid thing which allocates a per-cpu array for each mm_struct. What is it that you have against the per-thread rss? We already have several places that do the thread-looping, so it's not like "you can't do that" is a valid argument. Linus ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 19:16 ` Linus Torvalds @ 2004-11-20 19:33 ` William Lee Irwin III 2004-11-22 17:44 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 19:33 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, clameter, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, William Lee Irwin III wrote: >> "The perfect is the enemy of the good." On Sat, Nov 20, 2004 at 11:16:12AM -0800, Linus Torvalds wrote: > Yes. But in this case, my suggestion _is_ the good. You seem to be pushing > for a really horrid thing which allocates a per-cpu array for each > mm_struct. > What is it that you have against the per-thread rss? We already have > several places that do the thread-looping, so it's not like "you can't do > that" is a valid argument. Okay, first thread groups can share mm's, so it's worse than iterating over a thread group. Second, the long loops under tasklist_lock didn't stop causing rwlock starvation because what patches there were to do something about them didn't get merged. I'm not particularly "stuck on" the per-cpu business, it was merely the most obvious method of splitting the RSS counter without catastrophes elsewhere. Robin Holt's 2.4 performance studies actually show that splitting the counter is not even essential. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 19:33 ` William Lee Irwin III @ 2004-11-22 17:44 ` Christoph Lameter 2004-11-22 22:43 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 17:44 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Nick Piggin, Andrew Morton, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, William Lee Irwin III wrote: > I'm not particularly "stuck on" the per-cpu business, it was merely the > most obvious method of splitting the RSS counter without catastrophes > elsewhere. Robin Holt's 2.4 performance studies actually show that > splitting the counter is not even essential. There is no problem moving back to the atomic approach that is if it is okay to also make anon_rss atomic. But its a pretty significant performance hit (comparison with some old data from V4 of patch which makes this data a bit suspect since the test environment is likely slightly different. I should really test this again. Note that the old performance test was only run 3 times instead of 10): atomic vs. sloppy rss performance 64G allocation: sloppy rss: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 16 10 1 1.818s 131.556s 133.038s 78618.592 78615.672 16 10 2 1.736s 121.167s 65.026s 85317.098 160656.362 16 10 4 1.835s 120.444s 36.002s 85751.810 291074.998 16 10 8 1.820s 131.068s 25.049s 78906.310 411304.895 16 10 16 3.275s 194.971s 22.019s 52892.356 472497.962 16 10 32 13.006s 496.628s 27.044s 20575.038 381999.865 atomic: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 16 3 1 0.610s 61.557s 62.016s 50600.438 50599.822 16 3 2 0.640s 83.116s 43.016s 37557.847 72869.978 16 3 4 0.621s 73.897s 26.023s 42214.002 119908.246 16 3 8 0.596s 86.587s 14.098s 36081.229 209962.059 16 3 16 0.646s 69.601s 7.000s 44780.269 448823.690 16 3 32 0.903s 185.609s 8.085s 16866.018 355301.694 Lets go for the approach to move rss into the thread structure but keep the rss in the mm structure as is (need to take page_table_lock for update) to consolidate the values. This allows to keep most of the code as is and the rss in the task struct is only used if we are not holding page_table_lock. Maybe we can then find some way to regularly update the rss in the mm structure to avoid the loop over the tasklist in proc. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-22 17:44 ` Christoph Lameter @ 2004-11-22 22:43 ` William Lee Irwin III 2004-11-22 22:51 ` Christoph Lameter 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-22 22:43 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Nick Piggin, Andrew Morton, benh, hugh, linux-mm, linux-ia64, linux-kernel On Sat, 20 Nov 2004, William Lee Irwin III wrote: >> I'm not particularly "stuck on" the per-cpu business, it was merely the >> most obvious method of splitting the RSS counter without catastrophes >> elsewhere. Robin Holt's 2.4 performance studies actually show that >> splitting the counter is not even essential. On Mon, Nov 22, 2004 at 09:44:02AM -0800, Christoph Lameter wrote: > There is no problem moving back to the atomic approach that is if it is > okay to also make anon_rss atomic. But its a pretty significant > performance hit (comparison with some old data from V4 of patch which > makes this data a bit suspect since the test environment is likely > slightly different. I should really test this again. Note that the old > performance test was only run 3 times instead of 10): > atomic vs. sloppy rss performance 64G allocation: The specific patches you compared matter a great deal as there are implementation blunders (e.g. poor placement of counters relative to ->mmap_sem) that can ruin the results. URL's to the specific patches would rule out that source of error. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-22 22:43 ` William Lee Irwin III @ 2004-11-22 22:51 ` Christoph Lameter 2004-11-23 2:25 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Christoph Lameter @ 2004-11-22 22:51 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Nick Piggin, Andrew Morton, benh, hugh, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, William Lee Irwin III wrote: > The specific patches you compared matter a great deal as there are > implementation blunders (e.g. poor placement of counters relative to > ->mmap_sem) that can ruin the results. URL's to the specific patches > would rule out that source of error. I mentioned V4 of this patch which was posted to lkml. A simple search should get you there. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-22 22:51 ` Christoph Lameter @ 2004-11-23 2:25 ` William Lee Irwin III 0 siblings, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-23 2:25 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Nick Piggin, Andrew Morton, benh, hugh, linux-mm, linux-ia64, linux-kernel On Mon, 22 Nov 2004, William Lee Irwin III wrote: >> The specific patches you compared matter a great deal as there are >> implementation blunders (e.g. poor placement of counters relative to >> ->mmap_sem) that can ruin the results. URL's to the specific patches >> would rule out that source of error. On Mon, Nov 22, 2004 at 02:51:22PM -0800, Christoph Lameter wrote: > I mentioned V4 of this patch which was posted to lkml. A simple search > should get you there. The counter's placement was poor in that version of the patch. The results are very suspect and likely invalid. It would have been more helpful if you provided some kind of unique identifier when requests for complete disambiguation are made. For instance, the version tags of your patches are not visible in Subject: lines. There are, of course, other issues, e.g. where the arch sweeps went. This discussion has degenerated into non-cooperation making it beyond my power to help, and I'm in the midst of several rather urgent bughunts, of which there are apparently more to come. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* [OT] Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 19:08 ` William Lee Irwin III 2004-11-20 19:16 ` Linus Torvalds @ 2004-11-20 20:25 ` Adam Heath 1 sibling, 0 replies; 286+ messages in thread From: Adam Heath @ 2004-11-20 20:25 UTC (permalink / raw) Cc: linux-kernel On Sat, 20 Nov 2004, William Lee Irwin III wrote: > "The perfect is the enemy of the good." "With the proper course of action made so explicit, we had merely to choose between wisdom and folly. Precisely how we chose folly in this instance is not entirely clear." Quote taken from Andrew Suffield on irc, who said he got it from Penny Arcade. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 6:49 ` Nick Piggin 2004-11-20 6:57 ` Andrew Morton @ 2004-11-20 7:15 ` William Lee Irwin III 2004-11-20 7:29 ` Nick Piggin 1 sibling, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 7:15 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> touch_nmi_watchdog() is only "protection" against local interrupt >> disablement triggering the NMI oopser because alert_counter[] >> increments are not atomic. Yet even supposing they were made so, the On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote: > That would be a bug in touch_nmi_watchdog then, because you're > racy against your own NMI too. > So I'm actually not very very wrong at all. I'm technically wrong > because touch_nmi_watchdog has a theoretical 'bug'. In practice, > multiple races with the non atomic increments to the same counter, > and in an unbroken sequence would be about as likely as hardware > failure. > Anyway, this touch nmi thing is going off topic, sorry list. No, it's on-topic. (1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses, merely not every time, and not on every system. It is not associated with hardware failure. It is, however, tolerable because sysrq's require privilege to trigger and are primarly used when the box is dying anyway. (2) NMI's don't nest. There is no possibility of NMI's racing against themselves while the data is per-cpu. William Lee Irwin III wrote: >> net effect of "covering up" this gross deficiency is making the >> user-observable problems it causes undiagnosable, as noted before. On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote: > Well the loops that are in there now aren't covered up, and they > don't seem to be causing problems. Ergo there is no problem (we're > being _practical_ here, right?) They are causing problems. They never stopped causing problems. None of the above attempts to reduce rwlock starvation has been successful in reducing it to untriggerable-in-the-field levels, and empirical demonstrations of starvation recurring after those available at the time of testing were put into place did in fact happen. Reduction of frequency and making starvation more difficult to trigger are all that they've achieved thus far. William Lee Irwin III wrote: >> Kevin Marin was the first to report this issue to lkml. I had seen >> instances of it in internal corporate bugreports and it was one of >> the motivators for the work I did on pidhashing (one of the causes >> of the timeouts was worst cases in pid allocation). Manfred Spraul >> and myself wrote patches attempting to reduce read-side hold time >> in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically >> subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh >> Soni wrote patches to carry out the long iterations in /proc/ locklessly. >> The last several of these affecting /proc/ have not gained acceptance, >> though the work has not been halted in any sense, as this problem >> recurs quite regularly. A considerable amount of sustained effort has >> gone toward mitigating and resolving rwlock starvation. On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote: > That's very nice. But there is no problem _now_, is there? There is and has always been. All of the above merely mitigate the issue, with the possible exception of the tasklist RCU patch, for which I know of no testing results. Also note that almost none of the work on /proc/ has been merged. William Lee Irwin III wrote: >> Aggravating the rwlock starvation destabilizes, not pessimizes, >> and performance is secondary to stability. On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote: > Well luckily we're not going to be aggravating the rwlock stavation. > If you found a problem with, and fixed do_task_stat: ?time, ???_flt, > et al, then you would apply the same solution to per thread rss to > fix it in the same way. You are aggravating the rwlock starvation by introducing gratuitous full tasklist iterations. There is no solution to do_task_stat() because it was recently introduced. There will be one as part of a port of the usual mitigation patches when the perennial problem is reported against a sufficiently recent kernel version, as usual. The already- demonstrated problematic iterations have not been removed. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 7:15 ` William Lee Irwin III @ 2004-11-20 7:29 ` Nick Piggin 2004-11-20 7:45 ` touch_nmi_watchdog (was: page fault scalability patch V11 [0/7]: overview) Nick Piggin 2004-11-20 7:57 ` page fault scalability patch V11 [0/7]: overview Nick Piggin 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 7:29 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>touch_nmi_watchdog() is only "protection" against local interrupt >>>disablement triggering the NMI oopser because alert_counter[] >>>increments are not atomic. Yet even supposing they were made so, the > > > On Sat, Nov 20, 2004 at 05:49:53PM +1100, Nick Piggin wrote: > >>That would be a bug in touch_nmi_watchdog then, because you're >>racy against your own NMI too. >>So I'm actually not very very wrong at all. I'm technically wrong >>because touch_nmi_watchdog has a theoretical 'bug'. In practice, >>multiple races with the non atomic increments to the same counter, >>and in an unbroken sequence would be about as likely as hardware >>failure. >>Anyway, this touch nmi thing is going off topic, sorry list. > > > No, it's on-topic. > (1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses, > merely not every time, and not on every system. It is not > associated with hardware failure. It is, however, tolerable > because sysrq's require privilege to trigger and are primarly > used when the box is dying anyway. OK then put a touch_nmi_watchdog in there if you must. > (2) NMI's don't nest. There is no possibility of NMI's racing against > themselves while the data is per-cpu. > Your point was that touch_nmi_watchdog() which resets alert_counter, is racy when resetting the counter of other CPUs. Yes it is racy. It is also racy against the NMI on the _current_ CPU. This has nothing whatsoever to do with NMIs racing against themselves, I don't know how you got that idea when you were the one to bring up this race anyway. [ snip back-and-forth that is going nowhere ] I'll bow out of the argument here. I grant you raise valid concens WRT the /proc issues, of course. ^ permalink raw reply [flat|nested] 286+ messages in thread
* touch_nmi_watchdog (was: page fault scalability patch V11 [0/7]: overview) 2004-11-20 7:29 ` Nick Piggin @ 2004-11-20 7:45 ` Nick Piggin 2004-11-20 7:57 ` page fault scalability patch V11 [0/7]: overview Nick Piggin 1 sibling, 0 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 7:45 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: >> (2) NMI's don't nest. There is no possibility of NMI's racing against >> themselves while the data is per-cpu. >> > > Your point was that touch_nmi_watchdog() which resets alert_counter, > is racy when resetting the counter of other CPUs. Yes it is racy. > It is also racy against the NMI on the _current_ CPU. Hmm no I think you're right in that it is only a problem WRT the remote CPUs. However that would still be a problem, as the comment in i386 touch_nmi_watchdog attests. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 7:29 ` Nick Piggin 2004-11-20 7:45 ` touch_nmi_watchdog (was: page fault scalability patch V11 [0/7]: overview) Nick Piggin @ 2004-11-20 7:57 ` Nick Piggin 2004-11-20 8:25 ` William Lee Irwin III 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 7:57 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: > William Lee Irwin III wrote: >> No, it's on-topic. >> (1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses, >> merely not every time, and not on every system. It is not >> associated with hardware failure. It is, however, tolerable >> because sysrq's require privilege to trigger and are primarly >> used when the box is dying anyway. > > > OK then put a touch_nmi_watchdog in there if you must. > Duh, there is one in there :\ Still, that doesn't really say much about a normal tasklist traversal because this thing will spend ages writing stuff to serial console. Now I know going over the whole tasklist is crap. Anything O(n) for things like this is crap. I happen to just get frustrated to see concessions being made to support more efficient /proc access. I know you are one of the ones who has to deal with the practical realities of that though. Sigh. Well try to bear with me... :| ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 7:57 ` page fault scalability patch V11 [0/7]: overview Nick Piggin @ 2004-11-20 8:25 ` William Lee Irwin III 0 siblings, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 8:25 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel Nick Piggin wrote: >> OK then put a touch_nmi_watchdog in there if you must. On Sat, Nov 20, 2004 at 06:57:38PM +1100, Nick Piggin wrote: > Duh, there is one in there :\ > Still, that doesn't really say much about a normal tasklist traversal > because this thing will spend ages writing stuff to serial console. > Now I know going over the whole tasklist is crap. Anything O(n) for > things like this is crap. I happen to just get frustrated to see > concessions being made to support more efficient /proc access. I know > you are one of the ones who has to deal with the practical realities > of that though. Sigh. Well try to bear with me... :| I sure as Hell don't have any interest in /proc/ in and of itself, but this stuff does really bite people, and hard, too. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (7 preceding siblings ...) 2004-11-19 19:59 ` page fault scalability patch V11 [0/7]: overview Linus Torvalds @ 2004-11-20 2:04 ` William Lee Irwin III 2004-11-20 2:18 ` Nick Piggin 2004-11-20 2:06 ` Robin Holt 9 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 2:04 UTC (permalink / raw) To: Christoph Lameter Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote: > A. make_rss_atomic. The earlier releases contained that patch but > then another variable (such as anon_rss) was introduced that would > have required additional atomic operations. Atomic rss operations > are also causing slowdowns on machines with a high number of cpus > due to memory contention. > B. remove_rss. Replace rss with a periodic scan over the vm to > determine rss and additional numbers. This was also discussed on > linux-mm and linux-ia64. The scans while displaying /proc data > were undesirable. Split counters easily resolve the issues with both these approaches (and apparently your co-workers are suggesting it too, and have performance results backing it). -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:04 ` William Lee Irwin III @ 2004-11-20 2:18 ` Nick Piggin 2004-11-20 2:34 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 2:18 UTC (permalink / raw) To: William Lee Irwin III Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: > On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote: > >>A. make_rss_atomic. The earlier releases contained that patch but >>then another variable (such as anon_rss) was introduced that would >> have required additional atomic operations. Atomic rss operations >> are also causing slowdowns on machines with a high number of cpus >> due to memory contention. >>B. remove_rss. Replace rss with a periodic scan over the vm to >> determine rss and additional numbers. This was also discussed on >> linux-mm and linux-ia64. The scans while displaying /proc data >> were undesirable. > > > Split counters easily resolve the issues with both these approaches > (and apparently your co-workers are suggesting it too, and have > performance results backing it). > Split counters still require atomic operations though. This is what Christoph's latest effort is directed at removing. And they'll still bounce cachelines around. (I assume we've reached the conclusion that per-cpu split counters per-mm won't fly?). ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:18 ` Nick Piggin @ 2004-11-20 2:34 ` William Lee Irwin III 2004-11-20 2:40 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 2:34 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel William Lee Irwin III wrote: >> Split counters easily resolve the issues with both these approaches >> (and apparently your co-workers are suggesting it too, and have >> performance results backing it). On Sat, Nov 20, 2004 at 01:18:22PM +1100, Nick Piggin wrote: > Split counters still require atomic operations though. This is what > Christoph's latest effort is directed at removing. And they'll still > bounce cachelines around. (I assume we've reached the conclusion > that per-cpu split counters per-mm won't fly?). Split != per-cpu, though it may be. Counterexamples are as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]); Furthermore, see Robin Holt's results regarding the performance of the atomic operations and their relation to cacheline sharing. And frankly, the argument that the space overhead of per-cpu counters is problematic is not compelling. Even at 1024 cpus it's smaller than an ia64 pagetable page, of which there are numerous instances attached to each mm. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:34 ` William Lee Irwin III @ 2004-11-20 2:40 ` Nick Piggin 2004-11-20 3:04 ` William Lee Irwin III 0 siblings, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 2:40 UTC (permalink / raw) To: William Lee Irwin III Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Split counters easily resolve the issues with both these approaches >>>(and apparently your co-workers are suggesting it too, and have >>>performance results backing it). > > > On Sat, Nov 20, 2004 at 01:18:22PM +1100, Nick Piggin wrote: > >>Split counters still require atomic operations though. This is what >>Christoph's latest effort is directed at removing. And they'll still >>bounce cachelines around. (I assume we've reached the conclusion >>that per-cpu split counters per-mm won't fly?). > > > Split != per-cpu, though it may be. Counterexamples are > as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]); Oh yes, I just meant that the only way split counters will relieve the atomic ops and bouncing is by having them per-cpu. But you knew that :) > Furthermore, see Robin Holt's results regarding the performance of the > atomic operations and their relation to cacheline sharing. > Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss means another atomic op. While this doesn't immediately make it a showstopper, it is gradually slowing down the single threaded page fault path too, which is bad. > And frankly, the argument that the space overhead of per-cpu counters > is problematic is not compelling. Even at 1024 cpus it's smaller than > an ia64 pagetable page, of which there are numerous instances attached > to each mm. > 1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably don't even care about 64K on their large machines, but... On i386 this would be maybe 32 * 128 byte == 4K per task for distro kernels. Not so good. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 2:40 ` Nick Piggin @ 2004-11-20 3:04 ` William Lee Irwin III 2004-11-20 3:14 ` Nick Piggin 2004-11-20 3:33 ` Robin Holt 0 siblings, 2 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 3:04 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: >> Furthermore, see Robin Holt's results regarding the performance of the >> atomic operations and their relation to cacheline sharing. On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss Irrelevant. Unshare cachelines with hot mm-global ones, and the "problem" goes away. This stuff is going on and on about some purist "no atomic operations anywhere" weirdness even though killing the last atomic operation creates problems and doesn't improve performance. On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > means another atomic op. While this doesn't immediately make it a > showstopper, it is gradually slowing down the single threaded page > fault path too, which is bad. William Lee Irwin III wrote: >> And frankly, the argument that the space overhead of per-cpu counters >> is problematic is not compelling. Even at 1024 cpus it's smaller than >> an ia64 pagetable page, of which there are numerous instances attached >> to each mm. On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > 1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably > don't even care about 64K on their large machines, but... > On i386 this would be maybe 32 * 128 byte == 4K per task for distro > kernels. Not so good. Why the Hell would you bother giving each cpu a separate cacheline? The odds of bouncing significantly merely amongst the counters are not particularly high. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:04 ` William Lee Irwin III @ 2004-11-20 3:14 ` Nick Piggin 2004-11-20 3:43 ` William Lee Irwin III 2004-11-20 3:33 ` Robin Holt 1 sibling, 1 reply; 286+ messages in thread From: Nick Piggin @ 2004-11-20 3:14 UTC (permalink / raw) To: William Lee Irwin III Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Furthermore, see Robin Holt's results regarding the performance of the >>>atomic operations and their relation to cacheline sharing. > > > On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > >>Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss > > > Irrelevant. Unshare cachelines with hot mm-global ones, and the > "problem" goes away. > That's the idea. > This stuff is going on and on about some purist "no atomic operations > anywhere" weirdness even though killing the last atomic operation > creates problems and doesn't improve performance. > Huh? How is not wanting to impact single threaded performance being "purist weirdness"? Practical, I'd call it. > > On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > >>means another atomic op. While this doesn't immediately make it a >>showstopper, it is gradually slowing down the single threaded page >>fault path too, which is bad. > > > William Lee Irwin III wrote: > >>>And frankly, the argument that the space overhead of per-cpu counters >>>is problematic is not compelling. Even at 1024 cpus it's smaller than >>>an ia64 pagetable page, of which there are numerous instances attached >>>to each mm. > > > On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > >>1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably >>don't even care about 64K on their large machines, but... >>On i386 this would be maybe 32 * 128 byte == 4K per task for distro >>kernels. Not so good. > > > Why the Hell would you bother giving each cpu a separate cacheline? > The odds of bouncing significantly merely amongst the counters are not > particularly high. > Hmm yeah I guess wouldn't put them all on different cachelines. As you can see though, Christoph ran into a wall at 8 CPUs, so having them densly packed still might not be enough. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:14 ` Nick Piggin @ 2004-11-20 3:43 ` William Lee Irwin III 2004-11-20 3:58 ` Nick Piggin 0 siblings, 1 reply; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 3:43 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: >> Irrelevant. Unshare cachelines with hot mm-global ones, and the >> "problem" goes away. On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > That's the idea. William Lee Irwin III wrote: >> This stuff is going on and on about some purist "no atomic operations >> anywhere" weirdness even though killing the last atomic operation >> creates problems and doesn't improve performance. On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > Huh? How is not wanting to impact single threaded performance being > "purist weirdness"? Practical, I'd call it. Empirically demonstrate the impact on single-threaded performance. On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: >> Why the Hell would you bother giving each cpu a separate cacheline? >> The odds of bouncing significantly merely amongst the counters are not >> particularly high. On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > Hmm yeah I guess wouldn't put them all on different cachelines. > As you can see though, Christoph ran into a wall at 8 CPUs, so > having them densly packed still might not be enough. Please be more specific about the result, and cite the Message-Id. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:43 ` William Lee Irwin III @ 2004-11-20 3:58 ` Nick Piggin 2004-11-20 4:01 ` William Lee Irwin III 2004-11-20 4:34 ` Robin Holt 0 siblings, 2 replies; 286+ messages in thread From: Nick Piggin @ 2004-11-20 3:58 UTC (permalink / raw) To: William Lee Irwin III Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>Irrelevant. Unshare cachelines with hot mm-global ones, and the >>>"problem" goes away. > > > On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > >>That's the idea. > > > > William Lee Irwin III wrote: > >>>This stuff is going on and on about some purist "no atomic operations >>>anywhere" weirdness even though killing the last atomic operation >>>creates problems and doesn't improve performance. > > > On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > >>Huh? How is not wanting to impact single threaded performance being >>"purist weirdness"? Practical, I'd call it. > > > Empirically demonstrate the impact on single-threaded performance. > I can tell you its worse. I don't have to demonstrate anything, more atomic RMW ops in the page fault path is going to have an impact. I'm not saying we must not compromise *anywhere*, but it would just be nice to try to avoid making the path heavier, that's all. I'm not being purist when I say I'd first rather explore all other options before adding atomics. But nevermind arguing, it appears Linus' suggested method will be fine and *does* mean we don't have to compromise. > > On Sat, Nov 20, 2004 at 01:40:40PM +1100, Nick Piggin wrote: > >>>Why the Hell would you bother giving each cpu a separate cacheline? >>>The odds of bouncing significantly merely amongst the counters are not >>>particularly high. > > > On Sat, Nov 20, 2004 at 02:14:33PM +1100, Nick Piggin wrote: > >>Hmm yeah I guess wouldn't put them all on different cachelines. >>As you can see though, Christoph ran into a wall at 8 CPUs, so >>having them densly packed still might not be enough. > > > Please be more specific about the result, and cite the Message-Id. > Start of this thread. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:58 ` Nick Piggin @ 2004-11-20 4:01 ` William Lee Irwin III 2004-11-20 4:34 ` Robin Holt 1 sibling, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 4:01 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt William Lee Irwin III wrote: >> Please be more specific about the result, and cite the Message-Id. On Sat, Nov 20, 2004 at 02:58:36PM +1100, Nick Piggin wrote: > Start of this thread. Those do not have testing results of different RSS counter implementations in isolation. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:58 ` Nick Piggin 2004-11-20 4:01 ` William Lee Irwin III @ 2004-11-20 4:34 ` Robin Holt 1 sibling, 0 replies; 286+ messages in thread From: Robin Holt @ 2004-11-20 4:34 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt On Sat, Nov 20, 2004 at 02:58:36PM +1100, Nick Piggin wrote: > >Please be more specific about the result, and cite the Message-Id. > > > > Start of this thread. Part of the impact was having the page table lock, the mmap_sem, and these two atomic counters in the same cacheline. What about seperating the counters from the locks? ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:04 ` William Lee Irwin III 2004-11-20 3:14 ` Nick Piggin @ 2004-11-20 3:33 ` Robin Holt 2004-11-20 4:24 ` William Lee Irwin III 1 sibling, 1 reply; 286+ messages in thread From: Robin Holt @ 2004-11-20 3:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel, Robin Holt On Fri, Nov 19, 2004 at 07:04:25PM -0800, William Lee Irwin III wrote: > Why the Hell would you bother giving each cpu a separate cacheline? > The odds of bouncing significantly merely amongst the counters are not > particularly high. Agree, we are currently using atomic ops on a global rss on our 2.4 kernel with 512cpu systems and not seeing much cacheline contention. I don't remember how little it ended up being, but it was very little. We had gone to dropping the page_table_lock and only reaquiring it if the pte was non-null when we went to insert our new one. I think that was how we had it working. I would have to wake up and actually look at that code as it was many months ago that Ray Bryant did that work. We did make rss atomic. Most of the contention is sorted out by the mmap_sem. Processes acquiring themselves off of mmap_sem were found to have spaced themselves out enough that they were all approximately equal time from doing their atomic_add and therefore had very little contention for the cacheline. At least it was not enough that we could measure it as significant. ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-20 3:33 ` Robin Holt @ 2004-11-20 4:24 ` William Lee Irwin III 0 siblings, 0 replies; 286+ messages in thread From: William Lee Irwin III @ 2004-11-20 4:24 UTC (permalink / raw) To: Robin Holt Cc: Nick Piggin, Christoph Lameter, torvalds, akpm, Benjamin Herrenschmidt, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, Nov 19, 2004 at 09:33:12PM -0600, Robin Holt wrote: > Agree, we are currently using atomic ops on a global rss on our 2.4 > kernel with 512cpu systems and not seeing much cacheline contention. > I don't remember how little it ended up being, but it was very little. > We had gone to dropping the page_table_lock and only reaquiring it if > the pte was non-null when we went to insert our new one. I think that > was how we had it working. I would have to wake up and actually look > at that code as it was many months ago that Ray Bryant did that work. > We did make rss atomic. Most of the contention is sorted out by the > mmap_sem. Processes acquiring themselves off of mmap_sem were found > to have spaced themselves out enough that they were all approximately > equal time from doing their atomic_add and therefore had very little > contention for the cacheline. At least it was not enough that we could > measure it as significant. Also, the densely-packed split counter can only get 4-16 cpus to a cacheline with cachelines <= 128B, so there are definite limitations to the amount of cacheline contention in such schemes. -- wli ^ permalink raw reply [flat|nested] 286+ messages in thread
* Re: page fault scalability patch V11 [0/7]: overview 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter ` (8 preceding siblings ...) 2004-11-20 2:04 ` William Lee Irwin III @ 2004-11-20 2:06 ` Robin Holt 9 siblings, 0 replies; 286+ messages in thread From: Robin Holt @ 2004-11-20 2:06 UTC (permalink / raw) To: Christoph Lameter Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, linux-mm, linux-ia64, linux-kernel On Fri, Nov 19, 2004 at 11:42:39AM -0800, Christoph Lameter wrote: > Note that I have posted two other approaches of dealing with the rss problem: > > A. make_rss_atomic. The earlier releases contained that patch but then another > variable (such as anon_rss) was introduced that would have required additional > atomic operations. Atomic rss operations are also causing slowdowns on > machines with a high number of cpus due to memory contention. > > B. remove_rss. Replace rss with a periodic scan over the vm to determine > rss and additional numbers. This was also discussed on linux-mm and linux-ia64. > The scans while displaying /proc data were undesirable. Can you run a comparison benchmark between atomic rss and anon_rss and the sloppy rss with the rss and anon_rss in seperate cachelines. I am not sure that it is important to seperate the two into seperate lines, just rss and anon_rss from the lock and sema. If I have the time over the weekend, I might try this myself. If not, can you give it a try. Thanks, Robin ^ permalink raw reply [flat|nested] 286+ messages in thread
end of thread, other threads:[~2005-02-24 6:06 UTC | newest] Thread overview: 286+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <Pine.LNX.4.44.0411061527440.3567-100000@localhost.localdomain> [not found] ` <Pine.LNX.4.58.0411181126440.30385@schroedinger.engr.sgi.com> [not found] ` <Pine.LNX.4.58.0411181715280.834@schroedinger.engr.sgi.com> [not found] ` <419D581F.2080302@yahoo.com.au> [not found] ` <Pine.LNX.4.58.0411181835540.1421@schroedinger.engr.sgi.com> [not found] ` <419D5E09.20805@yahoo.com.au> [not found] ` <Pine.LNX.4.58.0411181921001.1674@schroedinger.engr.sgi.com> [not found] ` <1100848068.25520.49.camel@gaston> 2004-11-19 19:42 ` page fault scalability patch V11 [0/7]: overview Christoph Lameter 2004-11-19 19:43 ` page fault scalability patch V11 [1/7]: sloppy rss Christoph Lameter 2004-11-19 20:50 ` Hugh Dickins 2004-11-20 1:29 ` Christoph Lameter 2004-11-22 15:00 ` Hugh Dickins 2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter 2004-11-22 22:11 ` Andrew Morton 2004-11-22 22:13 ` Christoph Lameter 2004-11-22 22:17 ` Benjamin Herrenschmidt 2004-11-22 22:45 ` Andrew Morton 2004-11-22 22:48 ` Christoph Lameter 2004-11-22 23:09 ` Nick Piggin 2004-11-22 23:13 ` Christoph Lameter 2004-11-22 23:16 ` Andrew Morton 2004-11-22 23:19 ` Christoph Lameter 2004-11-22 22:22 ` Linus Torvalds 2004-11-22 22:27 ` Christoph Lameter 2004-11-22 22:40 ` Linus Torvalds 2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock Christoph Lameter 2004-12-01 23:42 ` page fault scalability patch V12 [2/7]: atomic pte operations for ia64 Christoph Lameter 2004-12-01 23:43 ` page fault scalability patch V12 [3/7]: universal cmpxchg for i386 Christoph Lameter 2004-12-01 23:43 ` page fault scalability patch V12 [4/7]: atomic pte operations " Christoph Lameter 2004-12-01 23:44 ` page fault scalability patch V12 [5/7]: atomic pte operations for x86_64 Christoph Lameter 2004-12-01 23:45 ` page fault scalability patch V12 [6/7]: atomic pte operations for s390 Christoph Lameter 2004-12-01 23:45 ` page fault scalability patch V12 [7/7]: Split counter for rss Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [0/7]: Overview Christoph Lameter 2005-01-04 19:35 ` page fault scalability patch V14 [1/7]: Avoid taking page_table_lock Christoph Lameter 2005-01-04 19:36 ` page fault scalability patch V14 [2/7]: ia64 atomic pte operations Christoph Lameter 2005-01-04 19:37 ` page fault scalability patch V14 [3/7]: i386 universal cmpxchg Christoph Lameter 2005-01-05 11:51 ` Roman Zippel 2005-01-04 19:37 ` page fault scalability patch V14 [4/7]: i386 atomic pte operations Christoph Lameter 2005-01-04 19:38 ` page fault scalability patch V14 [5/7]: x86_64 " Christoph Lameter 2005-01-04 19:46 ` Andi Kleen 2005-01-04 19:58 ` Christoph Lameter 2005-01-04 20:21 ` Andi Kleen 2005-01-04 20:32 ` Christoph Lameter 2005-01-11 17:39 ` page table lock patch V15 [0/7]: overview Christoph Lameter 2005-01-11 17:40 ` page table lock patch V15 [1/7]: Reduce use of page table lock Christoph Lameter 2005-01-11 17:41 ` page table lock patch V15 [2/7]: ia64 atomic pte operations Christoph Lameter 2005-01-11 17:41 ` page table lock patch V15 [3/7]: i386 universal cmpxchg Christoph Lameter 2005-01-11 17:42 ` page table lock patch V15 [4/7]: i386 atomic pte operations Christoph Lameter 2005-01-11 17:43 ` page table lock patch V15 [5/7]: x86_64 " Christoph Lameter 2005-01-11 17:43 ` page table lock patch V15 [6/7]: s390 " Christoph Lameter 2005-01-11 17:44 ` page table lock patch V15 [7/7]: Split RSS counter Christoph Lameter 2005-01-12 5:59 ` page table lock patch V15 [0/7]: overview Nick Piggin 2005-01-12 9:42 ` Andrew Morton 2005-01-12 12:29 ` Marcelo Tosatti 2005-01-12 12:43 ` Hugh Dickins 2005-01-12 21:22 ` Hugh Dickins 2005-01-12 23:52 ` Christoph Lameter 2005-01-13 2:52 ` Hugh Dickins 2005-01-13 17:05 ` Christoph Lameter 2005-01-12 16:39 ` Christoph Lameter 2005-01-12 16:49 ` Christoph Hellwig 2005-01-12 17:37 ` Christoph Lameter 2005-01-12 17:41 ` Christoph Hellwig 2005-01-12 17:52 ` Christoph Lameter 2005-01-12 18:04 ` Christoph Hellwig 2005-01-12 18:20 ` Andrew Walrond 2005-01-12 18:43 ` Andrew Morton 2005-01-12 19:06 ` Christoph Lameter 2005-01-14 3:39 ` Roman Zippel 2005-01-14 4:14 ` Andi Kleen 2005-01-14 12:02 ` Roman Zippel 2005-01-12 23:16 ` Nick Piggin 2005-01-12 23:30 ` Andrew Morton 2005-01-12 23:50 ` Nick Piggin 2005-01-12 23:54 ` Christoph Lameter 2005-01-13 0:10 ` Nick Piggin 2005-01-13 0:16 ` Christoph Lameter 2005-01-13 0:42 ` Nick Piggin 2005-01-13 22:19 ` Peter Chubb 2005-01-13 3:18 ` Andi Kleen 2005-01-13 17:11 ` Christoph Lameter 2005-01-13 17:25 ` Linus Torvalds 2005-01-13 18:02 ` Andi Kleen 2005-01-13 18:16 ` Christoph Lameter 2005-01-13 20:17 ` Andi Kleen 2005-01-14 1:09 ` Christoph Lameter 2005-01-14 4:39 ` Andi Kleen 2005-01-14 4:52 ` page table lock patch V15 [0/7]: overview II Andi Kleen 2005-01-14 4:59 ` Nick Piggin 2005-01-14 10:47 ` Andi Kleen 2005-01-14 10:57 ` Nick Piggin 2005-01-14 11:11 ` Andi Kleen 2005-01-14 16:57 ` Christoph Lameter 2005-01-14 4:54 ` page table lock patch V15 [0/7]: overview Nick Piggin 2005-01-14 10:46 ` Andi Kleen 2005-01-14 16:52 ` Christoph Lameter 2005-01-14 17:01 ` Andi Kleen 2005-01-14 17:08 ` Christoph Lameter 2005-01-14 17:11 ` Andi Kleen 2005-01-14 17:43 ` Linus Torvalds 2005-01-28 20:35 ` page fault scalability patch V16 [0/4]: redesign overview Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [1/4]: avoid intermittent clearing of ptes Christoph Lameter 2005-01-28 20:36 ` page fault scalability patch V16 [2/4]: mm counter macros Christoph Lameter 2005-01-28 20:37 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 2005-02-01 4:08 ` Nick Piggin 2005-02-01 18:47 ` Christoph Lameter 2005-02-01 19:01 ` Christoph Lameter 2005-02-02 0:31 ` Nick Piggin 2005-02-02 1:20 ` Christoph Lameter 2005-02-02 1:41 ` Nick Piggin 2005-02-02 2:49 ` Christoph Lameter 2005-02-02 3:09 ` Nick Piggin 2005-02-04 6:27 ` Nick Piggin 2005-02-17 0:57 ` page fault scalability patchsets update: prezeroing, prefaulting and atomic operations Christoph Lameter 2005-02-24 6:04 ` A Proposal for an MMU abstraction layer Christoph Lameter 2005-02-01 4:16 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Nick Piggin 2005-02-01 8:20 ` Kernel 2.4.21 hangs up baswaraj kasture 2005-02-01 8:35 ` Arjan van de Ven 2005-02-01 9:03 ` Christian Hildner 2005-02-07 6:14 ` Kernel 2.4.21 gives kernel panic at boot time baswaraj kasture 2005-02-01 17:46 ` Kernel 2.4.21 hangs up David Mosberger 2005-02-01 17:54 ` Markus Trippelsdorf 2005-02-01 18:08 ` David Mosberger 2005-02-01 18:44 ` page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault Christoph Lameter 2005-01-28 20:38 ` page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page Christoph Lameter 2005-01-13 3:09 ` page table lock patch V15 [0/7]: overview Hugh Dickins 2005-01-13 3:46 ` Nick Piggin 2005-01-13 17:14 ` Christoph Lameter 2005-01-04 21:21 ` page fault scalability patch V14 [5/7]: x86_64 atomic pte operations Brian Gerst 2005-01-04 21:26 ` Christoph Lameter 2005-01-04 19:38 ` page fault scalability patch V14 [6/7]: s390 atomic pte operationsw Christoph Lameter 2005-01-04 19:39 ` page fault scalability patch V14 [7/7]: Split RSS counters Christoph Lameter 2004-12-02 0:10 ` page fault scalability patch V12 [0/7]: Overview and performance tests Linus Torvalds 2004-12-02 0:55 ` Andrew Morton 2004-12-02 1:46 ` Christoph Lameter 2004-12-02 6:21 ` Jeff Garzik 2004-12-02 6:34 ` Andrew Morton 2004-12-02 6:48 ` Jeff Garzik 2004-12-02 7:02 ` Andrew Morton 2004-12-02 7:26 ` Martin J. Bligh 2004-12-02 7:31 ` Jeff Garzik 2004-12-02 18:10 ` cliff white 2004-12-02 18:17 ` Gerrit Huizenga 2004-12-02 20:25 ` linux-os 2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter 2004-12-08 17:33 ` Jesse Barnes 2004-12-08 17:56 ` Christoph Lameter 2004-12-08 18:33 ` Jesse Barnes 2004-12-08 21:26 ` David S. Miller 2004-12-08 21:42 ` Linus Torvalds 2004-12-08 17:55 ` Dave Hansen 2004-12-08 19:07 ` Martin J. Bligh 2004-12-08 22:50 ` Martin J. Bligh 2004-12-09 19:32 ` Christoph Lameter 2004-12-10 2:13 ` [OT:HUMOR] " Adam Heath 2004-12-13 14:30 ` Akinobu Mita 2004-12-13 17:10 ` Christoph Lameter 2004-12-13 22:16 ` Martin J. Bligh 2004-12-14 1:32 ` Anticipatory prefaulting in the page fault handler V2 Christoph Lameter 2004-12-14 19:31 ` Adam Litke 2004-12-15 19:03 ` Anticipatory prefaulting in the page fault handler V3 Christoph Lameter 2005-01-05 0:29 ` Anticipatory prefaulting in the page fault handler V4 Christoph Lameter 2004-12-14 12:24 ` Anticipatory prefaulting in the page fault handler V1 Akinobu Mita 2004-12-14 15:25 ` Akinobu Mita 2004-12-14 20:25 ` Christoph Lameter 2004-12-09 10:57 ` Pavel Machek 2004-12-09 11:32 ` Nick Piggin 2004-12-09 17:05 ` Christoph Lameter 2004-12-14 15:28 ` Adam Litke 2004-12-02 18:43 ` page fault scalability patch V12 [0/7]: Overview and performance tests cliff white 2004-12-06 19:33 ` Marcelo Tosatti 2004-12-02 16:24 ` Gerrit Huizenga 2004-12-02 17:34 ` cliff white 2004-12-02 19:48 ` Diego Calleja 2004-12-02 20:12 ` Jeff Garzik 2004-12-02 20:30 ` Diego Calleja 2004-12-02 21:08 ` Wichert Akkerman 2004-12-03 0:07 ` Francois Romieu 2004-12-02 7:00 ` Jeff Garzik 2004-12-02 7:05 ` Benjamin Herrenschmidt 2004-12-02 7:11 ` Jeff Garzik 2004-12-02 11:16 ` Benjamin Herrenschmidt 2004-12-02 14:30 ` Andy Warner 2005-01-06 23:40 ` Jeff Garzik 2004-12-02 18:27 ` Grant Grundler 2004-12-02 18:33 ` Andrew Morton 2004-12-02 18:36 ` Christoph Hellwig 2004-12-07 10:51 ` Pavel Machek 2004-12-09 8:00 ` Nick Piggin 2004-12-09 17:03 ` Christoph Lameter 2004-12-10 4:30 ` Nick Piggin 2004-12-09 18:37 ` Hugh Dickins 2004-12-09 22:02 ` page fault scalability patch V12: rss tasklist vs sloppy rss Christoph Lameter 2004-12-09 22:52 ` Andrew Morton 2004-12-09 22:52 ` William Lee Irwin III 2004-12-09 23:07 ` Christoph Lameter 2004-12-09 23:29 ` William Lee Irwin III 2004-12-09 23:49 ` Christoph Lameter 2004-12-10 4:26 ` page fault scalability patch V12 [0/7]: Overview and performance tests Nick Piggin 2004-12-10 4:54 ` Nick Piggin 2004-12-10 5:06 ` Benjamin Herrenschmidt 2004-12-10 5:19 ` Nick Piggin 2004-12-10 12:30 ` Hugh Dickins 2004-12-10 18:43 ` Christoph Lameter 2004-12-10 21:43 ` Hugh Dickins 2004-12-10 22:12 ` Andrew Morton 2004-12-10 23:52 ` Hugh Dickins 2004-12-11 0:18 ` Andrew Morton 2004-12-11 0:44 ` Hugh Dickins 2004-12-11 0:57 ` Andrew Morton 2004-12-11 9:23 ` Hugh Dickins 2004-12-12 7:54 ` Nick Piggin 2004-12-12 9:33 ` Hugh Dickins 2004-12-12 9:48 ` Nick Piggin 2004-12-12 21:24 ` William Lee Irwin III 2004-12-17 3:31 ` Christoph Lameter 2004-12-17 3:32 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [1/8]: Reduce the use of the page_table_lock Christoph Lameter 2004-12-17 3:33 ` page fault scalability patch V13 [2/8]: ia64 atomic pte operations Christoph Lameter 2004-12-17 3:34 ` page fault scalability patch V13 [3/8]: universal cmpxchg for i386 Christoph Lameter 2004-12-17 3:35 ` page fault scalability patch V13 [4/8]: atomic pte operations " Christoph Lameter 2004-12-17 3:36 ` page fault scalability patch V13 [5/8]: atomic pte operations for AMD64 Christoph Lameter 2004-12-17 3:38 ` page fault scalability patch V13 [7/8]: Split RSS Christoph Lameter 2004-12-17 3:39 ` page fault scalability patch V13 [8/8]: Prefaulting using ptep_cmpxchg Christoph Lameter 2004-12-17 5:55 ` page fault scalability patch V13 [0/8]: Overview Christoph Lameter 2004-12-10 20:03 ` pfault V12 : correction to tasklist rss Christoph Lameter 2004-12-10 21:24 ` Hugh Dickins 2004-12-10 21:38 ` Andrew Morton 2004-12-11 6:03 ` William Lee Irwin III 2004-11-22 22:32 ` deferred rss update instead of sloppy rss Nick Piggin 2004-11-22 22:39 ` Christoph Lameter 2004-11-22 23:14 ` Nick Piggin 2004-11-19 19:44 ` page fault scalability patch V11 [2/7]: page fault handler optimizations Christoph Lameter 2004-11-19 19:44 ` page fault scalability patch V11 [3/7]: ia64 atomic pte operations Christoph Lameter 2004-11-19 19:45 ` page fault scalability patch V11 [4/7]: universal cmpxchg for i386 Christoph Lameter 2004-11-19 19:46 ` page fault scalability patch V11 [5/7]: i386 atomic pte operations Christoph Lameter 2004-11-19 19:46 ` page fault scalability patch V11 [6/7]: x86_64 " Christoph Lameter 2004-11-19 19:47 ` page fault scalability patch V11 [7/7]: s390 " Christoph Lameter 2004-11-19 19:59 ` page fault scalability patch V11 [0/7]: overview Linus Torvalds 2004-11-20 1:07 ` Nick Piggin 2004-11-20 1:29 ` Christoph Lameter 2004-11-20 1:45 ` Nick Piggin 2004-11-20 1:58 ` Linus Torvalds 2004-11-20 2:06 ` Linus Torvalds 2004-11-20 1:56 ` Linus Torvalds 2004-11-22 18:06 ` Bill Davidsen 2004-11-20 2:03 ` William Lee Irwin III 2004-11-20 2:25 ` Nick Piggin 2004-11-20 2:41 ` William Lee Irwin III 2004-11-20 2:46 ` Nick Piggin 2004-11-20 3:37 ` Nick Piggin 2004-11-20 3:55 ` William Lee Irwin III 2004-11-20 4:03 ` Nick Piggin 2004-11-20 4:06 ` Nick Piggin 2004-11-20 4:23 ` William Lee Irwin III 2004-11-20 4:29 ` Nick Piggin 2004-11-20 5:38 ` William Lee Irwin III 2004-11-20 5:50 ` Nick Piggin 2004-11-20 6:23 ` William Lee Irwin III 2004-11-20 6:49 ` Nick Piggin 2004-11-20 6:57 ` Andrew Morton 2004-11-20 7:04 ` Andrew Morton 2004-11-20 7:13 ` Nick Piggin 2004-11-20 8:00 ` William Lee Irwin III 2004-11-20 16:59 ` Martin J. Bligh 2004-11-20 17:14 ` Linus Torvalds 2004-11-20 19:08 ` William Lee Irwin III 2004-11-20 19:16 ` Linus Torvalds 2004-11-20 19:33 ` William Lee Irwin III 2004-11-22 17:44 ` Christoph Lameter 2004-11-22 22:43 ` William Lee Irwin III 2004-11-22 22:51 ` Christoph Lameter 2004-11-23 2:25 ` William Lee Irwin III 2004-11-20 20:25 ` [OT] " Adam Heath 2004-11-20 7:15 ` William Lee Irwin III 2004-11-20 7:29 ` Nick Piggin 2004-11-20 7:45 ` touch_nmi_watchdog (was: page fault scalability patch V11 [0/7]: overview) Nick Piggin 2004-11-20 7:57 ` page fault scalability patch V11 [0/7]: overview Nick Piggin 2004-11-20 8:25 ` William Lee Irwin III 2004-11-20 2:04 ` William Lee Irwin III 2004-11-20 2:18 ` Nick Piggin 2004-11-20 2:34 ` William Lee Irwin III 2004-11-20 2:40 ` Nick Piggin 2004-11-20 3:04 ` William Lee Irwin III 2004-11-20 3:14 ` Nick Piggin 2004-11-20 3:43 ` William Lee Irwin III 2004-11-20 3:58 ` Nick Piggin 2004-11-20 4:01 ` William Lee Irwin III 2004-11-20 4:34 ` Robin Holt 2004-11-20 3:33 ` Robin Holt 2004-11-20 4:24 ` William Lee Irwin III 2004-11-20 2:06 ` Robin Holt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).