linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [QUICKLIST 0/4] Arch independent quicklists V2
@ 2007-03-13  7:13 Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 1/4] Generic quicklist implementation Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13  7:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

V1->V2
- Add sparch64 patch
- Single i386 and x86_64 patch
- Update attribution
- Update justification
- Update approvals
- Earlier discussion of V1 was at
  http://marc.info/?l=linux-kernel&m=117357922219342&w=2

This patchset introduces an arch independent framework to handle lists
of recently used page table pages. It is necessary for x86_64 and
i386 to avoid the special casing of SLUB because these two platforms
use fields in the page_struct (page->index and page->private)
that SLUB needs (and in fact SLAB also needs page-private if
performing debugging!). There is also the tendency of arches to use
page flags to mark page table pages. The slab also uses page flags.
Separating page table page allocation into quicklists avoids the danger
of conflicts and frees up page flags for SLUB and for the arch code.

Page table pages have the characteristics that they are typically zero
or in a known state when they are freed. This is usually the exactly
same state as needed after allocation. So it makes sense to build a list
of freed page table pages and then consume the pages already in use
first. Those pages have already been initialized correctly (thus no
need to zero them) and are likely already cached in such a way that
the MMU can use them most effectively. Page table pages are used in
a sparse way so zeroing them on allocation is not too useful.

Such an implementation already exits for ia64. Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.
It also only supported a single quicklist. The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining the necessary number
of quicklists in arch/<arch>/Kconfig. F.e. i386 needs two and thus
has

config NR_QUICK
	int
	default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:

quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)

Page table pages can be freed using:

quicklist_free(<quicklist-nr>, <destructor>, <page>)

Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [QUICKLIST 1/4] Generic quicklist implementation
  2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
@ 2007-03-13  7:13 ` Christoph Lameter
  2007-03-13  9:05   ` Paul Mundt
  2007-03-13  7:13 ` [QUICKLIST 2/4] Quicklist support for i386 Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13  7:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Abstract quicklist from the OA64 implementation

Extract the quicklist implementation for IA64, clean it up
and generalize it to allow multiple quicklists and support
for constructors and destructors..

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/ia64/Kconfig          |    4 ++
 arch/ia64/mm/contig.c      |    2 -
 arch/ia64/mm/discontig.c   |    2 -
 arch/ia64/mm/init.c        |   51 ---------------------------
 include/asm-ia64/pgalloc.h |   82 ++++++++-------------------------------------
 include/linux/quicklist.h  |   81 ++++++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig                 |    5 ++
 mm/Makefile                |    2 +
 mm/quicklist.c             |   81 ++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 191 insertions(+), 119 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/init.c	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c	2007-03-12 22:49:23.000000000 -0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr;	/* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES			25UL
-#define MAX_PGT_FREES_PER_PASS		16L
-#define PGT_FRACTION_OF_NODE_MEM	16
-
-static inline long
-max_pgt_pages(void)
-{
-	u64 node_free_pages, max_pgt_pages;
-
-#ifndef	CONFIG_NUMA
-	node_free_pages = nr_free_pages();
-#else
-	node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-	max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-	max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-	return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-	long pages_to_free;
-
-	pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-	pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-	return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-	long pages_to_free;
-
-	if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES))
-		return;
-
-	preempt_disable();
-	while (unlikely((pages_to_free = min_pages_to_free()) > 0)) {
-		while (pages_to_free--) {
-			free_page((unsigned long)pgtable_quicklist_alloc());
-		}
-		preempt_enable();
-		preempt_disable();
-	}
-	preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-ia64/pgalloc.h	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h	2007-03-12 22:49:23.000000000 -0700
@@ -18,71 +18,18 @@
 #include <linux/mm.h>
 #include <linux/page-flags.h>
 #include <linux/threads.h>
+#include <linux/quicklist.h>
 
 #include <asm/mmu_context.h>
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-	long ql_size = 0;
-	int cpuid;
-
-	for_each_online_cpu(cpuid) {
-		ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-	}
-	return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-	unsigned long *ret = NULL;
-
-	preempt_disable();
-
-	ret = pgtable_quicklist;
-	if (likely(ret != NULL)) {
-		pgtable_quicklist = (unsigned long *)(*ret);
-		ret[0] = 0;
-		--pgtable_quicklist_size;
-		preempt_enable();
-	} else {
-		preempt_enable();
-		ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-	}
-
-	return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-	int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-	if (unlikely(nid != numa_node_id())) {
-		free_page((unsigned long)pgtable_entry);
-		return;
-	}
-#endif
-
-	preempt_disable();
-	*(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-	pgtable_quicklist = (unsigned long *)pgtable_entry;
-	++pgtable_quicklist_size;
-	preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 {
-	pgtable_quicklist_free(pgd);
+	quicklist_free(0, NULL, pgd);
 }
 
 #ifdef CONFIG_PGTABLE_4
@@ -94,12 +41,12 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pud_free(pud_t * pud)
 {
-	pgtable_quicklist_free(pud);
+	quicklist_free(0, NULL, pud);
 }
 #define __pud_free_tlb(tlb, pud)	pud_free(pud)
 #endif /* CONFIG_PGTABLE_4 */
@@ -112,12 +59,12 @@ pud_populate(struct mm_struct *mm, pud_t
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t * pmd)
 {
-	pgtable_quicklist_free(pmd);
+	quicklist_free(0, NULL, pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
@@ -137,28 +84,31 @@ pmd_populate_kernel(struct mm_struct *mm
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long addr)
 {
-	void *pg = pgtable_quicklist_alloc();
+	void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
 	return pg ? virt_to_page(pg) : NULL;
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pte_free(struct page *pte)
 {
-	pgtable_quicklist_free(page_address(pte));
+	quicklist_free(0, NULL, page_address(pte));
 }
 
 static inline void pte_free_kernel(pte_t * pte)
 {
-	pgtable_quicklist_free(pte);
+	quicklist_free(0, NULL, pte);
 }
 
-#define __pte_free_tlb(tlb, pte)	pte_free(pte)
+static inline void check_pgt_cache(void)
+{
+	quicklist_check(0, NULL);
+}
 
-extern void check_pgt_cache(void);
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #endif				/* _ASM_IA64_PGALLOC_H */
Index: linux-2.6.21-rc3-mm2/include/linux/quicklist.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc3-mm2/include/linux/quicklist.h	2007-03-12 22:53:23.000000000 -0700
@@ -0,0 +1,81 @@
+#ifndef LINUX_QUICKLIST_H
+#define LINUX_QUICKLIST_H
+/*
+ * Fast allocations and disposal of pages. Pages must be in the condition
+ * as needed after allocation when they are freed. Per cpu lists of pages
+ * are kept that only contain node local pages.
+ *
+ * (C) 2007, SGI. Christoph Lameter <clameter@sgi.com>
+ */
+#include <linux/kernel.h>
+#include <linux/gfp.h>
+#include <linux/percpu.h>
+
+#ifdef CONFIG_NR_QUICK
+
+struct quicklist {
+	void *page;
+	int nr_pages;
+};
+
+DECLARE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
+
+static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *))
+{
+	struct quicklist *q;
+	void **p = NULL;
+
+	q =&get_cpu_var(quicklist)[nr];
+	p = q->page;
+	if (likely(p)) {
+		q->page = p[0];
+		p[0] = NULL;
+		q->nr_pages--;
+	}
+	put_cpu_var(quicklist);
+	if (likely(p))
+		return p;
+
+	p = (void *)__get_free_page(flags | __GFP_ZERO);
+	if (ctor && p)
+		ctor(p);
+	return p;
+}
+
+static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
+{
+	struct quicklist *q;
+	void **p = pp;
+	struct page *page = virt_to_page(p);
+	int nid = page_to_nid(page);
+
+	if (unlikely(nid != numa_node_id())) {
+		if (dtor)
+			dtor(p);
+		free_page((unsigned long)p);
+		return;
+	}
+
+	q = &get_cpu_var(quicklist)[nr];
+	p[0] = q->page;
+	q->page = p;
+	q->nr_pages++;
+	put_cpu_var(quicklist);
+}
+
+void quicklist_check(int nr, void (*dtor)(void *));
+unsigned long quicklist_total_size(void);
+
+#else
+void quicklist_check(int nr, void (*dtor)(void *))
+{
+}
+
+unsigned long quicklist_total_size(void)
+{
+	return 0;
+}
+#endif
+
+#endif /* LINUX_QUICKLIST_H */
+
Index: linux-2.6.21-rc3-mm2/mm/Makefile
===================================================================
--- linux-2.6.21-rc3-mm2.orig/mm/Makefile	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/mm/Makefile	2007-03-13 00:09:06.000000000 -0700
@@ -30,3 +30,5 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_QUICKLIST) += quicklist.o
+
Index: linux-2.6.21-rc3-mm2/mm/quicklist.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc3-mm2/mm/quicklist.c	2007-03-12 22:51:55.000000000 -0700
@@ -0,0 +1,81 @@
+/*
+ * Quicklist support.
+ *
+ * Quicklists are light weight lists of pages that have a defined state
+ * on alloc and free. Pages must be in the quicklist specific defined state
+ * (zero by default) when the page is freed. It seems that the initial idea
+ * for such lists first came from Dave Miller and then various other people
+ * improved on it.
+ *
+ * Copyright (C) 2007 SGI,
+ * 	Christoph Lameter <clameter@sgi.com>
+ * 		Generalized, added support for multiple lists and
+ * 		constructors / destructors.
+ */
+#include <linux/kernel.h>
+
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/module.h>
+#include <linux/quicklist.h>
+
+DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
+
+#define MIN_PAGES		25
+#define MAX_FREES_PER_PASS	16
+#define FRACTION_OF_NODE_MEM	16
+
+static unsigned long max_pages(void)
+{
+	unsigned long node_free_pages, max;
+
+	node_free_pages = node_page_state(numa_node_id(),
+			NR_FREE_PAGES);
+	max = node_free_pages / FRACTION_OF_NODE_MEM;
+	return max(max, (unsigned long)MIN_PAGES);
+}
+
+static long min_pages_to_free(struct quicklist *q)
+{
+	long pages_to_free;
+
+	pages_to_free = q->nr_pages - max_pages();
+
+	return min(pages_to_free, (long)MAX_FREES_PER_PASS);
+}
+
+void quicklist_check(int nr, void (*dtor)(void *))
+{
+	long pages_to_free;
+	struct quicklist *q;
+
+	q = &get_cpu_var(quicklist)[nr];
+	if (q->nr_pages > MIN_PAGES) {
+		pages_to_free = min_pages_to_free(q);
+
+		while (pages_to_free > 0) {
+			void *p = quicklist_alloc(nr, 0, NULL);
+
+			if (dtor)
+				dtor(p);
+			free_page((unsigned long)p);
+			pages_to_free--;
+		}
+	}
+	put_cpu_var(quicklist);
+}
+
+unsigned long quicklist_total_size(void)
+{
+	unsigned long count = 0;
+	int cpu;
+	struct quicklist *ql, *q;
+
+	for_each_online_cpu(cpu) {
+		ql = per_cpu(quicklist, cpu);
+		for (q = ql; q < ql + CONFIG_NR_QUICK; q++)
+			count += q->nr_pages;
+	}
+	return count;
+}
+
Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/contig.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/contig.c	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/contig.c	2007-03-12 22:49:23.000000000 -0700
@@ -88,7 +88,7 @@ void show_mem(void)
 	printk(KERN_INFO "%d pages shared\n", total_shared);
 	printk(KERN_INFO "%d pages swap cached\n", total_cached);
 	printk(KERN_INFO "Total of %ld pages in page table cache\n",
-	       pgtable_quicklist_total_size());
+	       quicklist_total_size());
 	printk(KERN_INFO "%d free buffer pages\n", nr_free_buffer_pages());
 }
 
Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/discontig.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/discontig.c	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/discontig.c	2007-03-12 22:49:23.000000000 -0700
@@ -563,7 +563,7 @@ void show_mem(void)
 	printk(KERN_INFO "%d pages shared\n", total_shared);
 	printk(KERN_INFO "%d pages swap cached\n", total_cached);
 	printk(KERN_INFO "Total of %ld pages in page table cache\n",
-	       pgtable_quicklist_total_size());
+	       quicklist_total_size());
 	printk(KERN_INFO "%d free buffer pages\n", nr_free_buffer_pages());
 }
 
Index: linux-2.6.21-rc3-mm2/arch/ia64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/Kconfig	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/Kconfig	2007-03-12 22:49:23.000000000 -0700
@@ -29,6 +29,10 @@ config ZONE_DMA
 	def_bool y
 	depends on !IA64_SGI_SN2
 
+config NR_QUICK
+	int
+	default 1
+
 config MMU
 	bool
 	default y
Index: linux-2.6.21-rc3-mm2/mm/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/mm/Kconfig	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/mm/Kconfig	2007-03-13 00:09:50.000000000 -0700
@@ -220,3 +220,8 @@ config DEBUG_READAHEAD
 
 	  Say N for production servers.
 
+config QUICKLIST
+	bool
+	default y if NR_QUICK != 0
+
+

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [QUICKLIST 2/4] Quicklist support for i386
  2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 1/4] Generic quicklist implementation Christoph Lameter
@ 2007-03-13  7:13 ` Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 3/4] Quicklist support for x86_64 Christoph Lameter
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13  7:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
The page state is therefore mainly determined by the slab code. However,
i386 also uses its own fields in the page struct to mark special pages
and to build a list of pgds using the ->private and ->index field (yuck!).
This has been finely tuned to work right with SLAB but SLUB needs more
control over the page struct. Currently the only way for SLUB to support
these slabs is through special casing PAGE_SIZE slabs.

If we use quicklists instead then we can avoid the mess, and also the
overhead of manipulating page sized objects through slab.

It also allows us to use standard list manipulation macros for the
pgd list using page->lru thereby simplifying the code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/i386/Kconfig          |    4 ++
 arch/i386/kernel/process.c |    1 
 arch/i386/kernel/smp.c     |    2 -
 arch/i386/mm/fault.c       |    5 +--
 arch/i386/mm/init.c        |   25 -----------------
 arch/i386/mm/pageattr.c    |    2 -
 arch/i386/mm/pgtable.c     |   63 +++++++++++++++++----------------------------
 include/asm-i386/pgalloc.h |    2 -
 include/asm-i386/pgtable.h |   13 +++------
 9 files changed, 39 insertions(+), 78 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/i386/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/init.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/init.c	2007-03-12 22:53:27.000000000 -0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-	if (PTRS_PER_PMD > 1) {
-		pmd_cache = kmem_cache_create("pmd",
-					PTRS_PER_PMD*sizeof(pmd_t),
-					PTRS_PER_PMD*sizeof(pmd_t),
-					0,
-					pmd_ctor,
-					NULL);
-		if (!pmd_cache)
-			panic("pgtable_cache_init(): cannot create pmd cache");
-	}
-	pgd_cache = kmem_cache_create("pgd",
-				PTRS_PER_PGD*sizeof(pgd_t),
-				PTRS_PER_PGD*sizeof(pgd_t),
-				0,
-				pgd_ctor,
-				PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-	if (!pgd_cache)
-		panic("pgtable_cache_init(): Cannot create pgd cache");
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/pgtable.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c	2007-03-12 22:53:27.000000000 -0700
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
+#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>
@@ -181,9 +182,12 @@ void reserve_top_address(unsigned long r
 #endif
 }
 
+#define QUICK_PGD 0
+#define QUICK_PT 1
+
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+	return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -198,11 +202,6 @@ struct page *pte_alloc_one(struct mm_str
 	return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-	memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,36 +210,15 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-	page->index = (unsigned long)pgd_list;
-	if (pgd_list)
-		set_page_private(pgd_list, (unsigned long)&page->index);
-	pgd_list = page;
-	set_page_private(page, (unsigned long)&pgd_list);
-}
+LIST_HEAD(pgd_list);
 
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *next, **pprev, *page = virt_to_page(pgd);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page_private(page);
-	*pprev = next;
-	if (next)
-		set_page_private(next, (unsigned long)pprev);
-}
-
-void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_ctor(void *pgd)
 {
 	unsigned long flags;
+	struct page *page = virt_to_page(pgd);
 
 	if (PTRS_PER_PMD == 1) {
 		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
@@ -259,31 +237,32 @@ void pgd_ctor(void *pgd, struct kmem_cac
 			__pa(swapper_pg_dir) >> PAGE_SHIFT,
 			USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
-	pgd_list_add(pgd);
+	list_add(&page->lru, &pgd_list);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 /* never called when PTRS_PER_PMD > 1 */
-void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_dtor(void *pgd)
 {
 	unsigned long flags; /* can be called from interrupt context */
+	struct page *page = virt_to_page(pgd);
 
 	paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
 	spin_lock_irqsave(&pgd_lock, flags);
-	pgd_list_del(pgd);
+	list_del(&page->lru);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
-	pgd_t *pgd = kmem_cache_alloc(pgd_cache, GFP_KERNEL);
+	pgd_t *pgd = quicklist_alloc(QUICK_PGD, GFP_KERNEL, pgd_ctor);
 
 	if (PTRS_PER_PMD == 1 || !pgd)
 		return pgd;
 
 	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+		pmd_t *pmd = quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 		if (!pmd)
 			goto out_oom;
 		paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
@@ -296,9 +275,9 @@ out_oom:
 		pgd_t pgdent = pgd[i];
 		void* pmd = (void *)__va(pgd_val(pgdent)-1);
 		paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-		kmem_cache_free(pmd_cache, pmd);
+		quicklist_free(QUICK_PT, NULL, pmd);
 	}
-	kmem_cache_free(pgd_cache, pgd);
+	quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 	return NULL;
 }
 
@@ -312,8 +291,14 @@ void pgd_free(pgd_t *pgd)
 			pgd_t pgdent = pgd[i];
 			void* pmd = (void *)__va(pgd_val(pgdent)-1);
 			paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-			kmem_cache_free(pmd_cache, pmd);
+			quicklist_free(QUICK_PT, NULL, pmd);
 		}
 	/* in the non-PAE case, free_pgtables() clears user pgd entries */
-	kmem_cache_free(pgd_cache, pgd);
+	quicklist_free(QUICK_PGD, pgd_ctor, pgd);
+}
+
+void check_pgt_cache(void)
+{
+	quicklist_check(QUICK_PGD, pgd_dtor);
+	quicklist_check(QUICK_PT, NULL);
 }
Index: linux-2.6.21-rc3-mm2/arch/i386/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/Kconfig	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/Kconfig	2007-03-12 22:53:27.000000000 -0700
@@ -55,6 +55,10 @@ config ZONE_DMA
 	bool
 	default y
 
+config NR_QUICK
+	int
+	default 2
+
 config SBUS
 	bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-i386/pgtable.h	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-i386/pgtable.h	2007-03-12 22:53:27.000000000 -0700
@@ -35,15 +35,12 @@ struct vm_area_struct;
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 extern unsigned long empty_zero_page[1024];
 extern pgd_t swapper_pg_dir[1024];
-extern struct kmem_cache *pgd_cache;
-extern struct kmem_cache *pmd_cache;
-extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
 
-void pmd_ctor(void *, struct kmem_cache *, unsigned long);
-void pgd_ctor(void *, struct kmem_cache *, unsigned long);
-void pgd_dtor(void *, struct kmem_cache *, unsigned long);
-void pgtable_cache_init(void);
+void check_pgt_cache(void);
+
+extern spinlock_t pgd_lock;
+extern struct list_head pgd_list;
+static inline void pgtable_cache_init(void) {};
 void paging_init(void);
 
 /*
Index: linux-2.6.21-rc3-mm2/arch/i386/kernel/smp.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/kernel/smp.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/kernel/smp.c	2007-03-12 22:53:27.000000000 -0700
@@ -437,7 +437,7 @@ void flush_tlb_mm (struct mm_struct * mm
 	}
 	if (!cpus_empty(cpu_mask))
 		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
-
+	check_pgt_cache();
 	preempt_enable();
 }
 
Index: linux-2.6.21-rc3-mm2/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/kernel/process.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/kernel/process.c	2007-03-12 22:53:27.000000000 -0700
@@ -181,6 +181,7 @@ void cpu_idle(void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 
Index: linux-2.6.21-rc3-mm2/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-i386/pgalloc.h	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-i386/pgalloc.h	2007-03-12 22:53:27.000000000 -0700
@@ -65,6 +65,6 @@ do {									\
 #define pud_populate(mm, pmd, pte)	BUG()
 #endif
 
-#define check_pgt_cache()	do { } while (0)
+extern void check_pgt_cache(void);
 
 #endif /* _I386_PGALLOC_H */
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/fault.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/fault.c	2007-03-12 22:53:27.000000000 -0700
@@ -623,11 +623,10 @@ void vmalloc_sync_all(void)
 			struct page *page;
 
 			spin_lock_irqsave(&pgd_lock, flags);
-			for (page = pgd_list; page; page =
-					(struct page *)page->index)
+			list_for_each_entry(page, &pgd_list, lru)
 				if (!vmalloc_sync_one(page_address(page),
 								address)) {
-					BUG_ON(page != pgd_list);
+					BUG();
 					break;
 				}
 			spin_unlock_irqrestore(&pgd_lock, flags);
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/pageattr.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/pageattr.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/pageattr.c	2007-03-12 22:53:27.000000000 -0700
@@ -95,7 +95,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 		return;
 
 	spin_lock_irqsave(&pgd_lock, flags);
-	for (page = pgd_list; page; page = (struct page *)page->index) {
+	list_for_each_entry(page, &pgd_list, lru) {
 		pgd_t *pgd;
 		pud_t *pud;
 		pmd_t *pmd;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [QUICKLIST 3/4] Quicklist support for x86_64
  2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 1/4] Generic quicklist implementation Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 2/4] Quicklist support for i386 Christoph Lameter
@ 2007-03-13  7:13 ` Christoph Lameter
  2007-03-13  7:13 ` [QUICKLIST 4/4] Quicklist support for sparc64 Christoph Lameter
  2007-03-13  8:53 ` [QUICKLIST 0/4] Arch independent quicklists V2 Andrew Morton
  4 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13  7:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is useful to separate out PGD handling. We can carry
the initialized pgds over to the next process needing them.

Also clean up the pgd_list handling to use regular list macros.
There is no need anymore to avoid the lru field.

Move the add/removal of the pgds to the pgdlist into the
constructor / destructor. That way the implementation is
congruent with i386.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86_64/Kconfig          |    4 ++
 arch/x86_64/kernel/process.c |    1 
 arch/x86_64/kernel/smp.c     |    2 -
 arch/x86_64/mm/fault.c       |    5 +-
 include/asm-x86_64/pgalloc.h |   76 +++++++++++++++++++++----------------------
 include/asm-x86_64/pgtable.h |    3 -
 mm/Kconfig                   |    5 ++
 7 files changed, 52 insertions(+), 44 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig	2007-03-12 22:53:28.000000000 -0700
@@ -56,6 +56,10 @@ config ZONE_DMA
 	bool
 	default y
 
+config NR_QUICK
+	int
+	default 2
+
 config ISA
 	bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h	2007-03-12 22:53:28.000000000 -0700
@@ -4,6 +4,10 @@
 #include <asm/pda.h>
 #include <linux/threads.h>
 #include <linux/mm.h>
+#include <linux/quicklist.h>
+
+#define QUICK_PGD 0	/* We preserve special mappings over free */
+#define QUICK_PT 1	/* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
 	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-	free_page((unsigned long)pmd);
+	quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
 	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-	free_page((unsigned long)pud);
+	quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+	unsigned boundary;
+	pgd_t *pgd = x;
 	struct page *page = virt_to_page(pgd);
 
+	/*
+	 * Copy kernel pointers in from init.
+	 */
+	boundary = pgd_index(__PAGE_OFFSET);
+	memcpy(pgd + boundary,
+		init_level4_pgt + boundary,
+		(PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
 	spin_lock(&pgd_lock);
-	page->index = (pgoff_t)pgd_list;
-	if (pgd_list)
-		pgd_list->private = (unsigned long)&page->index;
-	pgd_list = page;
-	page->private = (unsigned long)&pgd_list;
+	list_add(&page->lru, &pgd_list);
 	spin_unlock(&pgd_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-	struct page *next, **pprev, *page = virt_to_page(pgd);
+	pgd_t *pgd = x;
+	struct page *page = virt_to_page(pgd);
 
 	spin_lock(&pgd_lock);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page->private;
-	*pprev = next;
-	if (next)
-		next->private = (unsigned long)pprev;
+	list_del(&page->lru);
 	spin_unlock(&pgd_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	unsigned boundary;
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (!pgd)
-		return NULL;
-	pgd_list_add(pgd);
-	/*
-	 * Copy kernel pointers in from init.
-	 * Could keep a freelist or slab cache of those because the kernel
-	 * part never changes.
-	 */
-	boundary = pgd_index(__PAGE_OFFSET);
-	memset(pgd, 0, boundary * sizeof(pgd_t));
-	memcpy(pgd + boundary,
-	       init_level4_pgt + boundary,
-	       (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+	pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+			 GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
 	return pgd;
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
 	BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
-	pgd_list_del(pgd);
-	free_page((unsigned long)pgd);
+	quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 	if (!p)
 		return NULL;
 	return virt_to_page(p);
@@ -111,17 +106,22 @@ static inline struct page *pte_alloc_one
 static inline void pte_free_kernel(pte_t *pte)
 {
 	BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-	free_page((unsigned long)pte); 
+	quicklist_free(QUICK_PT, NULL, pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
 	__free_page(pte);
-} 
+}
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 
+static inline void check_pgt_cache(void)
+{
+	quicklist_check(QUICK_PGD, pgd_dtor);
+	quicklist_check(QUICK_PT, NULL);
+}
 #endif /* _X86_64_PGALLOC_H */
Index: linux-2.6.21-rc3-mm2/mm/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/mm/Kconfig	2007-03-12 22:49:23.000000000 -0700
+++ linux-2.6.21-rc3-mm2/mm/Kconfig	2007-03-12 22:53:28.000000000 -0700
@@ -225,3 +225,8 @@ config QUICKLIST
 	default y if NR_QUICK != 0
 
 
+config QUICKLIST
+	bool
+	default y if NR_QUICK != 0
+
+
Index: linux-2.6.21-rc3-mm2/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/kernel/process.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/kernel/process.c	2007-03-12 22:53:28.000000000 -0700
@@ -207,6 +207,7 @@ void cpu_idle (void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 			if (!idle)
Index: linux-2.6.21-rc3-mm2/arch/x86_64/kernel/smp.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/kernel/smp.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/kernel/smp.c	2007-03-12 22:53:28.000000000 -0700
@@ -242,7 +242,7 @@ void flush_tlb_mm (struct mm_struct * mm
 	}
 	if (!cpus_empty(cpu_mask))
 		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
-
+	check_pgt_cache();
 	preempt_enable();
 }
 EXPORT_SYMBOL(flush_tlb_mm);
Index: linux-2.6.21-rc3-mm2/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/mm/fault.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/mm/fault.c	2007-03-12 22:53:28.000000000 -0700
@@ -585,7 +585,7 @@ do_sigbus:
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
+LIST_HEAD(pgd_list);
 
 void vmalloc_sync_all(void)
 {
@@ -605,8 +605,7 @@ void vmalloc_sync_all(void)
 			if (pgd_none(*pgd_ref))
 				continue;
 			spin_lock(&pgd_lock);
-			for (page = pgd_list; page;
-			     page = (struct page *)page->index) {
+			list_for_each_entry(page, &pgd_list, lru) {
 				pgd_t *pgd;
 				pgd = (pgd_t *)page_address(page) + pgd_index(address);
 				if (pgd_none(*pgd))
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgtable.h	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgtable.h	2007-03-12 22:53:28.000000000 -0700
@@ -402,7 +402,7 @@ static inline pte_t pte_modify(pte_t pte
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
 extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
+extern struct list_head pgd_list;
 void vmalloc_sync_all(void);
 
 #endif /* !__ASSEMBLY__ */
@@ -419,7 +419,6 @@ extern int kern_addr_valid(unsigned long
 #define HAVE_ARCH_UNMAPPED_AREA
 
 #define pgtable_cache_init()   do { } while (0)
-#define check_pgt_cache()      do { } while (0)
 
 #define PAGE_AGP    PAGE_KERNEL_NOCACHE
 #define HAVE_PAGE_AGP 1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [QUICKLIST 4/4] Quicklist support for sparc64
  2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-03-13  7:13 ` [QUICKLIST 3/4] Quicklist support for x86_64 Christoph Lameter
@ 2007-03-13  7:13 ` Christoph Lameter
  2007-03-13  8:53 ` [QUICKLIST 0/4] Arch independent quicklists V2 Andrew Morton
  4 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13  7:13 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

From: David Miller <davem@davemloft.net>

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller <davem@davemloft.net>

---
 arch/sparc64/Kconfig          |    4 ++++
 arch/sparc64/mm/init.c        |   24 ------------------------
 arch/sparc64/mm/tsb.c         |    2 +-
 include/asm-sparc64/pgalloc.h |   26 ++++++++++++++------------
 4 files changed, 19 insertions(+), 37 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/Kconfig	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig	2007-03-12 22:53:30.000000000 -0700
@@ -26,6 +26,10 @@ config MMU
 	bool
 	default y
 
+config NR_QUICK
+	int
+	default 1
+
 config STACKTRACE_SUPPORT
 	bool
 	default y
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/init.c	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c	2007-03-12 22:53:30.000000000 -0700
@@ -176,30 +176,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long flags)
-{
-	clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-	pgtable_cache = kmem_cache_create("pgtable_cache",
-					  PAGE_SIZE, PAGE_SIZE,
-					  SLAB_HWCACHE_ALIGN |
-					  SLAB_MUST_HWCACHE_ALIGN,
-					  zero_ctor,
-					  NULL);
-	if (!pgtable_cache) {
-		prom_printf("Could not create pgtable_cache\n");
-		prom_halt();
-	}
-	tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/tsb.c	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c	2007-03-12 22:53:30.000000000 -0700
@@ -252,7 +252,7 @@ static const char *tsb_cache_names[8] = 
 	"tsb_1MB",
 };
 
-void __init tsb_cache_init(void)
+void __init pgtable_cache_init(void)
 {
 	unsigned long i;
 
Index: linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-sparc64/pgalloc.h	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h	2007-03-12 22:53:30.000000000 -0700
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/quicklist.h>
 
 #include <asm/spitfire.h>
 #include <asm/cpudata.h>
@@ -13,52 +14,50 @@
 #include <asm/page.h>
 
 /* Page table allocation/freeing. */
-extern struct kmem_cache *pgtable_cache;
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
-	kmem_cache_free(pgtable_cache, pgd);
+	quicklist_free(0, NULL, pgd);
 }
 
 #define pud_populate(MM, PUD, PMD)	pud_set(PUD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache,
-				GFP_KERNEL|__GFP_REPEAT);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache, pmd);
+	quicklist_free(0, NULL, pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return kmem_cache_alloc(pgtable_cache,
-				GFP_KERNEL|__GFP_REPEAT);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long address)
 {
-	return virt_to_page(pte_alloc_one_kernel(mm, address));
+	void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
+	return pg ? virt_to_page(pg) : NULL;
 }
 		
 static inline void pte_free_kernel(pte_t *pte)
 {
-	kmem_cache_free(pgtable_cache, pte);
+	quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
 {
-	pte_free_kernel(page_address(ptepage));
+	quicklist_free(0, NULL, page_address(ptepage));
 }
 
 
@@ -66,6 +65,9 @@ static inline void pte_free(struct page 
 #define pmd_populate(MM,PMD,PTE_PAGE)		\
 	pmd_populate_kernel(MM,PMD,page_address(PTE_PAGE))
 
-#define check_pgt_cache()	do { } while (0)
+static inline void check_pgt_cache(void)
+{
+	quicklist_check(0, NULL);
+}
 
 #endif /* _SPARC64_PGALLOC_H */

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13  8:53 ` [QUICKLIST 0/4] Arch independent quicklists V2 Andrew Morton
@ 2007-03-13  8:03   ` Nick Piggin
  2007-03-13 11:52     ` Andrew Morton
  2007-03-13 11:17   ` Christoph Lameter
  1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2007-03-13  8:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

Andrew Morton wrote:
>>On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>>Page table pages have the characteristics that they are typically zero
>>or in a known state when they are freed.
> 
> 
> Well if they're zero then perhaps they should be released to the page allocator
> to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
> page, we break even (except we get to remove special-case code).  If that
> __GFP_ZERO allocation was or some application other than for a pagetable, we
> win.
> 
> iow, can we just nuke 'em?

Page allocator still requires interrupts to be disabled, which this doesn't.

Considering there isn't much else that frees known zeroed pages, I wonder if
it is worthwhile.

Last time the zeroidle discussion came up was IIRC not actually real performance
gain, just cooking the 1024 CPU threaded pagefault numbers ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-03-13  7:13 ` [QUICKLIST 4/4] Quicklist support for sparc64 Christoph Lameter
@ 2007-03-13  8:53 ` Andrew Morton
  2007-03-13  8:03   ` Nick Piggin
  2007-03-13 11:17   ` Christoph Lameter
  4 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-13  8:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, clameter, linux-kernel

> On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> Page table pages have the characteristics that they are typically zero
> or in a known state when they are freed.

Well if they're zero then perhaps they should be released to the page allocator
to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
page, we break even (except we get to remove special-case code).  If that
__GFP_ZERO allocation was or some application other than for a pagetable, we
win.

iow, can we just nuke 'em?

(Will require some work in the page allocator)
(That work will open the path to using the idle thread to prezero pages)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 1/4] Generic quicklist implementation
  2007-03-13  7:13 ` [QUICKLIST 1/4] Generic quicklist implementation Christoph Lameter
@ 2007-03-13  9:05   ` Paul Mundt
  2007-03-15 20:51     ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Mundt @ 2007-03-13  9:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 12:13:30AM -0700, Christoph Lameter wrote:
> --- linux-2.6.21-rc3-mm2.orig/mm/Kconfig	2007-03-12 22:49:21.000000000 -0700
> +++ linux-2.6.21-rc3-mm2/mm/Kconfig	2007-03-13 00:09:50.000000000 -0700
> @@ -220,3 +220,8 @@ config DEBUG_READAHEAD
>  
>  	  Say N for production servers.
>  
> +config QUICKLIST
> +	bool
> +	default y if NR_QUICK != 0
> +
> +

This doesn't work, and so CONFIG_QUICKLIST is always set. The NR_QUICK
thing seems a bit backwards anyways, perhaps it would make more sense to
have architectures set CONFIG_GENERIC_QUICKLIST in the same way that the
other GENERIC_xxx bits are defined, and then set NR_QUICK based off of
that. It's obviously going to be 2 or 1 for most people, and x86 seems to
be the only one that needs 2.

How about this?

--

diff --git a/mm/Kconfig b/mm/Kconfig
index 7942b33..2f20860 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -163,3 +163,8 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+config NR_QUICK
+	int
+	depends on GENERIC_QUICKLIST
+	default "2" if X86
+	default "1"

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:52     ` Andrew Morton
@ 2007-03-13 11:06       ` Nick Piggin
  2007-03-13 12:15         ` Andrew Morton
  2007-03-13 23:58       ` Paul Mackerras
  1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2007-03-13 11:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, linux-kernel

Andrew Morton wrote:
>>On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>Page allocator still requires interrupts to be disabled, which this doesn't.
> 
> 
> Bah.  How many cli/sti statements fit into a single cachemiss?

On a Pentium 4? ;)

Sure, that is a minor detail, considering that you'll usually be allocating
an order of magnitude or three more anon/pagecache pages than page tables.

>>Considering there isn't much else that frees known zeroed pages, I wonder if
>>it is worthwhile.
> 
> 
> If you want a zeroed page for pagecache and someone has just stuffed a
> known-zero, cache-hot page into the pagetable quicklists, you have good
> reason to be upset.

The thing is, pagetable pages are the one really good exception to the
rule that we should keep cache hot and initialise-on-demand. They
typically are fairly sparsely populated and sparsely accessed. Even
for last level page tables, I think it is reasonable to assume they will
usually be pretty cold.

And you want to allocate cache cold pages as well, for the same reasons
(you want to keep your cache hot pages for when they actually will be
used - eg. for the anon/pagecache itself).

> In fact, if you want a _non_-zeroed page and someone has just stuffed a
> known-zero, cache-hot page into the pagetable quicklists, you still have
> reason to be upset.  You *want* that cache-hot page.
> 
> Generally, all these little private lists of pages (such as the ones which
> slab had/has) are a bad deal.  Cache effects preponderate and I do think
> we're generally better off tossing the things into a central pool.

For slab I understand. And a lot of users of slab constructers were also
silly, precisely because we should initialise on demand to keep the cache
hits up.

But cold(ish?) pagetable quicklists make sense, IMO (that is, if you *must*
avoid using slab).

>>Last time the zeroidle discussion came up was IIRC not actually real performance
>>gain, just cooking the 1024 CPU threaded pagefault numbers ;)
> 
> 
> Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
> fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
> page.  But it needed too much support in core VM to bother.  Since then
> we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
> anyone having tried doing it on x86 with non-temporal stores.

You can win on specifically constructed benchmarks, easily.

But considering all the other problems you're going to introduce, we'd need
a significant win on a significant something, IMO.

You waste memory bandwidth. You also use more CPU and memory cycles
speculatively, ergo you waste more power.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13  8:53 ` [QUICKLIST 0/4] Arch independent quicklists V2 Andrew Morton
  2007-03-13  8:03   ` Nick Piggin
@ 2007-03-13 11:17   ` Christoph Lameter
  2007-03-13 12:27     ` Andrew Morton
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13 11:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Tue, 13 Mar 2007, Andrew Morton wrote:

> > On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > Page table pages have the characteristics that they are typically zero
> > or in a known state when they are freed.
> 
> Well if they're zero then perhaps they should be released to the page 
> allocator to satisfy the next __GFP_ZERO request.  If that request is 
> for a pagetable page, we break even (except we get to remove 
> special-case code).  If that __GFP_ZERO allocation was or some 
> application other than for a pagetable, we win.

Nope that wont work.

1. We need to support other states of pages other than zeroed.

2. Prezeroing does not make much sense if a large portion of the
   page is being used. Performance is better if the whole page 
   is zeroed directly before use.Prezeroing only makes sense for sparse
   allocations like the page table pages.

> (Will require some work in the page allocator)
> (That work will open the path to using the idle thread to prezero pages)

I already tried that 3 years ago and there was *no* benefit for usual
users of the a page allocator. The advantage exists only if a small
portion of the page is used. F.e. For one cacheline there was a 4x 
improvement. See lkml archives for prezeroing.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:15         ` Andrew Morton
@ 2007-03-13 11:20           ` Christoph Lameter
  2007-03-13 12:30             ` Andrew Morton
  2007-03-13 11:30           ` Nick Piggin
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-03-13 11:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm, linux-kernel

On Tue, 13 Mar 2007, Andrew Morton wrote:

> Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
> anyone having tried it properly...

Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:15         ` Andrew Morton
  2007-03-13 11:20           ` Christoph Lameter
@ 2007-03-13 11:30           ` Nick Piggin
  2007-03-13 12:47             ` Andrew Morton
  1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2007-03-13 11:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, linux-kernel

Andrew Morton wrote:
>>On Tue, 13 Mar 2007 22:06:46 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>Andrew Morton wrote:
>>
>>>>On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>
>>...
>>
>>
>>>>Page allocator still requires interrupts to be disabled, which this doesn't.
> 
> 
>>>>it is worthwhile.
>>>
>>>
>>>If you want a zeroed page for pagecache and someone has just stuffed a
>>>known-zero, cache-hot page into the pagetable quicklists, you have good
>>>reason to be upset.
>>
>>The thing is, pagetable pages are the one really good exception to the
>>rule that we should keep cache hot and initialise-on-demand. They
>>typically are fairly sparsely populated and sparsely accessed. Even
>>for last level page tables, I think it is reasonable to assume they will
>>usually be pretty cold.
> 
> 
> eh?  I'd have thought that a pte page which has just gone through
> zap_pte_range() will very often have a _lot_ of hot cachelines, and
> that's a common case.
> 
> Still.   It's pretty easy to test.

Well I guess that would be the case if you had just unmapped a 4MB
chunk that was pretty dense with pages.

My malloc seems to allocate and free in blocks of 128K, so that's
only going to give us 3% of the last level pte being cache hot when
it gets freed. Not sure what common mmap(file) access patterns
look like.

The majority of programs I run have a smattering of llpt pages
pretty sparsely populated, covering text, libraries, heap, stack,
vdso.

We don't actually have to zap_pte_range the entire page table in
order to free it (IIRC we used to have to, before the 4lpt patches).

But yeah let's see some tests. I would definitely want to avoid this
extra layer of complexity if it is just as good to return the pages
to the pcp lists.

>>>Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
>>>fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
>>>page.  But it needed too much support in core VM to bother.  Since then
>>>we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
>>>anyone having tried doing it on x86 with non-temporal stores.
>>
>>You can win on specifically constructed benchmarks, easily.
>>
>>But considering all the other problems you're going to introduce, we'd need
>>a significant win on a significant something, IMO.
>>
>>You waste memory bandwidth. You also use more CPU and memory cycles
>>speculatively, ergo you waste more power.
> 
> 
> Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
> anyone having tried it properly...

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13  8:03   ` Nick Piggin
@ 2007-03-13 11:52     ` Andrew Morton
  2007-03-13 11:06       ` Nick Piggin
  2007-03-13 23:58       ` Paul Mackerras
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 11:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: clameter, linux-mm, linux-kernel

> On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Andrew Morton wrote:
> >>On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> >>Page table pages have the characteristics that they are typically zero
> >>or in a known state when they are freed.
> > 
> > 
> > Well if they're zero then perhaps they should be released to the page allocator
> > to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
> > page, we break even (except we get to remove special-case code).  If that
> > __GFP_ZERO allocation was or some application other than for a pagetable, we
> > win.
> > 
> > iow, can we just nuke 'em?
> 
> Page allocator still requires interrupts to be disabled, which this doesn't.

Bah.  How many cli/sti statements fit into a single cachemiss?

> Considering there isn't much else that frees known zeroed pages, I wonder if
> it is worthwhile.

If you want a zeroed page for pagecache and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you have good
reason to be upset.

In fact, if you want a _non_-zeroed page and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you still have
reason to be upset.  You *want* that cache-hot page.

Generally, all these little private lists of pages (such as the ones which
slab had/has) are a bad deal.  Cache effects preponderate and I do think
we're generally better off tossing the things into a central pool.

Plus, we can get in a situation where take a cache-cold, known-zero page
from the pte quicklist when there is a cache-hot, non-zero page sitting in
the page allocator.  I suspect that zeroing the cache-hot page would take a
similar amount of time to a single miss agains the cache-cold page.

I'm not saying that I _know_ that the quicklists are pointless, but I don't
think it's established that they are pointful.

ISTR that experiments with removing the i386 quicklists made zero
difference, but that was an awfully long time ago.  Significantly, it
predated per-cpu-pages..


> Last time the zeroidle discussion came up was IIRC not actually real performance
> gain, just cooking the 1024 CPU threaded pagefault numbers ;)

Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
page.  But it needed too much support in core VM to bother.  Since then
we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
anyone having tried doing it on x86 with non-temporal stores.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:47             ` Andrew Morton
@ 2007-03-13 12:01               ` Nick Piggin
  2007-03-13 13:11                 ` Andrew Morton
  2007-03-13 17:30                 ` Jeremy Fitzhardinge
  2007-03-14  1:12               ` William Lee Irwin III
  1 sibling, 2 replies; 35+ messages in thread
From: Nick Piggin @ 2007-03-13 12:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, linux-kernel

Andrew Morton wrote:
>>On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>We don't actually have to zap_pte_range the entire page table in
>>order to free it (IIRC we used to have to, before the 4lpt patches).
> 
> 
> I'm trying to remember why we ever would have needed to zero out the pagetable
> pages if we're taking down the whole mm?  Maybe it's because "oh, the
> arch wants to put this page into a quicklist to recycle it", which is
> all rather circular.
> 
> It would be interesting to look at a) leave the page full of random garbage
> if we're releasing the whole mm and b) return it straight to the page allocator.

Well we have the 'fullmm' case, which avoids all the locked pte operations
(for those architectures where hardware pt walking requires atomicity).

However we still have to visit those to-be-unmapped parts of the page table,
to find the pages and free them. So we still at least need to bring it into
cache for the read... at which point, the store probably isn't a big burden.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:06       ` Nick Piggin
@ 2007-03-13 12:15         ` Andrew Morton
  2007-03-13 11:20           ` Christoph Lameter
  2007-03-13 11:30           ` Nick Piggin
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 12:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: clameter, linux-mm, linux-kernel

> On Tue, 13 Mar 2007 22:06:46 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Andrew Morton wrote:
> >>On Tue, 13 Mar 2007 19:03:38 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> ...
>
> >>Page allocator still requires interrupts to be disabled, which this doesn't.

> >>it is worthwhile.
> > 
> > 
> > If you want a zeroed page for pagecache and someone has just stuffed a
> > known-zero, cache-hot page into the pagetable quicklists, you have good
> > reason to be upset.
> 
> The thing is, pagetable pages are the one really good exception to the
> rule that we should keep cache hot and initialise-on-demand. They
> typically are fairly sparsely populated and sparsely accessed. Even
> for last level page tables, I think it is reasonable to assume they will
> usually be pretty cold.

eh?  I'd have thought that a pte page which has just gone through
zap_pte_range() will very often have a _lot_ of hot cachelines, and
that's a common case.

Still.   It's pretty easy to test.

> > 
> > Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
> > fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
> > page.  But it needed too much support in core VM to bother.  Since then
> > we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
> > anyone having tried doing it on x86 with non-temporal stores.
> 
> You can win on specifically constructed benchmarks, easily.
> 
> But considering all the other problems you're going to introduce, we'd need
> a significant win on a significant something, IMO.
> 
> You waste memory bandwidth. You also use more CPU and memory cycles
> speculatively, ergo you waste more power.

Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
anyone having tried it properly...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 13:11                 ` Andrew Morton
@ 2007-03-13 12:18                   ` Nick Piggin
  0 siblings, 0 replies; 35+ messages in thread
From: Nick Piggin @ 2007-03-13 12:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-mm, linux-kernel

Andrew Morton wrote:
>>On Tue, 13 Mar 2007 23:01:11 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>Andrew Morton wrote:

>>>It would be interesting to look at a) leave the page full of random garbage
>>>if we're releasing the whole mm and b) return it straight to the page allocator.
>>
>>Well we have the 'fullmm' case, which avoids all the locked pte operations
>>(for those architectures where hardware pt walking requires atomicity).
> 
> 
> I suspect there are some tlb operations which could be skipped in that case
> too.

Depends on the tlb flush implementation. The generic one doesn't look like
it is all that smart about optimising the fullmm case. It does skip some
tlb flushing though.

>>However we still have to visit those to-be-unmapped parts of the page table
>>to find the pages and free them. So we still at least need to bring it into
>>cache for the read... at which point, the store probably isn't a big burden.
> 
> 
> It means all that data has to be written back.  Yes, I expect it'll prove
> to be less costly than the initial load.

Still, it is something we could try.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:17   ` Christoph Lameter
@ 2007-03-13 12:27     ` Andrew Morton
  2007-03-15 20:28       ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 12:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

> On Tue, 13 Mar 2007 04:17:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 13 Mar 2007, Andrew Morton wrote:
> 
> > > On Tue, 13 Mar 2007 00:13:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > > Page table pages have the characteristics that they are typically zero
> > > or in a known state when they are freed.
> > 
> > Well if they're zero then perhaps they should be released to the page 
> > allocator to satisfy the next __GFP_ZERO request.  If that request is 
> > for a pagetable page, we break even (except we get to remove 
> > special-case code).  If that __GFP_ZERO allocation was or some 
> > application other than for a pagetable, we win.
> 
> Nope that wont work.
> 
> 1. We need to support other states of pages other than zeroed.

What does this mean?

> 2. Prezeroing does not make much sense if a large portion of the
>    page is being used. Performance is better if the whole page 
>    is zeroed directly before use.Prezeroing only makes sense for sparse
>    allocations like the page table pages.

This is not related to the above discussion.

> > (Will require some work in the page allocator)
> > (That work will open the path to using the idle thread to prezero pages)
> 
> I already tried that 3 years ago and there was *no* benefit for usual
> users of the a page allocator. The advantage exists only if a small
> portion of the page is used. F.e. For one cacheline there was a 4x 
> improvement. See lkml archives for prezeroing.

Unsurprised.  Were non-temporal stores tried?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:20           ` Christoph Lameter
@ 2007-03-13 12:30             ` Andrew Morton
  2007-03-15 20:23               ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 12:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: nickpiggin, linux-mm, linux-kernel

> On Tue, 13 Mar 2007 04:20:48 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 13 Mar 2007, Andrew Morton wrote:
> 
> > Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
> > anyone having tried it properly...
> 
> Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?

Failed to provide us a link to it?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:30           ` Nick Piggin
@ 2007-03-13 12:47             ` Andrew Morton
  2007-03-13 12:01               ` Nick Piggin
  2007-03-14  1:12               ` William Lee Irwin III
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 12:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: clameter, linux-mm, linux-kernel

> On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> We don't actually have to zap_pte_range the entire page table in
> order to free it (IIRC we used to have to, before the 4lpt patches).

I'm trying to remember why we ever would have needed to zero out the pagetable
pages if we're taking down the whole mm?  Maybe it's because "oh, the
arch wants to put this page into a quicklist to recycle it", which is
all rather circular.

It would be interesting to look at a) leave the page full of random garbage
if we're releasing the whole mm and b) return it straight to the page allocator.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:01               ` Nick Piggin
@ 2007-03-13 13:11                 ` Andrew Morton
  2007-03-13 12:18                   ` Nick Piggin
  2007-03-13 17:30                 ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2007-03-13 13:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: clameter, linux-mm, linux-kernel

> On Tue, 13 Mar 2007 23:01:11 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Andrew Morton wrote:
> >>On Tue, 13 Mar 2007 22:30:19 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >>We don't actually have to zap_pte_range the entire page table in
> >>order to free it (IIRC we used to have to, before the 4lpt patches).
> > 
> > 
> > I'm trying to remember why we ever would have needed to zero out the pagetable
> > pages if we're taking down the whole mm?  Maybe it's because "oh, the
> > arch wants to put this page into a quicklist to recycle it", which is
> > all rather circular.
> > 
> > It would be interesting to look at a) leave the page full of random garbage
> > if we're releasing the whole mm and b) return it straight to the page allocator.
> 
> Well we have the 'fullmm' case, which avoids all the locked pte operations
> (for those architectures where hardware pt walking requires atomicity).

I suspect there are some tlb operations which could be skipped in that case
too.

> However we still have to visit those to-be-unmapped parts of the page table
> to find the pages and free them. So we still at least need to bring it into
> cache for the read... at which point, the store probably isn't a big burden.

It means all that data has to be written back.  Yes, I expect it'll prove
to be less costly than the initial load.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:01               ` Nick Piggin
  2007-03-13 13:11                 ` Andrew Morton
@ 2007-03-13 17:30                 ` Jeremy Fitzhardinge
  2007-03-13 20:03                   ` Matt Mackall
  1 sibling, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 17:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, clameter, linux-mm, linux-kernel

Nick Piggin wrote:
> However we still have to visit those to-be-unmapped parts of the page
> table,
> to find the pages and free them. So we still at least need to bring it
> into
> cache for the read... at which point, the store probably isn't a big
> burden.

Why not try to find a place to stash a linklist pointer and link them
all together?  Saves the pulldown pagetable walk altogether.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 17:30                 ` Jeremy Fitzhardinge
@ 2007-03-13 20:03                   ` Matt Mackall
  2007-03-13 20:17                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Matt Mackall @ 2007-03-13 20:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Andrew Morton, clameter, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 10:30:10AM -0700, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > However we still have to visit those to-be-unmapped parts of the page
> > table,
> > to find the pages and free them. So we still at least need to bring it
> > into
> > cache for the read... at which point, the store probably isn't a big
> > burden.
> 
> Why not try to find a place to stash a linklist pointer and link them
> all together?  Saves the pulldown pagetable walk altogether.

Because we'd need one link per mm that a page is mapped in?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 20:03                   ` Matt Mackall
@ 2007-03-13 20:17                     ` Jeremy Fitzhardinge
  2007-03-13 20:21                       ` Matt Mackall
  0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 20:17 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Nick Piggin, Andrew Morton, clameter, linux-mm, linux-kernel

Matt Mackall wrote:
> On Tue, Mar 13, 2007 at 10:30:10AM -0700, Jeremy Fitzhardinge wrote:
>   
>> Nick Piggin wrote:
>>     
>>> However we still have to visit those to-be-unmapped parts of the page
>>> table,
>>> to find the pages and free them. So we still at least need to bring it
>>> into
>>> cache for the read... at which point, the store probably isn't a big
>>> burden.
>>>       
>> Why not try to find a place to stash a linklist pointer and link them
>> all together?  Saves the pulldown pagetable walk altogether.
>>     
>
> Because we'd need one link per mm that a page is mapped in?
>   

Can pagetable pages be shared between mms?  (Kernel pmds in PAE excepted.)

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 20:17                     ` Jeremy Fitzhardinge
@ 2007-03-13 20:21                       ` Matt Mackall
  2007-03-13 21:07                         ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Matt Mackall @ 2007-03-13 20:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Andrew Morton, clameter, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 01:17:00PM -0700, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > On Tue, Mar 13, 2007 at 10:30:10AM -0700, Jeremy Fitzhardinge wrote:
> >   
> >> Nick Piggin wrote:
> >>     
> >>> However we still have to visit those to-be-unmapped parts of the page
> >>> table,
> >>> to find the pages and free them. So we still at least need to bring it
> >>> into
> >>> cache for the read... at which point, the store probably isn't a big
> >>> burden.
> >>>       
> >> Why not try to find a place to stash a linklist pointer and link them
> >> all together?  Saves the pulldown pagetable walk altogether.
> >>     
> >
> > Because we'd need one link per mm that a page is mapped in?
> >   
> 
> Can pagetable pages be shared between mms?  (Kernel pmds in PAE excepted.)

Ahh, I think the issue is that we have to walk the page tables to drop
the reference count of the _actual pages_ they point to. The page
tables themselves could all be put on a list or two lists (one for
PMDs, one for everything else), but that wouldn't really be a win over
just walking the tree, especially given the extra list maintenance.

Because the fan-out is large, the bulk of the work is bringing the last
layer of the tree into cache to find all the pages in the address
space. And there's really no way around that.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 20:21                       ` Matt Mackall
@ 2007-03-13 21:07                         ` David Miller
  2007-03-13 21:14                           ` Matt Mackall
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2007-03-13 21:07 UTC (permalink / raw)
  To: mpm; +Cc: jeremy, nickpiggin, akpm, clameter, linux-mm, linux-kernel

From: Matt Mackall <mpm@selenic.com>
Date: Tue, 13 Mar 2007 15:21:25 -0500

> Because the fan-out is large, the bulk of the work is bringing the last
> layer of the tree into cache to find all the pages in the address
> space. And there's really no way around that.

That's right.

And I will note that historically we used to be much worse
in this area, as we used to walk the page table tree twice
on address space teardown (once to hit the PTE entries, once
to free the page tables).

Happily it is a one-pass algorithm now.

But, within active VMA ranges, we do have to walk all
the bits at least one time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 21:07                         ` David Miller
@ 2007-03-13 21:14                           ` Matt Mackall
  2007-03-13 21:36                             ` Jeremy Fitzhardinge
  2007-03-13 21:48                             ` David Miller
  0 siblings, 2 replies; 35+ messages in thread
From: Matt Mackall @ 2007-03-13 21:14 UTC (permalink / raw)
  To: David Miller; +Cc: jeremy, nickpiggin, akpm, clameter, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 02:07:22PM -0700, David Miller wrote:
> From: Matt Mackall <mpm@selenic.com>
> Date: Tue, 13 Mar 2007 15:21:25 -0500
> 
> > Because the fan-out is large, the bulk of the work is bringing the last
> > layer of the tree into cache to find all the pages in the address
> > space. And there's really no way around that.
> 
> That's right.
> 
> And I will note that historically we used to be much worse
> in this area, as we used to walk the page table tree twice
> on address space teardown (once to hit the PTE entries, once
> to free the page tables).
> 
> Happily it is a one-pass algorithm now.
> 
> But, within active VMA ranges, we do have to walk all
> the bits at least one time.

Well you -could- do this:

- reuse a long in struct page as a used map that divides the page up
  into 32 or 64 segments
- every time you set a PTE, set the corresponding bit in the mask
- when we zap, only visit the regions set in the mask

Thus, you avoid visiting most of a PMD page in the sparse case,
assuming PTEs aren't evenly spread across the PMD.

This might not even be too horrible as the appropriate struct page
should be in cache with the appropriate bits of the mm already locked,
etc.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 21:14                           ` Matt Mackall
@ 2007-03-13 21:36                             ` Jeremy Fitzhardinge
  2007-03-13 21:46                               ` Peter Chubb
  2007-03-13 21:48                             ` David Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 21:36 UTC (permalink / raw)
  To: Matt Mackall
  Cc: David Miller, nickpiggin, akpm, clameter, linux-mm, linux-kernel

Matt Mackall wrote:
> Well you -could- do this:
>
> - reuse a long in struct page as a used map that divides the page up
>   into 32 or 64 segments
> - every time you set a PTE, set the corresponding bit in the mask
> - when we zap, only visit the regions set in the mask
>
> Thus, you avoid visiting most of a PMD page in the sparse case,
> assuming PTEs aren't evenly spread across the PMD.
>
> This might not even be too horrible as the appropriate struct page
> should be in cache with the appropriate bits of the mm already locked,
> etc.
>   

And do the same in pte pages for actual mapped pages?  Or do you think
they would be too densely populated for it to be worthwhile?

    J


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 21:36                             ` Jeremy Fitzhardinge
@ 2007-03-13 21:46                               ` Peter Chubb
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Chubb @ 2007-03-13 21:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, ianw
  Cc: Matt Mackall, David Miller, nickpiggin, akpm, clameter, linux-mm,
	linux-kernel

>>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy@goop.org> writes:


Jeremy> And do the same in pte pages for actual mapped pages?  Or do
Jeremy> you think they would be too densely populated for it to be
Jeremy> worthwhile?

We've been doing some measurements on how densely clumped ptes are.
On 32-bit platforms, they're pretty dense.  On IA64, quite a bit
sparser, depending on the workload of course.  I think that's mostly because
of the larger pagesize on IA64 -- with 64k pages, you don't need very
many to map a small object.

I'm hoping IanW can give more details.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 21:14                           ` Matt Mackall
  2007-03-13 21:36                             ` Jeremy Fitzhardinge
@ 2007-03-13 21:48                             ` David Miller
  1 sibling, 0 replies; 35+ messages in thread
From: David Miller @ 2007-03-13 21:48 UTC (permalink / raw)
  To: mpm; +Cc: jeremy, nickpiggin, akpm, clameter, linux-mm, linux-kernel

From: Matt Mackall <mpm@selenic.com>
Date: Tue, 13 Mar 2007 16:14:35 -0500

> Well you -could- do this:
> 
> - reuse a long in struct page as a used map that divides the page up
>   into 32 or 64 segments
> - every time you set a PTE, set the corresponding bit in the mask
> - when we zap, only visit the regions set in the mask
> 
> Thus, you avoid visiting most of a PMD page in the sparse case,
> assuming PTEs aren't evenly spread across the PMD.
> 
> This might not even be too horrible as the appropriate struct page
> should be in cache with the appropriate bits of the mm already locked,
> etc.

Yes, I've even had that idea before.

You can even hide it behind pmd_none() et al., the generic VM
doesn't even have to know that the page table macros are doing
this optimization.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 11:52     ` Andrew Morton
  2007-03-13 11:06       ` Nick Piggin
@ 2007-03-13 23:58       ` Paul Mackerras
  1 sibling, 0 replies; 35+ messages in thread
From: Paul Mackerras @ 2007-03-13 23:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, clameter, linux-mm, linux-kernel

Andrew Morton writes:

> Plus, we can get in a situation where take a cache-cold, known-zero page
> from the pte quicklist when there is a cache-hot, non-zero page sitting in
> the page allocator.  I suspect that zeroing the cache-hot page would take a
> similar amount of time to a single miss agains the cache-cold page.

That is certainly the case on powerpc.

> I'm not saying that I _know_ that the quicklists are pointless, but I don't
> think it's established that they are pointful.

I don't see much point to them.  For powerpc, I would rather grab an
arbitrary page and zero it than get a page off a quicklist.

> Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a

My recollection was that it wasn't a win, but it was a long time ago...

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:47             ` Andrew Morton
  2007-03-13 12:01               ` Nick Piggin
@ 2007-03-14  1:12               ` William Lee Irwin III
  2007-03-15 23:12                 ` William Lee Irwin III
  1 sibling, 1 reply; 35+ messages in thread
From: William Lee Irwin III @ 2007-03-14  1:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, clameter, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 04:47:56AM -0800, Andrew Morton wrote:
> I'm trying to remember why we ever would have needed to zero out the
> pagetable pages if we're taking down the whole mm?  Maybe it's
> because "oh, the arch wants to put this page into a quicklist to
> recycle it", which is all rather circular.
> It would be interesting to look at a) leave the page full of random
> garbage if we're releasing the whole mm and b) return it straight to
> the page allocator.

We never did need to modify ptes on exit() or other pagetable prunings
(not that they were ever done outside exit() before 2.6.x). The only
subtlety is that pruning on munmap() needs a TLB flush for the TLB
itself to drop the references to the pages referred to by the PTE's on
pruning in the presence of hardware pagetable walkers (in the exit()
case there are no user execution contexts left to potentially utilize
the dead translations so it's less important). That's handled by
tlb_remove_page() and shouldn't need any updates across such a change.

I believe the zeroing on teardown was largely a result of idiom vs.
any particular need. Essentially using ptep_get_and_clear() to handle
the non-pruning munmap() case in a manner unified with other pagetable
teardowns. Also likely is 2.4.x legacy from when that and possibly
earlier kernels maintained arch-private quicklists for pagetables.

There are furthermore distinctions to make between fork() and execve().
fork() stomps over the entire process address space copying pagetables
en masse. After execve() a process incrementally faults in PTE's one at
a time. It should be clear that if case analyses are of interest at
all, fork() will want cache-hot pages (cache-preloaded pages?) where
such are largely wasted on incremental faults after execve(). The copy
operations in fork() should probably also be examined in the context of
shared pagetables at some point.


-- wli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:30             ` Andrew Morton
@ 2007-03-15 20:23               ` Christoph Lameter
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-15 20:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-mm, linux-kernel

On Tue, 13 Mar 2007, Andrew Morton wrote:

> > On Tue, 13 Mar 2007 04:20:48 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > On Tue, 13 Mar 2007, Andrew Morton wrote:
> > 
> > > Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
> > > anyone having tried it properly...
> > 
> > Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?
> 
> Failed to provide us a link to it?

You merged part of it and were involved in the discussions.

General overviews:

http://lwn.net/Articles/117881/
http://lwn.net/Articles/128225/

The details on the problems with prezeroing and touching multiple 
cachelines of the page.

http://www.gelato.unsw.edu.au/archives/linux-ia64/0412/12252.html


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-13 12:27     ` Andrew Morton
@ 2007-03-15 20:28       ` Christoph Lameter
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-15 20:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Tue, 13 Mar 2007, Andrew Morton wrote:

> > 1. We need to support other states of pages other than zeroed.
> 
> What does this mean?

pgd are not completely zeroed. They contain mappings that are always 
present. Thus the state is not a zeroed state.

> > 2. Prezeroing does not make much sense if a large portion of the
> >    page is being used. Performance is better if the whole page 
> >    is zeroed directly before use.Prezeroing only makes sense for sparse
> >    allocations like the page table pages.
> 
> This is not related to the above discussion.

Really? I definitely see the word prezeroing in the discussion.

> > I already tried that 3 years ago and there was *no* benefit for usual
> > users of the a page allocator. The advantage exists only if a small
> > portion of the page is used. F.e. For one cacheline there was a 4x 
> > improvement. See lkml archives for prezeroing.
> 
> Unsurprised.  Were non-temporal stores tried?

Yes with no material change. The work lead to making ia64 use non 
temporal stores for spin unlock but it was not useful for prezeroing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 1/4] Generic quicklist implementation
  2007-03-13  9:05   ` Paul Mundt
@ 2007-03-15 20:51     ` Christoph Lameter
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-03-15 20:51 UTC (permalink / raw)
  To: Paul Mundt; +Cc: akpm, linux-mm, linux-kernel

On Tue, 13 Mar 2007, Paul Mundt wrote:

> This doesn't work, and so CONFIG_QUICKLIST is always set. The NR_QUICK
> thing seems a bit backwards anyways, perhaps it would make more sense to
> have architectures set CONFIG_GENERIC_QUICKLIST in the same way that the
> other GENERIC_xxx bits are defined, and then set NR_QUICK based off of
> that. It's obviously going to be 2 or 1 for most people, and x86 seems to
> be the only one that needs 2.

Both i386 and x86_64 currently need 2 and if other arches start using 
quicklists then they would have the same issues. There may be other 
cases in the future where these may be useful. So I think this is too 
inflexible.

> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7942b33..2f20860 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -163,3 +163,8 @@ config ZONE_DMA_FLAG
>  	default "0" if !ZONE_DMA
>  	default "1"
>  
> +config NR_QUICK
> +	int
> +	depends on GENERIC_QUICKLIST
> +	default "2" if X86
> +	default "1"
> 

Is there a way of checking if a CONFIG_xxx is set to any value?

Then we could do

config QUICKLISTS
	depends on defined(NR_QUICK)

Alternately we could replace #ifdef CONFIG_QUICKLISTS with
#ifdef CONFIG_NR_QUICK ?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [QUICKLIST 0/4] Arch independent quicklists V2
  2007-03-14  1:12               ` William Lee Irwin III
@ 2007-03-15 23:12                 ` William Lee Irwin III
  0 siblings, 0 replies; 35+ messages in thread
From: William Lee Irwin III @ 2007-03-15 23:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, clameter, linux-mm, linux-kernel

On Tue, Mar 13, 2007 at 06:12:44PM -0700, William Lee Irwin III wrote:
> There are furthermore distinctions to make between fork() and execve().
> fork() stomps over the entire process address space copying pagetables
> en masse. After execve() a process incrementally faults in PTE's one at
> a time. It should be clear that if case analyses are of interest at
> all, fork() will want cache-hot pages (cache-preloaded pages?) where
> such are largely wasted on incremental faults after execve(). The copy
> operations in fork() should probably also be examined in the context of
> shared pagetables at some point.

To make this perfectly clear, we can deal with the varying usage cases
with hot/cold flags to the pagetable allocator functions. Where bulk
copies such as fork() are happening, it makes perfect sense to
precharge the cache by eager zeroing. Where sparse single pte affairs
such as incrementally faulting things in after execve() are involved,
cache cold preconstructed pagetable pages are ideal. Address hints
could furthermore be used to precharge single cachelines (e.g. via
prefetch) in the sparse usage case.


-- wli

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-03-15 23:12 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-13  7:13 [QUICKLIST 0/4] Arch independent quicklists V2 Christoph Lameter
2007-03-13  7:13 ` [QUICKLIST 1/4] Generic quicklist implementation Christoph Lameter
2007-03-13  9:05   ` Paul Mundt
2007-03-15 20:51     ` Christoph Lameter
2007-03-13  7:13 ` [QUICKLIST 2/4] Quicklist support for i386 Christoph Lameter
2007-03-13  7:13 ` [QUICKLIST 3/4] Quicklist support for x86_64 Christoph Lameter
2007-03-13  7:13 ` [QUICKLIST 4/4] Quicklist support for sparc64 Christoph Lameter
2007-03-13  8:53 ` [QUICKLIST 0/4] Arch independent quicklists V2 Andrew Morton
2007-03-13  8:03   ` Nick Piggin
2007-03-13 11:52     ` Andrew Morton
2007-03-13 11:06       ` Nick Piggin
2007-03-13 12:15         ` Andrew Morton
2007-03-13 11:20           ` Christoph Lameter
2007-03-13 12:30             ` Andrew Morton
2007-03-15 20:23               ` Christoph Lameter
2007-03-13 11:30           ` Nick Piggin
2007-03-13 12:47             ` Andrew Morton
2007-03-13 12:01               ` Nick Piggin
2007-03-13 13:11                 ` Andrew Morton
2007-03-13 12:18                   ` Nick Piggin
2007-03-13 17:30                 ` Jeremy Fitzhardinge
2007-03-13 20:03                   ` Matt Mackall
2007-03-13 20:17                     ` Jeremy Fitzhardinge
2007-03-13 20:21                       ` Matt Mackall
2007-03-13 21:07                         ` David Miller
2007-03-13 21:14                           ` Matt Mackall
2007-03-13 21:36                             ` Jeremy Fitzhardinge
2007-03-13 21:46                               ` Peter Chubb
2007-03-13 21:48                             ` David Miller
2007-03-14  1:12               ` William Lee Irwin III
2007-03-15 23:12                 ` William Lee Irwin III
2007-03-13 23:58       ` Paul Mackerras
2007-03-13 11:17   ` Christoph Lameter
2007-03-13 12:27     ` Andrew Morton
2007-03-15 20:28       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).