All of lore.kernel.org
 help / color / mirror / Atom feed
* [00/14] Virtual Compound Page Support V3
@ 2008-03-21  6:17 ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Allocations of larger pages are not reliable in Linux. If larger
pages have to be allocated then one faces various choices of allowing
graceful fallback or using vmalloc with a performance penalty due
to the use of a page table. Virtual Compound pages are
a simple solution out of this dilemma.

A virtual compound allocation means that there will be first of all
an attempt to satisfy the request with physically contiguous memory.
If that is not possible then a virtually contiguous memory will be
created.

This has two advantages:

1. Current uses of vmalloc can be converted to allocate virtual
   compounds instead. In most cases physically contiguous
   memory can be used which avoids the vmalloc performance
   penalty. See f.e. the e1000 driver patch.

2. Uses of higher order allocations (stacks, buffers etc) can be
   converted to use virtual compounds instead. Physically contiguous
   memory will still be used for those higher order allocs in general
   but the system can degrade to the use of vmalloc should memory
   become heavily fragmented.

There is a compile time option to switch on fallback for
testing purposes. Virtually mapped mmemory may behave differently
and the CONFIG_FALLBACK_ALWAYS option will ensure that the code is
tested to deal with virtual memory.

V2->V3:
- Put the code into mm/vmalloc.c and leave the page allocator alone.
- Add a series of examples where virtual compound pages can be used.
- Diffed on top of the page flags and the vmalloc info patches
  already in mm.
- Simplify things by omitting some of the more complex code
  that used to be in there.

V1->V2
- Remove some cleanup patches and the SLUB patches from this set.
- Transparent vcompound support through page_address() and
  virt_to_head_page().
- Additional use cases.
- Factor the code better for an easier read
- Add configurable stack size.
- Follow up on various suggestions made for V1

RFC->V1
- Complete support for all compound functions for virtual compound pages
  (including the compound_nth_page() necessary for LBS mmap support)
- Fix various bugs
- Fix i386 build

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [00/14] Virtual Compound Page Support V3
@ 2008-03-21  6:17 ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Allocations of larger pages are not reliable in Linux. If larger
pages have to be allocated then one faces various choices of allowing
graceful fallback or using vmalloc with a performance penalty due
to the use of a page table. Virtual Compound pages are
a simple solution out of this dilemma.

A virtual compound allocation means that there will be first of all
an attempt to satisfy the request with physically contiguous memory.
If that is not possible then a virtually contiguous memory will be
created.

This has two advantages:

1. Current uses of vmalloc can be converted to allocate virtual
   compounds instead. In most cases physically contiguous
   memory can be used which avoids the vmalloc performance
   penalty. See f.e. the e1000 driver patch.

2. Uses of higher order allocations (stacks, buffers etc) can be
   converted to use virtual compounds instead. Physically contiguous
   memory will still be used for those higher order allocs in general
   but the system can degrade to the use of vmalloc should memory
   become heavily fragmented.

There is a compile time option to switch on fallback for
testing purposes. Virtually mapped mmemory may behave differently
and the CONFIG_FALLBACK_ALWAYS option will ensure that the code is
tested to deal with virtual memory.

V2->V3:
- Put the code into mm/vmalloc.c and leave the page allocator alone.
- Add a series of examples where virtual compound pages can be used.
- Diffed on top of the page flags and the vmalloc info patches
  already in mm.
- Simplify things by omitting some of the more complex code
  that used to be in there.

V1->V2
- Remove some cleanup patches and the SLUB patches from this set.
- Transparent vcompound support through page_address() and
  virt_to_head_page().
- Additional use cases.
- Factor the code better for an easier read
- Add configurable stack size.
- Follow up on various suggestions made for V1

RFC->V1
- Complete support for all compound functions for virtual compound pages
  (including the compound_nth_page() necessary for LBS mmap support)
- Fix various bugs
- Fix i386 build

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [01/14] vcompound: Return page array on vunmap
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0001-vcompound-Return-page-array-on-vunmap.patch --]
[-- Type: text/plain, Size: 4342 bytes --]

Make vunmap return the page array that was used at vmap. This is useful
if one has no structures to track the page array but simply stores the
virtual address somewhere. The disposition of the page array can be
decided upon after vunmap. vfree() may now also be used instead of
vunmap which will release the page array after vunmap'ping it.

As noted by Kamezawa: The same subsystem that provides the page array
to vmap must must use its own method to dispose of the page array.

If vfree() is called to free the page array then the page array must either
be

1. Allocated via the slab allocator

2. Allocated via vmalloc but then VM_VPAGES must have been passed at
   vunmap to specify that a vfree is needed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/vmalloc.h |    2 +-
 mm/vmalloc.c            |   32 ++++++++++++++++++++++----------
 2 files changed, 23 insertions(+), 11 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-18 12:20:12.295837331 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
@@ -50,7 +50,7 @@ extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
-extern void vunmap(const void *addr);
+extern struct page **vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-18 13:48:56.344025498 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:17:42.125444187 -0700
@@ -153,6 +153,7 @@ int map_vm_area(struct vm_struct *area, 
 	unsigned long addr = (unsigned long) area->addr;
 	unsigned long end = addr + area->size - PAGE_SIZE;
 	int err;
+	area->pages = *pages;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
@@ -163,6 +164,8 @@ int map_vm_area(struct vm_struct *area, 
 			break;
 	} while (pgd++, addr = next, addr != end);
 	flush_cache_vmap((unsigned long) area->addr, end);
+
+	area->nr_pages = *pages - area->pages;
 	return err;
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
@@ -372,17 +375,18 @@ struct vm_struct *remove_vm_area(const v
 	return v;
 }
 
-static void __vunmap(const void *addr, int deallocate_pages)
+static struct page **__vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
+	struct page **pages;
 
 	if (!addr)
-		return;
+		return NULL;
 
 	if ((PAGE_SIZE-1) & (unsigned long)addr) {
 		printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
 	area = remove_vm_area(addr);
@@ -390,29 +394,30 @@ static void __vunmap(const void *addr, i
 		printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n",
 				addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
+	pages = area->pages;
 	debug_check_no_locks_freed(addr, area->size);
 
 	if (deallocate_pages) {
 		int i;
 
 		for (i = 0; i < area->nr_pages; i++) {
-			struct page *page = area->pages[i];
+			struct page *page = pages[i];
 
 			BUG_ON(!page);
 			__free_page(page);
 		}
 
 		if (area->flags & VM_VPAGES)
-			vfree(area->pages);
+			vfree(pages);
 		else
-			kfree(area->pages);
+			kfree(pages);
 	}
 
 	kfree(area);
-	return;
+	return pages;
 }
 
 /**
@@ -441,10 +446,10 @@ EXPORT_SYMBOL(vfree);
  *
  *	Must not be called in interrupt context.
  */
-void vunmap(const void *addr)
+struct page **vunmap(const void *addr)
 {
 	BUG_ON(in_interrupt());
-	__vunmap(addr, 0);
+	return __vunmap(addr, 0);
 }
 EXPORT_SYMBOL(vunmap);
 
@@ -457,6 +462,13 @@ EXPORT_SYMBOL(vunmap);
  *
  *	Maps @count pages from @pages into contiguous kernel virtual
  *	space.
+ *
+ *	The page array may be freed via vfree() on the virtual address
+ *	returned. In that case the page array must be allocated via
+ *	the slab allocator. If the page array was allocated via
+ *	vmalloc then VM_VPAGES must be specified in the flags. There is
+ *	no support for vfree() to free a page array allocated via the
+ *	page allocator.
  */
 void *vmap(struct page **pages, unsigned int count,
 		unsigned long flags, pgprot_t prot)

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [01/14] vcompound: Return page array on vunmap
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0001-vcompound-Return-page-array-on-vunmap.patch --]
[-- Type: text/plain, Size: 4568 bytes --]

Make vunmap return the page array that was used at vmap. This is useful
if one has no structures to track the page array but simply stores the
virtual address somewhere. The disposition of the page array can be
decided upon after vunmap. vfree() may now also be used instead of
vunmap which will release the page array after vunmap'ping it.

As noted by Kamezawa: The same subsystem that provides the page array
to vmap must must use its own method to dispose of the page array.

If vfree() is called to free the page array then the page array must either
be

1. Allocated via the slab allocator

2. Allocated via vmalloc but then VM_VPAGES must have been passed at
   vunmap to specify that a vfree is needed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/vmalloc.h |    2 +-
 mm/vmalloc.c            |   32 ++++++++++++++++++++++----------
 2 files changed, 23 insertions(+), 11 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-18 12:20:12.295837331 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
@@ -50,7 +50,7 @@ extern void vfree(const void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
-extern void vunmap(const void *addr);
+extern struct page **vunmap(const void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-18 13:48:56.344025498 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:17:42.125444187 -0700
@@ -153,6 +153,7 @@ int map_vm_area(struct vm_struct *area, 
 	unsigned long addr = (unsigned long) area->addr;
 	unsigned long end = addr + area->size - PAGE_SIZE;
 	int err;
+	area->pages = *pages;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
@@ -163,6 +164,8 @@ int map_vm_area(struct vm_struct *area, 
 			break;
 	} while (pgd++, addr = next, addr != end);
 	flush_cache_vmap((unsigned long) area->addr, end);
+
+	area->nr_pages = *pages - area->pages;
 	return err;
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
@@ -372,17 +375,18 @@ struct vm_struct *remove_vm_area(const v
 	return v;
 }
 
-static void __vunmap(const void *addr, int deallocate_pages)
+static struct page **__vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
+	struct page **pages;
 
 	if (!addr)
-		return;
+		return NULL;
 
 	if ((PAGE_SIZE-1) & (unsigned long)addr) {
 		printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
 	area = remove_vm_area(addr);
@@ -390,29 +394,30 @@ static void __vunmap(const void *addr, i
 		printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n",
 				addr);
 		WARN_ON(1);
-		return;
+		return NULL;
 	}
 
+	pages = area->pages;
 	debug_check_no_locks_freed(addr, area->size);
 
 	if (deallocate_pages) {
 		int i;
 
 		for (i = 0; i < area->nr_pages; i++) {
-			struct page *page = area->pages[i];
+			struct page *page = pages[i];
 
 			BUG_ON(!page);
 			__free_page(page);
 		}
 
 		if (area->flags & VM_VPAGES)
-			vfree(area->pages);
+			vfree(pages);
 		else
-			kfree(area->pages);
+			kfree(pages);
 	}
 
 	kfree(area);
-	return;
+	return pages;
 }
 
 /**
@@ -441,10 +446,10 @@ EXPORT_SYMBOL(vfree);
  *
  *	Must not be called in interrupt context.
  */
-void vunmap(const void *addr)
+struct page **vunmap(const void *addr)
 {
 	BUG_ON(in_interrupt());
-	__vunmap(addr, 0);
+	return __vunmap(addr, 0);
 }
 EXPORT_SYMBOL(vunmap);
 
@@ -457,6 +462,13 @@ EXPORT_SYMBOL(vunmap);
  *
  *	Maps @count pages from @pages into contiguous kernel virtual
  *	space.
+ *
+ *	The page array may be freed via vfree() on the virtual address
+ *	returned. In that case the page array must be allocated via
+ *	the slab allocator. If the page array was allocated via
+ *	vmalloc then VM_VPAGES must be specified in the flags. There is
+ *	no support for vfree() to free a page array allocated via the
+ *	page allocator.
  */
 void *vmap(struct page **pages, unsigned int count,
 		unsigned long flags, pgprot_t prot)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [02/14] vcompound: pageflags: Add PageVcompound()
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0004-vcompound-Pageflags-Add-PageVcompound.patch --]
[-- Type: text/plain, Size: 1555 bytes --]

Add a another page flag that can be used to figure out if a compound
page is virtually mapped. The mark is necessary since we have to know
when freeing pages if we have to destroy a virtual mapping.

No additional flag is needed. We use PG_swapcache together with PG_compound
(similar to PageHead() and PageTail()) to signal that a compound
page is virtually mapped. PG_swapcache is not used at this point since
compound pages cannot be put onto the LRU (yet).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/page-flags.h |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/page-flags.h	2008-03-20 17:40:16.141487362 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/page-flags.h	2008-03-20 17:41:58.768233703 -0700
@@ -196,6 +196,24 @@ static inline int PageHighMem(struct pag
 }
 #endif
 
+/*
+ * PG_swapcache is used in combination with PG_compound to indicate
+ * that a compound page was allocated via vmalloc.
+ */
+#define PG_vcompound_mask ((1L << PG_compound) | (1L << PG_swapcache))
+#define PageVcompound(page)	((page->flags & PG_vcompound_mask) \
+					== PG_vcompound_mask)
+
+static inline void __SetPageVcompound(struct page *page)
+{
+	page->flags |= PG_vcompound_mask;
+}
+
+static inline void __ClearPageVcompound(struct page *page)
+{
+	page->flags &= ~PG_vcompound_mask;
+}
+
 #ifdef CONFIG_SWAP
 PAGEFLAG(SwapCache, swapcache)
 #else

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [02/14] vcompound: pageflags: Add PageVcompound()
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0004-vcompound-Pageflags-Add-PageVcompound.patch --]
[-- Type: text/plain, Size: 1781 bytes --]

Add a another page flag that can be used to figure out if a compound
page is virtually mapped. The mark is necessary since we have to know
when freeing pages if we have to destroy a virtual mapping.

No additional flag is needed. We use PG_swapcache together with PG_compound
(similar to PageHead() and PageTail()) to signal that a compound
page is virtually mapped. PG_swapcache is not used at this point since
compound pages cannot be put onto the LRU (yet).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/page-flags.h |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/page-flags.h	2008-03-20 17:40:16.141487362 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/page-flags.h	2008-03-20 17:41:58.768233703 -0700
@@ -196,6 +196,24 @@ static inline int PageHighMem(struct pag
 }
 #endif
 
+/*
+ * PG_swapcache is used in combination with PG_compound to indicate
+ * that a compound page was allocated via vmalloc.
+ */
+#define PG_vcompound_mask ((1L << PG_compound) | (1L << PG_swapcache))
+#define PageVcompound(page)	((page->flags & PG_vcompound_mask) \
+					== PG_vcompound_mask)
+
+static inline void __SetPageVcompound(struct page *page)
+{
+	page->flags |= PG_vcompound_mask;
+}
+
+static inline void __ClearPageVcompound(struct page *page)
+{
+	page->flags &= ~PG_vcompound_mask;
+}
+
 #ifdef CONFIG_SWAP
 PAGEFLAG(SwapCache, swapcache)
 #else

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: vcompound_vmalloc_type --]
[-- Type: text/plain, Size: 1383 bytes --]

Add another flag to the vmalloc subsystem to mark virtual compound pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |    1 +
 mm/vmalloc.c            |    3 +++
 2 files changed, 4 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:27:20.150422445 -0700
@@ -12,6 +12,7 @@ struct vm_area_struct;
 #define VM_MAP		0x00000004	/* vmap()ed pages */
 #define VM_USERMAP	0x00000008	/* suitable for remap_vmalloc_range */
 #define VM_VPAGES	0x00000010	/* buffer for pages was vmalloc'ed */
+#define VM_VCOMPOUND	0x00000020	/* Page allocator fallback */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-19 18:18:02.689633934 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:27:20.150422445 -0700
@@ -974,6 +974,9 @@ static int s_show(struct seq_file *m, vo
 	if (v->flags & VM_VPAGES)
 		seq_printf(m, " vpages");
 
+	if (v->flags & VM_VCOMPOUND)
+		seq_printf(m, " vcompound");
+
 	seq_putc(m, '\n');
 	return 0;
 }

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: vcompound_vmalloc_type --]
[-- Type: text/plain, Size: 1609 bytes --]

Add another flag to the vmalloc subsystem to mark virtual compound pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |    1 +
 mm/vmalloc.c            |    3 +++
 2 files changed, 4 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:27:20.150422445 -0700
@@ -12,6 +12,7 @@ struct vm_area_struct;
 #define VM_MAP		0x00000004	/* vmap()ed pages */
 #define VM_USERMAP	0x00000008	/* suitable for remap_vmalloc_range */
 #define VM_VPAGES	0x00000010	/* buffer for pages was vmalloc'ed */
+#define VM_VCOMPOUND	0x00000020	/* Page allocator fallback */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-19 18:18:02.689633934 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:27:20.150422445 -0700
@@ -974,6 +974,9 @@ static int s_show(struct seq_file *m, vo
 	if (v->flags & VM_VPAGES)
 		seq_printf(m, " vpages");
 
+	if (v->flags & VM_VCOMPOUND)
+		seq_printf(m, " vcompound");
+
 	seq_putc(m, '\n');
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [04/14] vcompound: Core piece
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: newcore --]
[-- Type: text/plain, Size: 7091 bytes --]

Add support functions to allow the creation and destruction of virtual compound
pages. Virtual compound pages are similar to compound pages in that if
PageTail(page) is true then page->first points to the first page.

	vcompound_head_page(address)

(similar to virt_to_head_page) can be used to determine the head page from an
address.

Another similarity to compound pages is that page[1].lru.next contains the
order of the virtual compound page. However, the page structs of virtual
compound pages are not in order. So page[1] means the second page belonging
to the virtual compound mapping which is not necessarily the page following
the head page.

Freeing of virtual compound pages is support both from preemptible and
non preemptible context (freeing requires a preemptible context, we simply
defer free if we are not in a preemptible context).

However, allocation of virtual compound pages must at this stage be done from
preemptible contexts only (there are patches to implement allocations from
atomic context but those are unecessary at this early stage).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |   14 +++
 mm/vmalloc.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 211 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-20 23:03:14.600588151 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-20 23:03:14.612588010 -0700
@@ -86,6 +86,20 @@ extern struct vm_struct *alloc_vm_area(s
 extern void free_vm_area(struct vm_struct *area);
 
 /*
+ * Support for virtual compound pages.
+ *
+ * Calls to vcompound alloc will result in the allocation of normal compound
+ * pages unless memory is fragmented.  If insufficient physical linear memory
+ * is available then a virtually contiguous area of memory will be created
+ * using the vmalloc functionality.
+ */
+struct page *alloc_vcompound_alloc(gfp_t flags, int order);
+void free_vcompound(struct page *);
+void *__alloc_vcompound(gfp_t flags, int order);
+void __free_vcompound(void *addr);
+struct page *vcompound_head_page(const void *x);
+
+/*
  *	Internals.  Dont't use..
  */
 extern rwlock_t vmlist_lock;
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-20 23:03:14.600588151 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-20 23:06:43.703428350 -0700
@@ -989,3 +989,200 @@ const struct seq_operations vmalloc_op =
 };
 #endif
 
+/*
+ * Virtual Compound Page support.
+ *
+ * Virtual Compound Pages are used to fall back to order 0 allocations if large
+ * linear mappings are not available. They are formatted according to compound
+ * page conventions. I.e. following page->first_page if PageTail(page) is set
+ * can be used to determine the head page.
+ */
+
+/*
+ * Determine the appropriate page struct given a virtual address
+ * (including vmalloced areas).
+ *
+ * Return the head page if this is a compound page.
+ *
+ * Cannot be inlined since VMALLOC_START and VMALLOC_END may contain
+ * complex calculations that depend on multiple arch includes or
+ * even variables.
+ */
+struct page *vcompound_head_page(const void *x)
+{
+	unsigned long addr = (unsigned long)x;
+	struct page *page;
+
+	if (unlikely(is_vmalloc_addr(x)))
+		page = vmalloc_to_page(x);
+	else
+		page = virt_to_page(addr);
+
+	return compound_head(page);
+}
+
+static void __vcompound_free(void *addr)
+{
+
+	struct page **pages;
+	int i;
+	int order;
+
+	pages = vunmap(addr);
+	order = (unsigned long)pages[1]->lru.prev;
+
+	/*
+	 * First page will have zero refcount since it maintains state
+	 * for the compound and was decremented before we got here.
+	 */
+	set_page_address(pages[0], NULL);
+	__ClearPageVcompound(pages[0]);
+	free_hot_page(pages[0]);
+
+	for (i = 1; i < (1 << order); i++) {
+		struct page *page = pages[i];
+		BUG_ON(!PageTail(page) || !PageVcompound(page));
+		set_page_address(page, NULL);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+}
+
+static void vcompound_free_work(struct work_struct *w)
+{
+	__vcompound_free((void *)w);
+}
+
+static void vcompound_free(void *addr, struct page *page)
+{
+	struct work_struct *w = addr;
+
+	BUG_ON((!PageVcompound(page) || !PageHead(page)));
+
+	if (!put_page_testzero(page))
+		return;
+
+	if (!preemptible()) {
+		/*
+		 * Need to defer the free until we are in
+		 * a preemptible context.
+		 */
+		INIT_WORK(w, vcompound_free_work);
+		schedule_work(w);
+	} else
+		__vcompound_free(addr);
+}
+
+
+void __free_vcompound(void *addr)
+{
+	struct page *page;
+
+	if (unlikely(is_vmalloc_addr(addr)))
+		vcompound_free(addr, vmalloc_to_page(addr));
+	else {
+		page = virt_to_page(addr);
+		free_pages((unsigned long)addr, compound_order(page));
+	}
+}
+
+void free_vcompound(struct page *page)
+{
+	if (unlikely(PageVcompound(page)))
+		vcompound_free(page_address(page), page);
+	else
+		__free_pages(page, compound_order(page));
+}
+
+static struct vm_struct *____alloc_vcompound(gfp_t gfp_mask, unsigned long order,
+								void *caller)
+{
+	struct page *page;
+	int i;
+	struct vm_struct *vm;
+	int nr_pages = 1 << order;
+	struct page **pages = kmalloc(nr_pages * sizeof(struct page *),
+						gfp_mask & GFP_RECLAIM_MASK);
+	struct page **pages2;
+
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_page(gfp_mask);
+		if (!page)
+			goto abort;
+
+		/* Sets PageCompound which makes PageHead(page) true */
+		__SetPageVcompound(page);
+		pages[i] = page;
+	}
+
+	vm = __get_vm_area_node(nr_pages << PAGE_SHIFT, VM_VCOMPOUND,
+		VMALLOC_START, VMALLOC_END, -1, gfp_mask, caller);
+
+	if (!vm)
+		goto abort;
+
+	vm->caller = caller;
+	pages2 = pages;
+	if (map_vm_area(vm, PAGE_KERNEL, &pages2))
+		goto abort;
+
+	pages[1]->lru.prev = (void *)order;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pages[i];
+
+		__SetPageTail(page);
+		page->first_page = pages[0];
+		set_page_address(page, vm->addr + (i << PAGE_SHIFT));
+	}
+	return vm;
+
+abort:
+	while (i-- > 0) {
+		page = pages[i];
+		if (!page)
+			continue;
+		set_page_address(page, NULL);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+	return NULL;
+}
+
+struct page *alloc_vcompound(gfp_t flags, int order)
+{
+	struct vm_struct *vm;
+	struct page *page;
+
+	page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN, order);
+	if (page || !order)
+		return page;
+
+	vm = ____alloc_vcompound(flags, order, __builtin_return_address(0));
+	if (vm)
+		return vm->pages[0];
+
+	return NULL;
+}
+
+void *__alloc_vcompound(gfp_t flags, int order)
+{
+	struct vm_struct *vm;
+	void *addr;
+
+	addr = (void *)__get_free_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
+								order);
+	if (addr || !order)
+		return addr;
+
+	vm = ____alloc_vcompound(flags, order, __builtin_return_address(0));
+	if (vm)
+		return vm->addr;
+
+	return NULL;
+}

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [04/14] vcompound: Core piece
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: newcore --]
[-- Type: text/plain, Size: 7317 bytes --]

Add support functions to allow the creation and destruction of virtual compound
pages. Virtual compound pages are similar to compound pages in that if
PageTail(page) is true then page->first points to the first page.

	vcompound_head_page(address)

(similar to virt_to_head_page) can be used to determine the head page from an
address.

Another similarity to compound pages is that page[1].lru.next contains the
order of the virtual compound page. However, the page structs of virtual
compound pages are not in order. So page[1] means the second page belonging
to the virtual compound mapping which is not necessarily the page following
the head page.

Freeing of virtual compound pages is support both from preemptible and
non preemptible context (freeing requires a preemptible context, we simply
defer free if we are not in a preemptible context).

However, allocation of virtual compound pages must at this stage be done from
preemptible contexts only (there are patches to implement allocations from
atomic context but those are unecessary at this early stage).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmalloc.h |   14 +++
 mm/vmalloc.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 211 insertions(+)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-20 23:03:14.600588151 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-20 23:03:14.612588010 -0700
@@ -86,6 +86,20 @@ extern struct vm_struct *alloc_vm_area(s
 extern void free_vm_area(struct vm_struct *area);
 
 /*
+ * Support for virtual compound pages.
+ *
+ * Calls to vcompound alloc will result in the allocation of normal compound
+ * pages unless memory is fragmented.  If insufficient physical linear memory
+ * is available then a virtually contiguous area of memory will be created
+ * using the vmalloc functionality.
+ */
+struct page *alloc_vcompound_alloc(gfp_t flags, int order);
+void free_vcompound(struct page *);
+void *__alloc_vcompound(gfp_t flags, int order);
+void __free_vcompound(void *addr);
+struct page *vcompound_head_page(const void *x);
+
+/*
  *	Internals.  Dont't use..
  */
 extern rwlock_t vmlist_lock;
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-20 23:03:14.600588151 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-20 23:06:43.703428350 -0700
@@ -989,3 +989,200 @@ const struct seq_operations vmalloc_op =
 };
 #endif
 
+/*
+ * Virtual Compound Page support.
+ *
+ * Virtual Compound Pages are used to fall back to order 0 allocations if large
+ * linear mappings are not available. They are formatted according to compound
+ * page conventions. I.e. following page->first_page if PageTail(page) is set
+ * can be used to determine the head page.
+ */
+
+/*
+ * Determine the appropriate page struct given a virtual address
+ * (including vmalloced areas).
+ *
+ * Return the head page if this is a compound page.
+ *
+ * Cannot be inlined since VMALLOC_START and VMALLOC_END may contain
+ * complex calculations that depend on multiple arch includes or
+ * even variables.
+ */
+struct page *vcompound_head_page(const void *x)
+{
+	unsigned long addr = (unsigned long)x;
+	struct page *page;
+
+	if (unlikely(is_vmalloc_addr(x)))
+		page = vmalloc_to_page(x);
+	else
+		page = virt_to_page(addr);
+
+	return compound_head(page);
+}
+
+static void __vcompound_free(void *addr)
+{
+
+	struct page **pages;
+	int i;
+	int order;
+
+	pages = vunmap(addr);
+	order = (unsigned long)pages[1]->lru.prev;
+
+	/*
+	 * First page will have zero refcount since it maintains state
+	 * for the compound and was decremented before we got here.
+	 */
+	set_page_address(pages[0], NULL);
+	__ClearPageVcompound(pages[0]);
+	free_hot_page(pages[0]);
+
+	for (i = 1; i < (1 << order); i++) {
+		struct page *page = pages[i];
+		BUG_ON(!PageTail(page) || !PageVcompound(page));
+		set_page_address(page, NULL);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+}
+
+static void vcompound_free_work(struct work_struct *w)
+{
+	__vcompound_free((void *)w);
+}
+
+static void vcompound_free(void *addr, struct page *page)
+{
+	struct work_struct *w = addr;
+
+	BUG_ON((!PageVcompound(page) || !PageHead(page)));
+
+	if (!put_page_testzero(page))
+		return;
+
+	if (!preemptible()) {
+		/*
+		 * Need to defer the free until we are in
+		 * a preemptible context.
+		 */
+		INIT_WORK(w, vcompound_free_work);
+		schedule_work(w);
+	} else
+		__vcompound_free(addr);
+}
+
+
+void __free_vcompound(void *addr)
+{
+	struct page *page;
+
+	if (unlikely(is_vmalloc_addr(addr)))
+		vcompound_free(addr, vmalloc_to_page(addr));
+	else {
+		page = virt_to_page(addr);
+		free_pages((unsigned long)addr, compound_order(page));
+	}
+}
+
+void free_vcompound(struct page *page)
+{
+	if (unlikely(PageVcompound(page)))
+		vcompound_free(page_address(page), page);
+	else
+		__free_pages(page, compound_order(page));
+}
+
+static struct vm_struct *____alloc_vcompound(gfp_t gfp_mask, unsigned long order,
+								void *caller)
+{
+	struct page *page;
+	int i;
+	struct vm_struct *vm;
+	int nr_pages = 1 << order;
+	struct page **pages = kmalloc(nr_pages * sizeof(struct page *),
+						gfp_mask & GFP_RECLAIM_MASK);
+	struct page **pages2;
+
+	if (!pages)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_page(gfp_mask);
+		if (!page)
+			goto abort;
+
+		/* Sets PageCompound which makes PageHead(page) true */
+		__SetPageVcompound(page);
+		pages[i] = page;
+	}
+
+	vm = __get_vm_area_node(nr_pages << PAGE_SHIFT, VM_VCOMPOUND,
+		VMALLOC_START, VMALLOC_END, -1, gfp_mask, caller);
+
+	if (!vm)
+		goto abort;
+
+	vm->caller = caller;
+	pages2 = pages;
+	if (map_vm_area(vm, PAGE_KERNEL, &pages2))
+		goto abort;
+
+	pages[1]->lru.prev = (void *)order;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pages[i];
+
+		__SetPageTail(page);
+		page->first_page = pages[0];
+		set_page_address(page, vm->addr + (i << PAGE_SHIFT));
+	}
+	return vm;
+
+abort:
+	while (i-- > 0) {
+		page = pages[i];
+		if (!page)
+			continue;
+		set_page_address(page, NULL);
+		__ClearPageVcompound(page);
+		__free_page(page);
+	}
+	kfree(pages);
+	return NULL;
+}
+
+struct page *alloc_vcompound(gfp_t flags, int order)
+{
+	struct vm_struct *vm;
+	struct page *page;
+
+	page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN, order);
+	if (page || !order)
+		return page;
+
+	vm = ____alloc_vcompound(flags, order, __builtin_return_address(0));
+	if (vm)
+		return vm->pages[0];
+
+	return NULL;
+}
+
+void *__alloc_vcompound(gfp_t flags, int order)
+{
+	struct vm_struct *vm;
+	void *addr;
+
+	addr = (void *)__get_free_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
+								order);
+	if (addr || !order)
+		return addr;
+
+	vm = ____alloc_vcompound(flags, order, __builtin_return_address(0));
+	if (vm)
+		return vm->addr;
+
+	return NULL;
+}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [05/14] vcompound: Debugging aid
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0008-vcompound-Debugging-aid.patch --]
[-- Type: text/plain, Size: 2472 bytes --]

Virtual fallbacks are rare and thus subtle bugs may creep in if we do not
test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all vcompound allocations
fall back to vmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 lib/Kconfig.debug |   11 +++++++++++
 mm/vmalloc.c      |   18 +++++++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

Index: linux-2.6.25-rc5-mm1/lib/Kconfig.debug
===================================================================
--- linux-2.6.25-rc5-mm1.orig/lib/Kconfig.debug	2008-03-20 23:05:12.910212550 -0700
+++ linux-2.6.25-rc5-mm1/lib/Kconfig.debug	2008-03-20 23:06:21.599135107 -0700
@@ -158,6 +158,17 @@ config DETECT_SOFTLOCKUP
 	   can be detected via the NMI-watchdog, on platforms that
 	   support it.)
 
+config VFALLBACK_ALWAYS
+	bool "Always fall back to virtually mapped compound pages"
+	default y
+	help
+	  Virtual compound pages are only allocated if there is no linear
+	  memory available. They are a fallback and errors created by the
+	  use of virtual mappings instead of linear ones may not surface
+	  because of their infrequent use. This option makes every
+	  allocation that allows a fallback to a virtual mapping use
+	  the virtual mapping. May have a significant performance impact.
+
 config SCHED_DEBUG
 	bool "Collect scheduler debugging info"
 	depends on DEBUG_KERNEL && PROC_FS
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-20 23:06:14.875045176 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-20 23:06:21.599135107 -0700
@@ -1159,7 +1159,13 @@ struct page *alloc_vcompound(gfp_t flags
 	struct vm_struct *vm;
 	struct page *page;
 
-	page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN, order);
+#ifdef CONFIG_VFALLBACK_ALWAYS
+	if (system_state == SYSTEM_RUNNING && order)
+		page = NULL;
+	else
+#endif
+		page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
+								order);
 	if (page || !order)
 		return page;
 
@@ -1175,8 +1181,14 @@ void *__alloc_vcompound(gfp_t flags, int
 	struct vm_struct *vm;
 	void *addr;
 
-	addr = (void *)__get_free_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
-								order);
+#ifdef CONFIG_VFALLBACK_ALWAYS
+	if (system_state == SYSTEM_RUNNING && order)
+		addr = NULL;
+	else
+#endif
+		addr = (void *)__get_free_pages(
+			flags | __GFP_NORETRY | __GFP_NOWARN, order);
+
 	if (addr || !order)
 		return addr;
 

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [05/14] vcompound: Debugging aid
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0008-vcompound-Debugging-aid.patch --]
[-- Type: text/plain, Size: 2698 bytes --]

Virtual fallbacks are rare and thus subtle bugs may creep in if we do not
test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all vcompound allocations
fall back to vmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 lib/Kconfig.debug |   11 +++++++++++
 mm/vmalloc.c      |   18 +++++++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

Index: linux-2.6.25-rc5-mm1/lib/Kconfig.debug
===================================================================
--- linux-2.6.25-rc5-mm1.orig/lib/Kconfig.debug	2008-03-20 23:05:12.910212550 -0700
+++ linux-2.6.25-rc5-mm1/lib/Kconfig.debug	2008-03-20 23:06:21.599135107 -0700
@@ -158,6 +158,17 @@ config DETECT_SOFTLOCKUP
 	   can be detected via the NMI-watchdog, on platforms that
 	   support it.)
 
+config VFALLBACK_ALWAYS
+	bool "Always fall back to virtually mapped compound pages"
+	default y
+	help
+	  Virtual compound pages are only allocated if there is no linear
+	  memory available. They are a fallback and errors created by the
+	  use of virtual mappings instead of linear ones may not surface
+	  because of their infrequent use. This option makes every
+	  allocation that allows a fallback to a virtual mapping use
+	  the virtual mapping. May have a significant performance impact.
+
 config SCHED_DEBUG
 	bool "Collect scheduler debugging info"
 	depends on DEBUG_KERNEL && PROC_FS
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-20 23:06:14.875045176 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-20 23:06:21.599135107 -0700
@@ -1159,7 +1159,13 @@ struct page *alloc_vcompound(gfp_t flags
 	struct vm_struct *vm;
 	struct page *page;
 
-	page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN, order);
+#ifdef CONFIG_VFALLBACK_ALWAYS
+	if (system_state == SYSTEM_RUNNING && order)
+		page = NULL;
+	else
+#endif
+		page = alloc_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
+								order);
 	if (page || !order)
 		return page;
 
@@ -1175,8 +1181,14 @@ void *__alloc_vcompound(gfp_t flags, int
 	struct vm_struct *vm;
 	void *addr;
 
-	addr = (void *)__get_free_pages(flags | __GFP_NORETRY | __GFP_NOWARN,
-								order);
+#ifdef CONFIG_VFALLBACK_ALWAYS
+	if (system_state == SYSTEM_RUNNING && order)
+		addr = NULL;
+	else
+#endif
+		addr = (void *)__get_free_pages(
+			flags | __GFP_NORETRY | __GFP_NOWARN, order);
+
 	if (addr || !order)
 		return addr;
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [06/14] vcompound: Virtual fallback for sparsemem
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, apw

[-- Attachment #1: 0009-vcompound-Virtual-fallback-for-sparsemem.patch --]
[-- Type: text/plain, Size: 1672 bytes --]

Sparsemem currently attempts to do a physically contiguous mapping
and then falls back to vmalloc. The same thing can now be accomplished
using virtual compound pages.

Cc: apw@shadowen.org
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/sparse.c |   25 ++-----------------------
 1 file changed, 2 insertions(+), 23 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/sparse.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/sparse.c	2008-03-20 18:04:45.345133447 -0700
+++ linux-2.6.25-rc5-mm1/mm/sparse.c	2008-03-20 19:32:53.361317058 -0700
@@ -327,24 +327,7 @@ static void __kfree_section_memmap(struc
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
 {
-	struct page *page, *ret;
-	unsigned long memmap_size = sizeof(struct page) * nr_pages;
-
-	page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
-	if (page)
-		goto got_map_page;
-
-	ret = vmalloc(memmap_size);
-	if (ret)
-		goto got_map_ptr;
-
-	return NULL;
-got_map_page:
-	ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
-got_map_ptr:
-	memset(ret, 0, memmap_size);
-
-	return ret;
+	return __alloc_vcompound(GFP_KERNEL, get_order(memmap_size)));
 }
 
 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
@@ -355,11 +338,7 @@ static inline struct page *kmalloc_secti
 
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	if (is_vmalloc_addr(memmap))
-		vfree(memmap);
-	else
-		free_pages((unsigned long)memmap,
-			   get_order(sizeof(struct page) * nr_pages));
+	__free_vcompound(memmap);
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [06/14] vcompound: Virtual fallback for sparsemem
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, apw

[-- Attachment #1: 0009-vcompound-Virtual-fallback-for-sparsemem.patch --]
[-- Type: text/plain, Size: 1898 bytes --]

Sparsemem currently attempts to do a physically contiguous mapping
and then falls back to vmalloc. The same thing can now be accomplished
using virtual compound pages.

Cc: apw@shadowen.org
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/sparse.c |   25 ++-----------------------
 1 file changed, 2 insertions(+), 23 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/sparse.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/sparse.c	2008-03-20 18:04:45.345133447 -0700
+++ linux-2.6.25-rc5-mm1/mm/sparse.c	2008-03-20 19:32:53.361317058 -0700
@@ -327,24 +327,7 @@ static void __kfree_section_memmap(struc
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
 {
-	struct page *page, *ret;
-	unsigned long memmap_size = sizeof(struct page) * nr_pages;
-
-	page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
-	if (page)
-		goto got_map_page;
-
-	ret = vmalloc(memmap_size);
-	if (ret)
-		goto got_map_ptr;
-
-	return NULL;
-got_map_page:
-	ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
-got_map_ptr:
-	memset(ret, 0, memmap_size);
-
-	return ret;
+	return __alloc_vcompound(GFP_KERNEL, get_order(memmap_size)));
 }
 
 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
@@ -355,11 +338,7 @@ static inline struct page *kmalloc_secti
 
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	if (is_vmalloc_addr(memmap))
-		vfree(memmap);
-	else
-		free_pages((unsigned long)memmap,
-			   get_order(sizeof(struct page) * nr_pages));
+	__free_vcompound(memmap);
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [07/14] vcompound: bit waitqueue support
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0011-vcompound-bit-waitqueue-support.patch --]
[-- Type: text/plain, Size: 1088 bytes --]

If bit_waitqueue is passed a vmalloc address then it must use
vmalloc_head_page() instead of virt_to_page().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 kernel/wait.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc5-mm1/kernel/wait.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/kernel/wait.c	2008-03-20 20:03:51.141901370 -0700
+++ linux-2.6.25-rc5-mm1/kernel/wait.c	2008-03-20 20:07:57.266856571 -0700
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/wait.h>
 #include <linux/hash.h>
+#include <linux/vmalloc.h>
 
 void init_waitqueue_head(wait_queue_head_t *q)
 {
@@ -245,7 +246,7 @@ EXPORT_SYMBOL(wake_up_bit);
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
 	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-	const struct zone *zone = page_zone(virt_to_page(word));
+	const struct zone *zone = page_zone(vcompound_head_page(word));
 	unsigned long val = (unsigned long)word << shift | bit;
 
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [07/14] vcompound: bit waitqueue support
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0011-vcompound-bit-waitqueue-support.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

If bit_waitqueue is passed a vmalloc address then it must use
vmalloc_head_page() instead of virt_to_page().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 kernel/wait.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc5-mm1/kernel/wait.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/kernel/wait.c	2008-03-20 20:03:51.141901370 -0700
+++ linux-2.6.25-rc5-mm1/kernel/wait.c	2008-03-20 20:07:57.266856571 -0700
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/wait.h>
 #include <linux/hash.h>
+#include <linux/vmalloc.h>
 
 void init_waitqueue_head(wait_queue_head_t *q)
 {
@@ -245,7 +246,7 @@ EXPORT_SYMBOL(wake_up_bit);
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
 	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-	const struct zone *zone = page_zone(virt_to_page(word));
+	const struct zone *zone = page_zone(vcompound_head_page(word));
 	unsigned long val = (unsigned long)word << shift | bit;
 
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [08/14] vcompound: Fallback for zone wait table
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0010-vcompound-Fallback-for-wait-table.patch --]
[-- Type: text/plain, Size: 1021 bytes --]

Currently vmalloc may be used for allocating zone wait table.
Use virtual compound page in order to be able to use a physically contiguous
page that can then use the large kernel TLBs.

Drawback: The zone wait table is rounded up to the next power of two which
may cost some memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/page_alloc.c	2008-03-20 20:03:50.885900600 -0700
+++ linux-2.6.25-rc5-mm1/mm/page_alloc.c	2008-03-20 20:04:43.282104684 -0700
@@ -2866,7 +2866,8 @@ int zone_wait_table_init(struct zone *zo
 		 * To use this new node's memory, further consideration will be
 		 * necessary.
 		 */
-		zone->wait_table = vmalloc(alloc_size);
+		zone->wait_table = __alloc_vcompound(GFP_KERNEL,
+						get_order(alloc_size));
 	}
 	if (!zone->wait_table)
 		return -ENOMEM;

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [08/14] vcompound: Fallback for zone wait table
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0010-vcompound-Fallback-for-wait-table.patch --]
[-- Type: text/plain, Size: 1247 bytes --]

Currently vmalloc may be used for allocating zone wait table.
Use virtual compound page in order to be able to use a physically contiguous
page that can then use the large kernel TLBs.

Drawback: The zone wait table is rounded up to the next power of two which
may cost some memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/page_alloc.c	2008-03-20 20:03:50.885900600 -0700
+++ linux-2.6.25-rc5-mm1/mm/page_alloc.c	2008-03-20 20:04:43.282104684 -0700
@@ -2866,7 +2866,8 @@ int zone_wait_table_init(struct zone *zo
 		 * To use this new node's memory, further consideration will be
 		 * necessary.
 		 */
-		zone->wait_table = vmalloc(alloc_size);
+		zone->wait_table = __alloc_vcompound(GFP_KERNEL,
+						get_order(alloc_size));
 	}
 	if (!zone->wait_table)
 		return -ENOMEM;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [09/14] vcompound: crypto: Fallback for temporary order 2 allocation
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Dan Williams

[-- Attachment #1: 0012-vcompound-crypto-Fallback-for-temporary-order-2-al.patch --]
[-- Type: text/plain, Size: 1035 bytes --]

The crypto subsystem needs an order 2 allocation. This is a temporary buffer
for xoring data so we can safely allow fallback.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 crypto/xor.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.25-rc5-mm1/crypto/xor.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/crypto/xor.c	2008-03-20 18:04:44.649120096 -0700
+++ linux-2.6.25-rc5-mm1/crypto/xor.c	2008-03-20 19:41:35.383789613 -0700
@@ -101,7 +101,7 @@ calibrate_xor_blocks(void)
 	void *b1, *b2;
 	struct xor_block_template *f, *fastest;
 
-	b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+	b1 = __alloc_vcompound(GFP_KERNEL, 2);
 	if (!b1) {
 		printk(KERN_WARNING "xor: Yikes!  No memory available.\n");
 		return -ENOMEM;
@@ -140,7 +140,7 @@ calibrate_xor_blocks(void)
 
 #undef xor_speed
 
-	free_pages((unsigned long)b1, 2);
+	__free_vcompound(b1);
 
 	active_template = fastest;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [09/14] vcompound: crypto: Fallback for temporary order 2 allocation
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Dan Williams

[-- Attachment #1: 0012-vcompound-crypto-Fallback-for-temporary-order-2-al.patch --]
[-- Type: text/plain, Size: 1261 bytes --]

The crypto subsystem needs an order 2 allocation. This is a temporary buffer
for xoring data so we can safely allow fallback.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 crypto/xor.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.25-rc5-mm1/crypto/xor.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/crypto/xor.c	2008-03-20 18:04:44.649120096 -0700
+++ linux-2.6.25-rc5-mm1/crypto/xor.c	2008-03-20 19:41:35.383789613 -0700
@@ -101,7 +101,7 @@ calibrate_xor_blocks(void)
 	void *b1, *b2;
 	struct xor_block_template *f, *fastest;
 
-	b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+	b1 = __alloc_vcompound(GFP_KERNEL, 2);
 	if (!b1) {
 		printk(KERN_WARNING "xor: Yikes!  No memory available.\n");
 		return -ENOMEM;
@@ -140,7 +140,7 @@ calibrate_xor_blocks(void)
 
 #undef xor_speed
 
-	free_pages((unsigned long)b1, 2);
+	__free_vcompound(b1);
 
 	active_template = fastest;
 	return 0;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [10/14] vcompound: slub: Use for buffer to correlate allocation addresses
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0015-vcompound-Fallback-for-buffer-to-correlate-alloc-lo.patch --]
[-- Type: text/plain, Size: 1327 bytes --]

The caller table can get quite large if there are many call sites for a
particular slab. Using a virtual compound page allows fallback to vmalloc in
case the caller table gets too big and memory is fragmented. Currently we
would fail the operation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/slub.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/slub.c	2008-03-20 18:04:44.153110938 -0700
+++ linux-2.6.25-rc5-mm1/mm/slub.c	2008-03-20 19:40:17.103393950 -0700
@@ -21,6 +21,7 @@
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
+#include <linux/vmalloc.h>
 
 /*
  * Lock order:
@@ -3372,8 +3373,7 @@ struct loc_track {
 static void free_loc_track(struct loc_track *t)
 {
 	if (t->max)
-		free_pages((unsigned long)t->loc,
-			get_order(sizeof(struct location) * t->max));
+		__free_vcompound(t->loc);
 }
 
 static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
@@ -3383,7 +3383,7 @@ static int alloc_loc_track(struct loc_tr
 
 	order = get_order(sizeof(struct location) * max);
 
-	l = (void *)__get_free_pages(flags, order);
+	l = __alloc_vcompound(flags, order);
 	if (!l)
 		return 0;
 

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [10/14] vcompound: slub: Use for buffer to correlate allocation addresses
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0015-vcompound-Fallback-for-buffer-to-correlate-alloc-lo.patch --]
[-- Type: text/plain, Size: 1553 bytes --]

The caller table can get quite large if there are many call sites for a
particular slab. Using a virtual compound page allows fallback to vmalloc in
case the caller table gets too big and memory is fragmented. Currently we
would fail the operation.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/slub.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/slub.c	2008-03-20 18:04:44.153110938 -0700
+++ linux-2.6.25-rc5-mm1/mm/slub.c	2008-03-20 19:40:17.103393950 -0700
@@ -21,6 +21,7 @@
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
+#include <linux/vmalloc.h>
 
 /*
  * Lock order:
@@ -3372,8 +3373,7 @@ struct loc_track {
 static void free_loc_track(struct loc_track *t)
 {
 	if (t->max)
-		free_pages((unsigned long)t->loc,
-			get_order(sizeof(struct location) * t->max));
+		__free_vcompound(t->loc);
 }
 
 static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
@@ -3383,7 +3383,7 @@ static int alloc_loc_track(struct loc_tr
 
 	order = get_order(sizeof(struct location) * max);
 
-	l = (void *)__get_free_pages(flags, order);
+	l = __alloc_vcompound(flags, order);
 	if (!l)
 		return 0;
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0016-vcompound-Fallbacks-for-order-2-stack-allocations.patch --]
[-- Type: text/plain, Size: 3066 bytes --]

This allows fallback for order 1 stack allocations. In the fallback
scenario the stacks will be virtually mapped.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/asm-ia64/thread_info.h   |    5 +++--
 include/asm-x86/thread_info_32.h |    6 +++---
 include/asm-x86/thread_info_64.h |    4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-ia64/thread_info.h	2008-03-20 20:03:47.165885870 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h	2008-03-20 20:04:51.302135777 -0700
@@ -82,8 +82,9 @@ struct thread_info {
 #define end_of_stack(p) (unsigned long *)((void *)(p) + IA64_RBS_OFFSET)
 
 #define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
-#define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
-#define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
+#define alloc_task_struct()	((struct task_struct *)__alloc_vcompound( \
+			GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))
+#define free_task_struct(tsk)	__free_vcompound(tsk)
 
 #define tsk_set_notify_resume(tsk) \
 	set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_32.h	2008-03-20 20:03:47.173885951 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h	2008-03-20 20:04:51.306136067 -0700
@@ -96,13 +96,13 @@ static inline struct thread_info *curren
 /* thread information allocation */
 #ifdef CONFIG_DEBUG_STACK_USAGE
 #define alloc_thread_info(tsk) ((struct thread_info *) \
-	__get_free_pages(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
+	__alloc_vcompound(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
 #else
 #define alloc_thread_info(tsk) ((struct thread_info *) \
-	__get_free_pages(GFP_KERNEL, get_order(THREAD_SIZE)))
+	__alloc_vcompound(GFP_KERNEL, get_order(THREAD_SIZE)))
 #endif
 
-#define free_thread_info(info)	free_pages((unsigned long)(info), get_order(THREAD_SIZE))
+#define free_thread_info(info)	__free_vcompound(info)
 
 #else /* !__ASSEMBLY__ */
 
Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_64.h	2008-03-20 20:03:47.189886138 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h	2008-03-20 20:04:51.306136067 -0700
@@ -83,9 +83,9 @@ static inline struct thread_info *stack_
 #endif
 
 #define alloc_thread_info(tsk) \
-	((struct thread_info *) __get_free_pages(THREAD_FLAGS, THREAD_ORDER))
+	((struct thread_info *) __alloc_vcompound(THREAD_FLAGS, THREAD_ORDER))
 
-#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
+#define free_thread_info(ti) __free_vcompound(ti)
 
 #else /* !__ASSEMBLY__ */
 

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: 0016-vcompound-Fallbacks-for-order-2-stack-allocations.patch --]
[-- Type: text/plain, Size: 3292 bytes --]

This allows fallback for order 1 stack allocations. In the fallback
scenario the stacks will be virtually mapped.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/asm-ia64/thread_info.h   |    5 +++--
 include/asm-x86/thread_info_32.h |    6 +++---
 include/asm-x86/thread_info_64.h |    4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-ia64/thread_info.h	2008-03-20 20:03:47.165885870 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h	2008-03-20 20:04:51.302135777 -0700
@@ -82,8 +82,9 @@ struct thread_info {
 #define end_of_stack(p) (unsigned long *)((void *)(p) + IA64_RBS_OFFSET)
 
 #define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
-#define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
-#define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
+#define alloc_task_struct()	((struct task_struct *)__alloc_vcompound( \
+			GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))
+#define free_task_struct(tsk)	__free_vcompound(tsk)
 
 #define tsk_set_notify_resume(tsk) \
 	set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_32.h	2008-03-20 20:03:47.173885951 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h	2008-03-20 20:04:51.306136067 -0700
@@ -96,13 +96,13 @@ static inline struct thread_info *curren
 /* thread information allocation */
 #ifdef CONFIG_DEBUG_STACK_USAGE
 #define alloc_thread_info(tsk) ((struct thread_info *) \
-	__get_free_pages(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
+	__alloc_vcompound(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
 #else
 #define alloc_thread_info(tsk) ((struct thread_info *) \
-	__get_free_pages(GFP_KERNEL, get_order(THREAD_SIZE)))
+	__alloc_vcompound(GFP_KERNEL, get_order(THREAD_SIZE)))
 #endif
 
-#define free_thread_info(info)	free_pages((unsigned long)(info), get_order(THREAD_SIZE))
+#define free_thread_info(info)	__free_vcompound(info)
 
 #else /* !__ASSEMBLY__ */
 
Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_64.h	2008-03-20 20:03:47.189886138 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h	2008-03-20 20:04:51.306136067 -0700
@@ -83,9 +83,9 @@ static inline struct thread_info *stack_
 #endif
 
 #define alloc_thread_info(tsk) \
-	((struct thread_info *) __get_free_pages(THREAD_FLAGS, THREAD_ORDER))
+	((struct thread_info *) __alloc_vcompound(THREAD_FLAGS, THREAD_ORDER))
 
-#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
+#define free_thread_info(ti) __free_vcompound(ti)
 
 #else /* !__ASSEMBLY__ */
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [12/14] vcompound: Avoid vmalloc in e1000 driver
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: e1000 --]
[-- Type: text/plain, Size: 6064 bytes --]

Switch all the uses of vmalloc in the e1000 driver to virtual compounds.
This will result in the use of regular memory for the ring buffers etc
avoiding page tables,

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 drivers/net/e1000/e1000_main.c |   23 +++++++++++------------
 drivers/net/e1000e/netdev.c    |   12 ++++++------
 2 files changed, 17 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000e/netdev.c	2008-03-20 21:52:45.962733927 -0700
+++ linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c	2008-03-20 21:57:27.212078371 -0700
@@ -1083,7 +1083,7 @@ int e1000e_setup_tx_resources(struct e10
 	int err = -ENOMEM, size;
 
 	size = sizeof(struct e1000_buffer) * tx_ring->count;
-	tx_ring->buffer_info = vmalloc(size);
+	tx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
 	if (!tx_ring->buffer_info)
 		goto err;
 	memset(tx_ring->buffer_info, 0, size);
@@ -1102,7 +1102,7 @@ int e1000e_setup_tx_resources(struct e10
 
 	return 0;
 err:
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	ndev_err(adapter->netdev,
 	"Unable to allocate memory for the transmit descriptor ring\n");
 	return err;
@@ -1121,7 +1121,7 @@ int e1000e_setup_rx_resources(struct e10
 	int i, size, desc_len, err = -ENOMEM;
 
 	size = sizeof(struct e1000_buffer) * rx_ring->count;
-	rx_ring->buffer_info = vmalloc(size);
+	rx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
 	if (!rx_ring->buffer_info)
 		goto err;
 	memset(rx_ring->buffer_info, 0, size);
@@ -1157,7 +1157,7 @@ err_pages:
 		kfree(buffer_info->ps_pages);
 	}
 err:
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	ndev_err(adapter->netdev,
 	"Unable to allocate memory for the transmit descriptor ring\n");
 	return err;
@@ -1204,7 +1204,7 @@ void e1000e_free_tx_resources(struct e10
 
 	e1000_clean_tx_ring(adapter);
 
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	tx_ring->buffer_info = NULL;
 
 	dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
@@ -1231,7 +1231,7 @@ void e1000e_free_rx_resources(struct e10
 		kfree(rx_ring->buffer_info[i].ps_pages);
 	}
 
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	rx_ring->buffer_info = NULL;
 
 	dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
Index: linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000/e1000_main.c	2008-03-20 22:06:14.462252441 -0700
+++ linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c	2008-03-20 22:08:46.582009872 -0700
@@ -1609,14 +1609,13 @@ e1000_setup_tx_resources(struct e1000_ad
 	int size;
 
 	size = sizeof(struct e1000_buffer) * txdr->count;
-	txdr->buffer_info = vmalloc(size);
+	txdr->buffer_info = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
+							get_order(size));
 	if (!txdr->buffer_info) {
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the transmit descriptor ring\n");
 		return -ENOMEM;
 	}
-	memset(txdr->buffer_info, 0, size);
-
 	/* round up to nearest 4K */
 
 	txdr->size = txdr->count * sizeof(struct e1000_tx_desc);
@@ -1625,7 +1624,7 @@ e1000_setup_tx_resources(struct e1000_ad
 	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
 	if (!txdr->desc) {
 setup_tx_desc_die:
-		vfree(txdr->buffer_info);
+		__free_vcompound(txdr->buffer_info);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the transmit descriptor ring\n");
 		return -ENOMEM;
@@ -1653,7 +1652,7 @@ setup_tx_desc_die:
 			DPRINTK(PROBE, ERR,
 				"Unable to allocate aligned memory "
 				"for the transmit descriptor ring\n");
-			vfree(txdr->buffer_info);
+			__free_vcompound(txdr->buffer_info);
 			return -ENOMEM;
 		} else {
 			/* Free old allocation, new allocation was successful */
@@ -1826,7 +1825,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	int size, desc_len;
 
 	size = sizeof(struct e1000_buffer) * rxdr->count;
-	rxdr->buffer_info = vmalloc(size);
+	rxdr->buffer_info = __alloc_vcompound(GFP_KERNEL, size);
 	if (!rxdr->buffer_info) {
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
@@ -1837,7 +1836,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	rxdr->ps_page = kcalloc(rxdr->count, sizeof(struct e1000_ps_page),
 	                        GFP_KERNEL);
 	if (!rxdr->ps_page) {
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
 		return -ENOMEM;
@@ -1847,7 +1846,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	                            sizeof(struct e1000_ps_page_dma),
 	                            GFP_KERNEL);
 	if (!rxdr->ps_page_dma) {
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		kfree(rxdr->ps_page);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
@@ -1870,7 +1869,7 @@ e1000_setup_rx_resources(struct e1000_ad
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
 setup_rx_desc_die:
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		kfree(rxdr->ps_page);
 		kfree(rxdr->ps_page_dma);
 		return -ENOMEM;
@@ -2175,7 +2174,7 @@ e1000_free_tx_resources(struct e1000_ada
 
 	e1000_clean_tx_ring(adapter, tx_ring);
 
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	tx_ring->buffer_info = NULL;
 
 	pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
@@ -2283,9 +2282,9 @@ e1000_free_rx_resources(struct e1000_ada
 
 	e1000_clean_rx_ring(adapter, rx_ring);
 
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	rx_ring->buffer_info = NULL;
-	kfree(rx_ring->ps_page);
+	__free_vcompound(rx_ring->ps_page);
 	rx_ring->ps_page = NULL;
 	kfree(rx_ring->ps_page_dma);
 	rx_ring->ps_page_dma = NULL;

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [12/14] vcompound: Avoid vmalloc in e1000 driver
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: e1000 --]
[-- Type: text/plain, Size: 6290 bytes --]

Switch all the uses of vmalloc in the e1000 driver to virtual compounds.
This will result in the use of regular memory for the ring buffers etc
avoiding page tables,

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 drivers/net/e1000/e1000_main.c |   23 +++++++++++------------
 drivers/net/e1000e/netdev.c    |   12 ++++++------
 2 files changed, 17 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000e/netdev.c	2008-03-20 21:52:45.962733927 -0700
+++ linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c	2008-03-20 21:57:27.212078371 -0700
@@ -1083,7 +1083,7 @@ int e1000e_setup_tx_resources(struct e10
 	int err = -ENOMEM, size;
 
 	size = sizeof(struct e1000_buffer) * tx_ring->count;
-	tx_ring->buffer_info = vmalloc(size);
+	tx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
 	if (!tx_ring->buffer_info)
 		goto err;
 	memset(tx_ring->buffer_info, 0, size);
@@ -1102,7 +1102,7 @@ int e1000e_setup_tx_resources(struct e10
 
 	return 0;
 err:
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	ndev_err(adapter->netdev,
 	"Unable to allocate memory for the transmit descriptor ring\n");
 	return err;
@@ -1121,7 +1121,7 @@ int e1000e_setup_rx_resources(struct e10
 	int i, size, desc_len, err = -ENOMEM;
 
 	size = sizeof(struct e1000_buffer) * rx_ring->count;
-	rx_ring->buffer_info = vmalloc(size);
+	rx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
 	if (!rx_ring->buffer_info)
 		goto err;
 	memset(rx_ring->buffer_info, 0, size);
@@ -1157,7 +1157,7 @@ err_pages:
 		kfree(buffer_info->ps_pages);
 	}
 err:
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	ndev_err(adapter->netdev,
 	"Unable to allocate memory for the transmit descriptor ring\n");
 	return err;
@@ -1204,7 +1204,7 @@ void e1000e_free_tx_resources(struct e10
 
 	e1000_clean_tx_ring(adapter);
 
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	tx_ring->buffer_info = NULL;
 
 	dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
@@ -1231,7 +1231,7 @@ void e1000e_free_rx_resources(struct e10
 		kfree(rx_ring->buffer_info[i].ps_pages);
 	}
 
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	rx_ring->buffer_info = NULL;
 
 	dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
Index: linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000/e1000_main.c	2008-03-20 22:06:14.462252441 -0700
+++ linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c	2008-03-20 22:08:46.582009872 -0700
@@ -1609,14 +1609,13 @@ e1000_setup_tx_resources(struct e1000_ad
 	int size;
 
 	size = sizeof(struct e1000_buffer) * txdr->count;
-	txdr->buffer_info = vmalloc(size);
+	txdr->buffer_info = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
+							get_order(size));
 	if (!txdr->buffer_info) {
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the transmit descriptor ring\n");
 		return -ENOMEM;
 	}
-	memset(txdr->buffer_info, 0, size);
-
 	/* round up to nearest 4K */
 
 	txdr->size = txdr->count * sizeof(struct e1000_tx_desc);
@@ -1625,7 +1624,7 @@ e1000_setup_tx_resources(struct e1000_ad
 	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
 	if (!txdr->desc) {
 setup_tx_desc_die:
-		vfree(txdr->buffer_info);
+		__free_vcompound(txdr->buffer_info);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the transmit descriptor ring\n");
 		return -ENOMEM;
@@ -1653,7 +1652,7 @@ setup_tx_desc_die:
 			DPRINTK(PROBE, ERR,
 				"Unable to allocate aligned memory "
 				"for the transmit descriptor ring\n");
-			vfree(txdr->buffer_info);
+			__free_vcompound(txdr->buffer_info);
 			return -ENOMEM;
 		} else {
 			/* Free old allocation, new allocation was successful */
@@ -1826,7 +1825,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	int size, desc_len;
 
 	size = sizeof(struct e1000_buffer) * rxdr->count;
-	rxdr->buffer_info = vmalloc(size);
+	rxdr->buffer_info = __alloc_vcompound(GFP_KERNEL, size);
 	if (!rxdr->buffer_info) {
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
@@ -1837,7 +1836,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	rxdr->ps_page = kcalloc(rxdr->count, sizeof(struct e1000_ps_page),
 	                        GFP_KERNEL);
 	if (!rxdr->ps_page) {
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
 		return -ENOMEM;
@@ -1847,7 +1846,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	                            sizeof(struct e1000_ps_page_dma),
 	                            GFP_KERNEL);
 	if (!rxdr->ps_page_dma) {
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		kfree(rxdr->ps_page);
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
@@ -1870,7 +1869,7 @@ e1000_setup_rx_resources(struct e1000_ad
 		DPRINTK(PROBE, ERR,
 		"Unable to allocate memory for the receive descriptor ring\n");
 setup_rx_desc_die:
-		vfree(rxdr->buffer_info);
+		__free_vcompound(rxdr->buffer_info);
 		kfree(rxdr->ps_page);
 		kfree(rxdr->ps_page_dma);
 		return -ENOMEM;
@@ -2175,7 +2174,7 @@ e1000_free_tx_resources(struct e1000_ada
 
 	e1000_clean_tx_ring(adapter, tx_ring);
 
-	vfree(tx_ring->buffer_info);
+	__free_vcompound(tx_ring->buffer_info);
 	tx_ring->buffer_info = NULL;
 
 	pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
@@ -2283,9 +2282,9 @@ e1000_free_rx_resources(struct e1000_ada
 
 	e1000_clean_rx_ring(adapter, rx_ring);
 
-	vfree(rx_ring->buffer_info);
+	__free_vcompound(rx_ring->buffer_info);
 	rx_ring->buffer_info = NULL;
-	kfree(rx_ring->ps_page);
+	__free_vcompound(rx_ring->ps_page);
 	rx_ring->ps_page = NULL;
 	kfree(rx_ring->ps_page_dma);
 	rx_ring->ps_page_dma = NULL;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [13/14] vcompound: Use vcompound for swap_map
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: fixswapon --]
[-- Type: text/plain, Size: 1734 bytes --]

Use virtual compound pages for the large swap maps. This only works for
swap maps that are smaller than a MAX_ORDER block though. If the swap map
is larger then there is no way around the use of vmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/swapfile.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/swapfile.c	2008-03-20 20:32:12.793950570 -0700
+++ linux-2.6.25-rc5-mm1/mm/swapfile.c	2008-03-20 20:37:43.367821147 -0700
@@ -1312,7 +1312,7 @@ asmlinkage long sys_swapoff(const char _
 	p->flags = 0;
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
-	vfree(swap_map);
+	__free_vcompound(swap_map);
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1636,13 +1636,13 @@ asmlinkage long sys_swapon(const char __
 			goto bad_swap;
 
 		/* OK, set up the swap map and apply the bad block list */
-		if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
+		if (!(p->swap_map = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
+					get_order(maxpages * sizeof(short))))) {
 			error = -ENOMEM;
 			goto bad_swap;
 		}
 
 		error = 0;
-		memset(p->swap_map, 0, maxpages * sizeof(short));
 		for (i = 0; i < swap_header->info.nr_badpages; i++) {
 			int page_nr = swap_header->info.badpages[i];
 			if (page_nr <= 0 || page_nr >= swap_header->info.last_page)
@@ -1718,7 +1718,7 @@ bad_swap_2:
 	if (!(swap_flags & SWAP_FLAG_PREFER))
 		++least_priority;
 	spin_unlock(&swap_lock);
-	vfree(swap_map);
+	__free_vcompound(swap_map);
 	if (swap_file)
 		filp_close(swap_file, NULL);
 out:

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: fixswapon --]
[-- Type: text/plain, Size: 1960 bytes --]

Use virtual compound pages for the large swap maps. This only works for
swap maps that are smaller than a MAX_ORDER block though. If the swap map
is larger then there is no way around the use of vmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/swapfile.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6.25-rc5-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/swapfile.c	2008-03-20 20:32:12.793950570 -0700
+++ linux-2.6.25-rc5-mm1/mm/swapfile.c	2008-03-20 20:37:43.367821147 -0700
@@ -1312,7 +1312,7 @@ asmlinkage long sys_swapoff(const char _
 	p->flags = 0;
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
-	vfree(swap_map);
+	__free_vcompound(swap_map);
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1636,13 +1636,13 @@ asmlinkage long sys_swapon(const char __
 			goto bad_swap;
 
 		/* OK, set up the swap map and apply the bad block list */
-		if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
+		if (!(p->swap_map = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
+					get_order(maxpages * sizeof(short))))) {
 			error = -ENOMEM;
 			goto bad_swap;
 		}
 
 		error = 0;
-		memset(p->swap_map, 0, maxpages * sizeof(short));
 		for (i = 0; i < swap_header->info.nr_badpages; i++) {
 			int page_nr = swap_header->info.badpages[i];
 			if (page_nr <= 0 || page_nr >= swap_header->info.last_page)
@@ -1718,7 +1718,7 @@ bad_swap_2:
 	if (!(swap_flags & SWAP_FLAG_PREFER))
 		++least_priority;
 	spin_unlock(&swap_lock);
-	vfree(swap_map);
+	__free_vcompound(swap_map);
 	if (swap_file)
 		filp_close(swap_file, NULL);
 out:

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-21  6:17   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: tcpinit --]
[-- Type: text/plain, Size: 1208 bytes --]

Avoid the use of vmalloc for the ehash locks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/net/inet_hashtables.h |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/net/inet_hashtables.h	2008-03-20 22:21:02.680501729 -0700
+++ linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h	2008-03-20 22:22:15.416565317 -0700
@@ -164,7 +164,8 @@ static inline int inet_ehash_locks_alloc
 	if (sizeof(rwlock_t) != 0) {
 #ifdef CONFIG_NUMA
 		if (size * sizeof(rwlock_t) > PAGE_SIZE)
-			hashinfo->ehash_locks = vmalloc(size * sizeof(rwlock_t));
+			hashinfo->ehash_locks = __alloc_vcompound(GFP_KERNEL,
+				get_order(size * sizeof(rwlock_t)));
 		else
 #endif
 		hashinfo->ehash_locks =	kmalloc(size * sizeof(rwlock_t),
@@ -185,7 +186,7 @@ static inline void inet_ehash_locks_free
 		unsigned int size = (hashinfo->ehash_locks_mask + 1) *
 							sizeof(rwlock_t);
 		if (size > PAGE_SIZE)
-			vfree(hashinfo->ehash_locks);
+			__free_vcompound(hashinfo->ehash_locks);
 		else
 #endif
 		kfree(hashinfo->ehash_locks);

-- 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  6:17   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  6:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: tcpinit --]
[-- Type: text/plain, Size: 1434 bytes --]

Avoid the use of vmalloc for the ehash locks.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/net/inet_hashtables.h |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/net/inet_hashtables.h	2008-03-20 22:21:02.680501729 -0700
+++ linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h	2008-03-20 22:22:15.416565317 -0700
@@ -164,7 +164,8 @@ static inline int inet_ehash_locks_alloc
 	if (sizeof(rwlock_t) != 0) {
 #ifdef CONFIG_NUMA
 		if (size * sizeof(rwlock_t) > PAGE_SIZE)
-			hashinfo->ehash_locks = vmalloc(size * sizeof(rwlock_t));
+			hashinfo->ehash_locks = __alloc_vcompound(GFP_KERNEL,
+				get_order(size * sizeof(rwlock_t)));
 		else
 #endif
 		hashinfo->ehash_locks =	kmalloc(size * sizeof(rwlock_t),
@@ -185,7 +186,7 @@ static inline void inet_ehash_locks_free
 		unsigned int size = (hashinfo->ehash_locks_mask + 1) *
 							sizeof(rwlock_t);
 		if (size > PAGE_SIZE)
-			vfree(hashinfo->ehash_locks);
+			__free_vcompound(hashinfo->ehash_locks);
 		else
 #endif
 		kfree(hashinfo->ehash_locks);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21  7:02     ` Eric Dumazet
  -1 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel, Linux Netdev List

Christoph Lameter a écrit :
> Avoid the use of vmalloc for the ehash locks.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/net/inet_hashtables.h |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/net/inet_hashtables.h	2008-03-20 22:21:02.680501729 -0700
> +++ linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h	2008-03-20 22:22:15.416565317 -0700
> @@ -164,7 +164,8 @@ static inline int inet_ehash_locks_alloc
>  	if (sizeof(rwlock_t) != 0) {
>  #ifdef CONFIG_NUMA
>  		if (size * sizeof(rwlock_t) > PAGE_SIZE)
> -			hashinfo->ehash_locks = vmalloc(size * sizeof(rwlock_t));
> +			hashinfo->ehash_locks = __alloc_vcompound(GFP_KERNEL,
> +				get_order(size * sizeof(rwlock_t)));
>  		else
>  #endif
>  		hashinfo->ehash_locks =	kmalloc(size * sizeof(rwlock_t),
> @@ -185,7 +186,7 @@ static inline void inet_ehash_locks_free
>  		unsigned int size = (hashinfo->ehash_locks_mask + 1) *
>  							sizeof(rwlock_t);
>  		if (size > PAGE_SIZE)
> -			vfree(hashinfo->ehash_locks);
> +			__free_vcompound(hashinfo->ehash_locks);
>  		else
>  #endif
>  		kfree(hashinfo->ehash_locks);
> 

But, isnt it defeating the purpose of this *particular* vmalloc() use ?

CONFIG_NUMA and vmalloc() at boot time means :

Try to distribute the pages on several nodes.

Memory pressure on ehash_locks[] is so high we definitly want to spread it.

(for similar uses of vmalloc(), see also hashdist=1 )

Also, please CC netdev for network patches :)

Thank you


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  7:02     ` Eric Dumazet
  0 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel, Linux Netdev List

Christoph Lameter a ecrit :
> Avoid the use of vmalloc for the ehash locks.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/net/inet_hashtables.h |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/net/inet_hashtables.h	2008-03-20 22:21:02.680501729 -0700
> +++ linux-2.6.25-rc5-mm1/include/net/inet_hashtables.h	2008-03-20 22:22:15.416565317 -0700
> @@ -164,7 +164,8 @@ static inline int inet_ehash_locks_alloc
>  	if (sizeof(rwlock_t) != 0) {
>  #ifdef CONFIG_NUMA
>  		if (size * sizeof(rwlock_t) > PAGE_SIZE)
> -			hashinfo->ehash_locks = vmalloc(size * sizeof(rwlock_t));
> +			hashinfo->ehash_locks = __alloc_vcompound(GFP_KERNEL,
> +				get_order(size * sizeof(rwlock_t)));
>  		else
>  #endif
>  		hashinfo->ehash_locks =	kmalloc(size * sizeof(rwlock_t),
> @@ -185,7 +186,7 @@ static inline void inet_ehash_locks_free
>  		unsigned int size = (hashinfo->ehash_locks_mask + 1) *
>  							sizeof(rwlock_t);
>  		if (size > PAGE_SIZE)
> -			vfree(hashinfo->ehash_locks);
> +			__free_vcompound(hashinfo->ehash_locks);
>  		else
>  #endif
>  		kfree(hashinfo->ehash_locks);
> 

But, isnt it defeating the purpose of this *particular* vmalloc() use ?

CONFIG_NUMA and vmalloc() at boot time means :

Try to distribute the pages on several nodes.

Memory pressure on ehash_locks[] is so high we definitly want to spread it.

(for similar uses of vmalloc(), see also hashdist=1 )

Also, please CC netdev for network patches :)

Thank you

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  7:02     ` Eric Dumazet
@ 2008-03-21  7:03       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  7:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-mm, linux-kernel, Linux Netdev List

On Fri, 21 Mar 2008, Eric Dumazet wrote:

> But, isnt it defeating the purpose of this *particular* vmalloc() use ?

I thought that was controlled by hashdist? I did not see it used here and 
so I assumed that the RR was not intended here.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  7:03       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21  7:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-mm, linux-kernel, Linux Netdev List

On Fri, 21 Mar 2008, Eric Dumazet wrote:

> But, isnt it defeating the purpose of this *particular* vmalloc() use ?

I thought that was controlled by hashdist? I did not see it used here and 
so I assumed that the RR was not intended here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21  7:25     ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-21  7:25 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Thu, 20 Mar 2008 23:17:14 -0700

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

I would be very careful with this especially on IA64.

If the TLB miss or other low-level trap handler depends upon being
able to dereference thread info, task struct, or kernel stack stuff
without causing a fault outside of the linear PAGE_OFFSET area, this
patch will cause problems.

It will be difficult to debug the kinds of crashes this will cause
too.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21  7:25     ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-21  7:25 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

I would be very careful with this especially on IA64.

If the TLB miss or other low-level trap handler depends upon being
able to dereference thread info, task struct, or kernel stack stuff
without causing a fault outside of the linear PAGE_OFFSET area, this
patch will cause problems.

It will be difficult to debug the kinds of crashes this will cause
too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  7:02     ` Eric Dumazet
@ 2008-03-21  7:31       ` David Miller, Eric Dumazet
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-21  7:31 UTC (permalink / raw)
  To: dada1; +Cc: clameter, linux-mm, linux-kernel, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 21 Mar 2008 08:02:11 +0100

> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
> 
> CONFIG_NUMA and vmalloc() at boot time means :
> 
> Try to distribute the pages on several nodes.
> 
> Memory pressure on ehash_locks[] is so high we definitly want to spread it.
> 
> (for similar uses of vmalloc(), see also hashdist=1 )
> 
> Also, please CC netdev for network patches :)

I agree with Eric, converting any of the networking hash
allocations to this new facility is not the right thing
to do.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  7:31       ` David Miller, Eric Dumazet
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Eric Dumazet @ 2008-03-21  7:31 UTC (permalink / raw)
  To: dada1; +Cc: clameter, linux-mm, linux-kernel, netdev

> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
> 
> CONFIG_NUMA and vmalloc() at boot time means :
> 
> Try to distribute the pages on several nodes.
> 
> Memory pressure on ehash_locks[] is so high we definitly want to spread it.
> 
> (for similar uses of vmalloc(), see also hashdist=1 )
> 
> Also, please CC netdev for network patches :)

I agree with Eric, converting any of the networking hash
allocations to this new facility is not the right thing
to do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  7:03       ` Christoph Lameter
@ 2008-03-21  7:31         ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-21  7:31 UTC (permalink / raw)
  To: clameter; +Cc: dada1, linux-mm, linux-kernel, netdev

From: Christoph Lameter <clameter@sgi.com>
Date: Fri, 21 Mar 2008 00:03:51 -0700 (PDT)

> On Fri, 21 Mar 2008, Eric Dumazet wrote:
> 
> > But, isnt it defeating the purpose of this *particular* vmalloc() use ?
> 
> I thought that was controlled by hashdist? I did not see it used here and 
> so I assumed that the RR was not intended here.

It's intended for all of the major networking hash tables.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  7:31         ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-21  7:31 UTC (permalink / raw)
  To: clameter; +Cc: dada1, linux-mm, linux-kernel, netdev

> On Fri, 21 Mar 2008, Eric Dumazet wrote:
> 
> > But, isnt it defeating the purpose of this *particular* vmalloc() use ?
> 
> I thought that was controlled by hashdist? I did not see it used here and 
> so I assumed that the RR was not intended here.

It's intended for all of the major networking hash tables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  7:31         ` David Miller, Christoph Lameter
@ 2008-03-21  7:42           ` Eric Dumazet
  -1 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:42 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, netdev

David Miller a écrit :
> From: Christoph Lameter <clameter@sgi.com>
> Date: Fri, 21 Mar 2008 00:03:51 -0700 (PDT)
> 
>> On Fri, 21 Mar 2008, Eric Dumazet wrote:
>>
>>> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
>> I thought that was controlled by hashdist? I did not see it used here and 
>> so I assumed that the RR was not intended here.
> 
> It's intended for all of the major networking hash tables.

Other networking hash tables uses alloc_large_system_hash(), which handles 
hashdist settings.

But this helper is __init only, so we can not use it for ehash_locks (can be 
allocated by DCCP module)



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21  7:42           ` Eric Dumazet
  0 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:42 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, netdev

David Miller a ecrit :
> From: Christoph Lameter <clameter@sgi.com>
> Date: Fri, 21 Mar 2008 00:03:51 -0700 (PDT)
> 
>> On Fri, 21 Mar 2008, Eric Dumazet wrote:
>>
>>> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
>> I thought that was controlled by hashdist? I did not see it used here and 
>> so I assumed that the RR was not intended here.
> 
> It's intended for all of the major networking hash tables.

Other networking hash tables uses alloc_large_system_hash(), which handles 
hashdist settings.

But this helper is __init only, so we can not use it for ehash_locks (can be 
allocated by DCCP module)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21  7:55     ` Eric Dumazet
  -1 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter a écrit :
> Add another flag to the vmalloc subsystem to mark virtual compound pages.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/vmalloc.h |    1 +
>  mm/vmalloc.c            |    3 +++
>  2 files changed, 4 insertions(+)
> 
> Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
> +++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:27:20.150422445 -0700
> @@ -12,6 +12,7 @@ struct vm_area_struct;
>  #define VM_MAP		0x00000004	/* vmap()ed pages */
>  #define VM_USERMAP	0x00000008	/* suitable for remap_vmalloc_range */
>  #define VM_VPAGES	0x00000010	/* buffer for pages was vmalloc'ed */
> +#define VM_VCOMPOUND	0x00000020	/* Page allocator fallback */
>  /* bits [20..32] reserved for arch specific ioremap internals */
>  
>  /*
> Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-19 18:18:02.689633934 -0700
> +++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:27:20.150422445 -0700
> @@ -974,6 +974,9 @@ static int s_show(struct seq_file *m, vo
>  	if (v->flags & VM_VPAGES)
>  		seq_printf(m, " vpages");
>  
> +	if (v->flags & VM_VCOMPOUND)
> +		seq_printf(m, " vcompound");
> +
>  	seq_putc(m, '\n');
>  	return 0;
>  }
> 

I would love to see NUMA information as well on vmallocinfo, but have 
currently no available time to prepare a patch.

Counters with numbers of pages per node would be great.

(like in /proc/pid/numa_maps)

N0=2 N1=2 N2=2 N3=2


This way we could check hashdist is working or not, since it depends on 
various numa policies :)


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
@ 2008-03-21  7:55     ` Eric Dumazet
  0 siblings, 0 replies; 212+ messages in thread
From: Eric Dumazet @ 2008-03-21  7:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter a ecrit :
> Add another flag to the vmalloc subsystem to mark virtual compound pages.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/vmalloc.h |    1 +
>  mm/vmalloc.c            |    3 +++
>  2 files changed, 4 insertions(+)
> 
> Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-19 18:17:42.093443900 -0700
> +++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-19 18:27:20.150422445 -0700
> @@ -12,6 +12,7 @@ struct vm_area_struct;
>  #define VM_MAP		0x00000004	/* vmap()ed pages */
>  #define VM_USERMAP	0x00000008	/* suitable for remap_vmalloc_range */
>  #define VM_VPAGES	0x00000010	/* buffer for pages was vmalloc'ed */
> +#define VM_VCOMPOUND	0x00000020	/* Page allocator fallback */
>  /* bits [20..32] reserved for arch specific ioremap internals */
>  
>  /*
> Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-19 18:18:02.689633934 -0700
> +++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-19 18:27:20.150422445 -0700
> @@ -974,6 +974,9 @@ static int s_show(struct seq_file *m, vo
>  	if (v->flags & VM_VPAGES)
>  		seq_printf(m, " vpages");
>  
> +	if (v->flags & VM_VCOMPOUND)
> +		seq_printf(m, " vcompound");
> +
>  	seq_putc(m, '\n');
>  	return 0;
>  }
> 

I would love to see NUMA information as well on vmallocinfo, but have 
currently no available time to prepare a patch.

Counters with numbers of pages per node would be great.

(like in /proc/pid/numa_maps)

N0=2 N1=2 N2=2 N3=2


This way we could check hashdist is working or not, since it depends on 
various numa policies :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  7:25     ` David Miller, Christoph Lameter
@ 2008-03-21  8:39       ` Ingo Molnar
  -1 siblings, 0 replies; 212+ messages in thread
From: Ingo Molnar @ 2008-03-21  8:39 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, Peter Zijlstra


* David Miller <davem@davemloft.net> wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 20 Mar 2008 23:17:14 -0700
> 
> > This allows fallback for order 1 stack allocations. In the fallback
> > scenario the stacks will be virtually mapped.
> > 
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> I would be very careful with this especially on IA64.
> 
> If the TLB miss or other low-level trap handler depends upon being 
> able to dereference thread info, task struct, or kernel stack stuff 
> without causing a fault outside of the linear PAGE_OFFSET area, this 
> patch will cause problems.
> 
> It will be difficult to debug the kinds of crashes this will cause 
> too. [...]

another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
which has been NACK-ed before on x86 by several people and i'm nacking 
this "configurable stack size" aspect of it again.

although it's not being spelled out in the changelog, i believe the 
fundamental problem comes from a cpumask_t taking 512 bytes with 
nr_cpus=4096, and if a few of them are on the kernel stack it can be a 
problem. The correct answer is to not put them on the stack and we've 
been taking patches to that end. Every other object allocator in the 
kernel is able to not put stuff on the kernel stack. We _dont_ want 
higher-order kernel stacks and we dont want to make a special exception 
for cpumask_t either.

i believe time might be better spent increasing PAGE_SIZE on these 
ridiculously large systems and making that work well with our binary 
formats - instead of complicating our kernel VM with virtually mapped 
buffers. That will also solve the kernel stack problem, in a very 
natural way.

	Ingo

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21  8:39       ` Ingo Molnar
  0 siblings, 0 replies; 212+ messages in thread
From: Ingo Molnar @ 2008-03-21  8:39 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, Peter Zijlstra

* David Miller <davem@davemloft.net> wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Thu, 20 Mar 2008 23:17:14 -0700
> 
> > This allows fallback for order 1 stack allocations. In the fallback
> > scenario the stacks will be virtually mapped.
> > 
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> I would be very careful with this especially on IA64.
> 
> If the TLB miss or other low-level trap handler depends upon being 
> able to dereference thread info, task struct, or kernel stack stuff 
> without causing a fault outside of the linear PAGE_OFFSET area, this 
> patch will cause problems.
> 
> It will be difficult to debug the kinds of crashes this will cause 
> too. [...]

another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
which has been NACK-ed before on x86 by several people and i'm nacking 
this "configurable stack size" aspect of it again.

although it's not being spelled out in the changelog, i believe the 
fundamental problem comes from a cpumask_t taking 512 bytes with 
nr_cpus=4096, and if a few of them are on the kernel stack it can be a 
problem. The correct answer is to not put them on the stack and we've 
been taking patches to that end. Every other object allocator in the 
kernel is able to not put stuff on the kernel stack. We _dont_ want 
higher-order kernel stacks and we dont want to make a special exception 
for cpumask_t either.

i believe time might be better spent increasing PAGE_SIZE on these 
ridiculously large systems and making that work well with our binary 
formats - instead of complicating our kernel VM with virtually mapped 
buffers. That will also solve the kernel stack problem, in a very 
natural way.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [12/14] vcompound: Avoid vmalloc in e1000 driver
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21 17:27     ` Kok, Auke
  -1 siblings, 0 replies; 212+ messages in thread
From: Kok, Auke @ 2008-03-21 17:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter wrote:
> Switch all the uses of vmalloc in the e1000 driver to virtual compounds.
> This will result in the use of regular memory for the ring buffers etc
> avoiding page tables,

hey, cool patch for sure!

I'll see if I can transpose this to e1000e and all the other drivers I maintain
which use vmalloc as well.

This one goes on my queue and I'll merge through Jeff.

Thanks Christoph!

Auke



> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  drivers/net/e1000/e1000_main.c |   23 +++++++++++------------
>  drivers/net/e1000e/netdev.c    |   12 ++++++------
>  2 files changed, 17 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000e/netdev.c	2008-03-20 21:52:45.962733927 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c	2008-03-20 21:57:27.212078371 -0700
> @@ -1083,7 +1083,7 @@ int e1000e_setup_tx_resources(struct e10
>  	int err = -ENOMEM, size;
>  
>  	size = sizeof(struct e1000_buffer) * tx_ring->count;
> -	tx_ring->buffer_info = vmalloc(size);
> +	tx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
>  	if (!tx_ring->buffer_info)
>  		goto err;
>  	memset(tx_ring->buffer_info, 0, size);
> @@ -1102,7 +1102,7 @@ int e1000e_setup_tx_resources(struct e10
>  
>  	return 0;
>  err:
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	ndev_err(adapter->netdev,
>  	"Unable to allocate memory for the transmit descriptor ring\n");
>  	return err;
> @@ -1121,7 +1121,7 @@ int e1000e_setup_rx_resources(struct e10
>  	int i, size, desc_len, err = -ENOMEM;
>  
>  	size = sizeof(struct e1000_buffer) * rx_ring->count;
> -	rx_ring->buffer_info = vmalloc(size);
> +	rx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
>  	if (!rx_ring->buffer_info)
>  		goto err;
>  	memset(rx_ring->buffer_info, 0, size);
> @@ -1157,7 +1157,7 @@ err_pages:
>  		kfree(buffer_info->ps_pages);
>  	}
>  err:
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	ndev_err(adapter->netdev,
>  	"Unable to allocate memory for the transmit descriptor ring\n");
>  	return err;
> @@ -1204,7 +1204,7 @@ void e1000e_free_tx_resources(struct e10
>  
>  	e1000_clean_tx_ring(adapter);
>  
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	tx_ring->buffer_info = NULL;
>  
>  	dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
> @@ -1231,7 +1231,7 @@ void e1000e_free_rx_resources(struct e10
>  		kfree(rx_ring->buffer_info[i].ps_pages);
>  	}
>  
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	rx_ring->buffer_info = NULL;
>  
>  	dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000/e1000_main.c	2008-03-20 22:06:14.462252441 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c	2008-03-20 22:08:46.582009872 -0700
> @@ -1609,14 +1609,13 @@ e1000_setup_tx_resources(struct e1000_ad
>  	int size;
>  
>  	size = sizeof(struct e1000_buffer) * txdr->count;
> -	txdr->buffer_info = vmalloc(size);
> +	txdr->buffer_info = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
> +							get_order(size));
>  	if (!txdr->buffer_info) {
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the transmit descriptor ring\n");
>  		return -ENOMEM;
>  	}
> -	memset(txdr->buffer_info, 0, size);
> -
>  	/* round up to nearest 4K */
>  
>  	txdr->size = txdr->count * sizeof(struct e1000_tx_desc);
> @@ -1625,7 +1624,7 @@ e1000_setup_tx_resources(struct e1000_ad
>  	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
>  	if (!txdr->desc) {
>  setup_tx_desc_die:
> -		vfree(txdr->buffer_info);
> +		__free_vcompound(txdr->buffer_info);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the transmit descriptor ring\n");
>  		return -ENOMEM;
> @@ -1653,7 +1652,7 @@ setup_tx_desc_die:
>  			DPRINTK(PROBE, ERR,
>  				"Unable to allocate aligned memory "
>  				"for the transmit descriptor ring\n");
> -			vfree(txdr->buffer_info);
> +			__free_vcompound(txdr->buffer_info);
>  			return -ENOMEM;
>  		} else {
>  			/* Free old allocation, new allocation was successful */
> @@ -1826,7 +1825,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	int size, desc_len;
>  
>  	size = sizeof(struct e1000_buffer) * rxdr->count;
> -	rxdr->buffer_info = vmalloc(size);
> +	rxdr->buffer_info = __alloc_vcompound(GFP_KERNEL, size);
>  	if (!rxdr->buffer_info) {
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
> @@ -1837,7 +1836,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	rxdr->ps_page = kcalloc(rxdr->count, sizeof(struct e1000_ps_page),
>  	                        GFP_KERNEL);
>  	if (!rxdr->ps_page) {
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
>  		return -ENOMEM;
> @@ -1847,7 +1846,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	                            sizeof(struct e1000_ps_page_dma),
>  	                            GFP_KERNEL);
>  	if (!rxdr->ps_page_dma) {
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		kfree(rxdr->ps_page);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
> @@ -1870,7 +1869,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
>  setup_rx_desc_die:
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		kfree(rxdr->ps_page);
>  		kfree(rxdr->ps_page_dma);
>  		return -ENOMEM;
> @@ -2175,7 +2174,7 @@ e1000_free_tx_resources(struct e1000_ada
>  
>  	e1000_clean_tx_ring(adapter, tx_ring);
>  
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	tx_ring->buffer_info = NULL;
>  
>  	pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
> @@ -2283,9 +2282,9 @@ e1000_free_rx_resources(struct e1000_ada
>  
>  	e1000_clean_rx_ring(adapter, rx_ring);
>  
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	rx_ring->buffer_info = NULL;
> -	kfree(rx_ring->ps_page);
> +	__free_vcompound(rx_ring->ps_page);
>  	rx_ring->ps_page = NULL;
>  	kfree(rx_ring->ps_page_dma);
>  	rx_ring->ps_page_dma = NULL;
> 


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [12/14] vcompound: Avoid vmalloc in e1000 driver
@ 2008-03-21 17:27     ` Kok, Auke
  0 siblings, 0 replies; 212+ messages in thread
From: Kok, Auke @ 2008-03-21 17:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter wrote:
> Switch all the uses of vmalloc in the e1000 driver to virtual compounds.
> This will result in the use of regular memory for the ring buffers etc
> avoiding page tables,

hey, cool patch for sure!

I'll see if I can transpose this to e1000e and all the other drivers I maintain
which use vmalloc as well.

This one goes on my queue and I'll merge through Jeff.

Thanks Christoph!

Auke



> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  drivers/net/e1000/e1000_main.c |   23 +++++++++++------------
>  drivers/net/e1000e/netdev.c    |   12 ++++++------
>  2 files changed, 17 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000e/netdev.c	2008-03-20 21:52:45.962733927 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c	2008-03-20 21:57:27.212078371 -0700
> @@ -1083,7 +1083,7 @@ int e1000e_setup_tx_resources(struct e10
>  	int err = -ENOMEM, size;
>  
>  	size = sizeof(struct e1000_buffer) * tx_ring->count;
> -	tx_ring->buffer_info = vmalloc(size);
> +	tx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
>  	if (!tx_ring->buffer_info)
>  		goto err;
>  	memset(tx_ring->buffer_info, 0, size);
> @@ -1102,7 +1102,7 @@ int e1000e_setup_tx_resources(struct e10
>  
>  	return 0;
>  err:
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	ndev_err(adapter->netdev,
>  	"Unable to allocate memory for the transmit descriptor ring\n");
>  	return err;
> @@ -1121,7 +1121,7 @@ int e1000e_setup_rx_resources(struct e10
>  	int i, size, desc_len, err = -ENOMEM;
>  
>  	size = sizeof(struct e1000_buffer) * rx_ring->count;
> -	rx_ring->buffer_info = vmalloc(size);
> +	rx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
>  	if (!rx_ring->buffer_info)
>  		goto err;
>  	memset(rx_ring->buffer_info, 0, size);
> @@ -1157,7 +1157,7 @@ err_pages:
>  		kfree(buffer_info->ps_pages);
>  	}
>  err:
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	ndev_err(adapter->netdev,
>  	"Unable to allocate memory for the transmit descriptor ring\n");
>  	return err;
> @@ -1204,7 +1204,7 @@ void e1000e_free_tx_resources(struct e10
>  
>  	e1000_clean_tx_ring(adapter);
>  
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	tx_ring->buffer_info = NULL;
>  
>  	dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
> @@ -1231,7 +1231,7 @@ void e1000e_free_rx_resources(struct e10
>  		kfree(rx_ring->buffer_info[i].ps_pages);
>  	}
>  
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	rx_ring->buffer_info = NULL;
>  
>  	dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000/e1000_main.c	2008-03-20 22:06:14.462252441 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c	2008-03-20 22:08:46.582009872 -0700
> @@ -1609,14 +1609,13 @@ e1000_setup_tx_resources(struct e1000_ad
>  	int size;
>  
>  	size = sizeof(struct e1000_buffer) * txdr->count;
> -	txdr->buffer_info = vmalloc(size);
> +	txdr->buffer_info = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
> +							get_order(size));
>  	if (!txdr->buffer_info) {
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the transmit descriptor ring\n");
>  		return -ENOMEM;
>  	}
> -	memset(txdr->buffer_info, 0, size);
> -
>  	/* round up to nearest 4K */
>  
>  	txdr->size = txdr->count * sizeof(struct e1000_tx_desc);
> @@ -1625,7 +1624,7 @@ e1000_setup_tx_resources(struct e1000_ad
>  	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
>  	if (!txdr->desc) {
>  setup_tx_desc_die:
> -		vfree(txdr->buffer_info);
> +		__free_vcompound(txdr->buffer_info);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the transmit descriptor ring\n");
>  		return -ENOMEM;
> @@ -1653,7 +1652,7 @@ setup_tx_desc_die:
>  			DPRINTK(PROBE, ERR,
>  				"Unable to allocate aligned memory "
>  				"for the transmit descriptor ring\n");
> -			vfree(txdr->buffer_info);
> +			__free_vcompound(txdr->buffer_info);
>  			return -ENOMEM;
>  		} else {
>  			/* Free old allocation, new allocation was successful */
> @@ -1826,7 +1825,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	int size, desc_len;
>  
>  	size = sizeof(struct e1000_buffer) * rxdr->count;
> -	rxdr->buffer_info = vmalloc(size);
> +	rxdr->buffer_info = __alloc_vcompound(GFP_KERNEL, size);
>  	if (!rxdr->buffer_info) {
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
> @@ -1837,7 +1836,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	rxdr->ps_page = kcalloc(rxdr->count, sizeof(struct e1000_ps_page),
>  	                        GFP_KERNEL);
>  	if (!rxdr->ps_page) {
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
>  		return -ENOMEM;
> @@ -1847,7 +1846,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  	                            sizeof(struct e1000_ps_page_dma),
>  	                            GFP_KERNEL);
>  	if (!rxdr->ps_page_dma) {
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		kfree(rxdr->ps_page);
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
> @@ -1870,7 +1869,7 @@ e1000_setup_rx_resources(struct e1000_ad
>  		DPRINTK(PROBE, ERR,
>  		"Unable to allocate memory for the receive descriptor ring\n");
>  setup_rx_desc_die:
> -		vfree(rxdr->buffer_info);
> +		__free_vcompound(rxdr->buffer_info);
>  		kfree(rxdr->ps_page);
>  		kfree(rxdr->ps_page_dma);
>  		return -ENOMEM;
> @@ -2175,7 +2174,7 @@ e1000_free_tx_resources(struct e1000_ada
>  
>  	e1000_clean_tx_ring(adapter, tx_ring);
>  
> -	vfree(tx_ring->buffer_info);
> +	__free_vcompound(tx_ring->buffer_info);
>  	tx_ring->buffer_info = NULL;
>  
>  	pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
> @@ -2283,9 +2282,9 @@ e1000_free_rx_resources(struct e1000_ada
>  
>  	e1000_clean_rx_ring(adapter, rx_ring);
>  
> -	vfree(rx_ring->buffer_info);
> +	__free_vcompound(rx_ring->buffer_info);
>  	rx_ring->buffer_info = NULL;
> -	kfree(rx_ring->ps_page);
> +	__free_vcompound(rx_ring->ps_page);
>  	rx_ring->ps_page = NULL;
>  	kfree(rx_ring->ps_page_dma);
>  	rx_ring->ps_page_dma = NULL;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
  2008-03-21  7:31       ` David Miller, Eric Dumazet
@ 2008-03-21 17:31         ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, linux-mm, linux-kernel, netdev

On Fri, 21 Mar 2008, David Miller wrote:

> I agree with Eric, converting any of the networking hash
> allocations to this new facility is not the right thing
> to do.

Ok. Going to drop it.
 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [14/14] vcompound: Avoid vmalloc for ehash_locks
@ 2008-03-21 17:31         ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, linux-mm, linux-kernel, netdev

On Fri, 21 Mar 2008, David Miller wrote:

> I agree with Eric, converting any of the networking hash
> allocations to this new facility is not the right thing
> to do.

Ok. Going to drop it.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
  2008-03-21  7:55     ` Eric Dumazet
@ 2008-03-21 17:32       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Eric Dumazet wrote:

> I would love to see NUMA information as well on vmallocinfo, but have
> currently no available time to prepare a patch.

Ok. Easy to add.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [03/14] vmallocinfo: Support display of vcompound for a virtual compound page
@ 2008-03-21 17:32       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Eric Dumazet wrote:

> I would love to see NUMA information as well on vmallocinfo, but have
> currently no available time to prepare a patch.

Ok. Easy to add.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  8:39       ` Ingo Molnar
@ 2008-03-21 17:33         ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
> which has been NACK-ed before on x86 by several people and i'm nacking 
> this "configurable stack size" aspect of it again.

Huh? Nothing of that nature is in this patchset.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 17:33         ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
> which has been NACK-ed before on x86 by several people and i'm nacking 
> this "configurable stack size" aspect of it again.

Huh? Nothing of that nature is in this patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  7:25     ` David Miller, Christoph Lameter
@ 2008-03-21 17:40       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, David Miller wrote:

> I would be very careful with this especially on IA64.
> 
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.

Hmmm. Does not sound good for arches that cannot handle TLB misses in 
hardware. I wonder how arch specific this is? Last time around I was told 
that some arches already virtually map their stacks.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 17:40       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, David Miller wrote:

> I would be very careful with this especially on IA64.
> 
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.

Hmmm. Does not sound good for arches that cannot handle TLB misses in 
hardware. I wonder how arch specific this is? Last time around I was told 
that some arches already virtually map their stacks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 17:33         ` Christoph Lameter
@ 2008-03-21 19:02           ` Ingo Molnar
  -1 siblings, 0 replies; 212+ messages in thread
From: Ingo Molnar @ 2008-03-21 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra


* Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 21 Mar 2008, Ingo Molnar wrote:
> 
> > another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
> > which has been NACK-ed before on x86 by several people and i'm 
> > nacking this "configurable stack size" aspect of it again.
> 
> Huh? Nothing of that nature is in this patchset.

your patch indeed does not introduce it here, but 
KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and 
you refer to multi-order stack allocations in your 0/14 mail :-)

> -#define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
> -#define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
> +#define alloc_task_struct()	((struct task_struct *)__alloc_vcompound( \
> +			GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))

	Ingo

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 19:02           ` Ingo Molnar
  0 siblings, 0 replies; 212+ messages in thread
From: Ingo Molnar @ 2008-03-21 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra

* Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 21 Mar 2008, Ingo Molnar wrote:
> 
> > another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER 
> > which has been NACK-ed before on x86 by several people and i'm 
> > nacking this "configurable stack size" aspect of it again.
> 
> Huh? Nothing of that nature is in this patchset.

your patch indeed does not introduce it here, but 
KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and 
you refer to multi-order stack allocations in your 0/14 mail :-)

> -#define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
> -#define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
> +#define alloc_task_struct()	((struct task_struct *)__alloc_vcompound( \
> +			GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 19:02           ` Ingo Molnar
@ 2008-03-21 19:04             ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 19:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> your patch indeed does not introduce it here, but 
> KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and 
> you refer to multi-order stack allocations in your 0/14 mail :-)

Ahh. I see. Remnants from V2 in IA64 code. That portion has to be removed 
because of the software TLB issues on IA64 as pointed out by Dave.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 19:04             ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 19:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, linux-mm, linux-kernel, Peter Zijlstra

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> your patch indeed does not introduce it here, but 
> KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and 
> you refer to multi-order stack allocations in your 0/14 mail :-)

Ahh. I see. Remnants from V2 in IA64 code. That portion has to be removed 
because of the software TLB issues on IA64 as pointed out by Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21 21:25     ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-21 21:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Use virtual compound pages for the large swap maps. This only works for
> swap maps that are smaller than a MAX_ORDER block though. If the swap map
> is larger then there is no way around the use of vmalloc.

Have you considered the potential memory wastage from rounding up
to the next page order now? (similar in all the other patches
to change vmalloc). e.g. if the old size was 64k + 1 byte it will
suddenly get 128k now. That is actually not a uncommon situation
in my experience; there are often power of two buffers with 
some small headers.

A long time ago (in 2.4-aa) I did something similar for module loading
as an experiment to avoid too many TLB misses. The module loader
would first try to get a continuous range in the direct mapping and 
only then fall back to vmalloc.

But I used a simple trick to avoid the waste problem: it allocated a
continuous range rounded up to the next page-size order and then freed
the excess pages back into the page allocator. That was called
alloc_exact(). If you replace vmalloc with alloc_pages you should
use something like that too I think.

-Andi


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-21 21:25     ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-21 21:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Use virtual compound pages for the large swap maps. This only works for
> swap maps that are smaller than a MAX_ORDER block though. If the swap map
> is larger then there is no way around the use of vmalloc.

Have you considered the potential memory wastage from rounding up
to the next page order now? (similar in all the other patches
to change vmalloc). e.g. if the old size was 64k + 1 byte it will
suddenly get 128k now. That is actually not a uncommon situation
in my experience; there are often power of two buffers with 
some small headers.

A long time ago (in 2.4-aa) I did something similar for module loading
as an experiment to avoid too many TLB misses. The module loader
would first try to get a continuous range in the direct mapping and 
only then fall back to vmalloc.

But I used a simple trick to avoid the waste problem: it allocated a
continuous range rounded up to the next page-size order and then freed
the excess pages back into the page allocator. That was called
alloc_exact(). If you replace vmalloc with alloc_pages you should
use something like that too I think.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-21 21:25     ` Andi Kleen
@ 2008-03-21 21:33       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 21:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> > is larger then there is no way around the use of vmalloc.
> 
> Have you considered the potential memory wastage from rounding up
> to the next page order now? (similar in all the other patches
> to change vmalloc). e.g. if the old size was 64k + 1 byte it will
> suddenly get 128k now. That is actually not a uncommon situation
> in my experience; there are often power of two buffers with 
> some small headers.

Yes the larger the order the more significant the problem becomes.

> A long time ago (in 2.4-aa) I did something similar for module loading
> as an experiment to avoid too many TLB misses. The module loader
> would first try to get a continuous range in the direct mapping and 
> only then fall back to vmalloc.
> 
> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.

That trick is still in use for alloc_large_system_hash....

But cutting off the tail of compound pages would make treating them as 
order N pages difficult. The vmalloc fallback situation is easy to deal 
with.

Maybe we can think about making compound pages being N consecutive pages 
of PAGE_SIZE rather than an order O page? The api would be a bit 
different then and it would require changes to the page allocator. More 
fragmentation if pages like that are freed.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-21 21:33       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-21 21:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> > is larger then there is no way around the use of vmalloc.
> 
> Have you considered the potential memory wastage from rounding up
> to the next page order now? (similar in all the other patches
> to change vmalloc). e.g. if the old size was 64k + 1 byte it will
> suddenly get 128k now. That is actually not a uncommon situation
> in my experience; there are often power of two buffers with 
> some small headers.

Yes the larger the order the more significant the problem becomes.

> A long time ago (in 2.4-aa) I did something similar for module loading
> as an experiment to avoid too many TLB misses. The module loader
> would first try to get a continuous range in the direct mapping and 
> only then fall back to vmalloc.
> 
> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.

That trick is still in use for alloc_large_system_hash....

But cutting off the tail of compound pages would make treating them as 
order N pages difficult. The vmalloc fallback situation is easy to deal 
with.

Maybe we can think about making compound pages being N consecutive pages 
of PAGE_SIZE rather than an order O page? The api would be a bit 
different then and it would require changes to the page allocator. More 
fragmentation if pages like that are freed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 17:40       ` Christoph Lameter
@ 2008-03-21 21:57         ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-21 21:57 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Fri, 21 Mar 2008 10:40:18 -0700 (PDT)

> On Fri, 21 Mar 2008, David Miller wrote:
> 
> > I would be very careful with this especially on IA64.
> > 
> > If the TLB miss or other low-level trap handler depends upon being
> > able to dereference thread info, task struct, or kernel stack stuff
> > without causing a fault outside of the linear PAGE_OFFSET area, this
> > patch will cause problems.
> 
> Hmmm. Does not sound good for arches that cannot handle TLB misses in 
> hardware. I wonder how arch specific this is? Last time around I was told 
> that some arches already virtually map their stacks.

I'm not saying there is a problem, I'm saying "tread lightly"
because there might be one.

The thing to do is to first validate the way that IA64
handles recursive TLB misses occuring during an initial
TLB miss, and if there are any limitations therein.

That's the kind of thing I'm talking about.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 21:57         ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-21 21:57 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel

> On Fri, 21 Mar 2008, David Miller wrote:
> 
> > I would be very careful with this especially on IA64.
> > 
> > If the TLB miss or other low-level trap handler depends upon being
> > able to dereference thread info, task struct, or kernel stack stuff
> > without causing a fault outside of the linear PAGE_OFFSET area, this
> > patch will cause problems.
> 
> Hmmm. Does not sound good for arches that cannot handle TLB misses in 
> hardware. I wonder how arch specific this is? Last time around I was told 
> that some arches already virtually map their stacks.

I'm not saying there is a problem, I'm saying "tread lightly"
because there might be one.

The thing to do is to first validate the way that IA64
handles recursive TLB misses occuring during an initial
TLB miss, and if there are any limitations therein.

That's the kind of thing I'm talking about.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-21 22:30     ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-21 22:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.

The traditional reason this was discouraged (people seem to reinvent
variants of this patch all the time) was that there used 
to be drivers that did __pa() (or equivalent) on stack addresses
and that doesn't work with vmalloc pages.

I don't know if such drivers still exist, but such a change
is certainly not a no-brainer

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 22:30     ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-21 22:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.

The traditional reason this was discouraged (people seem to reinvent
variants of this patch all the time) was that there used 
to be drivers that did __pa() (or equivalent) on stack addresses
and that doesn't work with vmalloc pages.

I don't know if such drivers still exist, but such a change
is certainly not a no-brainer

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [04/14] vcompound: Core piece
  2008-03-21  6:17   ` Christoph Lameter
@ 2008-03-22 12:10     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 212+ messages in thread
From: KOSAKI Motohiro @ 2008-03-22 12:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kosaki.motohiro, linux-mm, linux-kernel

Hi

in general, I like this patch and I found no bug :)

> Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-20 23:03:14.600588151 -0700
> +++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-20 23:03:14.612588010 -0700
> @@ -86,6 +86,20 @@ extern struct vm_struct *alloc_vm_area(s
>  extern void free_vm_area(struct vm_struct *area);
>  
>  /*
> + * Support for virtual compound pages.
> + *
> + * Calls to vcompound alloc will result in the allocation of normal compound
> + * pages unless memory is fragmented.  If insufficient physical linear memory
> + * is available then a virtually contiguous area of memory will be created
> + * using the vmalloc functionality.
> + */
> +struct page *alloc_vcompound_alloc(gfp_t flags, int order);

where exist alloc_vcompound_alloc?


> +/*
> + * Virtual Compound Page support.
> + *
> + * Virtual Compound Pages are used to fall back to order 0 allocations if large
> + * linear mappings are not available. They are formatted according to compound
> + * page conventions. I.e. following page->first_page if PageTail(page) is set
> + * can be used to determine the head page.
> + */
> +

Hmm,
IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.


> +void __free_vcompound(void *addr)
> +void free_vcompound(struct page *page)
> +struct page *alloc_vcompound(gfp_t flags, int order)
> +void *__alloc_vcompound(gfp_t flags, int order)

may be, we need DocBook style comment at the head of these 4 functions.




^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [04/14] vcompound: Core piece
@ 2008-03-22 12:10     ` KOSAKI Motohiro
  0 siblings, 0 replies; 212+ messages in thread
From: KOSAKI Motohiro @ 2008-03-22 12:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kosaki.motohiro, linux-mm, linux-kernel

Hi

in general, I like this patch and I found no bug :)

> Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-20 23:03:14.600588151 -0700
> +++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-20 23:03:14.612588010 -0700
> @@ -86,6 +86,20 @@ extern struct vm_struct *alloc_vm_area(s
>  extern void free_vm_area(struct vm_struct *area);
>  
>  /*
> + * Support for virtual compound pages.
> + *
> + * Calls to vcompound alloc will result in the allocation of normal compound
> + * pages unless memory is fragmented.  If insufficient physical linear memory
> + * is available then a virtually contiguous area of memory will be created
> + * using the vmalloc functionality.
> + */
> +struct page *alloc_vcompound_alloc(gfp_t flags, int order);

where exist alloc_vcompound_alloc?


> +/*
> + * Virtual Compound Page support.
> + *
> + * Virtual Compound Pages are used to fall back to order 0 allocations if large
> + * linear mappings are not available. They are formatted according to compound
> + * page conventions. I.e. following page->first_page if PageTail(page) is set
> + * can be used to determine the head page.
> + */
> +

Hmm,
IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.


> +void __free_vcompound(void *addr)
> +void free_vcompound(struct page *page)
> +struct page *alloc_vcompound(gfp_t flags, int order)
> +void *__alloc_vcompound(gfp_t flags, int order)

may be, we need DocBook style comment at the head of these 4 functions.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
  2008-03-21  6:17 ` Christoph Lameter
@ 2008-03-22 18:40   ` Arjan van de Ven
  -1 siblings, 0 replies; 212+ messages in thread
From: Arjan van de Ven @ 2008-03-22 18:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Thu, 20 Mar 2008 23:17:03 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> Allocations of larger pages are not reliable in Linux. If larger
> pages have to be allocated then one faces various choices of allowing
> graceful fallback or using vmalloc with a performance penalty due
> to the use of a page table. Virtual Compound pages are
> a simple solution out of this dilemma.


can you document the drawback of large, frequent vmalloc() allocations at least?
On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
there's a rather abrupt upper limit to this.
Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of 
the "scales with nr of ABC" (for example "one for each thread") kind of things.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
@ 2008-03-22 18:40   ` Arjan van de Ven
  0 siblings, 0 replies; 212+ messages in thread
From: Arjan van de Ven @ 2008-03-22 18:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Thu, 20 Mar 2008 23:17:03 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> Allocations of larger pages are not reliable in Linux. If larger
> pages have to be allocated then one faces various choices of allowing
> graceful fallback or using vmalloc with a performance penalty due
> to the use of a page table. Virtual Compound pages are
> a simple solution out of this dilemma.


can you document the drawback of large, frequent vmalloc() allocations at least?
On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
there's a rather abrupt upper limit to this.
Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of 
the "scales with nr of ABC" (for example "one for each thread") kind of things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 21:57         ` David Miller, Christoph Lameter
  (?)
@ 2008-03-24 18:27           ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

On Fri, 21 Mar 2008, David Miller wrote:

> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.

I am familiar with that area and I am resonably sure that this 
is an issue on IA64 under some conditions (the processor decides to spill 
some registers either onto the stack or into the register backing store 
during tlb processing). Recursion (in the kernel context) still expects 
the stack and register backing store to be available. ccing linux-ia64 for 
any thoughts to the contrary.

The move to 64k page size on IA64 is another way that this issue can be 
addressed though. So I think its best to drop the IA64 portion.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-24 18:27           ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

On Fri, 21 Mar 2008, David Miller wrote:

> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.

I am familiar with that area and I am resonably sure that this 
is an issue on IA64 under some conditions (the processor decides to spill 
some registers either onto the stack or into the register backing store 
during tlb processing). Recursion (in the kernel context) still expects 
the stack and register backing store to be available. ccing linux-ia64 for 
any thoughts to the contrary.

The move to 64k page size on IA64 is another way that this issue can be 
addressed though. So I think its best to drop the IA64 portion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on
@ 2008-03-24 18:27           ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

On Fri, 21 Mar 2008, David Miller wrote:

> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.

I am familiar with that area and I am resonably sure that this 
is an issue on IA64 under some conditions (the processor decides to spill 
some registers either onto the stack or into the register backing store 
during tlb processing). Recursion (in the kernel context) still expects 
the stack and register backing store to be available. ccing linux-ia64 for 
any thoughts to the contrary.

The move to 64k page size on IA64 is another way that this issue can be 
addressed though. So I think its best to drop the IA64 portion.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [04/14] vcompound: Core piece
  2008-03-22 12:10     ` KOSAKI Motohiro
@ 2008-03-24 18:28       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:28 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Sat, 22 Mar 2008, KOSAKI Motohiro wrote:

> > +struct page *alloc_vcompound_alloc(gfp_t flags, int order);
> 
> where exist alloc_vcompound_alloc?

Duh... alloc_vcompound is not used at this point. Typo. _alloc needs to be 
cut off.

> Hmm,
> IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
> if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.


Ok.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [04/14] vcompound: Core piece
@ 2008-03-24 18:28       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:28 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel

On Sat, 22 Mar 2008, KOSAKI Motohiro wrote:

> > +struct page *alloc_vcompound_alloc(gfp_t flags, int order);
> 
> where exist alloc_vcompound_alloc?

Duh... alloc_vcompound is not used at this point. Typo. _alloc needs to be 
cut off.

> Hmm,
> IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
> if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.


Ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
  2008-03-22 18:40   ` Arjan van de Ven
@ 2008-03-24 18:31     ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:31 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-mm, linux-kernel

On Sat, 22 Mar 2008, Arjan van de Ven wrote:

> can you document the drawback of large, frequent vmalloc() allocations at least?

Ok. Lets add some documentation about this issue and some other 
things. A similar suggestion was made by Kosaki-san.

> On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
> so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
> there's a rather abrupt upper limit to this.
> Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of 
> the "scales with nr of ABC" (for example "one for each thread") kind of things.

I better take out any patches that do large scale allocs then.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
@ 2008-03-24 18:31     ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:31 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-mm, linux-kernel

On Sat, 22 Mar 2008, Arjan van de Ven wrote:

> can you document the drawback of large, frequent vmalloc() allocations at least?

Ok. Lets add some documentation about this issue and some other 
things. A similar suggestion was made by Kosaki-san.

> On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
> so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
> there's a rather abrupt upper limit to this.
> Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of 
> the "scales with nr of ABC" (for example "one for each thread") kind of things.

I better take out any patches that do large scale allocs then.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
  2008-03-24 18:31     ` Christoph Lameter
@ 2008-03-24 19:29       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:29 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-mm, linux-kernel

On the other hand: The conversion of vmalloc to vcompound_alloc will 
reduce the amount of virtually mapped space needed. Doing that to 
alloc_large_system_hash() etc may help there.



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [00/14] Virtual Compound Page Support V3
@ 2008-03-24 19:29       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:29 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-mm, linux-kernel

On the other hand: The conversion of vmalloc to vcompound_alloc will 
reduce the amount of virtually mapped space needed. Doing that to 
alloc_large_system_hash() etc may help there.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 22:30     ` Andi Kleen
@ 2008-03-24 19:53       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> The traditional reason this was discouraged (people seem to reinvent
> variants of this patch all the time) was that there used 
> to be drivers that did __pa() (or equivalent) on stack addresses
> and that doesn't work with vmalloc pages.
> 
> I don't know if such drivers still exist, but such a change
> is certainly not a no-brainer

I thought that had been cleaned up because some arches already have 
virtually mapped stacks? This could be debugged by testing with
CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always 
vmalloc'ed and thus the driver should fail.



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-24 19:53       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> The traditional reason this was discouraged (people seem to reinvent
> variants of this patch all the time) was that there used 
> to be drivers that did __pa() (or equivalent) on stack addresses
> and that doesn't work with vmalloc pages.
> 
> I don't know if such drivers still exist, but such a change
> is certainly not a no-brainer

I thought that had been cleaned up because some arches already have 
virtually mapped stacks? This could be debugged by testing with
CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always 
vmalloc'ed and thus the driver should fail.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-21 21:25     ` Andi Kleen
@ 2008-03-24 19:54       ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.

One way of dealing with it would be to define an additional allocation 
variant that allows the limiting of the loss? I noted that both the swap
and the wait tables vary significantly between allocations. So we could 
specify an upper boundary of a loss that is acceptable. If too much memory
would be lost then use vmalloc unconditionally.

---
 include/linux/vmalloc.h |   12 ++++++++----
 mm/page_alloc.c         |    4 ++--
 mm/swapfile.c           |    4 ++--
 mm/vmalloc.c            |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 8 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-24 12:51:47.457231129 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-24 12:52:05.449313572 -0700
@@ -88,14 +88,18 @@ extern void free_vm_area(struct vm_struc
 /*
  * Support for virtual compound pages.
  *
- * Calls to vcompound alloc will result in the allocation of normal compound
- * pages unless memory is fragmented.  If insufficient physical linear memory
- * is available then a virtually contiguous area of memory will be created
- * using the vmalloc functionality.
+ * Calls to vcompound_alloc and friends will result in the allocation of
+ * a normal physically contiguous compound page unless memory is fragmented.
+ * If insufficient physical linear memory is available then a virtually
+ * contiguous area of memory will be created using vmalloc.
  */
 struct page *alloc_vcompound(gfp_t flags, int order);
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+					unsigned long maxloss);
 void free_vcompound(struct page *);
 void *__alloc_vcompound(gfp_t flags, int order);
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+					unsigned long maxloss);
 void __free_vcompound(void *addr);
 struct page *vcompound_head_page(const void *x);
 
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-24 12:51:47.485231279 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-24 12:52:05.453313419 -0700
@@ -1198,3 +1198,37 @@ void *__alloc_vcompound(gfp_t flags, int
 
 	return NULL;
 }
+
+/*
+ * Functions to avoid loosing memory because of the rounding up to
+ * power of two sizes for compound page allocation. If the loss would
+ * be too great then use vmalloc regardless of the fragmentation
+ * situation.
+ */
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+							unsigned long maxloss)
+{
+	int order = get_order(size);
+	unsigned long loss = (PAGE_SIZE << order) - size;
+	void *addr;
+
+	if (loss < maxloss)
+		return alloc_vcompound(flags, order);
+
+	addr = __vmalloc(size, flags, PAGE_KERNEL);
+	if (!addr)
+		return NULL;
+	return vmalloc_to_page(addr);
+}
+
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+							unsigned long maxloss)
+{
+	int order = get_order(size);
+	unsigned long loss = (PAGE_SIZE << order) - size;
+
+	if (loss < maxloss)
+		return __alloc_vcompound(flags, order);
+
+	return __vmalloc(size, flags, PAGE_KERNEL);
+}
Index: linux-2.6.25-rc5-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/swapfile.c	2008-03-24 12:52:05.441314302 -0700
+++ linux-2.6.25-rc5-mm1/mm/swapfile.c	2008-03-24 12:52:05.453313419 -0700
@@ -1636,8 +1636,8 @@ asmlinkage long sys_swapon(const char __
 			goto bad_swap;
 
 		/* OK, set up the swap map and apply the bad block list */
-		if (!(p->swap_map = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
-					get_order(maxpages * sizeof(short))))) {
+		if (!(p->swap_map = __alloc_vcompound_maxloss(GFP_KERNEL | __GFP_ZERO,
+					maxpages * sizeof(short))), 16 * PAGE_SIZE) {
 			error = -ENOMEM;
 			goto bad_swap;
 		}
Index: linux-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/page_alloc.c	2008-03-24 12:52:05.389313168 -0700
+++ linux-2.6.25-rc5-mm1/mm/page_alloc.c	2008-03-24 12:52:07.493322559 -0700
@@ -2866,8 +2866,8 @@ int zone_wait_table_init(struct zone *zo
 		 * To use this new node's memory, further consideration will be
 		 * necessary.
 		 */
-		zone->wait_table = __alloc_vcompound(GFP_KERNEL,
-						get_order(alloc_size));
+		zone->wait_table = __alloc_vcompound_maxloss(GFP_KERNEL,
+				alloc_size, 32 * PAGE_SIZE);
 	}
 	if (!zone->wait_table)
 		return -ENOMEM;

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-24 19:54       ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 19:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, Andi Kleen wrote:

> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.

One way of dealing with it would be to define an additional allocation 
variant that allows the limiting of the loss? I noted that both the swap
and the wait tables vary significantly between allocations. So we could 
specify an upper boundary of a loss that is acceptable. If too much memory
would be lost then use vmalloc unconditionally.

---
 include/linux/vmalloc.h |   12 ++++++++----
 mm/page_alloc.c         |    4 ++--
 mm/swapfile.c           |    4 ++--
 mm/vmalloc.c            |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 8 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h	2008-03-24 12:51:47.457231129 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h	2008-03-24 12:52:05.449313572 -0700
@@ -88,14 +88,18 @@ extern void free_vm_area(struct vm_struc
 /*
  * Support for virtual compound pages.
  *
- * Calls to vcompound alloc will result in the allocation of normal compound
- * pages unless memory is fragmented.  If insufficient physical linear memory
- * is available then a virtually contiguous area of memory will be created
- * using the vmalloc functionality.
+ * Calls to vcompound_alloc and friends will result in the allocation of
+ * a normal physically contiguous compound page unless memory is fragmented.
+ * If insufficient physical linear memory is available then a virtually
+ * contiguous area of memory will be created using vmalloc.
  */
 struct page *alloc_vcompound(gfp_t flags, int order);
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+					unsigned long maxloss);
 void free_vcompound(struct page *);
 void *__alloc_vcompound(gfp_t flags, int order);
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+					unsigned long maxloss);
 void __free_vcompound(void *addr);
 struct page *vcompound_head_page(const void *x);
 
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c	2008-03-24 12:51:47.485231279 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c	2008-03-24 12:52:05.453313419 -0700
@@ -1198,3 +1198,37 @@ void *__alloc_vcompound(gfp_t flags, int
 
 	return NULL;
 }
+
+/*
+ * Functions to avoid loosing memory because of the rounding up to
+ * power of two sizes for compound page allocation. If the loss would
+ * be too great then use vmalloc regardless of the fragmentation
+ * situation.
+ */
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+							unsigned long maxloss)
+{
+	int order = get_order(size);
+	unsigned long loss = (PAGE_SIZE << order) - size;
+	void *addr;
+
+	if (loss < maxloss)
+		return alloc_vcompound(flags, order);
+
+	addr = __vmalloc(size, flags, PAGE_KERNEL);
+	if (!addr)
+		return NULL;
+	return vmalloc_to_page(addr);
+}
+
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+							unsigned long maxloss)
+{
+	int order = get_order(size);
+	unsigned long loss = (PAGE_SIZE << order) - size;
+
+	if (loss < maxloss)
+		return __alloc_vcompound(flags, order);
+
+	return __vmalloc(size, flags, PAGE_KERNEL);
+}
Index: linux-2.6.25-rc5-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/swapfile.c	2008-03-24 12:52:05.441314302 -0700
+++ linux-2.6.25-rc5-mm1/mm/swapfile.c	2008-03-24 12:52:05.453313419 -0700
@@ -1636,8 +1636,8 @@ asmlinkage long sys_swapon(const char __
 			goto bad_swap;
 
 		/* OK, set up the swap map and apply the bad block list */
-		if (!(p->swap_map = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
-					get_order(maxpages * sizeof(short))))) {
+		if (!(p->swap_map = __alloc_vcompound_maxloss(GFP_KERNEL | __GFP_ZERO,
+					maxpages * sizeof(short))), 16 * PAGE_SIZE) {
 			error = -ENOMEM;
 			goto bad_swap;
 		}
Index: linux-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/page_alloc.c	2008-03-24 12:52:05.389313168 -0700
+++ linux-2.6.25-rc5-mm1/mm/page_alloc.c	2008-03-24 12:52:07.493322559 -0700
@@ -2866,8 +2866,8 @@ int zone_wait_table_init(struct zone *zo
 		 * To use this new node's memory, further consideration will be
 		 * necessary.
 		 */
-		zone->wait_table = __alloc_vcompound(GFP_KERNEL,
-						get_order(alloc_size));
+		zone->wait_table = __alloc_vcompound_maxloss(GFP_KERNEL,
+				alloc_size, 32 * PAGE_SIZE);
 	}
 	if (!zone->wait_table)
 		return -ENOMEM;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* larger default page sizes...
  2008-03-24 18:27           ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
  (?)
@ 2008-03-24 20:37             ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 20:37 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)

> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.

This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.

The memory wastage is just rediculious.

I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* larger default page sizes...
@ 2008-03-24 20:37             ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-24 20:37 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.

This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.

The memory wastage is just rediculious.

I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* larger default page sizes...
@ 2008-03-24 20:37             ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 20:37 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)

> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.

This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.

The memory wastage is just rediculious.

I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-24 20:37             ` David Miller, Christoph Lameter
  (?)
@ 2008-03-24 21:05               ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 21:05 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

Its certainly not a light decision if your customer tells you that the box 
is almost unusable with 16k page size. For our new 2k and 4k processor 
systems this seems to be a requirement. Customers start hacking SLES10 to 
run with 64k pages....

> The memory wastage is just rediculious.

Well yes if you would use such a box for kernel compiles and small files 
then its a bad move. However, if you have to process terabytes of data 
then this is significantly reducing the VM and I/O overhead.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

powerpc also runs HPC codes. They certainly see the same results that we 
see.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:05               ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 21:05 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

Its certainly not a light decision if your customer tells you that the box 
is almost unusable with 16k page size. For our new 2k and 4k processor 
systems this seems to be a requirement. Customers start hacking SLES10 to 
run with 64k pages....

> The memory wastage is just rediculious.

Well yes if you would use such a box for kernel compiles and small files 
then its a bad move. However, if you have to process terabytes of data 
then this is significantly reducing the VM and I/O overhead.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

powerpc also runs HPC codes. They certainly see the same results that we 
see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:05               ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-24 21:05 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

Its certainly not a light decision if your customer tells you that the box 
is almost unusable with 16k page size. For our new 2k and 4k processor 
systems this seems to be a requirement. Customers start hacking SLES10 to 
run with 64k pages....

> The memory wastage is just rediculious.

Well yes if you would use such a box for kernel compiles and small files 
then its a bad move. However, if you have to process terabytes of data 
then this is significantly reducing the VM and I/O overhead.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

powerpc also runs HPC codes. They certainly see the same results that we 
see.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-24 18:27           ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
  (?)
@ 2008-03-24 21:13             ` Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:13 UTC (permalink / raw)
  To: Christoph Lameter, David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

> I am familiar with that area and I am resonably sure that this 
> is an issue on IA64 under some conditions (the processor decides to spill 
> some registers either onto the stack or into the register backing store 
> during tlb processing). Recursion (in the kernel context) still expects 
> the stack and register backing store to be available. ccing linux-ia64 for 
> any thoughts to the contrary.

Christoph is correct ... IA64 pins the TLB entry for the kernel stack
(which covers both the normal C stack and the register backing store)
so that it won't have to deal with a TLB miss on the stack while handling
another TLB miss.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-24 21:13             ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:13 UTC (permalink / raw)
  To: Christoph Lameter, David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

> I am familiar with that area and I am resonably sure that this 
> is an issue on IA64 under some conditions (the processor decides to spill 
> some registers either onto the stack or into the register backing store 
> during tlb processing). Recursion (in the kernel context) still expects 
> the stack and register backing store to be available. ccing linux-ia64 for 
> any thoughts to the contrary.

Christoph is correct ... IA64 pins the TLB entry for the kernel stack
(which covers both the normal C stack and the register backing store)
so that it won't have to deal with a TLB miss on the stack while handling
another TLB miss.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-24 21:13             ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:13 UTC (permalink / raw)
  To: Christoph Lameter, David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

> I am familiar with that area and I am resonably sure that this 
> is an issue on IA64 under some conditions (the processor decides to spill 
> some registers either onto the stack or into the register backing store 
> during tlb processing). Recursion (in the kernel context) still expects 
> the stack and register backing store to be available. ccing linux-ia64 for 
> any thoughts to the contrary.

Christoph is correct ... IA64 pins the TLB entry for the kernel stack
(which covers both the normal C stack and the register backing store)
so that it won't have to deal with a TLB miss on the stack while handling
another TLB miss.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
  2008-03-24 20:37             ` David Miller, Christoph Lameter
  (?)
@ 2008-03-24 21:25               ` Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:25 UTC (permalink / raw)
  To: David Miller, clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> The memory wastage is just rediculious.

In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.

In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.

When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-24 21:25               ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:25 UTC (permalink / raw)
  To: David Miller, clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> The memory wastage is just rediculious.

In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.

In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.

When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-24 21:25               ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-24 21:25 UTC (permalink / raw)
  To: David Miller, clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> The memory wastage is just rediculious.

In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.

In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.

When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:05               ` Christoph Lameter
  (?)
@ 2008-03-24 21:43                 ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 21:43 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 14:05:02 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> > 
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> > 
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
> 
> Its certainly not a light decision if your customer tells you that the box 
> is almost unusable with 16k page size. For our new 2k and 4k processor 
> systems this seems to be a requirement. Customers start hacking SLES10 to 
> run with 64k pages....

We should fix the underlying problems.

I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.

Which is very fixable, without going to larger pages.

> powerpc also runs HPC codes. They certainly see the same results
> that we see.

There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:43                 ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-24 21:43 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> > 
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> > 
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
> 
> Its certainly not a light decision if your customer tells you that the box 
> is almost unusable with 16k page size. For our new 2k and 4k processor 
> systems this seems to be a requirement. Customers start hacking SLES10 to 
> run with 64k pages....

We should fix the underlying problems.

I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.

Which is very fixable, without going to larger pages.

> powerpc also runs HPC codes. They certainly see the same results
> that we see.

There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:43                 ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 21:43 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 14:05:02 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> > 
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> > 
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
> 
> Its certainly not a light decision if your customer tells you that the box 
> is almost unusable with 16k page size. For our new 2k and 4k processor 
> systems this seems to be a requirement. Customers start hacking SLES10 to 
> run with 64k pages....

We should fix the underlying problems.

I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.

Which is very fixable, without going to larger pages.

> powerpc also runs HPC codes. They certainly see the same results
> that we see.

There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:25               ` Luck, Tony
  (?)
@ 2008-03-24 21:46                 ` David Miller, Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 21:46 UTC (permalink / raw)
  To: tony.luck; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Mon, 24 Mar 2008 14:25:11 -0700

> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.

We have hugepages and such for a reason.  And this can be
made more dynamic and flexible, as needed.

Increasing the page size is a "stick your head in the sand"
type solution by my book.

Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:46                 ` David Miller, Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Luck, Tony @ 2008-03-24 21:46 UTC (permalink / raw)
  To: tony.luck; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.

We have hugepages and such for a reason.  And this can be
made more dynamic and flexible, as needed.

Increasing the page size is a "stick your head in the sand"
type solution by my book.

Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-24 21:46                 ` David Miller, Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-24 21:46 UTC (permalink / raw)
  To: tony.luck; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Mon, 24 Mar 2008 14:25:11 -0700

> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.

We have hugepages and such for a reason.  And this can be
made more dynamic and flexible, as needed.

Increasing the page size is a "stick your head in the sand"
type solution by my book.

Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-24 20:37             ` David Miller, Christoph Lameter
  (?)
@ 2008-03-25  3:29               ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25  3:29 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.

> The memory wastage is just rediculious.

Depends on the distribution of file sizes you have.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels.  One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured.  Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):

4kB pages:	444.051s user + 34.406s system time
64kB pages:	419.963s user + 16.869s system time

That's nearly 10% faster with 64kB pages -- on a kernel compile.

Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25  3:29               ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25  3:29 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.

> The memory wastage is just rediculious.

Depends on the distribution of file sizes you have.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels.  One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured.  Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):

4kB pages:	444.051s user + 34.406s system time
64kB pages:	419.963s user + 16.869s system time

That's nearly 10% faster with 64kB pages -- on a kernel compile.

Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.

Regards,
Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25  3:29               ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25  3:29 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.

> The memory wastage is just rediculious.

Depends on the distribution of file sizes you have.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels.  One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured.  Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):

4kB pages:	444.051s user + 34.406s system time
64kB pages:	419.963s user + 16.869s system time

That's nearly 10% faster with 64kB pages -- on a kernel compile.

Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29               ` Paul Mackerras
  (?)
@ 2008-03-25  4:15                 ` David Miller, Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25  4:15 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 14:29:55 +1100

> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.

Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25  4:15                 ` David Miller, Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Paul Mackerras @ 2008-03-25  4:15 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.

Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25  4:15                 ` David Miller, Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25  4:15 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 14:29:55 +1100

> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.

Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-24 19:53       ` Christoph Lameter
@ 2008-03-25  7:51         ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25  7:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel, viro

On Mon, Mar 24, 2008 at 12:53:19PM -0700, Christoph Lameter wrote:
> On Fri, 21 Mar 2008, Andi Kleen wrote:
> 
> > The traditional reason this was discouraged (people seem to reinvent
> > variants of this patch all the time) was that there used 
> > to be drivers that did __pa() (or equivalent) on stack addresses
> > and that doesn't work with vmalloc pages.
> > 
> > I don't know if such drivers still exist, but such a change
> > is certainly not a no-brainer
> 
> I thought that had been cleaned up because some arches already have 

Someone posted a patch recently that showed that the cdrom layer
does it. Might be more. It is hard to audit a few million lines
of driver code.

> virtually mapped stacks? This could be debugged by testing with
> CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always 
> vmalloc'ed and thus the driver should fail.

It might be a subtle failure.

Maybe sparse could be taught to check for this if it happens
in a single function? (cc'ing Al who might have some thoughts
on this). Of course if it happens spread out over multiple
functions sparse wouldn't help neither. 

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25  7:51         ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25  7:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel, viro

On Mon, Mar 24, 2008 at 12:53:19PM -0700, Christoph Lameter wrote:
> On Fri, 21 Mar 2008, Andi Kleen wrote:
> 
> > The traditional reason this was discouraged (people seem to reinvent
> > variants of this patch all the time) was that there used 
> > to be drivers that did __pa() (or equivalent) on stack addresses
> > and that doesn't work with vmalloc pages.
> > 
> > I don't know if such drivers still exist, but such a change
> > is certainly not a no-brainer
> 
> I thought that had been cleaned up because some arches already have 

Someone posted a patch recently that showed that the cdrom layer
does it. Might be more. It is hard to audit a few million lines
of driver code.

> virtually mapped stacks? This could be debugged by testing with
> CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always 
> vmalloc'ed and thus the driver should fail.

It might be a subtle failure.

Maybe sparse could be taught to check for this if it happens
in a single function? (cc'ing Al who might have some thoughts
on this). Of course if it happens spread out over multiple
functions sparse wouldn't help neither. 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-24 19:54       ` Christoph Lameter
@ 2008-03-25  7:52         ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25  7:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Mon, Mar 24, 2008 at 12:54:54PM -0700, Christoph Lameter wrote:
> On Fri, 21 Mar 2008, Andi Kleen wrote:
> 
> > But I used a simple trick to avoid the waste problem: it allocated a
> > continuous range rounded up to the next page-size order and then freed
> > the excess pages back into the page allocator. That was called
> > alloc_exact(). If you replace vmalloc with alloc_pages you should
> > use something like that too I think.
> 
> One way of dealing with it would be to define an additional allocation 
> variant that allows the limiting of the loss? I noted that both the swap
> and the wait tables vary significantly between allocations. So we could 
> specify an upper boundary of a loss that is acceptable. If too much memory
> would be lost then use vmalloc unconditionally.

I liked your idea of fixing compound pages to not rely on order
better. Ok it is likely more work to implement @)

Also if anything preserving memory should be default, but maybe
skippable a with __GFP_GO_FAST flag.

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-25  7:52         ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25  7:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Mon, Mar 24, 2008 at 12:54:54PM -0700, Christoph Lameter wrote:
> On Fri, 21 Mar 2008, Andi Kleen wrote:
> 
> > But I used a simple trick to avoid the waste problem: it allocated a
> > continuous range rounded up to the next page-size order and then freed
> > the excess pages back into the page allocator. That was called
> > alloc_exact(). If you replace vmalloc with alloc_pages you should
> > use something like that too I think.
> 
> One way of dealing with it would be to define an additional allocation 
> variant that allows the limiting of the loss? I noted that both the swap
> and the wait tables vary significantly between allocations. So we could 
> specify an upper boundary of a loss that is acceptable. If too much memory
> would be lost then use vmalloc unconditionally.

I liked your idea of fixing compound pages to not rely on order
better. Ok it is likely more work to implement @)

Also if anything preserving memory should be default, but maybe
skippable a with __GFP_GO_FAST flag.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25  4:15                 ` David Miller, Paul Mackerras
  (?)
@ 2008-03-25 11:50                   ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 11:50 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
> 
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
> 
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.

It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles.  And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them.  How do I get gcc to use hugepages, for instance?  Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.

If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you.  But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 11:50                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 11:50 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
> 
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
> 
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.

It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles.  And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them.  How do I get gcc to use hugepages, for instance?  Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.

If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you.  But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 11:50                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 11:50 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
> 
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
> 
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.

It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles.  And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them.  How do I get gcc to use hugepages, for instance?  Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.

If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you.  But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29               ` Paul Mackerras
  (?)
@ 2008-03-25 12:05                 ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 12:05 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Paul Mackerras <paulus@samba.org> writes:
> 
> 4kB pages:	444.051s user + 34.406s system time
> 64kB pages:	419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 12:05                 ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 12:05 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Paul Mackerras <paulus@samba.org> writes:
> 
> 4kB pages:	444.051s user + 34.406s system time
> 64kB pages:	419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 12:05                 ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 12:05 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Paul Mackerras <paulus@samba.org> writes:
> 
> 4kB pages:	444.051s user + 34.406s system time
> 64kB pages:	419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-24 21:13             ` Luck, Tony
  (?)
@ 2008-03-25 17:42               ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:42 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Mon, 24 Mar 2008, Luck, Tony wrote:

> > I am familiar with that area and I am resonably sure that this 
> > is an issue on IA64 under some conditions (the processor decides to spill 
> > some registers either onto the stack or into the register backing store 
> > during tlb processing). Recursion (in the kernel context) still expects 
> > the stack and register backing store to be available. ccing linux-ia64 for 
> > any thoughts to the contrary.
> 
> Christoph is correct ... IA64 pins the TLB entry for the kernel stack
> (which covers both the normal C stack and the register backing store)
> so that it won't have to deal with a TLB miss on the stack while handling
> another TLB miss.

I thought the only pinned TLB entry was for the per cpu area? How does it 
pin the TLB? The expectation is that a single TLB covers the complete 
stack area? Is that a feature of fault handling?


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 17:42               ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:42 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Mon, 24 Mar 2008, Luck, Tony wrote:

> > I am familiar with that area and I am resonably sure that this 
> > is an issue on IA64 under some conditions (the processor decides to spill 
> > some registers either onto the stack or into the register backing store 
> > during tlb processing). Recursion (in the kernel context) still expects 
> > the stack and register backing store to be available. ccing linux-ia64 for 
> > any thoughts to the contrary.
> 
> Christoph is correct ... IA64 pins the TLB entry for the kernel stack
> (which covers both the normal C stack and the register backing store)
> so that it won't have to deal with a TLB miss on the stack while handling
> another TLB miss.

I thought the only pinned TLB entry was for the per cpu area? How does it 
pin the TLB? The expectation is that a single TLB covers the complete 
stack area? Is that a feature of fault handling?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on
@ 2008-03-25 17:42               ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:42 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Mon, 24 Mar 2008, Luck, Tony wrote:

> > I am familiar with that area and I am resonably sure that this 
> > is an issue on IA64 under some conditions (the processor decides to spill 
> > some registers either onto the stack or into the register backing store 
> > during tlb processing). Recursion (in the kernel context) still expects 
> > the stack and register backing store to be available. ccing linux-ia64 for 
> > any thoughts to the contrary.
> 
> Christoph is correct ... IA64 pins the TLB entry for the kernel stack
> (which covers both the normal C stack and the register backing store)
> so that it won't have to deal with a TLB miss on the stack while handling
> another TLB miss.

I thought the only pinned TLB entry was for the per cpu area? How does it 
pin the TLB? The expectation is that a single TLB covers the complete 
stack area? Is that a feature of fault handling?


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-25  7:52         ` Andi Kleen
@ 2008-03-25 17:45           ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 25 Mar 2008, Andi Kleen wrote:

> I liked your idea of fixing compound pages to not rely on order
> better. Ok it is likely more work to implement @)

Right. It just requires a page allocator rewrite. Which is overdue 
anyways given the fastpath issues. Volunteers?

> Also if anything preserving memory should be default, but maybe
> skippable a with __GFP_GO_FAST flag.

Well. Guess we need a definition of preserving memory. All allocations 
typically have some kind of overhead.



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-25 17:45           ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 25 Mar 2008, Andi Kleen wrote:

> I liked your idea of fixing compound pages to not rely on order
> better. Ok it is likely more work to implement @)

Right. It just requires a page allocator rewrite. Which is overdue 
anyways given the fastpath issues. Volunteers?

> Also if anything preserving memory should be default, but maybe
> skippable a with __GFP_GO_FAST flag.

Well. Guess we need a definition of preserving memory. All allocations 
typically have some kind of overhead.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:43                 ` David Miller, Christoph Lameter
  (?)
@ 2008-03-25 17:48                   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> We should fix the underlying problems.
> 
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
> 
> Which is very fixable, without going to larger pages.

No its not fixable. You are doing linear optimizations to a slowdown that 
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.
 
> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
> 
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.

These hacks have limitations. F.e. they do not deal with I/O and 
require application changes.
 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 17:48                   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> We should fix the underlying problems.
> 
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
> 
> Which is very fixable, without going to larger pages.

No its not fixable. You are doing linear optimizations to a slowdown that 
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.
 
> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
> 
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.

These hacks have limitations. F.e. they do not deal with I/O and 
require application changes.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 17:48                   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> We should fix the underlying problems.
> 
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
> 
> Which is very fixable, without going to larger pages.

No its not fixable. You are doing linear optimizations to a slowdown that 
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.
 
> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
> 
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.

These hacks have limitations. F.e. they do not deal with I/O and 
require application changes.
 

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-25 17:55             ` Andi Kleen
@ 2008-03-25 17:51               ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 25 Mar 2008, Andi Kleen wrote:

> Not when the trick of getting high order, returning left over pages
> is used. I meant just updating the GFP_COMPOUND code to always
> use number of pages instead of order so that it could deal with a compound
> where the excess pages are already returned. That is not actually that 
> much work (I reimplemented this recently for dma alloc and it's < 20 LOC) 

Would you post the patch here?

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-25 17:51               ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 25 Mar 2008, Andi Kleen wrote:

> Not when the trick of getting high order, returning left over pages
> is used. I meant just updating the GFP_COMPOUND code to always
> use number of pages instead of order so that it could deal with a compound
> where the excess pages are already returned. That is not actually that 
> much work (I reimplemented this recently for dma alloc and it's < 20 LOC) 

Would you post the patch here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-25  7:51         ` Andi Kleen
@ 2008-03-25 17:55           ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel, viro

On Tue, 25 Mar 2008, Andi Kleen wrote:

> Maybe sparse could be taught to check for this if it happens
> in a single function? (cc'ing Al who might have some thoughts
> on this). Of course if it happens spread out over multiple
> functions sparse wouldn't help neither. 

We could add debugging code to virt_to_page (or __pa) to catch these uses.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 17:55           ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel, viro

On Tue, 25 Mar 2008, Andi Kleen wrote:

> Maybe sparse could be taught to check for this if it happens
> in a single function? (cc'ing Al who might have some thoughts
> on this). Of course if it happens spread out over multiple
> functions sparse wouldn't help neither. 

We could add debugging code to virt_to_page (or __pa) to catch these uses.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
  2008-03-25 17:45           ` Christoph Lameter
@ 2008-03-25 17:55             ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 17:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Mar 25, 2008 at 10:45:06AM -0700, Christoph Lameter wrote:
> On Tue, 25 Mar 2008, Andi Kleen wrote:
> 
> > I liked your idea of fixing compound pages to not rely on order
> > better. Ok it is likely more work to implement @)
> 
> Right. It just requires a page allocator rewrite. 

Not when the trick of getting high order, returning left over pages
is used. I meant just updating the GFP_COMPOUND code to always
use number of pages instead of order so that it could deal with a compound
where the excess pages are already returned. That is not actually that 
much work (I reimplemented this recently for dma alloc and it's < 20 LOC) 

Of course the full rewrite would be also great, agreed :)

-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [13/14] vcompound: Use vcompound for swap_map
@ 2008-03-25 17:55             ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 17:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Mar 25, 2008 at 10:45:06AM -0700, Christoph Lameter wrote:
> On Tue, 25 Mar 2008, Andi Kleen wrote:
> 
> > I liked your idea of fixing compound pages to not rely on order
> > better. Ok it is likely more work to implement @)
> 
> Right. It just requires a page allocator rewrite. 

Not when the trick of getting high order, returning left over pages
is used. I meant just updating the GFP_COMPOUND code to always
use number of pages instead of order so that it could deal with a compound
where the excess pages are already returned. That is not actually that 
much work (I reimplemented this recently for dma alloc and it's < 20 LOC) 

Of course the full rewrite would be also great, agreed :)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-25 17:55           ` Christoph Lameter
@ 2008-03-25 18:07             ` Andi Kleen
  -1 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 18:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel, viro

On Tue, Mar 25, 2008 at 10:55:06AM -0700, Christoph Lameter wrote:
> On Tue, 25 Mar 2008, Andi Kleen wrote:
> 
> > Maybe sparse could be taught to check for this if it happens
> > in a single function? (cc'ing Al who might have some thoughts
> > on this). Of course if it happens spread out over multiple
> > functions sparse wouldn't help neither. 
> 
> We could add debugging code to virt_to_page (or __pa) to catch these uses.

Hard to test all cases. Static checking would be better.

Or just not do it? I didn't think order 1 failures were that big a problem.


-Andi

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 18:07             ` Andi Kleen
  0 siblings, 0 replies; 212+ messages in thread
From: Andi Kleen @ 2008-03-25 18:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel, viro

On Tue, Mar 25, 2008 at 10:55:06AM -0700, Christoph Lameter wrote:
> On Tue, 25 Mar 2008, Andi Kleen wrote:
> 
> > Maybe sparse could be taught to check for this if it happens
> > in a single function? (cc'ing Al who might have some thoughts
> > on this). Of course if it happens spread out over multiple
> > functions sparse wouldn't help neither. 
> 
> We could add debugging code to virt_to_page (or __pa) to catch these uses.

Hard to test all cases. Static checking would be better.

Or just not do it? I didn't think order 1 failures were that big a problem.


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29               ` Paul Mackerras
  (?)
@ 2008-03-25 18:27                 ` Dave Hansen
  -1 siblings, 0 replies; 212+ messages in thread
From: Dave Hansen @ 2008-03-25 18:27 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

On Tue, 2008-03-25 at 14:29 +1100, Paul Mackerras wrote:
> 4kB pages:      444.051s user + 34.406s system time
> 64kB pages:     419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?

-- Dave


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 18:27                 ` Dave Hansen
  0 siblings, 0 replies; 212+ messages in thread
From: Dave Hansen @ 2008-03-25 18:27 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

On Tue, 2008-03-25 at 14:29 +1100, Paul Mackerras wrote:
> 4kB pages:      444.051s user + 34.406s system time
> 64kB pages:     419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 18:27                 ` Dave Hansen
  0 siblings, 0 replies; 212+ messages in thread
From: Dave Hansen @ 2008-03-25 18:27 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

On Tue, 2008-03-25 at 14:29 +1100, Paul Mackerras wrote:
> 4kB pages:      444.051s user + 34.406s system time
> 64kB pages:     419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?

-- Dave


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-25 17:42               ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
  (?)
@ 2008-03-25 19:09                 ` Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 19:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

> I thought the only pinned TLB entry was for the per cpu area? How does it 
> pin the TLB? The expectation is that a single TLB covers the complete 
> stack area? Is that a feature of fault handling?

Pinning TLB entries on ia64 is done using TR registers with the "itr"
instruction.  Currently we have the following pinned mappings:

itr[0] : maps kernel code.  64MB page at virtual 0xA000000100000000
dtr[1] : maps kernel data.  64MB page at virtual 0xA000000100000000

itr[1] : maps PAL code as required by architecture

dtr[1] : maps an area of region 7 that spans kernel stack
         page size is kernel granule size (default 16M).
         This mapping needs to be reset on a context switch
         where we move to a stack in a different granule.

We used to used dtr[2] to map the 64K per-cpu area at 0xFFFFFFFFFFFF0000
but Ken Chen found that performance was better to use a dynamically
inserted DTC entry from the Alt-TLB miss handler which allows this
entry in the TLB to be available for generic use (on most processor
models).

-Tony


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 19:09                 ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 19:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

> I thought the only pinned TLB entry was for the per cpu area? How does it 
> pin the TLB? The expectation is that a single TLB covers the complete 
> stack area? Is that a feature of fault handling?

Pinning TLB entries on ia64 is done using TR registers with the "itr"
instruction.  Currently we have the following pinned mappings:

itr[0] : maps kernel code.  64MB page at virtual 0xA000000100000000
dtr[1] : maps kernel data.  64MB page at virtual 0xA000000100000000

itr[1] : maps PAL code as required by architecture

dtr[1] : maps an area of region 7 that spans kernel stack
         page size is kernel granule size (default 16M).
         This mapping needs to be reset on a context switch
         where we move to a stack in a different granule.

We used to used dtr[2] to map the 64K per-cpu area at 0xFFFFFFFFFFFF0000
but Ken Chen found that performance was better to use a dynamically
inserted DTC entry from the Alt-TLB miss handler which allows this
entry in the TLB to be available for generic use (on most processor
models).

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 19:09                 ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 19:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

> I thought the only pinned TLB entry was for the per cpu area? How does it 
> pin the TLB? The expectation is that a single TLB covers the complete 
> stack area? Is that a feature of fault handling?

Pinning TLB entries on ia64 is done using TR registers with the "itr"
instruction.  Currently we have the following pinned mappings:

itr[0] : maps kernel code.  64MB page at virtual 0xA000000100000000
dtr[1] : maps kernel data.  64MB page at virtual 0xA000000100000000

itr[1] : maps PAL code as required by architecture

dtr[1] : maps an area of region 7 that spans kernel stack
         page size is kernel granule size (default 16M).
         This mapping needs to be reset on a context switch
         where we move to a stack in a different granule.

We used to used dtr[2] to map the 64K per-cpu area at 0xFFFFFFFFFFFF0000
but Ken Chen found that performance was better to use a dynamically
inserted DTC entry from the Alt-TLB miss handler which allows this
entry in the TLB to be available for generic use (on most processor
models).

-Tony


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-25 19:09                 ` Luck, Tony
  (?)
@ 2008-03-25 19:25                   ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 19:25 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Tue, 25 Mar 2008, Luck, Tony wrote:

> dtr[1] : maps an area of region 7 that spans kernel stack
>          page size is kernel granule size (default 16M).
>          This mapping needs to be reset on a context switch
>          where we move to a stack in a different granule.

Interesting.... Never realized we were doing these tricks with DTR.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-25 19:25                   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 19:25 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Tue, 25 Mar 2008, Luck, Tony wrote:

> dtr[1] : maps an area of region 7 that spans kernel stack
>          page size is kernel granule size (default 16M).
>          This mapping needs to be reset on a context switch
>          where we move to a stack in a different granule.

Interesting.... Never realized we were doing these tricks with DTR.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on
@ 2008-03-25 19:25                   ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-25 19:25 UTC (permalink / raw)
  To: Luck, Tony; +Cc: David Miller, linux-mm, linux-kernel, linux-ia64

On Tue, 25 Mar 2008, Luck, Tony wrote:

> dtr[1] : maps an area of region 7 that spans kernel stack
>          page size is kernel granule size (default 16M).
>          This mapping needs to be reset on a context switch
>          where we move to a stack in a different granule.

Interesting.... Never realized we were doing these tricks with DTR.


^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 12:05                 ` Andi Kleen
  (?)
@ 2008-03-25 21:27                   ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 21:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.

As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support.  I'll do that
today.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 21:27                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 21:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.

As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support.  I'll do that
today.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 21:27                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-25 21:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.

As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support.  I'll do that
today.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 17:48                   ` Christoph Lameter
  (?)
@ 2008-03-25 23:22                     ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:22 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Tue, 25 Mar 2008 10:48:19 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
> 
> These hacks have limitations. F.e. they do not deal with I/O and 
> require application changes.

Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.

People want these larger pages for HPC apps.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:22                     ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-25 23:22 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
> 
> These hacks have limitations. F.e. they do not deal with I/O and 
> require application changes.

Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.

People want these larger pages for HPC apps.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:22                     ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:22 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Tue, 25 Mar 2008 10:48:19 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
> 
> These hacks have limitations. F.e. they do not deal with I/O and 
> require application changes.

Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.

People want these larger pages for HPC apps.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 11:50                   ` Paul Mackerras
  (?)
@ 2008-03-25 23:32                     ` David Miller, Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:32 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 22:50:00 +1100

> How do I get gcc to use hugepages, for instance?

Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.

This is what I was implying.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:32                     ` David Miller, Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Paul Mackerras @ 2008-03-25 23:32 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> How do I get gcc to use hugepages, for instance?

Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.

This is what I was implying.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:32                     ` David Miller, Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:32 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 22:50:00 +1100

> How do I get gcc to use hugepages, for instance?

Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.

This is what I was implying.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:22                     ` David Miller, Christoph Lameter
  (?)
@ 2008-03-25 23:41                       ` Peter Chubb
  -1 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-25 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Christoph Lameter <clameter@sgi.com> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)

>> On Mon, 24 Mar 2008, David Miller wrote:
>> 
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>> 
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.

David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.

It's actually harder than it looks.  Ian Wienand just finished his
Master's project in this area, so we have *lots* of data.  The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size).  This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages.  In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.

What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page.  For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.

I'm expecting Ian to publish the full results soon.

Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:41                       ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-25 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Christoph Lameter <clameter@sgi.com> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)

>> On Mon, 24 Mar 2008, David Miller wrote:
>> 
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>> 
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.

David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.

It's actually harder than it looks.  Ian Wienand just finished his
Master's project in this area, so we have *lots* of data.  The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size).  This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages.  In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.

What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page.  For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.

I'm expecting Ian to publish the full results soon.

Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:41                       ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-25 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" = David Miller <davem@davemloft.net> writes:

David> From: Christoph Lameter <clameter@sgi.com> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)

>> On Mon, 24 Mar 2008, David Miller wrote:
>> 
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>> 
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.

David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.

It's actually harder than it looks.  Ian Wienand just finished his
Master's project in this area, so we have *lots* of data.  The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size).  This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages.  In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.

What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page.  For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.

I'm expecting Ian to publish the full results soon.

Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
  2008-03-25 23:32                     ` David Miller, Paul Mackerras
  (?)
@ 2008-03-25 23:49                       ` Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 23:49 UTC (permalink / raw)
  To: David Miller, paulus
  Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> > How do I get gcc to use hugepages, for instance?
>
> Implementing transparent automatic usage of hugepages has been
> discussed many times, it's definitely doable and other OSs have
> implemented this for years.
>
> This is what I was implying.

"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications.  They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.

Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).

Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-25 23:49                       ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 23:49 UTC (permalink / raw)
  To: David Miller, paulus
  Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> > How do I get gcc to use hugepages, for instance?
>
> Implementing transparent automatic usage of hugepages has been
> discussed many times, it's definitely doable and other OSs have
> implemented this for years.
>
> This is what I was implying.

"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications.  They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.

Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).

Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-25 23:49                       ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-25 23:49 UTC (permalink / raw)
  To: David Miller, paulus
  Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> > How do I get gcc to use hugepages, for instance?
>
> Implementing transparent automatic usage of hugepages has been
> discussed many times, it's definitely doable and other OSs have
> implemented this for years.
>
> This is what I was implying.

"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications.  They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.

Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).

Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:41                       ` Peter Chubb
  (?)
@ 2008-03-25 23:49                         ` David Miller, Peter Chubb
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:49 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 10:41:32 +1100

> It's actually harder than it looks.  Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data.  The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size).  This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.

If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.

> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.

Right, admittedly this is just a (one of many) strange IA64 quirk.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:49                         ` David Miller, Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Peter Chubb @ 2008-03-25 23:49 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

> It's actually harder than it looks.  Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data.  The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size).  This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.

If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.

> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.

Right, admittedly this is just a (one of many) strange IA64 quirk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-25 23:49                         ` David Miller, Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-25 23:49 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 10:41:32 +1100

> It's actually harder than it looks.  Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data.  The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size).  This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.

If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.

> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.

Right, admittedly this is just a (one of many) strange IA64 quirk.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49                       ` Luck, Tony
  (?)
@ 2008-03-26  0:16                         ` David Miller, Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:16 UTC (permalink / raw)
  To: tony.luck; +Cc: paulus, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Tue, 25 Mar 2008 16:49:23 -0700

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

The kernel should be able to do this transparently, at the
very least for the anonymous page case.  It should also
be able to handle just fine chips that provide multiple
page size support, as many do.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:16                         ` David Miller, Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Luck, Tony @ 2008-03-26  0:16 UTC (permalink / raw)
  To: tony.luck; +Cc: paulus, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

The kernel should be able to do this transparently, at the
very least for the anonymous page case.  It should also
be able to handle just fine chips that provide multiple
page size support, as many do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:16                         ` David Miller, Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:16 UTC (permalink / raw)
  To: tony.luck; +Cc: paulus, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Tue, 25 Mar 2008 16:49:23 -0700

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

The kernel should be able to do this transparently, at the
very least for the anonymous page case.  It should also
be able to handle just fine chips that provide multiple
page size support, as many do.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49                         ` David Miller, Peter Chubb
  (?)
@ 2008-03-26  0:25                           ` Peter Chubb
  -1 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peterc@gelato.unsw.edu.au> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100

>> It's actually harder than it looks.  Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data.  The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).  This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.

David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.

That depends on the access pattern.  We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:25                           ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peterc@gelato.unsw.edu.au> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100

>> It's actually harder than it looks.  Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data.  The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).  This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.

David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.

That depends on the access pattern.  We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:25                           ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" = David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peterc@gelato.unsw.edu.au> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100

>> It's actually harder than it looks.  Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data.  The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).  This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.

David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.

That depends on the access pattern.  We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:25                           ` Peter Chubb
  (?)
@ 2008-03-26  0:31                             ` David Miller, Peter Chubb
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:31 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 11:25:58 +1100

> That depends on the access pattern.

Absolutely.

FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:31                             ` David Miller, Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Peter Chubb @ 2008-03-26  0:31 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

> That depends on the access pattern.

Absolutely.

FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:31                             ` David Miller, Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:31 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 11:25:58 +1100

> That depends on the access pattern.

Absolutely.

FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:41                       ` Peter Chubb
  (?)
@ 2008-03-26  0:34                         ` David Mosberger-Tang
  -1 siblings, 0 replies; 212+ messages in thread
From: David Mosberger-Tang @ 2008-03-26  0:34 UTC (permalink / raw)
  To: Peter Chubb
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds, ianw

On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb <peterc@gelato.unsw.edu.au> wrote:
>  The main issue is that, at least on Itanium, you have to turn off the hardware
>  page table walker for hugepages if you want to mix superpages and
>  standard pages in the same region. (The long format VHPT isn't the
>  panacea we'd like it to be because the hash function it uses depends
>  on the page size).

Why not just repeat the PTEs for super-pages?  That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.

  --david
-- 
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:34                         ` David Mosberger-Tang
  0 siblings, 0 replies; 212+ messages in thread
From: David Mosberger-Tang @ 2008-03-26  0:34 UTC (permalink / raw)
  To: Peter Chubb
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds, ianw

On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb <peterc@gelato.unsw.edu.au> wrote:
>  The main issue is that, at least on Itanium, you have to turn off the hardware
>  page table walker for hugepages if you want to mix superpages and
>  standard pages in the same region. (The long format VHPT isn't the
>  panacea we'd like it to be because the hash function it uses depends
>  on the page size).

Why not just repeat the PTEs for super-pages?  That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.

  --david
-- 
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:34                         ` David Mosberger-Tang
  0 siblings, 0 replies; 212+ messages in thread
From: David Mosberger-Tang @ 2008-03-26  0:34 UTC (permalink / raw)
  To: Peter Chubb
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds, ianw

On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb <peterc@gelato.unsw.edu.au> wrote:
>  The main issue is that, at least on Itanium, you have to turn off the hardware
>  page table walker for hugepages if you want to mix superpages and
>  standard pages in the same region. (The long format VHPT isn't the
>  panacea we'd like it to be because the hash function it uses depends
>  on the page size).

Why not just repeat the PTEs for super-pages?  That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.

  --david
-- 
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:34                         ` David Mosberger-Tang
  (?)
@ 2008-03-26  0:39                           ` David Miller, David Mosberger-Tang
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:39 UTC (permalink / raw)
  To: dmosberger
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: "David Mosberger-Tang" <dmosberger@gmail.com>
Date: Tue, 25 Mar 2008 18:34:13 -0600

> Why not just repeat the PTEs for super-pages?

This is basically how we implement hugepages in the page
tables on sparc64.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:39                           ` David Miller, David Mosberger-Tang
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, David Mosberger-Tang @ 2008-03-26  0:39 UTC (permalink / raw)
  To: dmosberger
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

> Why not just repeat the PTEs for super-pages?

This is basically how we implement hugepages in the page
tables on sparc64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:39                           ` David Miller, David Mosberger-Tang
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  0:39 UTC (permalink / raw)
  To: dmosberger
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: "David Mosberger-Tang" <dmosberger@gmail.com>
Date: Tue, 25 Mar 2008 18:34:13 -0600

> Why not just repeat the PTEs for super-pages?

This is basically how we implement hugepages in the page
tables on sparc64.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:34                         ` David Mosberger-Tang
  (?)
@ 2008-03-26  0:57                           ` Peter Chubb
  -1 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:57 UTC (permalink / raw)
  To: David Mosberger-Tang
  Cc: Peter Chubb, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, ianw

>>>>> "David" == David Mosberger-Tang <dmosberger@gmail.com> writes:

David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <peterc@gelato.unsw.edu.au> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).

David> Why not just repeat the PTEs for super-pages?  That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.

You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table).  But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size.  And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.  

If you use the long format VHPT, you have a choice:  load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page.  I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:57                           ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:57 UTC (permalink / raw)
  To: David Mosberger-Tang
  Cc: Peter Chubb, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, ianw

>>>>> "David" == David Mosberger-Tang <dmosberger@gmail.com> writes:

David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <peterc@gelato.unsw.edu.au> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).

David> Why not just repeat the PTEs for super-pages?  That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.

You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table).  But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size.  And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.  

If you use the long format VHPT, you have a choice:  load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page.  I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  0:57                           ` Peter Chubb
  0 siblings, 0 replies; 212+ messages in thread
From: Peter Chubb @ 2008-03-26  0:57 UTC (permalink / raw)
  To: David Mosberger-Tang
  Cc: Peter Chubb, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, ianw

>>>>> "David" = David Mosberger-Tang <dmosberger@gmail.com> writes:

David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <peterc@gelato.unsw.edu.au> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).

David> Why not just repeat the PTEs for super-pages?  That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.

You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table).  But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size.  And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.  

If you use the long format VHPT, you have a choice:  load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page.  I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:57                           ` Peter Chubb
  (?)
@ 2008-03-26  4:16                             ` John Marvin
  -1 siblings, 0 replies; 212+ messages in thread
From: John Marvin @ 2008-03-26  4:16 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-mm, linux-kernel

Peter Chubb wrote:

> 
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table).  But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size.  And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.  

No, you can set an illegal memory attribute in the pte for any superpage entry, 
and leave the hardware walker enabled for the base page size. The software tlb 
miss handler can then install the superpage tlb entry. I posted a working 
prototype of Shimizu superpages working on ia64 using short format vhpt's to the 
linux kernel list a while back.

> 
> If you use the long format VHPT, you have a choice:  load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page.  I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?

When I was doing measurements of long format vs. short format, the two main 
problems with long format (and why I eventually chose to stick with short 
format) were:

1) There was no easy way of determining what size the long format vhpt cache 
should be automatically, and changing it dynamically would be too painful. 
Different workloads performed better with different size vhpt caches.

2) Regardless of the size, the vhpt cache is duplicated information. Using long 
format vhpt's significantly increased the number of cache misses for some 
workloads. Theoretically there should have been some cases where the long format 
solution would have performed better than the short format solution, but I was 
never able to create such a case. In many cases the performance difference 
between the long format solution and the short format solution was essentially 
the same. In other cases the short format vhpt solution outperformed the long 
format solution, and in those cases there was a significant difference in cache 
misses that I believe explained the performance difference.

John

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  4:16                             ` John Marvin
  0 siblings, 0 replies; 212+ messages in thread
From: John Marvin @ 2008-03-26  4:16 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-mm, linux-kernel

Peter Chubb wrote:

> 
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table).  But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size.  And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.  

No, you can set an illegal memory attribute in the pte for any superpage entry, 
and leave the hardware walker enabled for the base page size. The software tlb 
miss handler can then install the superpage tlb entry. I posted a working 
prototype of Shimizu superpages working on ia64 using short format vhpt's to the 
linux kernel list a while back.

> 
> If you use the long format VHPT, you have a choice:  load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page.  I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?

When I was doing measurements of long format vs. short format, the two main 
problems with long format (and why I eventually chose to stick with short 
format) were:

1) There was no easy way of determining what size the long format vhpt cache 
should be automatically, and changing it dynamically would be too painful. 
Different workloads performed better with different size vhpt caches.

2) Regardless of the size, the vhpt cache is duplicated information. Using long 
format vhpt's significantly increased the number of cache misses for some 
workloads. Theoretically there should have been some cases where the long format 
solution would have performed better than the short format solution, but I was 
never able to create such a case. In many cases the performance difference 
between the long format solution and the short format solution was essentially 
the same. In other cases the short format vhpt solution outperformed the long 
format solution, and in those cases there was a significant difference in cache 
misses that I believe explained the performance difference.

John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  4:16                             ` John Marvin
  0 siblings, 0 replies; 212+ messages in thread
From: John Marvin @ 2008-03-26  4:16 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-mm, linux-kernel

Peter Chubb wrote:

> 
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table).  But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size.  And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.  

No, you can set an illegal memory attribute in the pte for any superpage entry, 
and leave the hardware walker enabled for the base page size. The software tlb 
miss handler can then install the superpage tlb entry. I posted a working 
prototype of Shimizu superpages working on ia64 using short format vhpt's to the 
linux kernel list a while back.

> 
> If you use the long format VHPT, you have a choice:  load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page.  I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?

When I was doing measurements of long format vs. short format, the two main 
problems with long format (and why I eventually chose to stick with short 
format) were:

1) There was no easy way of determining what size the long format vhpt cache 
should be automatically, and changing it dynamically would be too painful. 
Different workloads performed better with different size vhpt caches.

2) Regardless of the size, the vhpt cache is duplicated information. Using long 
format vhpt's significantly increased the number of cache misses for some 
workloads. Theoretically there should have been some cases where the long format 
solution would have performed better than the short format solution, but I was 
never able to create such a case. In many cases the performance difference 
between the long format solution and the short format solution was essentially 
the same. In other cases the short format vhpt solution outperformed the long 
format solution, and in those cases there was a significant difference in cache 
misses that I believe explained the performance difference.

John

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  4:16                             ` John Marvin
  (?)
@ 2008-03-26  4:36                               ` David Miller, John Marvin
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  4:36 UTC (permalink / raw)
  To: jsm; +Cc: linux-ia64, linux-mm, linux-kernel

From: John Marvin <jsm@fc.hp.com>
Date: Tue, 25 Mar 2008 22:16:00 -0600

> 1) There was no easy way of determining what size the long format vhpt cache 
> should be automatically, and changing it dynamically would be too painful. 
> Different workloads performed better with different size vhpt caches.

This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  4:36                               ` David Miller, John Marvin
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, John Marvin @ 2008-03-26  4:36 UTC (permalink / raw)
  To: jsm; +Cc: linux-ia64, linux-mm, linux-kernel

> 1) There was no easy way of determining what size the long format vhpt cache 
> should be automatically, and changing it dynamically would be too painful. 
> Different workloads performed better with different size vhpt caches.

This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  4:36                               ` David Miller, John Marvin
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26  4:36 UTC (permalink / raw)
  To: jsm; +Cc: linux-ia64, linux-mm, linux-kernel

From: John Marvin <jsm@fc.hp.com>
Date: Tue, 25 Mar 2008 22:16:00 -0600

> 1) There was no easy way of determining what size the long format vhpt cache 
> should be automatically, and changing it dynamically would be too painful. 
> Different workloads performed better with different size vhpt caches.

This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 12:05                 ` Andi Kleen
  (?)
@ 2008-03-26  5:24                   ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-26  5:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:

64k/4k: 441.723s user + 27.258s system time

So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect).  For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%.  So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  5:24                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-26  5:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:

64k/4k: 441.723s user + 27.258s system time

So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect).  For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%.  So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26  5:24                   ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-26  5:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:

64k/4k: 441.723s user + 27.258s system time

So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect).  For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%.  So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49                       ` Luck, Tony
  (?)
@ 2008-03-26 15:54                         ` Nish Aravamudan
  -1 siblings, 0 replies; 212+ messages in thread
From: Nish Aravamudan @ 2008-03-26 15:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

On 3/25/08, Luck, Tony <tony.luck@intel.com> wrote:
> > > How do I get gcc to use hugepages, for instance?
>  >
>  > Implementing transparent automatic usage of hugepages has been
>  > discussed many times, it's definitely doable and other OSs have
>  > implemented this for years.
>  >
>  > This is what I was implying.
>
>
> "large" pages, or "super" pages perhaps ... but Linux "huge" pages
>  seem pretty hard to adapt for generic use by applications.  They
>  are generally a somewhere between a bit too big (2MB on X86) to
>  way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
>
>  Right now they also suffer from making the sysadmin pick at
>  boot time how much memory to allocate as huge pages (while it
>  is possible to break huge pages into normal pages, going in
>  the reverse direction requires a memory defragmenter that
>  doesn't exist).

That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].

>  Making an application use huge pages as heap may be simple
>  (just link with a different library to provide with a different
>  version of malloc()) ... code, stack, mmap'd files are all
>  a lot harder to do transparently.

I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).

In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 15:54                         ` Nish Aravamudan
  0 siblings, 0 replies; 212+ messages in thread
From: Nish Aravamudan @ 2008-03-26 15:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

On 3/25/08, Luck, Tony <tony.luck@intel.com> wrote:
> > > How do I get gcc to use hugepages, for instance?
>  >
>  > Implementing transparent automatic usage of hugepages has been
>  > discussed many times, it's definitely doable and other OSs have
>  > implemented this for years.
>  >
>  > This is what I was implying.
>
>
> "large" pages, or "super" pages perhaps ... but Linux "huge" pages
>  seem pretty hard to adapt for generic use by applications.  They
>  are generally a somewhere between a bit too big (2MB on X86) to
>  way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
>
>  Right now they also suffer from making the sysadmin pick at
>  boot time how much memory to allocate as huge pages (while it
>  is possible to break huge pages into normal pages, going in
>  the reverse direction requires a memory defragmenter that
>  doesn't exist).

That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].

>  Making an application use huge pages as heap may be simple
>  (just link with a different library to provide with a different
>  version of malloc()) ... code, stack, mmap'd files are all
>  a lot harder to do transparently.

I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).

In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 15:54                         ` Nish Aravamudan
  0 siblings, 0 replies; 212+ messages in thread
From: Nish Aravamudan @ 2008-03-26 15:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

On 3/25/08, Luck, Tony <tony.luck@intel.com> wrote:
> > > How do I get gcc to use hugepages, for instance?
>  >
>  > Implementing transparent automatic usage of hugepages has been
>  > discussed many times, it's definitely doable and other OSs have
>  > implemented this for years.
>  >
>  > This is what I was implying.
>
>
> "large" pages, or "super" pages perhaps ... but Linux "huge" pages
>  seem pretty hard to adapt for generic use by applications.  They
>  are generally a somewhere between a bit too big (2MB on X86) to
>  way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
>
>  Right now they also suffer from making the sysadmin pick at
>  boot time how much memory to allocate as huge pages (while it
>  is possible to break huge pages into normal pages, going in
>  the reverse direction requires a memory defragmenter that
>  doesn't exist).

That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].

>  Making an application use huge pages as heap may be simple
>  (just link with a different library to provide with a different
>  version of malloc()) ... code, stack, mmap'd files are all
>  a lot harder to do transparently.

I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).

In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  5:24                   ` Paul Mackerras
  (?)
@ 2008-03-26 15:59                     ` Linus Torvalds
  -1 siblings, 0 replies; 212+ messages in thread
From: Linus Torvalds @ 2008-03-26 15:59 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64



On Wed, 26 Mar 2008, Paul Mackerras wrote:
> 
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

I realize that getting the POWER people to accept that they have been 
total morons when it comes to VM for the last three decades is hard, but 
somebody in the POWER hardware design camp should (a) be told and (b) be 
really ashamed of themselves.

Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
something like gcc shows that some piece of hardware is absolute crap. 

May I suggest people inside IBM try to fix this some day, and in the 
meantime people outside should probably continue to buy Intel/AMD CPU's 
until the others can get their act together.

			Linus

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 15:59                     ` Linus Torvalds
  0 siblings, 0 replies; 212+ messages in thread
From: Linus Torvalds @ 2008-03-26 15:59 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64


On Wed, 26 Mar 2008, Paul Mackerras wrote:
> 
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

I realize that getting the POWER people to accept that they have been 
total morons when it comes to VM for the last three decades is hard, but 
somebody in the POWER hardware design camp should (a) be told and (b) be 
really ashamed of themselves.

Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
something like gcc shows that some piece of hardware is absolute crap. 

May I suggest people inside IBM try to fix this some day, and in the 
meantime people outside should probably continue to buy Intel/AMD CPU's 
until the others can get their act together.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 15:59                     ` Linus Torvalds
  0 siblings, 0 replies; 212+ messages in thread
From: Linus Torvalds @ 2008-03-26 15:59 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64



On Wed, 26 Mar 2008, Paul Mackerras wrote:
> 
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

I realize that getting the POWER people to accept that they have been 
total morons when it comes to VM for the last three decades is hard, but 
somebody in the POWER hardware design camp should (a) be told and (b) be 
really ashamed of themselves.

Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
something like gcc shows that some piece of hardware is absolute crap. 

May I suggest people inside IBM try to fix this some day, and in the 
meantime people outside should probably continue to buy Intel/AMD CPU's 
until the others can get their act together.

			Linus

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
  2008-03-26 15:54                         ` Nish Aravamudan
  (?)
@ 2008-03-26 17:05                           ` Luck, Tony
  -1 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-26 17:05 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

> That's not entirely true. We have a dynamic pool now, thanks to Adam
> Litke [added to Cc], which can be treated as a high watermark for the
> hugetlb pool (and the static pool value serves as a low watermark).
> Unless by hugepages you mean something other than what I think (but
> referring to a 2M size on x86 imples you are not). And with the
> antifragmentation improvements, hugepage pool changes at run-time are
> more likely to succeed [added Mel to Cc].

Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence.  Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit.  This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]

> I feel like I should promote libhugetlbfs here.

This is also better than I thought ... sounds like some really
good things have already happened here.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-26 17:05                           ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-26 17:05 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

> That's not entirely true. We have a dynamic pool now, thanks to Adam
> Litke [added to Cc], which can be treated as a high watermark for the
> hugetlb pool (and the static pool value serves as a low watermark).
> Unless by hugepages you mean something other than what I think (but
> referring to a 2M size on x86 imples you are not). And with the
> antifragmentation improvements, hugepage pool changes at run-time are
> more likely to succeed [added Mel to Cc].

Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence.  Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit.  This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]

> I feel like I should promote libhugetlbfs here.

This is also better than I thought ... sounds like some really
good things have already happened here.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* RE: larger default page sizes...
@ 2008-03-26 17:05                           ` Luck, Tony
  0 siblings, 0 replies; 212+ messages in thread
From: Luck, Tony @ 2008-03-26 17:05 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

> That's not entirely true. We have a dynamic pool now, thanks to Adam
> Litke [added to Cc], which can be treated as a high watermark for the
> hugetlb pool (and the static pool value serves as a low watermark).
> Unless by hugepages you mean something other than what I think (but
> referring to a 2M size on x86 imples you are not). And with the
> antifragmentation improvements, hugepage pool changes at run-time are
> more likely to succeed [added Mel to Cc].

Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence.  Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit.  This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]

> I feel like I should promote libhugetlbfs here.

This is also better than I thought ... sounds like some really
good things have already happened here.

-Tony

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26  5:24                   ` Paul Mackerras
  (?)
@ 2008-03-26 17:56                     ` Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-26 17:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

On Wed, 26 Mar 2008, Paul Mackerras wrote:

> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

One should emphasize that this test was a kernel compile which is not 
a load that gains much from larger pages. 4k pages are mostly okay for 
loads that use large amounts of small files.



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 17:56                     ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-26 17:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

On Wed, 26 Mar 2008, Paul Mackerras wrote:

> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

One should emphasize that this test was a kernel compile which is not 
a load that gains much from larger pages. 4k pages are mostly okay for 
loads that use large amounts of small files.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 17:56                     ` Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: Christoph Lameter @ 2008-03-26 17:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

On Wed, 26 Mar 2008, Paul Mackerras wrote:

> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

One should emphasize that this test was a kernel compile which is not 
a load that gains much from larger pages. 4k pages are mostly okay for 
loads that use large amounts of small files.



^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:05                           ` Luck, Tony
@ 2008-03-26 18:54                             ` Mel Gorman
  -1 siblings, 0 replies; 212+ messages in thread
From: Mel Gorman @ 2008-03-26 18:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Nish Aravamudan, David Miller, paulus, clameter, linux-mm,
	linux-kernel, linux-ia64, torvalds, agl

On (26/03/08 10:05), Luck, Tony didst pronounce:
> > That's not entirely true. We have a dynamic pool now, thanks to Adam
> > Litke [added to Cc], which can be treated as a high watermark for the
> > hugetlb pool (and the static pool value serves as a low watermark).
> > Unless by hugepages you mean something other than what I think (but
> > referring to a 2M size on x86 imples you are not). And with the
> > antifragmentation improvements, hugepage pool changes at run-time are
> > more likely to succeed [added Mel to Cc].
> 
> Things are better than I thought ... though the phrase "more likely
> to succeed" doesn't fill me with confidence. 

It's a lot more likely to succeed since 2.6.24 than it has in the past. On
workloads where it is mainly user data that is occuping memory, the chances
are even better. If min_free_kbytes is hugepage_size*num_online_nodes(),
it becomes a harder again to fragment memory.

> Instead I imagine a
> system where an occasional spike in memory load causes some memory
> fragmentation that can't be handled, and so from that point many of
> the applications that relied on huge pages take a 10% performance
> hit. 

If it was found to be a problem and normal anti-frag is not coping for hugepage
pool resizes, then specify movablecore=MAX_POSSIBLE_POOL_SIZE_YOU_WOULD_NEED
on the command-line and the hugepage pool will be able to expand to that
side independent of workload. This would avoid the need to scheduled regular
reboots.

> This results in sysadmins scheduling regular reboots to unjam
> things. [Reminds me of the instructions that came with my first
> flatbed scanner that recommended rebooting the system before and
> after each use :-( ]
> 
> > I feel like I should promote libhugetlbfs here.
> 
> This is also better than I thought ... sounds like some really
> good things have already happened here.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 18:54                             ` Mel Gorman
  0 siblings, 0 replies; 212+ messages in thread
From: Mel Gorman @ 2008-03-26 18:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Nish Aravamudan, David Miller, paulus, clameter, linux-mm,
	linux-kernel, linux-ia64, torvalds, agl

On (26/03/08 10:05), Luck, Tony didst pronounce:
> > That's not entirely true. We have a dynamic pool now, thanks to Adam
> > Litke [added to Cc], which can be treated as a high watermark for the
> > hugetlb pool (and the static pool value serves as a low watermark).
> > Unless by hugepages you mean something other than what I think (but
> > referring to a 2M size on x86 imples you are not). And with the
> > antifragmentation improvements, hugepage pool changes at run-time are
> > more likely to succeed [added Mel to Cc].
> 
> Things are better than I thought ... though the phrase "more likely
> to succeed" doesn't fill me with confidence. 

It's a lot more likely to succeed since 2.6.24 than it has in the past. On
workloads where it is mainly user data that is occuping memory, the chances
are even better. If min_free_kbytes is hugepage_size*num_online_nodes(),
it becomes a harder again to fragment memory.

> Instead I imagine a
> system where an occasional spike in memory load causes some memory
> fragmentation that can't be handled, and so from that point many of
> the applications that relied on huge pages take a 10% performance
> hit. 

If it was found to be a problem and normal anti-frag is not coping for hugepage
pool resizes, then specify movablecore=MAX_POSSIBLE_POOL_SIZE_YOU_WOULD_NEED
on the command-line and the hugepage pool will be able to expand to that
side independent of workload. This would avoid the need to scheduled regular
reboots.

> This results in sysadmins scheduling regular reboots to unjam
> things. [Reminds me of the instructions that came with my first
> flatbed scanner that recommended rebooting the system before and
> after each use :-( ]
> 
> > I feel like I should promote libhugetlbfs here.
> 
> This is also better than I thought ... sounds like some really
> good things have already happened here.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:56                     ` Christoph Lameter
  (?)
@ 2008-03-26 23:21                       ` David Miller, Christoph Lameter
  -1 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26 23:21 UTC (permalink / raw)
  To: clameter; +Cc: paulus, andi, linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 26 Mar 2008 10:56:17 -0700 (PDT)

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages.

Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.

It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.

For some compiles this GC pool is more than 100MB in size.

GCC does not fit into any modern TLB using it's base page size.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 23:21                       ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller, Christoph Lameter @ 2008-03-26 23:21 UTC (permalink / raw)
  To: clameter; +Cc: paulus, andi, linux-mm, linux-kernel, linux-ia64, torvalds

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages.

Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.

It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.

For some compiles this GC pool is more than 100MB in size.

GCC does not fit into any modern TLB using it's base page size.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-26 23:21                       ` David Miller, Christoph Lameter
  0 siblings, 0 replies; 212+ messages in thread
From: David Miller @ 2008-03-26 23:21 UTC (permalink / raw)
  To: clameter; +Cc: paulus, andi, linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 26 Mar 2008 10:56:17 -0700 (PDT)

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages.

Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.

It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.

For some compiles this GC pool is more than 100MB in size.

GCC does not fit into any modern TLB using it's base page size.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26 15:59                     ` Linus Torvalds
  (?)
@ 2008-03-27  1:08                       ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64

Linus Torvalds writes:

> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> > 
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect).  For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%.  So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
> 
> I realize that getting the POWER people to accept that they have been 
> total morons when it comes to VM for the last three decades is hard, but 
> somebody in the POWER hardware design camp should (a) be told and (b) be 
> really ashamed of themselves.
> 
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
> something like gcc shows that some piece of hardware is absolute crap. 

You have misunderstood the 21% number.  That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.).  It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages.  Thus in both
cases the TLB misses and hash-table misses would have been the same.

The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using.  With the 64k page size
our architecture-independent kernel code runs 21% faster.

Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.

The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile).  And yes,
it's a POWER6.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-27  1:08                       ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64

Linus Torvalds writes:

> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> > 
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect).  For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%.  So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
> 
> I realize that getting the POWER people to accept that they have been 
> total morons when it comes to VM for the last three decades is hard, but 
> somebody in the POWER hardware design camp should (a) be told and (b) be 
> really ashamed of themselves.
> 
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
> something like gcc shows that some piece of hardware is absolute crap. 

You have misunderstood the 21% number.  That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.).  It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages.  Thus in both
cases the TLB misses and hash-table misses would have been the same.

The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using.  With the 64k page size
our architecture-independent kernel code runs 21% faster.

Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.

The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile).  And yes,
it's a POWER6.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-27  1:08                       ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel, linux-ia64

Linus Torvalds writes:

> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> > 
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect).  For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%.  So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
> 
> I realize that getting the POWER people to accept that they have been 
> total morons when it comes to VM for the last three decades is hard, but 
> somebody in the POWER hardware design camp should (a) be told and (b) be 
> really ashamed of themselves.
> 
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
> something like gcc shows that some piece of hardware is absolute crap. 

You have misunderstood the 21% number.  That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.).  It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages.  Thus in both
cases the TLB misses and hash-table misses would have been the same.

The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using.  With the 64k page size
our architecture-independent kernel code runs 21% faster.

Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.

The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile).  And yes,
it's a POWER6.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:56                     ` Christoph Lameter
  (?)
@ 2008-03-27  3:00                       ` Paul Mackerras
  -1 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  3:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

Christoph Lameter writes:

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages. 4k pages are mostly okay for 
> loads that use large amounts of small files.

It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.

That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-27  3:00                       ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  3:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

Christoph Lameter writes:

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages. 4k pages are mostly okay for 
> loads that use large amounts of small files.

It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.

That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 212+ messages in thread

* Re: larger default page sizes...
@ 2008-03-27  3:00                       ` Paul Mackerras
  0 siblings, 0 replies; 212+ messages in thread
From: Paul Mackerras @ 2008-03-27  3:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64, torvalds

Christoph Lameter writes:

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages. 4k pages are mostly okay for 
> loads that use large amounts of small files.

It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.

That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.

Paul.

^ permalink raw reply	[flat|nested] 212+ messages in thread

end of thread, other threads:[~2008-03-27  3:00 UTC | newest]

Thread overview: 212+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-21  6:17 [00/14] Virtual Compound Page Support V3 Christoph Lameter
2008-03-21  6:17 ` Christoph Lameter
2008-03-21  6:17 ` [01/14] vcompound: Return page array on vunmap Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [02/14] vcompound: pageflags: Add PageVcompound() Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [03/14] vmallocinfo: Support display of vcompound for a virtual compound page Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  7:55   ` Eric Dumazet
2008-03-21  7:55     ` Eric Dumazet
2008-03-21 17:32     ` Christoph Lameter
2008-03-21 17:32       ` Christoph Lameter
2008-03-21  6:17 ` [04/14] vcompound: Core piece Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-22 12:10   ` KOSAKI Motohiro
2008-03-22 12:10     ` KOSAKI Motohiro
2008-03-24 18:28     ` Christoph Lameter
2008-03-24 18:28       ` Christoph Lameter
2008-03-21  6:17 ` [05/14] vcompound: Debugging aid Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [06/14] vcompound: Virtual fallback for sparsemem Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [07/14] vcompound: bit waitqueue support Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [08/14] vcompound: Fallback for zone wait table Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [09/14] vcompound: crypto: Fallback for temporary order 2 allocation Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [10/14] vcompound: slub: Use for buffer to correlate allocation addresses Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  6:17 ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  7:25   ` David Miller
2008-03-21  7:25     ` David Miller, Christoph Lameter
2008-03-21  8:39     ` Ingo Molnar
2008-03-21  8:39       ` Ingo Molnar
2008-03-21 17:33       ` Christoph Lameter
2008-03-21 17:33         ` Christoph Lameter
2008-03-21 19:02         ` Ingo Molnar
2008-03-21 19:02           ` Ingo Molnar
2008-03-21 19:04           ` Christoph Lameter
2008-03-21 19:04             ` Christoph Lameter
2008-03-21 17:40     ` Christoph Lameter
2008-03-21 17:40       ` Christoph Lameter
2008-03-21 21:57       ` David Miller
2008-03-21 21:57         ` David Miller, Christoph Lameter
2008-03-24 18:27         ` Christoph Lameter
2008-03-24 18:27           ` [11/14] vcompound: Fallbacks for order 1 stack allocations on Christoph Lameter
2008-03-24 18:27           ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
2008-03-24 20:37           ` larger default page sizes David Miller
2008-03-24 20:37             ` David Miller
2008-03-24 20:37             ` David Miller, Christoph Lameter
2008-03-24 21:05             ` Christoph Lameter
2008-03-24 21:05               ` Christoph Lameter
2008-03-24 21:05               ` Christoph Lameter
2008-03-24 21:43               ` David Miller
2008-03-24 21:43                 ` David Miller
2008-03-24 21:43                 ` David Miller, Christoph Lameter
2008-03-25 17:48                 ` Christoph Lameter
2008-03-25 17:48                   ` Christoph Lameter
2008-03-25 17:48                   ` Christoph Lameter
2008-03-25 23:22                   ` David Miller
2008-03-25 23:22                     ` David Miller
2008-03-25 23:22                     ` David Miller, Christoph Lameter
2008-03-25 23:41                     ` Peter Chubb
2008-03-25 23:41                       ` Peter Chubb
2008-03-25 23:41                       ` Peter Chubb
2008-03-25 23:49                       ` David Miller
2008-03-25 23:49                         ` David Miller
2008-03-25 23:49                         ` David Miller, Peter Chubb
2008-03-26  0:25                         ` Peter Chubb
2008-03-26  0:25                           ` Peter Chubb
2008-03-26  0:25                           ` Peter Chubb
2008-03-26  0:31                           ` David Miller
2008-03-26  0:31                             ` David Miller
2008-03-26  0:31                             ` David Miller, Peter Chubb
2008-03-26  0:34                       ` David Mosberger-Tang
2008-03-26  0:34                         ` David Mosberger-Tang
2008-03-26  0:34                         ` David Mosberger-Tang
2008-03-26  0:39                         ` David Miller
2008-03-26  0:39                           ` David Miller
2008-03-26  0:39                           ` David Miller, David Mosberger-Tang
2008-03-26  0:57                         ` Peter Chubb
2008-03-26  0:57                           ` Peter Chubb
2008-03-26  0:57                           ` Peter Chubb
2008-03-26  4:16                           ` John Marvin
2008-03-26  4:16                             ` John Marvin
2008-03-26  4:16                             ` John Marvin
2008-03-26  4:36                             ` David Miller
2008-03-26  4:36                               ` David Miller
2008-03-26  4:36                               ` David Miller, John Marvin
2008-03-24 21:25             ` Luck, Tony
2008-03-24 21:25               ` Luck, Tony
2008-03-24 21:25               ` Luck, Tony
2008-03-24 21:46               ` David Miller
2008-03-24 21:46                 ` David Miller
2008-03-24 21:46                 ` David Miller, Luck, Tony
2008-03-25  3:29             ` Paul Mackerras
2008-03-25  3:29               ` Paul Mackerras
2008-03-25  3:29               ` Paul Mackerras
2008-03-25  4:15               ` David Miller
2008-03-25  4:15                 ` David Miller
2008-03-25  4:15                 ` David Miller, Paul Mackerras
2008-03-25 11:50                 ` Paul Mackerras
2008-03-25 11:50                   ` Paul Mackerras
2008-03-25 11:50                   ` Paul Mackerras
2008-03-25 23:32                   ` David Miller
2008-03-25 23:32                     ` David Miller
2008-03-25 23:32                     ` David Miller, Paul Mackerras
2008-03-25 23:49                     ` Luck, Tony
2008-03-25 23:49                       ` Luck, Tony
2008-03-25 23:49                       ` Luck, Tony
2008-03-26  0:16                       ` David Miller
2008-03-26  0:16                         ` David Miller
2008-03-26  0:16                         ` David Miller, Luck, Tony
2008-03-26 15:54                       ` Nish Aravamudan
2008-03-26 15:54                         ` Nish Aravamudan
2008-03-26 15:54                         ` Nish Aravamudan
2008-03-26 17:05                         ` Luck, Tony
2008-03-26 17:05                           ` Luck, Tony
2008-03-26 17:05                           ` Luck, Tony
2008-03-26 18:54                           ` Mel Gorman
2008-03-26 18:54                             ` Mel Gorman
2008-03-25 12:05               ` Andi Kleen
2008-03-25 12:05                 ` Andi Kleen
2008-03-25 12:05                 ` Andi Kleen
2008-03-25 21:27                 ` Paul Mackerras
2008-03-25 21:27                   ` Paul Mackerras
2008-03-25 21:27                   ` Paul Mackerras
2008-03-26  5:24                 ` Paul Mackerras
2008-03-26  5:24                   ` Paul Mackerras
2008-03-26  5:24                   ` Paul Mackerras
2008-03-26 15:59                   ` Linus Torvalds
2008-03-26 15:59                     ` Linus Torvalds
2008-03-26 15:59                     ` Linus Torvalds
2008-03-27  1:08                     ` Paul Mackerras
2008-03-27  1:08                       ` Paul Mackerras
2008-03-27  1:08                       ` Paul Mackerras
2008-03-26 17:56                   ` Christoph Lameter
2008-03-26 17:56                     ` Christoph Lameter
2008-03-26 17:56                     ` Christoph Lameter
2008-03-26 23:21                     ` David Miller
2008-03-26 23:21                       ` David Miller
2008-03-26 23:21                       ` David Miller, Christoph Lameter
2008-03-27  3:00                     ` Paul Mackerras
2008-03-27  3:00                       ` Paul Mackerras
2008-03-27  3:00                       ` Paul Mackerras
2008-03-25 18:27               ` Dave Hansen
2008-03-25 18:27                 ` Dave Hansen
2008-03-25 18:27                 ` Dave Hansen
2008-03-24 21:13           ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Luck, Tony
2008-03-24 21:13             ` Luck, Tony
2008-03-24 21:13             ` Luck, Tony
2008-03-25 17:42             ` Christoph Lameter
2008-03-25 17:42               ` [11/14] vcompound: Fallbacks for order 1 stack allocations on Christoph Lameter
2008-03-25 17:42               ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
2008-03-25 19:09               ` Luck, Tony
2008-03-25 19:09                 ` Luck, Tony
2008-03-25 19:09                 ` Luck, Tony
2008-03-25 19:25                 ` Christoph Lameter
2008-03-25 19:25                   ` [11/14] vcompound: Fallbacks for order 1 stack allocations on Christoph Lameter
2008-03-25 19:25                   ` [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
2008-03-21 22:30   ` Andi Kleen
2008-03-21 22:30     ` Andi Kleen
2008-03-24 19:53     ` Christoph Lameter
2008-03-24 19:53       ` Christoph Lameter
2008-03-25  7:51       ` Andi Kleen
2008-03-25  7:51         ` Andi Kleen
2008-03-25 17:55         ` Christoph Lameter
2008-03-25 17:55           ` Christoph Lameter
2008-03-25 18:07           ` Andi Kleen
2008-03-25 18:07             ` Andi Kleen
2008-03-21  6:17 ` [12/14] vcompound: Avoid vmalloc in e1000 driver Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21 17:27   ` Kok, Auke
2008-03-21 17:27     ` Kok, Auke
2008-03-21  6:17 ` [13/14] vcompound: Use vcompound for swap_map Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21 21:25   ` Andi Kleen
2008-03-21 21:25     ` Andi Kleen
2008-03-21 21:33     ` Christoph Lameter
2008-03-21 21:33       ` Christoph Lameter
2008-03-24 19:54     ` Christoph Lameter
2008-03-24 19:54       ` Christoph Lameter
2008-03-25  7:52       ` Andi Kleen
2008-03-25  7:52         ` Andi Kleen
2008-03-25 17:45         ` Christoph Lameter
2008-03-25 17:45           ` Christoph Lameter
2008-03-25 17:55           ` Andi Kleen
2008-03-25 17:55             ` Andi Kleen
2008-03-25 17:51             ` Christoph Lameter
2008-03-25 17:51               ` Christoph Lameter
2008-03-21  6:17 ` [14/14] vcompound: Avoid vmalloc for ehash_locks Christoph Lameter
2008-03-21  6:17   ` Christoph Lameter
2008-03-21  7:02   ` Eric Dumazet
2008-03-21  7:02     ` Eric Dumazet
2008-03-21  7:03     ` Christoph Lameter
2008-03-21  7:03       ` Christoph Lameter
2008-03-21  7:31       ` David Miller
2008-03-21  7:31         ` David Miller, Christoph Lameter
2008-03-21  7:42         ` Eric Dumazet
2008-03-21  7:42           ` Eric Dumazet
2008-03-21  7:31     ` David Miller
2008-03-21  7:31       ` David Miller, Eric Dumazet
2008-03-21 17:31       ` Christoph Lameter
2008-03-21 17:31         ` Christoph Lameter
2008-03-22 18:40 ` [00/14] Virtual Compound Page Support V3 Arjan van de Ven
2008-03-22 18:40   ` Arjan van de Ven
2008-03-24 18:31   ` Christoph Lameter
2008-03-24 18:31     ` Christoph Lameter
2008-03-24 19:29     ` Christoph Lameter
2008-03-24 19:29       ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.