linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Reducing fragmentation using zones
@ 2006-01-19 19:08 Mel Gorman
  2006-01-19 19:08 ` [PATCH 1/5] Add __GFP_EASYRCLM flag and update callers Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:08 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel

This is a zone-based approach to fragmentation reduction. This is posted
in light of the discussions related to the list-based (sometimes dubbed
as sub-zones) approach where the prevailing opinion was that zones were
the answer. The patches are based on linux-2.6.16-rc1-mm1 and has been
successfully tested on x86 and ppc64. The patches are as follows;

Patches 1-4: These patches are related to the adding of the zone and setting
up the callers

Patch 5: This is only for testing. It stops the OOM killer hitting everything
in sight while stress-testing high-order allocations. To have comparable
results during the high-order stress test allocation, this patch is applied
to both the stock -mm kernel and the kernel using the zone-based approach
to anti-fragmentation.

The usage scenario I set up to test out the patches is;

1. Test machine: 4-way x86 machine with 1.5GiB physical RAM
2. Boot with kernelcore=512MB . This gives the kernel 512MB to work with and
   the rest is placed in ZONE_EASYRCLM. (see patch 3 for more comments about
   the value of kernelcore)
3. Benchmark kbuild, aim9 and high order allocations

An alternative scenario has been tested that produces similar figures. The
scenario is;

1. Test machine: 4-way x86 machine with 1.5GiB physical RAM
2. Boot with mem=512MB
3. Hot-add the remaining memory
4. Benchmark kbuild, aim9 and high order allocations

The alternative scenario requires two more patches related to hot-adding on
the x86. I can post them if people want to take a look or experiment with
hot-add instead of using kernelcore= .

With zone-based anti-fragmentation, the usage of zones changes slightly on
the x86. The HIGHMEM zone is effectively split into two, with allocations
destined for this area split between HIGHMEM and EASYRCLM.  GFP_HIGHUSER pages
such as PTE's are passed to HIGHMEM and the remainder (mostly user pages)
are passed to EASYRCLM. Note that if kernelcore is less than the maximum size
of ZONE_NORMAL, GFP_HIGHMEM allocations will use ZONE_NORMAL, not the reachable
portion of ZONE_EASYRCLM.

I have tested with booting a kernel with no mem= or kernelcore= to make sure
there are no normal performance regressions.  On ppc64, a 2GiB system was
booted with kernelcore=896MB and dbench run as a regression test. It was
confirmed that ZONE_EASYRCLM was created and was being used.

Benchmark comparison between -mm+NoOOM tree and with the new zones

KBuild
                               2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
Time taken to extract kernel:                    14                        14
Time taken to build kernel:                     741                       738

(Performance is about the same, what you would expect really. To see a
regression, you would have to have kernelcore=TooSmallANumber)

Aim9
                 2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
 1 creat-clo                 12273.11                  12235.72     -37.39 -0.30% File Creations and Closes/second
 2 page_test                131762.75                 132946.18    1183.43  0.90% System Allocations & Pages/second
 3 brk_test                 586206.90                 603298.90   17092.00  2.92% System Memory Allocations/second
 4 jmp_test                4375520.75                4376557.81    1037.06  0.02% Non-local gotos/second
 5 signal_test               79436.76                  81086.49    1649.73  2.08% Signal Traps/second
 6 exec_test                    62.90                     62.81      -0.09 -0.14% Program Loads/second
 7 fork_test                  1211.92                   1212.52       0.60  0.05% Task Creations/second
 8 link_test                  4332.30                   4346.60      14.30  0.33% Link/Unlink Pairs/second

(Again, performance is about the same. The differences are about the same
as what you would see between runs)

High order allocations under load
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3 
Order                                        10                        10 
Allocation type                         HighMem                   HighMem 
Attempted allocations                       275                       275 
Success allocs                               60                       106 
Failed allocs                               215                       169 
DMA zone allocs                               1                         1 
Normal zone allocs                            5                         8 
HighMem zone allocs                          54                         0 
EasyRclm zone allocs                          0                        97 
% Success                                    21                        38 
HighAlloc Under Load Test Results Pass 2
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3 
Order                                        10                        10 
Allocation type                         HighMem                   HighMem 
Attempted allocations                       275                       275 
Success allocs                              101                       154 
Failed allocs                               174                       121 
DMA zone allocs                               1                         1 
Normal zone allocs                            5                         8 
HighMem zone allocs                          95                         0 
EasyRclm zone allocs                          0                       145 
% Success                                    36                        56 
HighAlloc Test Results while Rested
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3 
Order                                        10                        10 
Allocation type                         HighMem                   HighMem 
Attempted allocations                       275                       275 
Success allocs                              141                       212 
Failed allocs                               134                        63 
DMA zone allocs                               1                         1 
Normal zone allocs                           16                         8 
HighMem zone allocs                         124                         0 
EasyRclm zone allocs                          0                       203 
% Success                                    51                        77 

The use of ZONE_EASYRCLM pushes up the success rate for HugeTLB-sized
allocations by 46 huge pages which is a big improvement.  To compare, the
list-based approach gave an additional 19. At rest, an additional 71 pages
were available although this varies depending on the location of per-cpu pages
(patch available that drains them).  To compare, at rest, the list-based
approach was able to allocate an additional 192 huge pages. It is important
to note that the value of kernelcore at boot time can have a big impact on
the these stress test. Again, to compare, list-based anti-fragmentation had
no tunables.

In terms of performance, the kernel with the additional zone performs as
well as the standard kernel with variances between runs typically around
+/- 2% on each test in aim9. If the zone is not sized at all, there is no
measurable performance difference and the patches. The zone-based approach is
a lot less invasive of the core paths than the list-based approach was. The
final diffstat is;

 arch/i386/kernel/setup.c |   28 +++++++++++++++++++++++++++-
 arch/powerpc/mm/numa.c   |   37 ++++++++++++++++++++++++++++++++++---
 fs/compat.c              |    2 +-
 fs/exec.c                |    2 +-
 fs/inode.c               |    2 +-
 include/asm-i386/page.h  |    3 ++-
 include/linux/gfp.h      |    3 +++
 include/linux/highmem.h  |    2 +-
 include/linux/mmzone.h   |   14 ++++++++------
 mm/memory.c              |    4 ++--
 mm/page_alloc.c          |   27 +++++++++++++++++++--------
 mm/shmem.c               |    4 ++++
 mm/swap_state.c          |    2 +-
 13 files changed, 104 insertions(+), 26 deletions(-)

Unlike the list-based (or sub-zones if you prefer) approach, the zone-based
approach does not not help high-order kernel allocations but it can help
huge pages. Huge pages are currently allocated from ZONE_HIGHMEM as they
are not "easily reclaimable". However, if the HugeTLB page is the same size
as a sparsemem section size (the smallest unit that can be hot-removed)
we could use ZONE_EASYRCLM. If huge pages are the same size as a sparsemem
section they cause no fragmentation with that section.  On ppc64 this is
typically the case, but not so on 86. One possibility is to have an
architecture-specific option that determines if ZONE_EASYRCLM is used or not.

Comments?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] Add __GFP_EASYRCLM flag and update callers
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
@ 2006-01-19 19:08 ` Mel Gorman
  2006-01-19 19:08 ` [PATCH 2/5] Create the ZONE_EASYRCLM zone Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:08 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel


This creates a zone modifier __GFP_EASYRCLM and a set of GFP flags called
GFP_RCLMUSER. The only difference between GFP_HIGHUSER and GFP_RCLMUSER is the
zone that is used. Callers appropriate to use the ZONE_EASYRCLM are changed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/fs/compat.c linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/compat.c
--- linux-2.6.16-rc1-mm1-clean/fs/compat.c	2006-01-19 11:21:58.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/compat.c	2006-01-19 11:37:10.000000000 +0000
@@ -1397,7 +1397,7 @@ static int compat_copy_strings(int argc,
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_RCLMUSER);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/fs/exec.c linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/exec.c
--- linux-2.6.16-rc1-mm1-clean/fs/exec.c	2006-01-19 11:21:58.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/exec.c	2006-01-19 11:37:10.000000000 +0000
@@ -238,7 +238,7 @@ static int copy_strings(int argc, char _
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_RCLMUSER);
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/fs/inode.c linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/inode.c
--- linux-2.6.16-rc1-mm1-clean/fs/inode.c	2006-01-19 11:21:58.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/fs/inode.c	2006-01-19 11:37:10.000000000 +0000
@@ -147,7 +147,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+		mapping_set_gfp_mask(mapping, GFP_RCLMUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/include/asm-i386/page.h linux-2.6.16-rc1-mm1-101_antifrag_flags/include/asm-i386/page.h
--- linux-2.6.16-rc1-mm1-clean/include/asm-i386/page.h	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/include/asm-i386/page.h	2006-01-19 11:37:10.000000000 +0000
@@ -36,7 +36,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+	alloc_page_vma(GFP_RCLMUSER | __GFP_ZERO, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 /*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/include/linux/gfp.h linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/gfp.h
--- linux-2.6.16-rc1-mm1-clean/include/linux/gfp.h	2006-01-17 07:44:47.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/gfp.h	2006-01-19 11:37:10.000000000 +0000
@@ -21,6 +21,7 @@ struct vm_area_struct;
 #else
 #define __GFP_DMA32	((__force gfp_t)0x04)	/* Has own ZONE_DMA32 */
 #endif
+#define __GFP_EASYRCLM  ((__force gfp_t)0x08u)
 
 /*
  * Action modifiers - doesn't change the zoning
@@ -65,6 +66,8 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
 			 __GFP_HIGHMEM)
+#define GFP_RCLMUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
+			__GFP_EASYRCLM)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/include/linux/highmem.h linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/highmem.h
--- linux-2.6.16-rc1-mm1-clean/include/linux/highmem.h	2006-01-17 07:44:47.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/highmem.h	2006-01-19 11:37:10.000000000 +0000
@@ -47,7 +47,7 @@ static inline void clear_user_highpage(s
 static inline struct page *
 alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
 {
-	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+	struct page *page = alloc_page_vma(GFP_RCLMUSER, vma, vaddr);
 
 	if (page)
 		clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/mm/memory.c linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/memory.c
--- linux-2.6.16-rc1-mm1-clean/mm/memory.c	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/memory.c	2006-01-19 11:37:10.000000000 +0000
@@ -1472,7 +1472,7 @@ gotten:
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		new_page = alloc_page_vma(GFP_RCLMUSER, vma, address);
 		if (!new_page)
 			goto oom;
 		cow_user_page(new_page, old_page, address);
@@ -2071,7 +2071,7 @@ retry:
 
 		if (unlikely(anon_vma_prepare(vma)))
 			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		page = alloc_page_vma(GFP_RCLMUSER, vma, address);
 		if (!page)
 			goto oom;
 		copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/mm/shmem.c linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/shmem.c
--- linux-2.6.16-rc1-mm1-clean/mm/shmem.c	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/shmem.c	2006-01-19 11:37:10.000000000 +0000
@@ -921,6 +921,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
+	if (gfp & __GFP_HIGHMEM)
+		gfp = (gfp & ~__GFP_HIGHMEM) | __GFP_EASYRCLM;
 	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
@@ -936,6 +938,8 @@ shmem_swapin(struct shmem_inode_info *in
 static inline struct page *
 shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
 {
+	if (gfp & __GFP_HIGHMEM)
+		gfp = (gfp & ~__GFP_HIGHMEM) | __GFP_EASYRCLM;
 	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-clean/mm/swap_state.c linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/swap_state.c
--- linux-2.6.16-rc1-mm1-clean/mm/swap_state.c	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/swap_state.c	2006-01-19 11:37:10.000000000 +0000
@@ -334,7 +334,7 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+			new_page = alloc_page_vma(GFP_RCLMUSER, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/5] Create the ZONE_EASYRCLM zone
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
  2006-01-19 19:08 ` [PATCH 1/5] Add __GFP_EASYRCLM flag and update callers Mel Gorman
@ 2006-01-19 19:08 ` Mel Gorman
  2006-01-19 19:09 ` [PATCH 3/5] x86 - Specify amount of kernel memory at boot time Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:08 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel


This patch adds the ZONE_EASYRCLM zone and updates relevant contants and
helper functions. After this patch is applied, memory that is hot-added on
the x86 will be placed in ZONE_EASYRCLM. Memory hot-added on the ppc64 still
goes to ZONE_DMA.

The value of GFP_ZONETYPES is debatable. It should reflect all possible
combinations of the zone modifiers which implies a value of 16. However, the
existing value does not reflect the ability to use zone bits in combination.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/mmzone.h linux-2.6.16-rc1-mm1-102_addzone/include/linux/mmzone.h
--- linux-2.6.16-rc1-mm1-101_antifrag_flags/include/linux/mmzone.h	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-102_addzone/include/linux/mmzone.h	2006-01-19 11:37:52.000000000 +0000
@@ -73,9 +73,10 @@ struct per_cpu_pageset {
 #define ZONE_DMA32		1
 #define ZONE_NORMAL		2
 #define ZONE_HIGHMEM		3
+#define ZONE_EASYRCLM		4
 
-#define MAX_NR_ZONES		4	/* Sync this with ZONES_SHIFT */
-#define ZONES_SHIFT		2	/* ceil(log2(MAX_NR_ZONES)) */
+#define MAX_NR_ZONES		5	/* Sync this with ZONES_SHIFT */
+#define ZONES_SHIFT		3	/* ceil(log2(MAX_NR_ZONES)) */
 
 
 /*
@@ -93,8 +94,8 @@ struct per_cpu_pageset {
  *
  * NOTE! Make sure this matches the zones in <linux/gfp.h>
  */
-#define GFP_ZONEMASK	0x07
-#define GFP_ZONETYPES	5
+#define GFP_ZONEMASK	0x0f
+#define GFP_ZONETYPES	9
 
 /*
  * On machines where it is needed (eg PCs) we divide physical memory
@@ -397,7 +398,7 @@ static inline int populated_zone(struct 
 
 static inline int is_highmem_idx(int idx)
 {
-	return (idx == ZONE_HIGHMEM);
+	return (idx == ZONE_HIGHMEM || idx == ZONE_EASYRCLM);
 }
 
 static inline int is_normal_idx(int idx)
@@ -413,7 +414,8 @@ static inline int is_normal_idx(int idx)
  */
 static inline int is_highmem(struct zone *zone)
 {
-	return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM;
+	return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM ||
+		zone == zone->zone_pgdat->node_zones + ZONE_EASYRCLM;
 }
 
 static inline int is_normal(struct zone *zone)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/page_alloc.c linux-2.6.16-rc1-mm1-102_addzone/mm/page_alloc.c
--- linux-2.6.16-rc1-mm1-101_antifrag_flags/mm/page_alloc.c	2006-01-19 11:21:59.000000000 +0000
+++ linux-2.6.16-rc1-mm1-102_addzone/mm/page_alloc.c	2006-01-19 11:37:52.000000000 +0000
@@ -68,7 +68,7 @@ static void fastcall free_hot_cold_page(
  * TBD: should special case ZONE_DMA32 machines here - in those we normally
  * don't need any ZONE_NORMAL reservation
  */
-int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 256, 32 };
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 256, 32, 32 };
 
 EXPORT_SYMBOL(totalram_pages);
 
@@ -79,7 +79,8 @@ EXPORT_SYMBOL(totalram_pages);
 struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
 EXPORT_SYMBOL(zone_table);
 
-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal",
+						"HighMem", "EasyRclm" };
 int min_free_kbytes = 1024;
 
 unsigned long __initdata nr_kernel_pages;
@@ -760,6 +761,7 @@ static inline void prep_zero_page(struct
 	int i;
 
 	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_EASYRCLM)) == __GFP_EASYRCLM);
 	for(i = 0; i < (1 << order); i++)
 		clear_highpage(page + i);
 }
@@ -1245,7 +1247,7 @@ unsigned int nr_free_buffer_pages(void)
  */
 unsigned int nr_free_pagecache_pages(void)
 {
-	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER));
+	return nr_free_zone_pages(gfp_zone(GFP_RCLMUSER));
 }
 
 #ifdef CONFIG_HIGHMEM
@@ -1255,7 +1257,7 @@ unsigned int nr_free_highpages (void)
 	unsigned int pages = 0;
 
 	for_each_pgdat(pgdat)
-		pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+		pages += pgdat->node_zones[ZONE_EASYRCLM].free_pages;
 
 	return pages;
 }
@@ -1560,7 +1562,7 @@ static int __init build_zonelists_node(p
 {
 	struct zone *zone;
 
-	BUG_ON(zone_type > ZONE_HIGHMEM);
+	BUG_ON(zone_type > ZONE_EASYRCLM);
 
 	do {
 		zone = pgdat->node_zones + zone_type;
@@ -1580,6 +1582,8 @@ static int __init build_zonelists_node(p
 static inline int highest_zone(int zone_bits)
 {
 	int res = ZONE_NORMAL;
+	if (zone_bits & (__force int)__GFP_EASYRCLM)
+		res = ZONE_EASYRCLM;
 	if (zone_bits & (__force int)__GFP_HIGHMEM)
 		res = ZONE_HIGHMEM;
 	if (zone_bits & (__force int)__GFP_DMA32)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/5] x86 - Specify amount of kernel memory at boot time
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
  2006-01-19 19:08 ` [PATCH 1/5] Add __GFP_EASYRCLM flag and update callers Mel Gorman
  2006-01-19 19:08 ` [PATCH 2/5] Create the ZONE_EASYRCLM zone Mel Gorman
@ 2006-01-19 19:09 ` Mel Gorman
  2006-01-19 19:09 ` [PATCH 4/5] ppc64 " Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel


This patch was originally written by Kamezawa Hiroyuki.

It should be possible for the administrator to specify at boot-time how much
memory should be used for the kernel and how much should go to ZONE_EASYRCLM.
After this patch is applied, the boot option kernelcore= can be used to
specify how much memory should be used by the kernel.

(Note that Kamezawa called this parameter coremem= . This was renamed because
of the way ppc64 parses command line arguments and would confuse coremem=
with mem=. The name was chosen that could be used across architectures)

The value of kernelcore is important. If it is too small, there will be more
pressure on ZONE_NORMAL and a potential loss of performance. If it is about
896MB, it means that ZONE_HIGHMEM will have a size of zero. Any differences in
tests will depend on whether CONFIG_HIGHPTE is set in the standard kernel or
not. With lots of memory, the ideal is to specify a kernelcore that gives
ZONE_NORMAL it's full size and a ZONE_HIGHMEM for PTEs. The right value
depends, like any tunable, on the workload.

It is also important to note that if kernelcore is less than the maximum
size of ZONE_NORMAL, GFP_HIGHMEM allocations will use ZONE_NORMAL, not the
reachable portion of ZONE_EASYRCLM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-102_addzone/arch/i386/kernel/setup.c linux-2.6.16-rc1-mm1-103_x86coremem/arch/i386/kernel/setup.c
--- linux-2.6.16-rc1-mm1-102_addzone/arch/i386/kernel/setup.c	2006-01-19 11:21:57.000000000 +0000
+++ linux-2.6.16-rc1-mm1-103_x86coremem/arch/i386/kernel/setup.c	2006-01-19 11:44:49.000000000 +0000
@@ -121,6 +121,9 @@ int bootloader_type;
 /* user-defined highmem size */
 static unsigned int highmem_pages = -1;
 
+/* user-defined easy-reclaim-size */
+static unsigned int core_mem_pages = -1;
+static unsigned int easyrclm_pages = 0;
 /*
  * Setup options
  */
@@ -921,6 +924,15 @@ static void __init parse_cmdline_early (
 		 */
 		else if (!memcmp(from, "vmalloc=", 8))
 			__VMALLOC_RESERVE = memparse(from+8, &from);
+		 /*
+		  * kernelcore=size sets the amount of memory for use for
+		  * kernel allocations that cannot be reclaimed easily.
+		  * The remaining memory is set aside for easy reclaim
+	          * for features like memory remove or huge page allocations
+		  */
+		else if (!memcmp(from, "kernelcore=",11)) {
+			core_mem_pages = memparse(from + 11, &from) >> PAGE_SHIFT;
+		}
 
 	next_char:
 		c = *(from++);
@@ -990,6 +1002,17 @@ void __init find_max_pfn(void)
 	}
 }
 
+unsigned long  __init calculate_core_memory(unsigned long max_low_pfn)
+{
+	if (max_low_pfn < core_mem_pages) {
+		highmem_pages -= (core_mem_pages - max_low_pfn);
+	} else {
+		max_low_pfn = core_mem_pages;
+		highmem_pages = 0;
+	}
+	easyrclm_pages = max_pfn - core_mem_pages;
+	return max_low_pfn;
+}
 /*
  * Determine low and high memory ranges:
  */
@@ -1046,6 +1069,8 @@ unsigned long __init find_max_low_pfn(vo
 			printk(KERN_ERR "ignoring highmem size on non-highmem kernel!\n");
 #endif
 	}
+	if (core_mem_pages != -1)
+		max_low_pfn = calculate_core_memory(max_low_pfn);
 	return max_low_pfn;
 }
 
@@ -1166,7 +1191,8 @@ void __init zone_sizes_init(void)
 		zones_size[ZONE_DMA] = max_dma;
 		zones_size[ZONE_NORMAL] = low - max_dma;
 #ifdef CONFIG_HIGHMEM
-		zones_size[ZONE_HIGHMEM] = highend_pfn - low;
+		zones_size[ZONE_HIGHMEM] = highend_pfn - low - easyrclm_pages;
+		zones_size[ZONE_EASYRCLM] = easyrclm_pages;
 #endif
 	}
 	free_area_init(zones_size);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/5] ppc64 - Specify amount of kernel memory at boot time
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
                   ` (2 preceding siblings ...)
  2006-01-19 19:09 ` [PATCH 3/5] x86 - Specify amount of kernel memory at boot time Mel Gorman
@ 2006-01-19 19:09 ` Mel Gorman
  2006-01-19 19:09 ` [PATCH 5/5] ForTesting - Prevent OOM killer firing for high-order allocations Mel Gorman
  2006-01-19 19:24 ` [PATCH 0/5] Reducing fragmentation using zones Joel Schopp
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel


This patch adds the kernelcore= parameter for ppc64.

The amount of memory will requested will not be reserved in all nodes. The
first node that is found that can accomodate the requested amount of memory
and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
it also will not be used.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc1-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
--- linux-2.6.16-rc1-mm1-103_x86coremem/arch/powerpc/mm/numa.c	2006-01-17 07:44:47.000000000 +0000
+++ linux-2.6.16-rc1-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c	2006-01-19 11:39:36.000000000 +0000
@@ -21,6 +21,7 @@
 #include <asm/lmb.h>
 #include <asm/system.h>
 #include <asm/smp.h>
+#include <asm/machdep.h>
 
 static int numa_enabled = 1;
 
@@ -722,20 +723,50 @@ void __init paging_init(void)
 	unsigned long zones_size[MAX_NR_ZONES];
 	unsigned long zholes_size[MAX_NR_ZONES];
 	int nid;
+	unsigned long core_mem_size = 0;
+	unsigned long core_mem_pfn = 0;
+	char *opt;
 
 	memset(zones_size, 0, sizeof(zones_size));
 	memset(zholes_size, 0, sizeof(zholes_size));
 
+	/* Check if ZONE_EASYRCLM should be populated */
+	opt = strstr(cmd_line, "kernelcore=");
+	if (opt) {
+		opt += 11;
+		core_mem_size = memparse(opt, &opt);
+		core_mem_pfn = core_mem_size >> PAGE_SHIFT;
+	}
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn, pages_present;
 
 		get_region(nid, &start_pfn, &end_pfn, &pages_present);
 
-		zones_size[ZONE_DMA] = end_pfn - start_pfn;
-		zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
+		/*
+		 * Set up a zone for EASYRCLM as long as this node is large
+		 * enough to accomodate the requested size and that there
+		 * are no memory holes
+		 */
+		if (end_pfn - start_pfn <= core_mem_pfn ||
+				end_pfn - start_pfn != pages_present) {
+			zones_size[ZONE_DMA] = end_pfn - start_pfn;
+			zholes_size[ZONE_DMA] =
+				zones_size[ZONE_DMA] - pages_present;
+			core_mem_pfn -= (end_pfn - start_pfn);
+		} else {
+			zones_size[ZONE_DMA] = core_mem_pfn;
+			zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
+			zholes_size[ZONE_DMA] = 0;
+			zholes_size[ZONE_EASYRCLM] = 0;
+			core_mem_pfn = 0;
+		}
 
-		dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
+		dbg("free_area_init DMA node %d %lx %lx (hole: %lx)\n", nid,
 		    zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
+		dbg("free_area_init EASYRCLM node %d %lx %lx (hole: %lx)\n",
+		    nid, zones_size[ZONE_EASYRCLM], start_pfn,
+		    zholes_size[ZONE_DMA]);
 
 		free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
 				    zholes_size);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-103_x86coremem/mm/page_alloc.c linux-2.6.16-rc1-mm1-104_ppc64coremem/mm/page_alloc.c
--- linux-2.6.16-rc1-mm1-103_x86coremem/mm/page_alloc.c	2006-01-19 11:37:52.000000000 +0000
+++ linux-2.6.16-rc1-mm1-104_ppc64coremem/mm/page_alloc.c	2006-01-19 11:39:36.000000000 +0000
@@ -1568,7 +1568,11 @@ static int __init build_zonelists_node(p
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
 #ifndef CONFIG_HIGHMEM
-			BUG_ON(zone_type > ZONE_NORMAL);
+			/*
+			 * On architectures with only ZONE_DMA, it is still
+			 * valid to have a ZONE_EASYRCLM
+			 */
+			BUG_ON(zone_type == ZONE_HIGHMEM);
 #endif
 			zonelist->zones[nr_zones++] = zone;
 			check_highest_zone(zone_type);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/5] ForTesting - Prevent OOM killer firing for high-order allocations
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
                   ` (3 preceding siblings ...)
  2006-01-19 19:09 ` [PATCH 4/5] ppc64 " Mel Gorman
@ 2006-01-19 19:09 ` Mel Gorman
  2006-01-19 19:24 ` [PATCH 0/5] Reducing fragmentation using zones Joel Schopp
  5 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-19 19:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, linux-kernel, lhms-devel


Stop going OOM for high-order allocations. During testing of high order
allocations, we do not want the OOM killing everything in sight.

For comparison between kernels during the high order allocatioon stress
test, this patch is applied to both the stock -mm kernel and the kernel
using ZONE_EASYRCLM.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc1-mm1-104_ppc64coremem/mm/page_alloc.c linux-2.6.16-rc1-mm1-902_highorderoom/mm/page_alloc.c
--- linux-2.6.16-rc1-mm1-104_ppc64coremem/mm/page_alloc.c	2006-01-19 16:43:20.000000000 +0000
+++ linux-2.6.16-rc1-mm1-902_highorderoom/mm/page_alloc.c	2006-01-19 16:44:03.000000000 +0000
@@ -1080,8 +1080,11 @@ rebalance:
 		if (page)
 			goto got_pg;
 
-		out_of_memory(gfp_mask, order);
-		goto restart;
+		/* Only go OOM for low-order allocations */
+		if (order <= 3) {
+			out_of_memory(gfp_mask, order);
+			goto restart;
+		}
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
                   ` (4 preceding siblings ...)
  2006-01-19 19:09 ` [PATCH 5/5] ForTesting - Prevent OOM killer firing for high-order allocations Mel Gorman
@ 2006-01-19 19:24 ` Joel Schopp
  2006-01-20  0:13   ` [Lhms-devel] " KAMEZAWA Hiroyuki
  2006-01-20  0:42   ` Mel Gorman
  5 siblings, 2 replies; 22+ messages in thread
From: Joel Schopp @ 2006-01-19 19:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, lhms-devel

> Benchmark comparison between -mm+NoOOM tree and with the new zones

I know you had also previously posted a very simplified version of your real 
fragmentation avoidance patches.  I was curious if you could repost those with 
the other benchmarks for a 3 way comparison.  The simplified version got rid of 
a lot of the complexity people were complaining about and in my mind still seems 
like preferable direction.

Zone based approaches are runtime inflexible and require boot time tuning by the 
sysadmin.  There are lots of workloads that "reasonable" defaults for a zone 
based approach would cause the system to regress terribly.

-Joel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-19 19:24 ` [PATCH 0/5] Reducing fragmentation using zones Joel Schopp
@ 2006-01-20  0:13   ` KAMEZAWA Hiroyuki
  2006-01-20  1:09     ` Mel Gorman
  2006-01-20  0:42   ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20  0:13 UTC (permalink / raw)
  To: Joel Schopp; +Cc: Mel Gorman, linux-mm, linux-kernel, lhms-devel

Joel Schopp wrote:
>> Benchmark comparison between -mm+NoOOM tree and with the new zones
> 
> I know you had also previously posted a very simplified version of your 
> real fragmentation avoidance patches.  I was curious if you could repost 
> those with the other benchmarks for a 3 way comparison.  The simplified 
> version got rid of a lot of the complexity people were complaining about 
> and in my mind still seems like preferable direction.
> 
I agree. I think you should try with simplified version again.
Then, we can discuss.

  I don't like using bitmap which I removed (T.T

> Zone based approaches are runtime inflexible and require boot time 
> tuning by the sysadmin.  There are lots of workloads that "reasonable" 
> defaults for a zone based approach would cause the system to regress 
> terribly.
> 
IMHO, I don't like automatic runtime tuning, you say 'flexible' here.
I think flexibility allows 2^(MAX_ORDER - 1) size fragmentaion.
When SECTION_SIZE > MAX_ORDER, this is terrible.

I love certainty that sysadmin can grap his system at boot-time.
And, for people who want to remove range of memory, list-based approach will
need some other hook and its flexibility is of no use.
(If list-based approach goes, I or someone will do.)

I know zone->zone_start_pfn can be removed very easily.
This means there is possiblity to reconfigure zone on demand and
zone-based approach can be a bit more fliexible.


- Kame

> -Joel
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log 
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> _______________________________________________
> Lhms-devel mailing list
> Lhms-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lhms-devel
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-19 19:24 ` [PATCH 0/5] Reducing fragmentation using zones Joel Schopp
  2006-01-20  0:13   ` [Lhms-devel] " KAMEZAWA Hiroyuki
@ 2006-01-20  0:42   ` Mel Gorman
  2006-01-20  1:18     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2006-01-20  0:42 UTC (permalink / raw)
  To: Joel Schopp; +Cc: linux-mm, linux-kernel, lhms-devel

On Thu, 19 Jan 2006, Joel Schopp wrote:

> > Benchmark comparison between -mm+NoOOM tree and with the new zones
>
> I know you had also previously posted a very simplified version of your real
> fragmentation avoidance patches.  I was curious if you could repost those with
> the other benchmarks for a 3 way comparison.  The simplified version got rid
> of a lot of the complexity people were complaining about and in my mind still
> seems like preferable direction.
>

To satisfy this request, I did a quick rebase of the list-based approach
against 2.6.16-rc1-mm1 to have a comparable set of benchmarks. I will post
the patches in the morning after a re-read.

The results here are in three sets

Set 1: -mm3 Vs list-based anti-frag
Set 2: -mm3 Vs zone-based anti-frag
Set 3: list-based anti-frag Vs zone-based anti-frag

In the headers, list-based is called mbuddy-v22. Zone based is called
zbuddy-v3 (versions 1 and 2 were only posted to lhms-devel)

>>> BEGIN SET 1: -clean Vs mbuddy-v22 <<<
                               2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-mbuddy-v22
Time taken to extract kernel:                    14                         15
Time taken to build kernel:                     741                        741

                 2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-mbuddy-v22
 1 creat-clo                 12273.11                   12239.80     -33.31 -0.27% File Creations and Closes/second
 2 page_test                131762.75                  134311.90    2549.15  1.93% System Allocations & Pages/second
 3 brk_test                 586206.90                  597167.14   10960.24  1.87% System Memory Allocations/second
 4 jmp_test                4375520.75                 4373004.50   -2516.25 -0.06% Non-local gotos/second
 5 signal_test               79436.76                   77307.56   -2129.20 -2.68% Signal Traps/second
 6 exec_test                    62.90                      62.93       0.03  0.05% Program Loads/second
 7 fork_test                  1211.92                    1218.13       6.21  0.51% Task Creations/second
 8 link_test                  4332.30                    4324.56      -7.74 -0.18% Link/Unlink Pairs/second

                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-mbuddy-v22
Order                                        10                         10
Allocation type                         HighMem                    HighMem
Attempted allocations                       275                        275
Success allocs                               60                         86
Failed allocs                               215                        189
DMA zone allocs                               1                          1
Normal zone allocs                            5                          0
HighMem zone allocs                          54                         85
EasyRclm zone allocs                          0                          0
% Success                                    21                         31
HighAlloc Under Load Test Results Pass 2
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-mbuddy-v22
Order                                        10                         10
Allocation type                         HighMem                    HighMem
Attempted allocations                       275                        275
Success allocs                              101                        103
Failed allocs                               174                        172
DMA zone allocs                               1                          1
Normal zone allocs                            5                          0
HighMem zone allocs                          95                        102
EasyRclm zone allocs                          0                          0
% Success                                    36                         37
HighAlloc Test Results while Rested
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-mbuddy-v22
Order                                        10                         10
Allocation type                         HighMem                    HighMem
Attempted allocations                       275                        275
Success allocs                              141                        242
Failed allocs                               134                         33
DMA zone allocs                               1                          1
Normal zone allocs                           16                         83
HighMem zone allocs                         124                        158
EasyRclm zone allocs                          0                          0
% Success                                    51                         88
>>> END SET 1: -clean Vs mbuddy-v22 <<<

>>> BEGIN SET 2: -clean Vs zbuddy-v3 <<<
                               2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
Time taken to extract kernel:                    14                        14
Time taken to build kernel:                     741                       738

                 2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
 1 creat-clo                 12273.11                  12235.72     -37.39 -0.30% File Creations and Closes/second
 2 page_test                131762.75                 132946.18    1183.43  0.90% System Allocations & Pages/second
 3 brk_test                 586206.90                 603298.90   17092.00  2.92% System Memory Allocations/second
 4 jmp_test                4375520.75                4376557.81    1037.06  0.02% Non-local gotos/second
 5 signal_test               79436.76                  81086.49    1649.73  2.08% Signal Traps/second
 6 exec_test                    62.90                     62.81      -0.09 -0.14% Program Loads/second
 7 fork_test                  1211.92                   1212.52       0.60  0.05% Task Creations/second
 8 link_test                  4332.30                   4346.60      14.30  0.33% Link/Unlink Pairs/second

                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
Order                                        10                        10
Allocation type                         HighMem                   HighMem
Attempted allocations                       275                       275
Success allocs                               60                       106
Failed allocs                               215                       169
DMA zone allocs                               1                         1
Normal zone allocs                            5                         8
HighMem zone allocs                          54                         0
EasyRclm zone allocs                          0                        97
% Success                                    21                        38
HighAlloc Under Load Test Results Pass 2
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
Order                                        10                        10
Allocation type                         HighMem                   HighMem
Attempted allocations                       275                       275
Success allocs                              101                       154
Failed allocs                               174                       121
DMA zone allocs                               1                         1
Normal zone allocs                            5                         8
HighMem zone allocs                          95                         0
EasyRclm zone allocs                          0                       145
% Success                                    36                        56
HighAlloc Test Results while Rested
                           2.6.16-rc1-mm1-clean  2.6.16-rc1-mm1-zbuddy-v3
Order                                        10                        10
Allocation type                         HighMem                   HighMem
Attempted allocations                       275                       275
Success allocs                              141                       212
Failed allocs                               134                        63
DMA zone allocs                               1                         1
Normal zone allocs                           16                         8
HighMem zone allocs                         124                         0
EasyRclm zone allocs                          0                       203
% Success                                    51                        77

>>> BEGIN SET 2: -clean Vs zbuddy-v3 <<<

>>> BEGIN SET 3: -mbuddy-v22 Vs zbuddy-v3 <<<
                               2.6.16-rc1-mm1-mbuddy-v22  2.6.16-rc1-mm1-zbuddy-v3
Time taken to extract kernel:                         15                        14
Time taken to build kernel:                          741                       738

                 2.6.16-rc1-mm1-mbuddy-v22  2.6.16-rc1-mm1-zbuddy-v3
 1 creat-clo                      12239.80                  12235.72      -4.08 -0.03% File Creations and Closes/second
 2 page_test                     134311.90                 132946.18   -1365.72 -1.02% System Allocations & Pages/second
 3 brk_test                      597167.14                 603298.90    6131.76  1.03% System Memory Allocations/second
 4 jmp_test                     4373004.50                4376557.81    3553.31  0.08% Non-local gotos/second
 5 signal_test                    77307.56                  81086.49    3778.93  4.89% Signal Traps/second
 6 exec_test                         62.93                     62.81      -0.12 -0.19% Program Loads/second
 7 fork_test                       1218.13                   1212.52      -5.61 -0.46% Task Creations/second
 8 link_test                       4324.56                   4346.60      22.04  0.51% Link/Unlink Pairs/second

                           2.6.16-rc1-mm1-mbuddy-v22  2.6.16-rc1-mm1-zbuddy-v3
Order                                             10                        10
Allocation type                              HighMem                   HighMem
Attempted allocations                            275                       275
Success allocs                                    86                       106
Failed allocs                                    189                       169
DMA zone allocs                                    1                         1
Normal zone allocs                                 0                         8
HighMem zone allocs                               85                         0
EasyRclm zone allocs                               0                        97
% Success                                         31                        38
HighAlloc Under Load Test Results Pass 2
                           2.6.16-rc1-mm1-mbuddy-v22  2.6.16-rc1-mm1-zbuddy-v3
Order                                             10                        10
Allocation type                              HighMem                   HighMem
Attempted allocations                            275                       275
Success allocs                                   103                       154
Failed allocs                                    172                       121
DMA zone allocs                                    1                         1
Normal zone allocs                                 0                         8
HighMem zone allocs                              102                         0
EasyRclm zone allocs                               0                       145
% Success                                         37                        56
HighAlloc Test Results while Rested
                           2.6.16-rc1-mm1-mbuddy-v22  2.6.16-rc1-mm1-zbuddy-v3
Order                                             10                        10
Allocation type                              HighMem                   HighMem
Attempted allocations                            275                       275
Success allocs                                   242                       212
Failed allocs                                     33                        63
DMA zone allocs                                    1                         1
Normal zone allocs                                83                         8
HighMem zone allocs                              158                         0
EasyRclm zone allocs                               0                       203
% Success                                         88                        77

>>> END SET 3: -mbuddy-v22 Vs zbuddy-v3 <<<

So, in terms of performance on this set of tests, both approachs perform
roughly the same as the stock kernel in terms of absolute performance. In
terms of high-order allocations, zone-based appears to do better under
load. However, if you look at the zones that are used, you will see that
zone-based appears to do as well as list-based *only* because it has the
EASYRCLM zone to play with. list-based was way better at keeping the
normal zone defragmented as well as highmem which is especially obvious
when tested at rest.  list-based was able to allocate 83 huge pages from
ZONE_NORMAL at rest while zone-based only managed 8.

Secondly, zone-based requires careful configuration to be successful.  If
booted with kernelcore=896MB for example, it only performs slightly better
than the standard kernel. If booted with kernelcore=1024MB, it tends to
perform slightly worse (more zone fallbacks I guess) and still only
manages slighly better satisfaction of high order pages.

On the flip side, zone-based code changes are easier to understand than
the list-based ones (at least in terms of volume of code changes). The
zone-based gives guarantees on what will happen in the future while
list-based is best-effort.

In terms of fragmentation, I still think that list-based is better overall
without configuration. The results above also represent the best possible
configuration with zone-based versus no configuration at all against
list-based. In an environment with changing workloads a constant reality,
I bet that list-based would win overall.

> Zone based approaches are runtime inflexible and require boot time tuning by
> the sysadmin.  There are lots of workloads that "reasonable" defaults for a
> zone based approach would cause the system to regress terribly.
>
> -Joel
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  0:13   ` [Lhms-devel] " KAMEZAWA Hiroyuki
@ 2006-01-20  1:09     ` Mel Gorman
  2006-01-20  1:25       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2006-01-20  1:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

On Fri, 20 Jan 2006, KAMEZAWA Hiroyuki wrote:

> Joel Schopp wrote:
> > > Benchmark comparison between -mm+NoOOM tree and with the new zones
> >
> > I know you had also previously posted a very simplified version of your real
> > fragmentation avoidance patches.  I was curious if you could repost those
> > with the other benchmarks for a 3 way comparison.  The simplified version
> > got rid of a lot of the complexity people were complaining about and in my
> > mind still seems like preferable direction.
> >
> I agree. I think you should try with simplified version again.
> Then, we can discuss.
>

Results from list-based have been posted. The actual patches will be
posted tomorrow (in local time, that is in about 12 hours time)

>  I don't like using bitmap which I removed (T.T
>
> > Zone based approaches are runtime inflexible and require boot time tuning by
> > the sysadmin.  There are lots of workloads that "reasonable" defaults for a
> > zone based approach would cause the system to regress terribly.
> >
> IMHO, I don't like automatic runtime tuning, you say 'flexible' here.
> I think flexibility allows 2^(MAX_ORDER - 1) size fragmentaion.
> When SECTION_SIZE > MAX_ORDER, this is terrible.
>

In an ideal world, we would have both. Zone-based would give guarantees on
the availability of reclaimed pages and list-based would give best-effort
everywhere.

> I love certainty that sysadmin can grap his system at boot-time.

It requires careful tuning. For suddenly different workloads, things may
go wrong. As with everything else, testing is required from workloads
defined by multiple people.

> And, for people who want to remove range of memory, list-based approach will
> need some other hook and its flexibility is of no use.
> (If list-based approach goes, I or someone will do.)
>

Will do what?

> I know zone->zone_start_pfn can be removed very easily.
> This means there is possiblity to reconfigure zone on demand and
> zone-based approach can be a bit more fliexible.
>

The obvious concern is that it is very easy to grow ZONE_NORMAL or
ZONE_HIGHMEM into the ZONE_EASYRCLM zone but it is hard to do the opposite
because you must be able to reclaim the pages at the end of the "awkward"
zone.

Linus has also stated that he does not mind the zone the kernel is using
(be it normal or highmem) growing, but he takes a dim view to it being
shrunk again. Either way, to shrink it again, it is likely that a page
migration mechanism is a requirement because there is no way to be sure
that easily reclaimed are at the end of the zone.

>
> - Kame
>
> > -Joel
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> > for problems?  Stop!  Download the new AJAX search engine that makes
> > searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> > _______________________________________________
> > Lhms-devel mailing list
> > Lhms-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lhms-devel
> >
>
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  0:42   ` Mel Gorman
@ 2006-01-20  1:18     ` KAMEZAWA Hiroyuki
  2006-01-20 12:03       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20  1:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

Mel Gorman wrote:
> To satisfy this request, I did a quick rebase of the list-based approach
> against 2.6.16-rc1-mm1 to have a comparable set of benchmarks. I will post
> the patches in the morning after a re-read.
> 
Thank you.


> So, in terms of performance on this set of tests, both approachs perform
> roughly the same as the stock kernel in terms of absolute performance. In
> terms of high-order allocations, zone-based appears to do better under
> load. However, if you look at the zones that are used, you will see that
> zone-based appears to do as well as list-based *only* because it has the
> EASYRCLM zone to play with. list-based was way better at keeping the
> normal zone defragmented as well as highmem which is especially obvious
> when tested at rest.  list-based was able to allocate 83 huge pages from
> ZONE_NORMAL at rest while zone-based only managed 8.
> 
yes, this is intersiting point :)
list-based one can defrag NORMAL zone.
The point will be "does we need to defrag NORMAL ?" , I think.
IMHO, I don't like to use NORMAL zone to alloc higher-order pages...

> Secondly, zone-based requires careful configuration to be successful.  If
> booted with kernelcore=896MB for example, it only performs slightly better
> than the standard kernel. If booted with kernelcore=1024MB, it tends to
> perform slightly worse (more zone fallbacks I guess) and still only
> manages slighly better satisfaction of high order pages.
This is because HIGHMEM is too small, right ?


> On the flip side, zone-based code changes are easier to understand than
> the list-based ones (at least in terms of volume of code changes). The
> zone-based gives guarantees on what will happen in the future while
> list-based is best-effort.
> 
> In terms of fragmentation, I still think that list-based is better overall
> without configuration. 
I agree here.

>The results above also represent the best possible
> configuration with zone-based versus no configuration at all against
> list-based. In an environment with changing workloads a constant reality,
> I bet that list-based would win overall.
> 
On x86, NORMAL is only 896M anyway. there is no discussion.


Honestly, I don't have enough experience with machines which doesn't have Highmem.
How large kernelcore should be ?
It looks using list-based and zone-based at the same time will make all people happy...

-- Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  1:09     ` Mel Gorman
@ 2006-01-20  1:25       ` KAMEZAWA Hiroyuki
  2006-01-20  9:44         ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20  1:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

Mel Gorman wrote:
>> Joel Schopp wrote:
>>>> Benchmark comparison between -mm+NoOOM tree and with the new zones
>>> I know you had also previously posted a very simplified version of your real
>>> fragmentation avoidance patches.  I was curious if you could repost those
>>> with the other benchmarks for a 3 way comparison.  The simplified version
>>> got rid of a lot of the complexity people were complaining about and in my
>>> mind still seems like preferable direction.
>>>
>> I agree. I think you should try with simplified version again.
>> Then, we can discuss.
>>
> 
> Results from list-based have been posted. The actual patches will be
> posted tomorrow (in local time, that is in about 12 hours time)
> 
Thank you.


>>  I don't like using bitmap which I removed (T.T
>>
>>> Zone based approaches are runtime inflexible and require boot time tuning by
>>> the sysadmin.  There are lots of workloads that "reasonable" defaults for a
>>> zone based approach would cause the system to regress terribly.
>>>
>> IMHO, I don't like automatic runtime tuning, you say 'flexible' here.
>> I think flexibility allows 2^(MAX_ORDER - 1) size fragmentaion.
>> When SECTION_SIZE > MAX_ORDER, this is terrible.
>>
> 
> In an ideal world, we would have both. Zone-based would give guarantees on
> the availability of reclaimed pages and list-based would give best-effort
> everywhere.
> 
>> I love certainty that sysadmin can grap his system at boot-time.
> 
> It requires careful tuning. For suddenly different workloads, things may
> go wrong. As with everything else, testing is required from workloads
> defined by multiple people.
> 
Yes, we need more test.


>> And, for people who want to remove range of memory, list-based approach will
>> need some other hook and its flexibility is of no use.
>> (If list-based approach goes, I or someone will do.)
>>
> 
> Will do what?
> 
add kernelcore= boot option and so on :)
As you say, "In an ideal world, we would have both".

>> I know zone->zone_start_pfn can be removed very easily.
>> This means there is possiblity to reconfigure zone on demand and
>> zone-based approach can be a bit more fliexible.
>>
> 
> The obvious concern is that it is very easy to grow ZONE_NORMAL or
> ZONE_HIGHMEM into the ZONE_EASYRCLM zone but it is hard to do the opposite
> because you must be able to reclaim the pages at the end of the "awkward"
> zone.
Yes, this is weak point of ZONE_EASYRCLM.

By the way, please test this in list-based approach.
==
%ls -lR / (and some commands uses many slabs)
%do high ordet test
==

-- Kame.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  1:25       ` KAMEZAWA Hiroyuki
@ 2006-01-20  9:44         ` Mel Gorman
  2006-01-20 10:40           ` KAMEZAWA Hiroyuki
  2006-01-20 12:08           ` Yasunori Goto
  0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-20  9:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

On Fri, 20 Jan 2006, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > > Joel Schopp wrote:
> > > > > Benchmark comparison between -mm+NoOOM tree and with the new zones
> > > > I know you had also previously posted a very simplified version of your
> > > > real
> > > > fragmentation avoidance patches.  I was curious if you could repost
> > > > those
> > > > with the other benchmarks for a 3 way comparison.  The simplified
> > > > version
> > > > got rid of a lot of the complexity people were complaining about and in
> > > > my
> > > > mind still seems like preferable direction.
> > > >
> > > I agree. I think you should try with simplified version again.
> > > Then, we can discuss.
> > >
> >
> > Results from list-based have been posted. The actual patches will be
> > posted tomorrow (in local time, that is in about 12 hours time)
> >
> Thank you.
>
>
> > >  I don't like using bitmap which I removed (T.T
> > >
> > > > Zone based approaches are runtime inflexible and require boot time
> > > > tuning by
> > > > the sysadmin.  There are lots of workloads that "reasonable" defaults
> > > > for a
> > > > zone based approach would cause the system to regress terribly.
> > > >
> > > IMHO, I don't like automatic runtime tuning, you say 'flexible' here.
> > > I think flexibility allows 2^(MAX_ORDER - 1) size fragmentaion.
> > > When SECTION_SIZE > MAX_ORDER, this is terrible.
> > >
> >
> > In an ideal world, we would have both. Zone-based would give guarantees on
> > the availability of reclaimed pages and list-based would give best-effort
> > everywhere.
> >
> > > I love certainty that sysadmin can grap his system at boot-time.
> >
> > It requires careful tuning. For suddenly different workloads, things may
> > go wrong. As with everything else, testing is required from workloads
> > defined by multiple people.
> >
> Yes, we need more test.
>

What sort of tests would you suggest? The tests I have been running to
date are

"kbuild + aim9" for regression testing

"updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
easy it was to reclaim contiguous blocks

What tests could be run that would be representative of real-world
workloads?

>
> > > And, for people who want to remove range of memory, list-based approach
> > > will
> > > need some other hook and its flexibility is of no use.
> > > (If list-based approach goes, I or someone will do.)
> > >
> >
> > Will do what?
> >
> add kernelcore= boot option and so on :)
> As you say, "In an ideal world, we would have both".
>

List-based was frowned at for adding complexity to the main path so we may
not get list-based built on top of zone based even though it is certinatly
possible. One reason to do zone-based was to do a comparison between them
in terms of complexity. Hopefully, Nick Piggin (as the first big objector
to the list-based approach) will make some sort of comment on what he
thinks of zone-based in comparison to list-based.

> > > I know zone->zone_start_pfn can be removed very easily.
> > > This means there is possiblity to reconfigure zone on demand and
> > > zone-based approach can be a bit more fliexible.
> > >
> >
> > The obvious concern is that it is very easy to grow ZONE_NORMAL or
> > ZONE_HIGHMEM into the ZONE_EASYRCLM zone but it is hard to do the opposite
> > because you must be able to reclaim the pages at the end of the "awkward"
> > zone.
> Yes, this is weak point of ZONE_EASYRCLM.
>
> By the way, please test this in list-based approach.
> ==
> %ls -lR / (and some commands uses many slabs)
> %do high ordet test
> ==
>

Will set this up and post results after I post patches. The high-order
stress tests are already running updatedb which should have had a similar
effect. However, I never checked when updatedb finished so maybe it
finishes early in the test.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  9:44         ` Mel Gorman
@ 2006-01-20 10:40           ` KAMEZAWA Hiroyuki
  2006-01-20 14:53             ` Mel Gorman
  2006-01-20 12:08           ` Yasunori Goto
  1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-01-20 10:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

Mel Gorman wrote:>
> What sort of tests would you suggest? 
> The tests I have been running to date are
> 
> "kbuild + aim9" for regression testing
> 
> "updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
> easy it was to reclaim contiguous blocks
> 
> What tests could be run that would be representative of real-world
> workloads?
> 

1. Using 1000+ processes(threads) at once
2. heavy network load.
3. running NFS
is maybe good.

>>>> And, for people who want to remove range of memory, list-based approach
>>>> will
>>>> need some other hook and its flexibility is of no use.
>>>> (If list-based approach goes, I or someone will do.)
>>>>
>>> Will do what?
>>>
>> add kernelcore= boot option and so on :)
>> As you say, "In an ideal world, we would have both".
>>
> 
> List-based was frowned at for adding complexity to the main path so we may
> not get list-based built on top of zone based even though it is certinatly
> possible. One reason to do zone-based was to do a comparison between them
> in terms of complexity. Hopefully, Nick Piggin (as the first big objector
> to the list-based approach) will make some sort of comment on what he
> thinks of zone-based in comparison to list-based.
> 
I think there is another point.

what I concern about is Linus's word ,this:
> My point is that regardless of what you _want_, defragmentation is 
> _useless_. It's useless simply because for big areas it is so expensive as 
> to be impractical.

You should make your own answer for this before posting.

 From the old threads (very long!), I think  one of the point was :
To use hugepages, sysadmin can specifies what he wants at boot time.
This guarantees 100% allocation of needed huge pages.
Why memhotplug cannot specifies "how much they can remove" before booting.
This will guaranntee 100% memory hotremove.

I think hugetlb and memory hotplug cannot be good reason for defragment.

Finding the reason for defragment is good.
Unfortunately, I don't know the cases of memory allocation failure
because of fragmentation with recent kernel.

-- Kame



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  1:18     ` KAMEZAWA Hiroyuki
@ 2006-01-20 12:03       ` Mel Gorman
  2006-01-20 13:28         ` [Lhms-devel] " Yasunori Goto
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2006-01-20 12:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Joel Schopp, linux-mm, linux-kernel, lhms-devel

On Fri, 20 Jan 2006, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > To satisfy this request, I did a quick rebase of the list-based approach
> > against 2.6.16-rc1-mm1 to have a comparable set of benchmarks. I will post
> > the patches in the morning after a re-read.
> >
> Thank you.
>
>
> > So, in terms of performance on this set of tests, both approachs perform
> > roughly the same as the stock kernel in terms of absolute performance. In
> > terms of high-order allocations, zone-based appears to do better under
> > load. However, if you look at the zones that are used, you will see that
> > zone-based appears to do as well as list-based *only* because it has the
> > EASYRCLM zone to play with. list-based was way better at keeping the
> > normal zone defragmented as well as highmem which is especially obvious
> > when tested at rest.  list-based was able to allocate 83 huge pages from
> > ZONE_NORMAL at rest while zone-based only managed 8.
> >
> yes, this is intersiting point :)
> list-based one can defrag NORMAL zone.
> The point will be "does we need to defrag NORMAL ?" , I think.

The original intention was two fold. One, it helps HugeTLB in situations
where it was not configured correctly at boot-time. this is the case for a
number of sites running HPC-related jobs. The second objective was to help
high-order kernel allocations to potentially reduce things like
scatter-gather IO.

> IMHO, I don't like to use NORMAL zone to alloc higher-order pages...
>

Neither do a lot of people apparently.

> > Secondly, zone-based requires careful configuration to be successful.  If
> > booted with kernelcore=896MB for example, it only performs slightly better
> > than the standard kernel. If booted with kernelcore=1024MB, it tends to
> > perform slightly worse (more zone fallbacks I guess) and still only
> > manages slighly better satisfaction of high order pages.
> This is because HIGHMEM is too small, right ?
>

Yes and it ends up falling back more to ZONE_NORMAL.

>
> > On the flip side, zone-based code changes are easier to understand than
> > the list-based ones (at least in terms of volume of code changes). The
> > zone-based gives guarantees on what will happen in the future while
> > list-based is best-effort.
> >
> > In terms of fragmentation, I still think that list-based is better overall
> > without configuration.
> I agree here.
>
> > The results above also represent the best possible
> > configuration with zone-based versus no configuration at all against
> > list-based. In an environment with changing workloads a constant reality,
> > I bet that list-based would win overall.
> >
> On x86, NORMAL is only 896M anyway. there is no discussion.
>

There is a discussion with architecutes like ppc64 which do not have a
normal zone (only ZONE_DMA) and 64 bit architectures that have very large
normal zones.

Take ppc64 as an example. Today, when memory is hot-added, it is available
for use by the kernel and userspace applications. Right now, hot-added
memory goes to ZONE_DMA but it should be going to ZONE_EASYRCLM. In this
case, the size of the kernel at the beginning is fixed. If you allow the
kernel zone to grow, it cannot be shrunk again and worse, if the kernel
expands to take up available memory, it loses all advantages.

>
> Honestly, I don't have enough experience with machines which doesn't
> have Highmem. How large kernelcore should be ? It looks using list-based
> and zone-based at the same time will make all people happy...
>

How large kernelcore should be is the million dollar question. The
administrator needs to know how much memory the kernel will require for
the workload. That is no universal answer to this question. That was one
reason we liked the list-based approach to anti-fragmentation. It could
grow or shrink the regions used by user and kernel allocations as
required. To do the same with zones is quite complex.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20  9:44         ` Mel Gorman
  2006-01-20 10:40           ` KAMEZAWA Hiroyuki
@ 2006-01-20 12:08           ` Yasunori Goto
  2006-01-20 12:25             ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Yasunori Goto @ 2006-01-20 12:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Joel Schopp, linux-mm, linux-kernel, lhms-devel

> What sort of tests would you suggest? The tests I have been running to
> date are
> 
> "kbuild + aim9" for regression testing
> 
> "updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
> easy it was to reclaim contiguous blocks

BTW, is "highorder allocation test" your original test code?
If so, just my curious, I would like to see it too. ;-).

Bye.
-- 
Yasunori Goto 




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 12:08           ` Yasunori Goto
@ 2006-01-20 12:25             ` Mel Gorman
  2006-01-20 13:22               ` Yasunori Goto
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2006-01-20 12:25 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: KAMEZAWA Hiroyuki, Joel Schopp, linux-mm, linux-kernel, lhms-devel

On Fri, 20 Jan 2006, Yasunori Goto wrote:

> > What sort of tests would you suggest? The tests I have been running to
> > date are
> >
> > "kbuild + aim9" for regression testing
> >
> > "updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
> > easy it was to reclaim contiguous blocks
>
> BTW, is "highorder allocation test" your original test code?
> If so, just my curious, I would like to see it too. ;-).
>

1. Download http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.20.tar.gz
2. Extract it to /usr/src/vmregress (i.e. there should be a
   /usr/src/vmregress/bin directory)
3. Download linux-2.6.11.tar.gz to /usr/src
4. Make a directory /usr/src/bench-stresshighalloc-test
5. cd to /usr/src/vmregress and run 3. cd to the directory and run
   ./configure --with-linux=/path/to/running/kernel
   make

5. Run the test
   bench-stresshighalloc.sh -z -k 6 --oprofile

   -z Will test using high memory
   -k 6 will build 1 kernel + 6 additional ones
   By default, it will try and allocate 275 order-10 pages. Specify the
   number of pages with -c and the order with -s

The paths above are default paths. They can all be overridden with command
line parameters like -t to specify a different kernel to use and -b to
specify a different path to build all the kernels in.

By default, the results will be logged to a directory whose name is based
on the kernel being tested. For example, one result directory is
~/vmregressbench-2.6.16-rc1-mm1-clean/highalloc-heavy/log.txt

Comparisions between different runs can be analysed by using
diff-highalloc.sh. e.g.

diff-highalloc.sh vmregressbench-2.6.16-rc1-mm1-clean vmregressbench-2.6.16-rc1-mm1-mbuddy-v22

If you want to test just high-order allocations while some other workload
is running, use bench-plainhighalloc.sh. See --help for a list of
available options.

If you want to use bench-aim9.sh, download and build aim9 in /usr/src/aim9
and edit the s9workfile to specify the tests you are interested in. Use
diff-aim9.sh to compare different runs of aim9.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 12:25             ` Mel Gorman
@ 2006-01-20 13:22               ` Yasunori Goto
  0 siblings, 0 replies; 22+ messages in thread
From: Yasunori Goto @ 2006-01-20 13:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Joel Schopp, linux-mm, linux-kernel, lhms-devel

Thanks! I'll try it next week. :-)

> On Fri, 20 Jan 2006, Yasunori Goto wrote:
> 
> > > What sort of tests would you suggest? The tests I have been running to
> > > date are
> > >
> > > "kbuild + aim9" for regression testing
> > >
> > > "updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
> > > easy it was to reclaim contiguous blocks
> >
> > BTW, is "highorder allocation test" your original test code?
> > If so, just my curious, I would like to see it too. ;-).
> >
> 
> 1. Download http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.20.tar.gz
> 2. Extract it to /usr/src/vmregress (i.e. there should be a
>    /usr/src/vmregress/bin directory)
> 3. Download linux-2.6.11.tar.gz to /usr/src
> 4. Make a directory /usr/src/bench-stresshighalloc-test
> 5. cd to /usr/src/vmregress and run 3. cd to the directory and run
>    ./configure --with-linux=/path/to/running/kernel
>    make
> 
> 5. Run the test
>    bench-stresshighalloc.sh -z -k 6 --oprofile
> 
>    -z Will test using high memory
>    -k 6 will build 1 kernel + 6 additional ones
>    By default, it will try and allocate 275 order-10 pages. Specify the
>    number of pages with -c and the order with -s
> 
> The paths above are default paths. They can all be overridden with command
> line parameters like -t to specify a different kernel to use and -b to
> specify a different path to build all the kernels in.
> 
> By default, the results will be logged to a directory whose name is based
> on the kernel being tested. For example, one result directory is
> ~/vmregressbench-2.6.16-rc1-mm1-clean/highalloc-heavy/log.txt
> 
> Comparisions between different runs can be analysed by using
> diff-highalloc.sh. e.g.
> 
> diff-highalloc.sh vmregressbench-2.6.16-rc1-mm1-clean vmregressbench-2.6.16-rc1-mm1-mbuddy-v22
> 
> If you want to test just high-order allocations while some other workload
> is running, use bench-plainhighalloc.sh. See --help for a list of
> available options.
> 
> If you want to use bench-aim9.sh, download and build aim9 in /usr/src/aim9
> and edit the s9workfile to specify the tests you are interested in. Use
> diff-aim9.sh to compare different runs of aim9.
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> _______________________________________________
> Lhms-devel mailing list
> Lhms-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lhms-devel

-- 
Yasunori Goto 




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 12:03       ` Mel Gorman
@ 2006-01-20 13:28         ` Yasunori Goto
  2006-01-20 14:02           ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Yasunori Goto @ 2006-01-20 13:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Joel Schopp, linux-mm, linux-kernel, lhms-devel

> > > So, in terms of performance on this set of tests, both approachs perform
> > > roughly the same as the stock kernel in terms of absolute performance. In
> > > terms of high-order allocations, zone-based appears to do better under
> > > load. However, if you look at the zones that are used, you will see that
> > > zone-based appears to do as well as list-based *only* because it has the
> > > EASYRCLM zone to play with. list-based was way better at keeping the
> > > normal zone defragmented as well as highmem which is especially obvious
> > > when tested at rest.  list-based was able to allocate 83 huge pages from
> > > ZONE_NORMAL at rest while zone-based only managed 8.
> > >
> > yes, this is intersiting point :)
> > list-based one can defrag NORMAL zone.
> > The point will be "does we need to defrag NORMAL ?" , I think.
> 
> The original intention was two fold. One, it helps HugeTLB in situations
> where it was not configured correctly at boot-time. this is the case for a
> number of sites running HPC-related jobs. The second objective was to help
> high-order kernel allocations to potentially reduce things like
> scatter-gather IO.

Probably, Linus-san's wish is reduce high order kernel allocation
to avoid fragment. (Did he say defragment is meaningless, right?)
If there is a driver/kernel component which require high order
allocation though physical contiguous memory is not necessary,
it should be modified to collect pieces of pages.
(I guess there is some component like it. But I'm not sure....)
If the scatter-gather IO is cause of bad performance,
it might be desirable that trying highorder allocation at first,
then collect peace of pages which can be allocated. 

It is just my guess.
But, some of components might not be able to do it.
If there are impossible components, it is good reason for
defragment....

> > > On the flip side, zone-based code changes are easier to understand than
> > > the list-based ones (at least in terms of volume of code changes). The
> > > zone-based gives guarantees on what will happen in the future while
> > > list-based is best-effort.
> > >
> > > In terms of fragmentation, I still think that list-based is better overall
> > > without configuration.
> > I agree here.
> >
> > > The results above also represent the best possible
> > > configuration with zone-based versus no configuration at all against
> > > list-based. In an environment with changing workloads a constant reality,
> > > I bet that list-based would win overall.
> > >
> > On x86, NORMAL is only 896M anyway. there is no discussion.
> >
> 
> There is a discussion with architecutes like ppc64 which do not have a
> normal zone (only ZONE_DMA) and 64 bit architectures that have very large
> normal zones.
> 
> Take ppc64 as an example. Today, when memory is hot-added, it is available
> for use by the kernel and userspace applications. Right now, hot-added
> memory goes to ZONE_DMA but it should be going to ZONE_EASYRCLM. In this
> case, the size of the kernel at the beginning is fixed. If you allow the
> kernel zone to grow, it cannot be shrunk again and worse, if the kernel
> expands to take up available memory, it loses all advantages.

Just for correction, ZONE_EASYRCLM is useful only hot-remove.
So, if kernel would like to have more memory, hot-add of ZONE_DMA(If its
address is in DMA area) Zone_NORMAL should be OK.
Only the new memory will not be able to be removed.

Thanks.

-- 
Yasunori Goto 




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 13:28         ` [Lhms-devel] " Yasunori Goto
@ 2006-01-20 14:02           ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2006-01-20 14:02 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: KAMEZAWA Hiroyuki, Joel Schopp, linux-mm, linux-kernel, lhms-devel

On Fri, 20 Jan 2006, Yasunori Goto wrote:

> > > > So, in terms of performance on this set of tests, both approachs perform
> > > > roughly the same as the stock kernel in terms of absolute performance. In
> > > > terms of high-order allocations, zone-based appears to do better under
> > > > load. However, if you look at the zones that are used, you will see that
> > > > zone-based appears to do as well as list-based *only* because it has the
> > > > EASYRCLM zone to play with. list-based was way better at keeping the
> > > > normal zone defragmented as well as highmem which is especially obvious
> > > > when tested at rest.  list-based was able to allocate 83 huge pages from
> > > > ZONE_NORMAL at rest while zone-based only managed 8.
> > > >
> > > yes, this is intersiting point :)
> > > list-based one can defrag NORMAL zone.
> > > The point will be "does we need to defrag NORMAL ?" , I think.
> >
> > The original intention was two fold. One, it helps HugeTLB in situations
> > where it was not configured correctly at boot-time. this is the case for a
> > number of sites running HPC-related jobs. The second objective was to help
> > high-order kernel allocations to potentially reduce things like
> > scatter-gather IO.
>
> Probably, Linus-san's wish is reduce high order kernel allocation
> to avoid fragment. (Did he say defragment is meaningless, right?)

Right.

> If there is a driver/kernel component which require high order
> allocation though physical contiguous memory is not necessary,
> it should be modified to collect pieces of pages.

Yes.

> (I guess there is some component like it. But I'm not sure....)
> If the scatter-gather IO is cause of bad performance,
> it might be desirable that trying highorder allocation at first,
> then collect peace of pages which can be allocated.
>

Figures have never been produced to show that high-order allocations would
help performnace for something like scatter/gather IO.

> It is just my guess.
> But, some of components might not be able to do it.
> If there are impossible components, it is good reason for
> defragment....
>
> > > > On the flip side, zone-based code changes are easier to understand than
> > > > the list-based ones (at least in terms of volume of code changes). The
> > > > zone-based gives guarantees on what will happen in the future while
> > > > list-based is best-effort.
> > > >
> > > > In terms of fragmentation, I still think that list-based is better overall
> > > > without configuration.
> > > I agree here.
> > >
> > > > The results above also represent the best possible
> > > > configuration with zone-based versus no configuration at all against
> > > > list-based. In an environment with changing workloads a constant reality,
> > > > I bet that list-based would win overall.
> > > >
> > > On x86, NORMAL is only 896M anyway. there is no discussion.
> > >
> >
> > There is a discussion with architecutes like ppc64 which do not have a
> > normal zone (only ZONE_DMA) and 64 bit architectures that have very large
> > normal zones.
> >
> > Take ppc64 as an example. Today, when memory is hot-added, it is available
> > for use by the kernel and userspace applications. Right now, hot-added
> > memory goes to ZONE_DMA but it should be going to ZONE_EASYRCLM. In this
> > case, the size of the kernel at the beginning is fixed. If you allow the
> > kernel zone to grow, it cannot be shrunk again and worse, if the kernel
> > expands to take up available memory, it loses all advantages.
>
> Just for correction, ZONE_EASYRCLM is useful only hot-remove.
> So, if kernel would like to have more memory, hot-add of ZONE_DMA(If its
> address is in DMA area) Zone_NORMAL should be OK.
> Only the new memory will not be able to be removed.
>

My understanding is that choosing what zone to add memory to is not an
option. The main case where memory is hot-added and hot-removed is to meet
changing demands of the workload. The memory is hot-added and removed by
an automated system which, no matter how well written, will end up adding
memory to the wrong zone some of the time.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 10:40           ` KAMEZAWA Hiroyuki
@ 2006-01-20 14:53             ` Mel Gorman
  2006-01-20 18:10               ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2006-01-20 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Joel Schopp, Linux Memory Management List,
	Linux Kernel Mailing List, lhms-devel

On Fri, 20 Jan 2006, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:>
> > What sort of tests would you suggest? The tests I have been running to date
> > are
> >
> > "kbuild + aim9" for regression testing
> >
> > "updatedb + 7 -j1 kernel compiles + highorder allocation" for seeing how
> > easy it was to reclaim contiguous blocks
> >
> > What tests could be run that would be representative of real-world
> > workloads?
> >
>

Before I get writing, I want to be clear on what tests are considered
useful.

> 1. Using 1000+ processes(threads) at once

Would tiobench --threads be suitable or would the IO skew what you are
looking for? If the IO is a problem, what would you recommend instead?

> 2. heavy network load.

Would iperf be suitable?

> 3. running NFS

Is running a kernel build over NFS reasonable? Should it be a remote NFS
server or could I setup a NFS share and mount it locally? If a kernel
build is not suitable, would tiobench over NFS be a better plan?

> is maybe good.
>
> > > > > And, for people who want to remove range of memory, list-based
> > > > > approach
> > > > > will
> > > > > need some other hook and its flexibility is of no use.
> > > > > (If list-based approach goes, I or someone will do.)
> > > > >
> > > > Will do what?
> > > >
> > > add kernelcore= boot option and so on :)
> > > As you say, "In an ideal world, we would have both".
> > >
> >
> > List-based was frowned at for adding complexity to the main path so we may
> > not get list-based built on top of zone based even though it is certinatly
> > possible. One reason to do zone-based was to do a comparison between them
> > in terms of complexity. Hopefully, Nick Piggin (as the first big objector
> > to the list-based approach) will make some sort of comment on what he
> > thinks of zone-based in comparison to list-based.
> >
> I think there is another point.
>
> what I concern about is Linus's word ,this:
> > My point is that regardless of what you _want_, defragmentation is
> > _useless_. It's useless simply because for big areas it is so expensive as
> > to be impractical.
>
> You should make your own answer for this before posting.
>

If it was expensive in absolute performance, then the aim9 figures would
have suffered badly and kbuild would also be hit. It has never been shown
that the list-based approach incurred a serious performance loss.
Similarly, I have not been able to find a performance loss with zone-based
unless it kernelcore was a small number. As both approaches reduce
fragmentation in a way without a measurable performance loss, I disagree
that defragmentation is "useless simply because for big areas it is so
expensive as to be impractical".

For "big areas", the issue is how big. When list-based was last released,
peoples view of "big" was the size of a bank of memory that was about to
be physically removed. AFAIK, that scenario is not as important as it was
because it comes with a host of difficult problems.

The scenario people really care about (someone correct me if I'm wrong
here) for hot-remove is giving virtual machines more or less memory as
demand requires. In this case, the "big"  area of memory required is the
same size as a sparsemem section - 16MiB on the ppc64 and 64MiB on the x86
(I think). Also, for hot-remove, it does not really matter where in the
zone the chunk is, as long as it is free. For ppc64, 16MiB of contiguous
memory is reasonably easy to get with the list-based approach and the case
would likely be the same for x86 if the value of MAX_ORDER was increased.

Both list-based and zone-based give the large chunks that could be
removed, but with list-based memory can be added that is usable by the
kernel and zone-based can only give more memory for userspace processes.
If the workload requires more kernel memory, you are out of luck.

The ideal would be bits of both would be available but I find it difficult
to believe that will ever make it it. I believe that list-based is better
overall because it's flexible and it helps both kernel and userspace.
However, it affects the main allocator code paths and it was greeted with
a severe kicking. Zone-based is less invasive, more predictable, probably
has a better chance of getting in, but it requires careful tuning and the
kernel cannot use hot-added memory at run-time.

I'd prefer the list-based approach to make it in, but zone-based is better
than nothing.

> From the old threads (very long!), I think  one of the point was :
> To use hugepages, sysadmin can specifies what he wants at boot time.
> This guarantees 100% allocation of needed huge pages.
> Why memhotplug cannot specifies "how much they can remove" before booting.
> This will guaranntee 100% memory hotremove.
>

One distinction is that memory reserved for huge pages is not the same as
memory placed in ZONE_EASYRCLM. Critically, pages in ZONE_EASYRCLM cannot
be used by the kernel for slabs, buffer pages etc. On the x86, this is not
a big issue because ZONE_HIGHMEM cannot be used anyway, but on
architectures that can use all memory (or at least large portions of it)
they would use ZONE_NORMAL, it is a big problem.

> I think hugetlb and memory hotplug cannot be good reason for defragment.
>

Right now, they are the main reasons for needing low fragmentation.

> Finding the reason for defragment is good.
> Unfortunately, I don't know the cases of memory allocation failure
> because of fragmentation with recent kernel.
>
> -- Kame
>
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lhms-devel] Re: [PATCH 0/5] Reducing fragmentation using zones
  2006-01-20 14:53             ` Mel Gorman
@ 2006-01-20 18:10               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2006-01-20 18:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Joel Schopp, Linux Memory Management List,
	Linux Kernel Mailing List, lhms-devel

Mel Gorman wrote:
> On Fri, 20 Jan 2006, KAMEZAWA Hiroyuki wrote:
>>1. Using 1000+ processes(threads) at once
> 
> 
> Would tiobench --threads be suitable or would the IO skew what you are
> looking for? If the IO is a problem, what would you recommend instead?
> 
What I'm looking for is slab usage coming with threads/procs.

> 
>>2. heavy network load.
> 
> 
> Would iperf be suitable?
> 
maybe
> 
>>3. running NFS
> 
> 
> Is running a kernel build over NFS reasonable? Should it be a remote NFS
> server or could I setup a NFS share and mount it locally? If a kernel
> build is not suitable, would tiobench over NFS be a better plan?
> 
I considered doing kernel build on  NFS which is mounted localy.


> The scenario people really care about (someone correct me if I'm wrong
> here) for hot-remove is giving virtual machines more or less memory as
> demand requires. In this case, the "big"  area of memory required is the
> same size as a sparsemem section - 16MiB on the ppc64 and 64MiB on the x86
> (I think). Also, for hot-remove, it does not really matter where in the
> zone the chunk is, as long as it is free. For ppc64, 16MiB of contiguous
> memory is reasonably easy to get with the list-based approach and the case
> would likely be the same for x86 if the value of MAX_ORDER was increased.
> 
What I' want is just node-hotplug on NUMA, removing physical range of mem.
So I'll need and push dividing memory into removable zones or pgdat, anyway.
For people who just want resizing, what you say is main reason for hotplug.

-- Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2006-01-20 18:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-19 19:08 [PATCH 0/5] Reducing fragmentation using zones Mel Gorman
2006-01-19 19:08 ` [PATCH 1/5] Add __GFP_EASYRCLM flag and update callers Mel Gorman
2006-01-19 19:08 ` [PATCH 2/5] Create the ZONE_EASYRCLM zone Mel Gorman
2006-01-19 19:09 ` [PATCH 3/5] x86 - Specify amount of kernel memory at boot time Mel Gorman
2006-01-19 19:09 ` [PATCH 4/5] ppc64 " Mel Gorman
2006-01-19 19:09 ` [PATCH 5/5] ForTesting - Prevent OOM killer firing for high-order allocations Mel Gorman
2006-01-19 19:24 ` [PATCH 0/5] Reducing fragmentation using zones Joel Schopp
2006-01-20  0:13   ` [Lhms-devel] " KAMEZAWA Hiroyuki
2006-01-20  1:09     ` Mel Gorman
2006-01-20  1:25       ` KAMEZAWA Hiroyuki
2006-01-20  9:44         ` Mel Gorman
2006-01-20 10:40           ` KAMEZAWA Hiroyuki
2006-01-20 14:53             ` Mel Gorman
2006-01-20 18:10               ` Kamezawa Hiroyuki
2006-01-20 12:08           ` Yasunori Goto
2006-01-20 12:25             ` Mel Gorman
2006-01-20 13:22               ` Yasunori Goto
2006-01-20  0:42   ` Mel Gorman
2006-01-20  1:18     ` KAMEZAWA Hiroyuki
2006-01-20 12:03       ` Mel Gorman
2006-01-20 13:28         ` [Lhms-devel] " Yasunori Goto
2006-01-20 14:02           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).