linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
@ 2007-08-08 16:15 Mel Gorman
  2007-08-08 16:15 ` [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 16:15 UTC (permalink / raw)
  To: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch replaces multiple zonelists with one zonelist that is
filtered.

The final patch is a fix that depends on the previous two patches. The
patch changes policy zone so that the MPOL_BIND policy gets applied
to the two highest populated zones when the highest populated zone is
ZONE_MOVABLE. Otherwise, MPOL_BIND only applies to the highest populated zone.

The tests passed regression tests with numactltest. Performance results
varied depending on the machine configuration but were usually small
performance gains. The new algorithm relies heavily on the implementation
of zone_idx which is currently pretty expensive. Experiments to optimise
this have shown significant improvements for this algorithm, but is beyond
the scope of this patchset. Due to the nature of the change, the results
for other people are likely to vary - it'll usually win but occasionally lose.

In real workloads the gain/loss will depend on how much the userspace
portion of the benchmark benefits from having more cache available due
to reduced referencing of zonelists. I expect it'll be more noticable on
x86_64 with many zones than on IA64 which typically would only have one
active zonelist-per-node.

These are the range of performance losses/gains I found when running against
2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.20% to  3.70%
Elapsed   time on Kernbench: -0.32% to  3.62%
page_test from aim9:         -2.17% to 12.42%
brk_test  from aim9:         -6.03% to 11.49%
fork_test from aim9:         -2.30% to  5.42%
exec_test from aim9:         -0.68% to  3.39%
Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes was within noise. I haven't gone though lmbench.

These three patches are a standalone set which address the MPOL_BIND problem
with ZONE_MOVABLE as well as reducing memory usage and in many cases the
cache footprint of the kernel.  They should be considered as a bug fix due to
the MPOL_BIND fixup.

If these patches are accepted, the follow-on work would entail;

o Encode zone_id in the zonelist pointers to avoid zone_idx() (Christoph's idea)
o If zone_id works out, eliminate z_to_n from the zonelist cache as unnecessary
o Remove bind_zonelist() (Patch in progress, very messy right now)
o Eliminate policy_zone (Trickier)

Comments?
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages
  2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
@ 2007-08-08 16:15 ` Mel Gorman
  2007-08-08 17:38   ` Christoph Lameter
  2007-08-08 16:15 ` [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 16:15 UTC (permalink / raw)
  To: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm


The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 include/linux/swap.h |    2 +-
 mm/page_alloc.c      |    2 +-
 mm/vmscan.c          |    4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/include/linux/swap.h linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/swap.h
--- linux-2.6.23-rc1-mm2-clean/include/linux/swap.h	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/swap.h	2007-08-08 11:35:00.000000000 +0100
@@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/mm/page_alloc.c linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc1-mm2-clean/mm/page_alloc.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c	2007-08-08 11:35:00.000000000 +0100
@@ -1644,7 +1644,7 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask);
+	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-clean/mm/vmscan.c linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c
--- linux-2.6.23-rc1-mm2-clean/mm/vmscan.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c	2007-08-08 11:35:00.000000000 +0100
@@ -1127,7 +1127,8 @@ static unsigned long shrink_zones(int pr
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
  */
-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+								gfp_t gfp_mask)
 {
 	int priority;
 	int ret = 0;
@@ -1135,6 +1136,7 @@ unsigned long try_to_free_pages(struct z
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
+	struct zone **zones = zonelist->zones;
 	int i;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists
  2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
  2007-08-08 16:15 ` [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
@ 2007-08-08 16:15 ` Mel Gorman
  2007-08-08 17:46   ` Christoph Lameter
  2007-08-08 16:16 ` [PATCH 3/3] Apply MPOL_BIND policy to two highest zones when highest is ZONE_MOVABLE Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 16:15 UTC (permalink / raw)
  To: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm


Currently a node has a number of zonelists, one for each zone type in
the system. Based on the zones allowed by a gfp mask, one of these zonelists
is selected. All of these zonelists occupy memory and consume cache lines.

This patch replaces the multiple zonelists in the node with a single
zonelist that contains all populated zones in the system. An iterator macro
is introduced called for_each_zone_zonelist() interates through each zone
in the zonelist that is allowed by the GFP flags.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 arch/parisc/mm/init.c     |   11 ++-
 drivers/char/sysrq.c      |    3 
 fs/buffer.c               |    5 -
 include/linux/gfp.h       |    3 
 include/linux/mempolicy.h |    2 
 include/linux/mmzone.h    |   33 +++++++++
 mm/mempolicy.c            |    6 -
 mm/oom_kill.c             |    8 +-
 mm/page_alloc.c           |  138 ++++++++++++++++++-----------------------
 mm/slab.c                 |   11 +--
 mm/slub.c                 |   11 +--
 mm/vmscan.c               |   14 ++--
 12 files changed, 133 insertions(+), 112 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc1-mm2-010_use_zonelist/arch/parisc/mm/init.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/arch/parisc/mm/init.c	2007-07-22 21:41:00.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/arch/parisc/mm/init.c	2007-08-08 11:35:09.000000000 +0100
@@ -599,15 +599,18 @@ void show_mem(void)
 #ifdef CONFIG_DISCONTIGMEM
 	{
 		struct zonelist *zl;
-		int i, j, k;
+		int i, j;
 
 		for (i = 0; i < npmem_ranges; i++) {
+			zl = &NODE_DATA(i)->node_zonelist;
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				zl = NODE_DATA(i)->node_zonelists + j;
+				struct zone **z;
+				struct zone *zone;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
-				for (k = 0; zl->zones[k] != NULL; k++) 
-					printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+				for_each_zone_zonelist(zone, z, zl, j)
+					printk(KERN_INFO, "[%ld/%s] ",
+						zone_to_nid(zone), zone->name);
 				printk("\n");
 			}
 		}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/drivers/char/sysrq.c linux-2.6.23-rc1-mm2-010_use_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/drivers/char/sysrq.c	2007-08-07 14:45:09.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/drivers/char/sysrq.c	2007-08-08 11:36:29.000000000 +0100
@@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
-			GFP_KERNEL, 0);
+	out_of_memory(&NODE_DATA(0)->node_zonelist, GFP_KERNEL, 0);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/fs/buffer.c linux-2.6.23-rc1-mm2-010_use_zonelist/fs/buffer.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/fs/buffer.c	2007-08-07 14:45:10.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/fs/buffer.c	2007-08-08 13:42:03.000000000 +0100
@@ -362,9 +362,10 @@ static void free_more_memory(void)
 	yield();
 
 	for_each_online_pgdat(pgdat) {
-		zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+		zones = first_zones_zonelist(&pgdat->node_zonelist,
+			gfp_zone(GFP_NOFS));
 		if (*zones)
-			try_to_free_pages(zones, 0, GFP_NOFS);
+			try_to_free_pages(&pgdat->node_zonelist, 0, GFP_NOFS);
 	}
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/gfp.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/gfp.h	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/gfp.h	2007-08-08 11:35:09.000000000 +0100
@@ -175,8 +175,7 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order,
-		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+	return __alloc_pages(gfp_mask, order, &NODE_DATA(nid)->node_zonelist);
 }
 
 #ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mempolicy.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mempolicy.h	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h	2007-08-08 11:35:09.000000000 +0100
@@ -240,7 +240,7 @@ static inline void mpol_fix_fork_child_f
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr, gfp_t gfp_flags)
 {
-	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+	return &NODE_DATA(0)->node_zonelist;
 }
 
 static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mmzone.h linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/include/linux/mmzone.h	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mmzone.h	2007-08-08 13:45:55.000000000 +0100
@@ -469,7 +469,7 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelist;
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;
@@ -670,6 +670,37 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+/* Returns the first zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+					enum zone_type highest_zoneidx)
+{
+	struct zone **z;
+	for (z = zonelist->zones; *z && zone_idx(*z) > highest_zoneidx; z++);
+	return z;
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **next_zones_zonelist(struct zone **z,
+					enum zone_type highest_zoneidx)
+{
+	for (++z; *z && zone_idx(*z) > highest_zoneidx; z++);
+	return z;
+}
+
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for (z = first_zones_zonelist(zlist, highidx), zone = *z;	\
+		zone;							\
+		z = next_zones_zonelist(z, highidx), zone = *z)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/mempolicy.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/mempolicy.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c	2007-08-08 11:35:09.000000000 +0100
@@ -1120,7 +1120,7 @@ static struct zonelist *zonelist_policy(
 		nd = 0;
 		BUG();
 	}
-	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+	return &NODE_DATA(nd)->node_zonelist;
 }
 
 /* Do dynamic interleaving for a process */
@@ -1216,7 +1216,7 @@ struct zonelist *huge_zonelist(struct vm
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
-		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+		return &NODE_DATA(nid)->node_zonelist;
 	}
 	return zonelist_policy(GFP_HIGHUSER, pol);
 }
@@ -1230,7 +1230,7 @@ static struct page *alloc_page_interleav
 	struct zonelist *zl;
 	struct page *page;
 
-	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+	zl = &NODE_DATA(nid)->node_zonelist;
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0])
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c	2007-08-08 11:35:09.000000000 +0100
@@ -177,8 +177,10 @@ static inline int constrained_alloc(stru
 {
 #ifdef CONFIG_NUMA
 	struct zone **z;
+	struct zone *zone;
 	nodemask_t nodes;
 	int node;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 
 	nodes_clear(nodes);
 	/* node has memory ? */
@@ -186,9 +188,9 @@ static inline int constrained_alloc(stru
 		if (NODE_DATA(node)->node_present_pages)
 			node_set(node, nodes);
 
-	for (z = zonelist->zones; *z; z++)
-		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
-			node_clear(zone_to_nid(*z), nodes);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
+			node_clear(zone_to_nid(zone), nodes);
 		else
 			return CONSTRAINT_CPUSET;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/page_alloc.c	2007-08-08 11:35:00.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/page_alloc.c	2007-08-08 14:36:42.000000000 +0100
@@ -1416,6 +1416,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	struct page *page = NULL;
 	int classzone_idx = zone_idx(zonelist->zones[0]);
 	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
@@ -1425,13 +1426,10 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	z = zonelist->zones;
-
-	do {
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		zone = *z;
 		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
 			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
 				break;
@@ -1468,7 +1466,7 @@ try_next_zone:
 			zlc_active = 1;
 			did_zlc_setup = 1;
 		}
-	} while (*(++z) != NULL);
+	}
 
 	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
 		/* Disable zlc cache for second zonelist scan */
@@ -1781,15 +1779,16 @@ EXPORT_SYMBOL(free_pages);
 
 static unsigned int nr_free_zone_pages(int offset)
 {
+	enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
+	struct zone **z;
+	struct zone *zone;
+
 	/* Just pick one node, since fallback list is circular */
 	pg_data_t *pgdat = NODE_DATA(numa_node_id());
 	unsigned int sum = 0;
+	struct zonelist *zonelist = &pgdat->node_zonelist;
 
-	struct zonelist *zonelist = pgdat->node_zonelists + offset;
-	struct zone **zonep = zonelist->zones;
-	struct zone *zone;
-
-	for (zone = *zonep++; zone; zone = *zonep++) {
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		unsigned long size = zone->present_pages;
 		unsigned long high = zone->pages_high;
 		if (size > high)
@@ -2148,17 +2147,14 @@ static int find_next_best_node(int node,
  */
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		for (j = 0; zonelist->zones[j] != NULL; j++)
-			;
- 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelist;
+	for (j = 0; zonelist->zones[j] != NULL; j++)
+		;
+	j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -2171,27 +2167,24 @@ static int node_order[MAX_NUMNODES];
 
 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
-	enum zone_type i;
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
 	struct zone *z;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		pos = 0;
-		for (zone_type = i; zone_type >= 0; zone_type--) {
-			for (j = 0; j < nr_nodes; j++) {
-				node = node_order[j];
-				z = &NODE_DATA(node)->node_zones[zone_type];
-				if (populated_zone(z)) {
-					zonelist->zones[pos++] = z;
-					check_highest_zone(zone_type);
-				}
+	zonelist = &pgdat->node_zonelist;
+	pos = 0;
+	for (zone_type = MAX_NR_ZONES-1; zone_type >= 0; zone_type--) {
+		for (j = 0; j < nr_nodes; j++) {
+			node = node_order[j];
+			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				zonelist->zones[pos++] = z;
+				check_highest_zone(zone_type);
 			}
 		}
-		zonelist->zones[pos] = NULL;
 	}
+	zonelist->zones[pos] = NULL;
 }
 
 static int default_zonelist_order(void)
@@ -2258,17 +2251,14 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int j, node, load;
-	enum zone_type i;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
 	int order = current_zonelist_order;
 
-	/* initialize zonelists */
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
-	}
+	/* initialize zonelist */
+	zonelist = &pgdat->node_zonelist;
+	zonelist->zones[0] = NULL;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -2316,18 +2306,15 @@ static void build_zonelists(pg_data_t *p
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	int i;
+	struct zonelist *zonelist;
+	struct zonelist_cache *zlc;
+	struct zone **z;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
-		struct zonelist_cache *zlc;
-		struct zone **z;
-
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
-		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-		for (z = zonelist->zones; *z; z++)
-			zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
-	}
+	zonelist = &pgdat->node_zonelist;
+	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+	for (z = zonelist->zones; *z; z++)
+		zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
 }
 
 
@@ -2341,45 +2328,42 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type i,j;
+	enum zone_type j;
+	struct zonelist *zonelist;
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
-
-		zonelist = pgdat->node_zonelists + i;
 
- 		j = build_zonelists_node(pgdat, zonelist, 0, i);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
+	zonelist = &pgdat->node_zonelist;
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES-1);
 
-		zonelist->zones[j] = NULL;
+	/*
+	 * Now we build the zonelist so that it contains the zones
+	 * of all the other nodes.
+	 * We don't want to pressure a particular node, so when
+	 * building the zones for node N, we make sure that the
+	 * zones coming right after the local ones are those from
+	 * node N+1 (modulo N)
+	 */
+	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+								MAX_NR_ZONES-1);
+	}
+	for (node = 0; node < local_node; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+								MAX_NR_ZONES-1);
 	}
+
+	zonelist->zones[j] = NULL;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		pgdat->node_zonelists[i].zlcache_ptr = NULL;
+	pgdat->node_zonelist.zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slab.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slab.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slab.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slab.c	2007-08-08 11:35:09.000000000 +0100
@@ -3239,14 +3239,15 @@ static void *fallback_alloc(struct kmem_
 	struct zonelist *zonelist;
 	gfp_t local_flags;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-			->node_zonelists[gfp_zone(flags)];
+	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
 	local_flags = (flags & GFP_LEVEL_MASK);
 
 retry:
@@ -3254,10 +3255,10 @@ retry:
 	 * Look through allowed nodes for objects available
 	 * from existing per node queues.
 	 */
-	for (z = zonelist->zones; *z && !obj; z++) {
-		nid = zone_to_nid(*z);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		nid = zone_to_nid(zone);
 
-		if (cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (cpuset_zone_allowed_hardwall(zone, flags) &&
 			cache->nodelists[nid] &&
 			cache->nodelists[nid]->free_objects)
 				obj = ____cache_alloc_node(cache,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slub.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slub.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/slub.c	2007-08-07 14:45:11.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/slub.c	2007-08-08 11:35:09.000000000 +0100
@@ -1285,6 +1285,8 @@ static struct page *get_any_partial(stru
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
 
 	/*
@@ -1308,14 +1310,13 @@ static struct page *get_any_partial(stru
 	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-					->node_zonelists[gfp_zone(flags)];
-	for (z = zonelist->zones; *z; z++) {
+	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		struct kmem_cache_node *n;
 
-		n = get_node(s, zone_to_nid(*z));
+		n = get_node(s, zone_to_nid(zone));
 
-		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > MIN_PARTIAL) {
 			page = get_partial_node(n);
 			if (page)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmscan.c
--- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/vmscan.c	2007-08-08 11:35:00.000000000 +0100
+++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/vmscan.c	2007-08-08 11:35:09.000000000 +0100
@@ -1136,8 +1136,10 @@ unsigned long try_to_free_pages(struct z
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
-	struct zone **zones = zonelist->zones;
+	struct zone **z;
+	struct zone *zone;
 	int i;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1150,9 +1152,7 @@ unsigned long try_to_free_pages(struct z
 	delay_swap_prefetch();
 	count_vm_event(ALLOCSTALL);
 
-	for (i = 0; zones[i] != NULL; i++) {
-		struct zone *zone = zones[i];
-
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
 
@@ -1164,7 +1164,7 @@ unsigned long try_to_free_pages(struct z
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		nr_reclaimed += shrink_zones(priority, zonelist->zones, &sc);
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
 		if (reclaim_state) {
 			nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -1206,8 +1206,8 @@ out:
 	 */
 	if (priority < 0)
 		priority = 0;
-	for (i = 0; zones[i] != 0; i++) {
-		struct zone *zone = zones[i];
+	for (i = 0; zonelist->zones[i] != 0; i++) {
+		struct zone *zone = zonelist->zones[i];
 
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/3] Apply MPOL_BIND policy to two highest zones when highest is ZONE_MOVABLE
  2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
  2007-08-08 16:15 ` [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
  2007-08-08 16:15 ` [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman
@ 2007-08-08 16:16 ` Mel Gorman
  2007-08-08 17:36 ` [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Christoph Lameter
  2007-08-09 20:19 ` Andrew Morton
  4 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 16:16 UTC (permalink / raw)
  To: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm


The NUMA layer only supports the MPOL_BIND NUMA policy for the highest
zone. When ZONE_MOVABLE is configured with kernelcore=, the the highest
zone becomes ZONE_MOVABLE. The result is that the bind policy policies is
only applied to allocations like anonymous pages and page cache allocated
from ZONE_MOVABLE when the zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, these two zones are equivalent in policy terms.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 include/linux/mempolicy.h |    2 +-
 mm/mempolicy.c            |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h linux-2.6.23-rc1-mm2-015_treat_movable_highest/include/linux/mempolicy.h
--- linux-2.6.23-rc1-mm2-010_use_zonelist/include/linux/mempolicy.h	2007-08-08 11:35:09.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_treat_movable_highest/include/linux/mempolicy.h	2007-08-08 14:36:52.000000000 +0100
@@ -157,7 +157,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone)
+	if (k > policy_zone && k != ZONE_MOVABLE)
 		policy_zone = k;
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c linux-2.6.23-rc1-mm2-015_treat_movable_highest/mm/mempolicy.c
--- linux-2.6.23-rc1-mm2-010_use_zonelist/mm/mempolicy.c	2007-08-08 11:35:09.000000000 +0100
+++ linux-2.6.23-rc1-mm2-015_treat_movable_highest/mm/mempolicy.c	2007-08-08 14:36:52.000000000 +0100
@@ -151,7 +151,7 @@ static struct zonelist *bind_zonelist(no
 	   lower zones etc. Avoid empty zones because the memory allocator
 	   doesn't like them. If you implement node hot removal you
 	   have to fix that. */
-	k = policy_zone;
+	k = MAX_NR_ZONES - 1;
 	while (1) {
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2007-08-08 16:16 ` [PATCH 3/3] Apply MPOL_BIND policy to two highest zones when highest is ZONE_MOVABLE Mel Gorman
@ 2007-08-08 17:36 ` Christoph Lameter
  2007-08-08 18:30   ` Lee Schermerhorn
  2007-08-08 21:04   ` Mel Gorman
  2007-08-09 20:19 ` Andrew Morton
  4 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 17:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Mel Gorman wrote:

> These are the range of performance losses/gains I found when running against
> 2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
> ppc64 both NUMA and non-NUMA.
> 
> Total CPU time on Kernbench: -0.20% to  3.70%
> Elapsed   time on Kernbench: -0.32% to  3.62%
> page_test from aim9:         -2.17% to 12.42%
> brk_test  from aim9:         -6.03% to 11.49%
> fork_test from aim9:         -2.30% to  5.42%
> exec_test from aim9:         -0.68% to  3.39%
> Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)

Looks good.

> o Remove bind_zonelist() (Patch in progress, very messy right now)

Will this also allow us to avoid always hitting the first node of an 
MPOL_BIND first?

> o Eliminate policy_zone (Trickier)

I doubt that this is possible given

1. We need lower zones (DMA) in various context

2. Those DMA zones are only available on particular nodes.

Policy_zone could be made to only control allows of the highest (and with 
ZONE_MOVABLE) second highest zone on a node?

Think about the 8GB x86_64 configuration I mentioned earlier

node 0  up to 2 GB 		ZONE_DMA and ZONE_DMA32
node 1  up to 4 GB		ZONE_DMA32
node 2  up to 6 GB		ZONE_NORMAL
node 3  up to 8 GB		ZONE_NORMAL

If one wants the node restrictions to work on all nodes then we need to 
apply policy depending on the highest zone of the node.

Current MPOL_BIND would only apply policy to allocations on node 2 and 3.

With ZONE_MOVABLE splitting the highest zone (We will likely need that):

node 0  up to 2 GB              ZONE_DMA and ZONE_DMA32, ZONE_MOVABLE
node 1  up to 4 GB              ZONE_DMA32, ZONE_MOVABLE
node 2  up to 6 GB              ZONE_NORMAL, ZONE_MOVABLE
node 3  up to 8 GB              ZONE_NORMAL, ZONE_MOVABLE

So then the two highest zones on each node would need to be subject to 
policy control.

Another thing is that we may want to think about is maybe to evolve 
ZONE_MOVABLE to be more like the antifrag sections. That way we may be 
able to avoid the multiple types of pages on the pcp lists. That would 
work if we would only work with two page types: Movable and unmovable 
(fold reclaimable into movable after slab defrag)

Then would make blocks of memory movable between ZONE_MOVABLE and others. 
At that point we are almost at the functionality that antifrag offers and 
we may have simplified things a bit.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages
  2007-08-08 16:15 ` [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
@ 2007-08-08 17:38   ` Christoph Lameter
  2007-08-08 21:06     ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 17:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

Good idea. Maybe it could be taken further by passing the zonelist also 
into shrink_zones()?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists
  2007-08-08 16:15 ` [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman
@ 2007-08-08 17:46   ` Christoph Lameter
  2007-08-08 21:10     ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 17:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Mel Gorman wrote:

>  		for (i = 0; i < npmem_ranges; i++) {
> +			zl = &NODE_DATA(i)->node_zonelist;

The above shows up again and again. Maybe add a new inline function?

struct zonelist *zonelist_node(int node)


>  {
> -	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
> +	return &NODE_DATA(0)->node_zonelist;

How many callers of gfp_zone are left? Do we still need the function?

Note that the memoryless_node patchset modifies gfp_zone and adds some 
more zonelists (sigh).

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c
> --- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c	2007-08-07 14:45:11.000000000 +0100
> +++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c	2007-08-08 11:35:09.000000000 +0100
> @@ -177,8 +177,10 @@ static inline int constrained_alloc(stru
>  {
>  #ifdef CONFIG_NUMA
>  	struct zone **z;
> +	struct zone *zone;
>  	nodemask_t nodes;
>  	int node;
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  
>  	nodes_clear(nodes);
>  	/* node has memory ? */
> @@ -186,9 +188,9 @@ static inline int constrained_alloc(stru
>  		if (NODE_DATA(node)->node_present_pages)
>  			node_set(node, nodes);
>  
> -	for (z = zonelist->zones; *z; z++)
> -		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
> -			node_clear(zone_to_nid(*z), nodes);
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> +		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> +			node_clear(zone_to_nid(zone), nodes);
>  		else
>  			return CONSTRAINT_CPUSET;
>  

The above portion has already been changed to no longer use a zonelist by 
the memoryless_node patchset in mm.

> +	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  		struct kmem_cache_node *n;
>  
> -		n = get_node(s, zone_to_nid(*z));
> +		n = get_node(s, zone_to_nid(zone));

Encoding the node in the zonelist pointer would help these loops but they 
are fallback lists and not on the critical path.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 17:36 ` [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Christoph Lameter
@ 2007-08-08 18:30   ` Lee Schermerhorn
  2007-08-08 21:44     ` Mel Gorman
  2007-08-08 21:04   ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 18:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 2007-08-08 at 10:36 -0700, Christoph Lameter wrote:
> On Wed, 8 Aug 2007, Mel Gorman wrote:
> 
> > These are the range of performance losses/gains I found when running against
> > 2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
> > ppc64 both NUMA and non-NUMA.
> > 
> > Total CPU time on Kernbench: -0.20% to  3.70%
> > Elapsed   time on Kernbench: -0.32% to  3.62%
> > page_test from aim9:         -2.17% to 12.42%
> > brk_test  from aim9:         -6.03% to 11.49%
> > fork_test from aim9:         -2.30% to  5.42%
> > exec_test from aim9:         -0.68% to  3.39%
> > Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)
> 
> Looks good.
> 
> > o Remove bind_zonelist() (Patch in progress, very messy right now)
> 
> Will this also allow us to avoid always hitting the first node of an 
> MPOL_BIND first?

An idea:

Apologies if someone already suggested this and I missed it.  Too much
traffic...

instead of passing a zonelist for BIND policy, how about passing [to
__alloc_pages(), I think] a starting node, a nodemask, and gfp flags for
zone and modifiers.  For various policies, the arguments would look like
this:

Policy		start node	nodemask

default		local node	cpuset_current_mems_allowed

preferred	preferred_node	cpuset_current_mems_allowed

interleave	computed node	cpuset_current_mems_allowed

bind		local node	policy nodemask [replaces bind
				zonelist in mempolicy]

Then, just walk the zonelist for the starting node--already ordered by
distance--filtering by gfp_zone() and nodemask.  Done "right", this
should always return memory from the closest allowed node [based on the
nodemask argument] to the starting node.  And, it would eliminate the
custom zonelists for bind policy.  Can also eliminate cpuset checks in
the allocation loop because that constraint would already be applied to
the nodemask argument.

The fast path--when we hit in the target zone on the starting
node--might be faster.  Once we have to start falling back to other
nodes/zones, we've pretty much fallen off the fast path anyway, I think.

Bind policy would suffer a hit when the nodemask does not include the
local node from which the allocation occurs.  I.e., this would always be
a fallback case.

Too backed up to investigate further right now.  

I will add Mel's patches to my test tree, tho'.

Lee


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 17:36 ` [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Christoph Lameter
  2007-08-08 18:30   ` Lee Schermerhorn
@ 2007-08-08 21:04   ` Mel Gorman
  2007-08-08 23:26     ` Christoph Lameter
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 21:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On (08/08/07 10:36), Christoph Lameter didst pronounce:
> On Wed, 8 Aug 2007, Mel Gorman wrote:
> 
> > These are the range of performance losses/gains I found when running against
> > 2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
> > ppc64 both NUMA and non-NUMA.
> > 
> > Total CPU time on Kernbench: -0.20% to  3.70%
> > Elapsed   time on Kernbench: -0.32% to  3.62%
> > page_test from aim9:         -2.17% to 12.42%
> > brk_test  from aim9:         -6.03% to 11.49%
> > fork_test from aim9:         -2.30% to  5.42%
> > exec_test from aim9:         -0.68% to  3.39%
> > Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)
> 
> Looks good.
> 

Indeed.

> > o Remove bind_zonelist() (Patch in progress, very messy right now)
> 
> Will this also allow us to avoid always hitting the first node of an 
> MPOL_BIND first?
> 

If by first node you mean avoid hitting nodes in numerical order, then
yes. The patch changes __alloc_pages to be __alloc_pages_nodemask() with
a wrapper __alloc_pages that passes in NULL for nodemask. The nodemask
is then filtered similar to how zones are filtered in this patch. The
patch is ugly right now and untested but it deletes policy-specific code
and prehaps some of the cpuset code could be expressed in those terms as
well.

> > o Eliminate policy_zone (Trickier)
> 
> I doubt that this is possible given
> 
> 1. We need lower zones (DMA) in various context
> 
> 2. Those DMA zones are only available on particular nodes.
> 

Right.

> Policy_zone could be made to only control allows of the highest (and with 
> ZONE_MOVABLE) second highest zone on a node?
> 
> Think about the 8GB x86_64 configuration I mentioned earlier
> 
> node 0  up to 2 GB 		ZONE_DMA and ZONE_DMA32
> node 1  up to 4 GB		ZONE_DMA32
> node 2  up to 6 GB		ZONE_NORMAL
> node 3  up to 8 GB		ZONE_NORMAL
> 
> If one wants the node restrictions to work on all nodes then we need to 
> apply policy depending on the highest zone of the node.
> 
> Current MPOL_BIND would only apply policy to allocations on node 2 and 3.
> 
> With ZONE_MOVABLE splitting the highest zone (We will likely need that):
> 
> node 0  up to 2 GB              ZONE_DMA and ZONE_DMA32, ZONE_MOVABLE
> node 1  up to 4 GB              ZONE_DMA32, ZONE_MOVABLE
> node 2  up to 6 GB              ZONE_NORMAL, ZONE_MOVABLE
> node 3  up to 8 GB              ZONE_NORMAL, ZONE_MOVABLE
> 
> So then the two highest zones on each node would need to be subject to 
> policy control.
> 

One option would be to force that a node with ZONE_DMA is bound so that
policies will get applied as much as possible but that would lead to an
unfair use of one node for ZONE_DMA allocations for example.

An alternative may be to work out at policy creation time what the lowest
zone common to all nodes in the list is and apply the MPOL_BIND policy if
the current allocation can use that zone. It's an improvement on the global
policy_zone at least but depends on this one-zonelist-per-node patchset
which we need to agree/disagree on first.

> Another thing is that we may want to think about is maybe to evolve 
> ZONE_MOVABLE to be more like the antifrag sections. That way we may be 
> able to avoid the multiple types of pages on the pcp lists. That would 
> work if we would only work with two page types: Movable and unmovable 
> (fold reclaimable into movable after slab defrag)
> 

I'll keep it in mind. It's been suggested before so I revisit it every
so often. The details were messy each time though and inferior to
grouping pages by mobility in a number of respects.

> Then would make blocks of memory movable between ZONE_MOVABLE and others. 
> At that point we are almost at the functionality that antifrag offers and 
> we may have simplified things a bit.
> 

It gets hard when the zone for unmovable pages is full, the zone with movable
pages doesn't have a fully free block and the allocator cannot reclaim. Even
though the blocks in the movable potion may contain free pages, there is
no easy way to access them. At that point, we are in a similar situation
grouping pages by mobility deals with except it's harder to work out.

I'll revisit it again just in case but for now I'd rather not get
sidetracked from the patchset at hand.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages
  2007-08-08 17:38   ` Christoph Lameter
@ 2007-08-08 21:06     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 21:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On (08/08/07 10:38), Christoph Lameter didst pronounce:
> Good idea. Maybe it could be taken further by passing the zonelist also 
> into shrink_zones()?
> 

Yeah, it could. It's a cleanup even though it doesn't simplify the
iterator. When the patch was done first, it was because multiple
iterators were needed to go through the zones in the zonelist. With this
cleanup, only one iterator was needed but it could be carried through
for consistency.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists
  2007-08-08 17:46   ` Christoph Lameter
@ 2007-08-08 21:10     ` Mel Gorman
  2007-08-08 23:28       ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 21:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On (08/08/07 10:46), Christoph Lameter didst pronounce:
> On Wed, 8 Aug 2007, Mel Gorman wrote:
> 
> >  		for (i = 0; i < npmem_ranges; i++) {
> > +			zl = &NODE_DATA(i)->node_zonelist;
> 
> The above shows up again and again. Maybe add a new inline function?
> 
> struct zonelist *zonelist_node(int node)
> 

Not a bad plan. There are least 5 instances of calls like this in mm/
alone so might as well.

> 
> >  {
> > -	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
> > +	return &NODE_DATA(0)->node_zonelist;
> 
> How many callers of gfp_zone are left? Do we still need the function?
> 

Well, the iterator needs it because it expects a high_zoneidx parameter
which is == gfp_zone(gfp_flags). It also appears quite a bit the code
according to grep. I don't think it can be got rid of now anyway.

> Note that the memoryless_node patchset modifies gfp_zone and adds some 
> more zonelists (sigh).
> 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c
> > --- linux-2.6.23-rc1-mm2-005_freepages_zonelist/mm/oom_kill.c	2007-08-07 14:45:11.000000000 +0100
> > +++ linux-2.6.23-rc1-mm2-010_use_zonelist/mm/oom_kill.c	2007-08-08 11:35:09.000000000 +0100
> > @@ -177,8 +177,10 @@ static inline int constrained_alloc(stru
> >  {
> >  #ifdef CONFIG_NUMA
> >  	struct zone **z;
> > +	struct zone *zone;
> >  	nodemask_t nodes;
> >  	int node;
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> >  
> >  	nodes_clear(nodes);
> >  	/* node has memory ? */
> > @@ -186,9 +188,9 @@ static inline int constrained_alloc(stru
> >  		if (NODE_DATA(node)->node_present_pages)
> >  			node_set(node, nodes);
> >  
> > -	for (z = zonelist->zones; *z; z++)
> > -		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
> > -			node_clear(zone_to_nid(*z), nodes);
> > +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > +		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +			node_clear(zone_to_nid(zone), nodes);
> >  		else
> >  			return CONSTRAINT_CPUSET;
> >  
> 
> The above portion has already been changed to no longer use a zonelist by 
> the memoryless_node patchset in mm.
> 

Oh, well that's good. One less thing to worry about even though we'll be
colliding a bit. I'll split the patch in two to have the second part that
collides with memoryless_node as a standalone patch.

> > +	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
> > +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >  		struct kmem_cache_node *n;
> >  
> > -		n = get_node(s, zone_to_nid(*z));
> > +		n = get_node(s, zone_to_nid(zone));
> 
> Encoding the node in the zonelist pointer would help these loops but they 
> are fallback lists and not on the critical path.
> 

The zone_id is the one I'm really interested in. It looks like the most
promising optimisation for avoiding zone_idx in the hotpath.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 18:30   ` Lee Schermerhorn
@ 2007-08-08 21:44     ` Mel Gorman
  2007-08-08 22:40       ` Lee Schermerhorn
  2007-08-08 23:35       ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-08 21:44 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On (08/08/07 14:30), Lee Schermerhorn didst pronounce:
> On Wed, 2007-08-08 at 10:36 -0700, Christoph Lameter wrote:
> > On Wed, 8 Aug 2007, Mel Gorman wrote:
> > 
> > > These are the range of performance losses/gains I found when running against
> > > 2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
> > > ppc64 both NUMA and non-NUMA.
> > > 
> > > Total CPU time on Kernbench: -0.20% to  3.70%
> > > Elapsed   time on Kernbench: -0.32% to  3.62%
> > > page_test from aim9:         -2.17% to 12.42%
> > > brk_test  from aim9:         -6.03% to 11.49%
> > > fork_test from aim9:         -2.30% to  5.42%
> > > exec_test from aim9:         -0.68% to  3.39%
> > > Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)
> > 
> > Looks good.
> > 
> > > o Remove bind_zonelist() (Patch in progress, very messy right now)
> > 
> > Will this also allow us to avoid always hitting the first node of an 
> > MPOL_BIND first?
> 
> An idea:
> 
> Apologies if someone already suggested this and I missed it.  Too much
> traffic...
> 
> instead of passing a zonelist for BIND policy, how about passing [to
> __alloc_pages(), I think] a starting node, a nodemask, and gfp flags for
> zone and modifiers. 

Yes, this has come up before although it wasn't my initial suggestion. I
thought maybe it was yours but I'm not sure anymore. I'm working through
it at the moment. With the patch currently, a a nodemask is passed in for
filtering which should be enough as the zonelist being used should be enough
information to indicate the starting node.

The signature of __alloc_pages() becomes

static page * fastcall
__alloc_pages_nodemask(gfp_t gfp_mask, nodemask_t *nodemask,
               unsigned int order, struct zonelist *zonelist)

>  For various policies, the arguments would look like this:
> Policy		start node	nodemask
> 
> default		local node	cpuset_current_mems_allowed
> 
> preferred	preferred_node	cpuset_current_mems_allowed
> 
> interleave	computed node	cpuset_current_mems_allowed
> 
> bind		local node	policy nodemask [replaces bind
> 				zonelist in mempolicy]
> 

The last one is the most interesting. Much of the patch in development
involves deleting the custom node stuff. I've included the patch below if
you're curious. I wanted to get one-zonelist out first to see if we could
agree on that before going further with it.

> Then, just walk the zonelist for the starting node--already ordered by
> distance--filtering by gfp_zone() and nodemask.  Done "right", this
> should always return memory from the closest allowed node [based on the
> nodemask argument] to the starting node.  And, it would eliminate the
> custom zonelists for bind policy.  Can also eliminate cpuset checks in
> the allocation loop because that constraint would already be applied to
> the nodemask argument.
> 

This is what I'm hoping. I haven't looked closely enough to be sure this will
work but currently I see no reason why it couldn't and it might eliminate
some of the NUMA-specific paths in the allocator.

> The fast path--when we hit in the target zone on the starting
> node--might be faster.  Once we have to start falling back to other
> nodes/zones, we've pretty much fallen off the fast path anyway, I think.
> 

Well, if you haven't hit it in the allocator, you are going to slow up
soon anyway.

> Bind policy would suffer a hit when the nodemask does not include the
> local node from which the allocation occurs.  I.e., this would always be
> a fallback case.
> 
> Too backed up to investigate further right now.  
> 
> I will add Mel's patches to my test tree, tho'.
> 

This is what the patch currently looks like for altering how MPOL_BIND
builds a zonelist. It hasn't been compile-tested, built-tested or even
had much though yet so should not be run anywhere. Expecting to apply
cleanly is probably optimistic :)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/fs/buffer.c linux-2.6.23-rc2-020_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc2-015_treat_movable_highest/fs/buffer.c	2007-08-08 17:51:13.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/fs/buffer.c	2007-08-08 22:35:30.000000000 +0100
@@ -355,7 +355,7 @@ static void free_more_memory(void)
 
 	for_each_online_pgdat(pgdat) {
 		zones = first_zones_zonelist(&pgdat->node_zonelist,
-			gfp_zone(GFP_NOFS));
+			NULL, gfp_zone(GFP_NOFS));
 		if (*zones)
 			try_to_free_pages(&pgdat->node_zonelist, 0, GFP_NOFS);
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/include/linux/cpuset.h linux-2.6.23-rc2-020_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc2-015_treat_movable_highest/include/linux/cpuset.h	2007-08-04 03:49:55.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/include/linux/cpuset.h	2007-08-08 22:18:09.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -98,7 +98,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/include/linux/gfp.h linux-2.6.23-rc2-020_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc2-015_treat_movable_highest/include/linux/gfp.h	2007-08-08 17:51:13.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/include/linux/gfp.h	2007-08-08 22:18:09.000000000 +0100
@@ -141,6 +141,9 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, nodemask_t *, unsigned int, struct zonelist *));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/include/linux/mempolicy.h linux-2.6.23-rc2-020_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc2-015_treat_movable_highest/include/linux/mempolicy.h	2007-08-08 17:51:21.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/include/linux/mempolicy.h	2007-08-08 22:18:09.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/include/linux/mmzone.h linux-2.6.23-rc2-020_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc2-015_treat_movable_highest/include/linux/mmzone.h	2007-08-08 17:51:13.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/include/linux/mmzone.h	2007-08-08 22:28:43.000000000 +0100
@@ -637,20 +637,39 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+static inline int zone_allowed_by_nodemask(struct zone *zone, nodemask_t *nodes)
+{
+#ifdef NUMA
+	if (likely(nodes == NULL))
+		return 1;
+
+	/* zone_to_nid not available in this context */
+	return node_isset(zone->node, *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
 	struct zone **z;
-	for (z = zonelist->zones; *z && zone_idx(*z) > highest_zoneidx; z++);
+	for (z = zonelist->zones;
+		*z && zone_idx(*z) > highest_zoneidx && !zone_allowed_by_nodemask(*z, nodes);
+		z++);
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zone **next_zones_zonelist(struct zone **z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	for (++z; *z && zone_idx(*z) > highest_zoneidx; z++);
+	for (++z;
+		*z && zone_idx(*z) > highest_zoneidx && !zone_allowed_by_nodemask(*z, nodes);
+		z++);
 	return z;
 }
 
@@ -664,9 +683,25 @@ static inline struct zone **next_zones_z
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx), zone = *z;	\
+	for (z = first_zones_zonelist(zlist, NULL, highidx), zone = *z;	\
+		zone;							\
+		z = next_zones_zonelist(z, NULL, highidx), zone = *z)
+
+/**
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
+ *
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
+ */
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx), zone = *z;	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx), zone = *z)
+		z = next_zones_zonelist(z, nodemask, highidx), zone = *z)
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/kernel/cpuset.c linux-2.6.23-rc2-020_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc2-015_treat_movable_highest/kernel/cpuset.c	2007-08-04 03:49:55.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/kernel/cpuset.c	2007-08-08 22:18:09.000000000 +0100
@@ -2327,21 +2327,19 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
+ * cpuset_nodemask_valid_mems_allowed - check zonelist vs. curremt mems_allowed
  * @zl: the zonelist to be checked
  *
  * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->zones[i]; i++) {
-		int nid = zone_to_nid(zl->zones[i]);
+	int nid;
 
+	for_each_node_mask(nid, *nodemask)
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
-	}
+
 	return 0;
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/mm/mempolicy.c linux-2.6.23-rc2-020_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc2-015_treat_movable_highest/mm/mempolicy.c	2007-08-08 17:51:21.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/mm/mempolicy.c	2007-08-08 22:31:00.000000000 +0100
@@ -131,41 +131,20 @@ static int mpol_check_policy(int mode, n
 	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				zl->zones[num++] = z;
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->zones[num] = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -196,12 +175,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -479,14 +457,10 @@ long do_set_mempolicy(int mode, nodemask
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
 	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->zones[i]; i++)
-			node_set(zone_to_nid(p->v.zonelist->zones[i]),
-				*nodes);
+		*nodes = p->v.nodes;
 		break;
 	case MPOL_DEFAULT:
 		break;
@@ -1090,6 +1064,17 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask represnting a mempolicy */
+static nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied  for MPOL_BIND */
+	if (policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+		return &policy->v.nodes;
+
+	return NULL;
+}
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1102,11 +1087,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1145,12 +1125,17 @@ unsigned slab_node(struct mempolicy *pol
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND:
+	case MPOL_BIND: {
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zone_to_nid(policy->v.zonelist->zones[0]);
+		struct zonelist *zonelist;
+		struct zone **z;
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
+		z = first_zones_zonelist(zonelist, &policy->v.nodes, gfp_zone(GFP_KERNEL));
+		return zone_to_nid(*z);
+	}
 
 	case MPOL_PREFERRED:
 		if (policy->v.preferred_node >= 0)
@@ -1268,7 +1253,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, nodemask_policy(gfp, pol), 0,
+						zonelist_policy(gfp, pol));
 }
 
 /**
@@ -1326,14 +1312,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1347,17 +1325,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fallthrough */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->zones[i]; i++)
-			if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
-				return 0;
-		return b->v.zonelist->zones[i] == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1369,8 +1342,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1660,6 +1631,8 @@ void mpol_rebind_policy(struct mempolicy
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1672,32 +1645,6 @@ void mpol_rebind_policy(struct mempolicy
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zone **z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->zones; *z; z++)
-			node_set(zone_to_nid(*z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/mm/mmzone.c linux-2.6.23-rc2-020_filter_nodemask/mm/mmzone.c
--- linux-2.6.23-rc2-015_treat_movable_highest/mm/mmzone.c	2007-08-04 03:49:55.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/mm/mmzone.c	2007-08-08 22:18:09.000000000 +0100
@@ -8,6 +8,7 @@
 #include <linux/stddef.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 
 struct pglist_data *first_online_pgdat(void)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc2-015_treat_movable_highest/mm/page_alloc.c linux-2.6.23-rc2-020_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc2-015_treat_movable_highest/mm/page_alloc.c	2007-08-08 17:51:13.000000000 +0100
+++ linux-2.6.23-rc2-020_filter_nodemask/mm/page_alloc.c	2007-08-08 22:29:10.000000000 +0100
@@ -1147,7 +1147,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int alloc_flags)
 {
 	struct zone **z;
@@ -1164,7 +1164,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1218,8 +1219,8 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+__alloc_pages_nodemask(gfp_t gfp_mask, nodemask_t *nodemask,
+		unsigned int order, struct zonelist *zonelist)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct zone **z;
@@ -1243,7 +1244,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 				zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1288,7 +1289,8 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
+									alloc_flags);
 	if (page)
 		goto got_pg;
 
@@ -1300,8 +1302,8 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
+			page = get_page_from_freelist(gfp_mask, nodemask,
+				order, zonelist, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
 			if (gfp_mask & __GFP_NOFAIL) {
@@ -1332,7 +1334,7 @@ nofail_alloc:
 	cond_resched();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 						zonelist, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1343,8 +1345,8 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+				order, zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1388,6 +1390,14 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_nodemask(gfp_mask, NULL, order, zonelist);
+}
+
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*
@@ -1980,7 +1990,6 @@ static void build_zonelists(pg_data_t *p
 /* Construct the zonelist performance cache - see further mmzone.h */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
 	struct zonelist *zonelist;
 	struct zonelist_cache *zlc;
 	struct zone **z;
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 21:44     ` Mel Gorman
@ 2007-08-08 22:40       ` Lee Schermerhorn
  2007-08-08 23:37         ` Christoph Lameter
  2007-08-09 14:47         ` Mel Gorman
  2007-08-08 23:35       ` Christoph Lameter
  1 sibling, 2 replies; 23+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 22:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 2007-08-08 at 22:44 +0100, Mel Gorman wrote:
> On (08/08/07 14:30), Lee Schermerhorn didst pronounce:
> > On Wed, 2007-08-08 at 10:36 -0700, Christoph Lameter wrote:
> > > On Wed, 8 Aug 2007, Mel Gorman wrote:
> > > 
<snip>
> > > > o Remove bind_zonelist() (Patch in progress, very messy right now)
> > > 
> > > Will this also allow us to avoid always hitting the first node of an 
> > > MPOL_BIND first?
> > 
> > An idea:
> > 
> > Apologies if someone already suggested this and I missed it.  Too much
> > traffic...
> > 
> > instead of passing a zonelist for BIND policy, how about passing [to
> > __alloc_pages(), I think] a starting node, a nodemask, and gfp flags for
> > zone and modifiers. 
> 
> Yes, this has come up before although it wasn't my initial suggestion. I
> thought maybe it was yours but I'm not sure anymore. I'm working through
> it at the moment. 

I've heard/seen Christoph mention passing a nodemask to alloc_pages a
few times, but hadn't seen any of the details.  Got me thinking..

> With the patch currently, a a nodemask is passed in for
> filtering which should be enough as the zonelist being used should be enough
> information to indicate the starting node.

It'll take me a while to absorb the patch, so I'll just ask:  Where does
the zonelist for the argument come from?  If the the bind policy
zonelist is removed, then does it come from a node?  There'll be only
one per node with your other patches, right?  So you had to have a node
id, to look up the zonelist?  Do you need the zonelist elsewhere,
outside of alloc_pages()?  If not, why not just let alloc_pages look it
up from a starting node [which I think can be determined from the
policy]?  

OK, that's a lot of questions.  no need to answer.  That's just what I'm
thinking re: all this.  I'll wait and see how the patch develops.
  
> 
> The signature of __alloc_pages() becomes
> 
> static page * fastcall
> __alloc_pages_nodemask(gfp_t gfp_mask, nodemask_t *nodemask,
>                unsigned int order, struct zonelist *zonelist)
> 
> >  For various policies, the arguments would look like this:
> > Policy		start node	nodemask
> > 
> > default		local node	cpuset_current_mems_allowed
> > 
> > preferred	preferred_node	cpuset_current_mems_allowed
> > 
> > interleave	computed node	cpuset_current_mems_allowed
> > 
> > bind		local node	policy nodemask [replaces bind
> > 				zonelist in mempolicy]
> > 
> 
> The last one is the most interesting. Much of the patch in development
> involves deleting the custom node stuff. I've included the patch below if
> you're curious. I wanted to get one-zonelist out first to see if we could
> agree on that before going further with it.

Again, it'll be a while. 

Thanks,
Lee



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 21:04   ` Mel Gorman
@ 2007-08-08 23:26     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 23:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Mel Gorman wrote:

> I'll revisit it again just in case but for now I'd rather not get
> sidetracked from the patchset at hand.

Sure. This is more long term.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists
  2007-08-08 21:10     ` Mel Gorman
@ 2007-08-08 23:28       ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 23:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Mel Gorman wrote:

> The zone_id is the one I'm really interested in. It looks like the most
> promising optimisation for avoiding zone_idx in the hotpath.

Certainly the highest priority. However, if the nodeid could be coded in 
then we may be able to also avoid the GFP_THISNODE zonelists and improve 
the speed of matching a zonelist to a node mask.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 21:44     ` Mel Gorman
  2007-08-08 22:40       ` Lee Schermerhorn
@ 2007-08-08 23:35       ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 23:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Mel Gorman wrote:

> 
> >  For various policies, the arguments would look like this:
> > Policy		start node	nodemask
> > 
> > default		local node	cpuset_current_mems_allowed
> > 
> > preferred	preferred_node	cpuset_current_mems_allowed
> > 
> > interleave	computed node	cpuset_current_mems_allowed
> > 
> > bind		local node	policy nodemask [replaces bind
> > 				zonelist in mempolicy]
> > 

GFP_THISNODE could be realized by only setting the desired nodenumber in 
the nodemask.

> The last one is the most interesting. Much of the patch in development
> involves deleting the custom node stuff. I've included the patch below if
> you're curious. I wanted to get one-zonelist out first to see if we could
> agree on that before going further with it.

I think we do.

> > Then, just walk the zonelist for the starting node--already ordered by
> > distance--filtering by gfp_zone() and nodemask.  Done "right", this
> > should always return memory from the closest allowed node [based on the
> > nodemask argument] to the starting node.  And, it would eliminate the
> > custom zonelists for bind policy.  Can also eliminate cpuset checks in
> > the allocation loop because that constraint would already be applied to
> > the nodemask argument.
> > 
> 
> This is what I'm hoping. I haven't looked closely enough to be sure this will
> work but currently I see no reason why it couldn't and it might eliminate
> some of the NUMA-specific paths in the allocator.

Right. But lets first get the general case for the single nodelist 
accepted (with the zoneid optimizations?)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 22:40       ` Lee Schermerhorn
@ 2007-08-08 23:37         ` Christoph Lameter
  2007-08-09 14:47         ` Mel Gorman
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-08 23:37 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On Wed, 8 Aug 2007, Lee Schermerhorn wrote:

> It'll take me a while to absorb the patch, so I'll just ask:  Where does
> the zonelist for the argument come from?  If the the bind policy
> zonelist is removed, then does it come from a node?  There'll be only

Right.

> one per node with your other patches, right?  So you had to have a node
> id, to look up the zonelist?  Do you need the zonelist elsewhere,
> outside of alloc_pages()?  If not, why not just let alloc_pages look it
> up from a starting node [which I think can be determined from the
> policy]?  
 
Exactly. The starting node is passed to alloc_pages_nodemask. We could 
just pass -1 for numa_node_id().



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 22:40       ` Lee Schermerhorn
  2007-08-08 23:37         ` Christoph Lameter
@ 2007-08-09 14:47         ` Mel Gorman
  1 sibling, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-09 14:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, pj, ak, kamezawa.hiroyu, akpm, linux-kernel, linux-mm

On (08/08/07 18:40), Lee Schermerhorn didst pronounce:
> On Wed, 2007-08-08 at 22:44 +0100, Mel Gorman wrote:
>
> <SNIP>
>
> > With the patch currently, a a nodemask is passed in for
> > filtering which should be enough as the zonelist being used should be enough
> > information to indicate the starting node.
> 
> It'll take me a while to absorb the patch, so I'll just ask:  Where does
> the zonelist for the argument come from? If the the bind policy
> zonelist is removed, then does it come from a node? 

Yes, it gets the zonelist from the node and uses a nodemask to ignore
zones within it.

> There'll be only
> one per node with your other patches, right?  So you had to have a node
> id, to look up the zonelist? 

You have the local node_id to lookup the zonelist with. The policy
provides a nodemask then instead of a zonelist for filtering purposes.

> Do you need the zonelist elsewhere,
> outside of alloc_pages()?  If not, why not just let alloc_pages look it
> up from a starting node [which I think can be determined from the
> policy]?
> 

The starting node can be determined from where we are currently running
on. Even if the local node is not in the nodemask, we'd still filter it
as normal.

> OK, that's a lot of questions.  no need to answer.  That's just what I'm
> thinking re: all this.  I'll wait and see how the patch develops.
>   
> > 
> > The signature of __alloc_pages() becomes
> > 
> > static page * fastcall
> > __alloc_pages_nodemask(gfp_t gfp_mask, nodemask_t *nodemask,
> >                unsigned int order, struct zonelist *zonelist)
> > 
> > >  For various policies, the arguments would look like this:
> > > Policy		start node	nodemask
> > > 
> > > default		local node	cpuset_current_mems_allowed
> > > 
> > > preferred	preferred_node	cpuset_current_mems_allowed
> > > 
> > > interleave	computed node	cpuset_current_mems_allowed
> > > 
> > > bind		local node	policy nodemask [replaces bind
> > > 				zonelist in mempolicy]
> > > 
> > 
> > The last one is the most interesting. Much of the patch in development
> > involves deleting the custom node stuff. I've included the patch below if
> > you're curious. I wanted to get one-zonelist out first to see if we could
> > agree on that before going further with it.
> 
> Again, it'll be a while. 
> 

Thanks anyway.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
                   ` (3 preceding siblings ...)
  2007-08-08 17:36 ` [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Christoph Lameter
@ 2007-08-09 20:19 ` Andrew Morton
  2007-08-09 20:33   ` Christoph Lameter
                     ` (2 more replies)
  4 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2007-08-09 20:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter,
	linux-kernel, linux-mm

On Wed,  8 Aug 2007 17:15:04 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:

> The following patches replace multiple zonelists per node with one zonelist
> that is filtered based on the GFP flags.

I think I'll duck this for now on im-trying-to-vaguely-stabilize-mm grounds.
Let's go with the horrible-hack for 2.6.23, then revert it and get this
new approach merged and stabilised over the next week or two?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-09 20:19 ` Andrew Morton
@ 2007-08-09 20:33   ` Christoph Lameter
  2007-08-09 20:51   ` Mel Gorman
  2007-08-09 21:20   ` Andi Kleen
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-09 20:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Lee.Schermerhorn, pj, ak, kamezawa.hiroyu,
	linux-kernel, linux-mm

On Thu, 9 Aug 2007, Andrew Morton wrote:

> On Wed,  8 Aug 2007 17:15:04 +0100 (IST)
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following patches replace multiple zonelists per node with one zonelist
> > that is filtered based on the GFP flags.
> 
> I think I'll duck this for now on im-trying-to-vaguely-stabilize-mm grounds.
> Let's go with the horrible-hack for 2.6.23, then revert it and get this
> new approach merged and stabilised over the next week or two?

Right (I would not call it a horrible hack but a shadow of 
nice things to come). I think Mel is still working on getting more 
optimizations in. The next patchset may be ready to be merged.
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-09 20:19 ` Andrew Morton
  2007-08-09 20:33   ` Christoph Lameter
@ 2007-08-09 20:51   ` Mel Gorman
  2007-08-09 21:20   ` Andi Kleen
  2 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2007-08-09 20:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee.Schermerhorn, pj, ak, kamezawa.hiroyu, clameter,
	linux-kernel, linux-mm

On (09/08/07 13:19), Andrew Morton didst pronounce:
> On Wed,  8 Aug 2007 17:15:04 +0100 (IST)
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following patches replace multiple zonelists per node with one zonelist
> > that is filtered based on the GFP flags.
> 
> I think I'll duck this for now on im-trying-to-vaguely-stabilize-mm grounds.
> Let's go with the horrible-hack for 2.6.23, then revert it and get this
> new approach merged and stabilised over the next week or two?
> 

I'm happy with that plan. I am about to release V3 of the one-zonelist
patchset that includes optimisations so I'm reasonably confident we can
get it to a state we like over the next few weeks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-09 20:19 ` Andrew Morton
  2007-08-09 20:33   ` Christoph Lameter
  2007-08-09 20:51   ` Mel Gorman
@ 2007-08-09 21:20   ` Andi Kleen
  2007-08-09 21:40     ` Christoph Lameter
  2 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2007-08-09 21:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Lee.Schermerhorn, pj, kamezawa.hiroyu, clameter,
	linux-kernel, linux-mm

On Thursday 09 August 2007 22:19:43 Andrew Morton wrote:
> On Wed,  8 Aug 2007 17:15:04 +0100 (IST)
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following patches replace multiple zonelists per node with one zonelist
> > that is filtered based on the GFP flags.
> 
> I think I'll duck this for now on im-trying-to-vaguely-stabilize-mm grounds.
> Let's go with the horrible-hack for 2.6.23, then revert it and get this
> new approach merged and stabilised over the next week or two?

I would prefer to not have horrible hacks even temporary

-Andi




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2
  2007-08-09 21:20   ` Andi Kleen
@ 2007-08-09 21:40     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2007-08-09 21:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Mel Gorman, Lee.Schermerhorn, pj, kamezawa.hiroyu,
	linux-kernel, linux-mm

On Thu, 9 Aug 2007, Andi Kleen wrote:

> > I think I'll duck this for now on im-trying-to-vaguely-stabilize-mm grounds.
> > Let's go with the horrible-hack for 2.6.23, then revert it and get this
> > new approach merged and stabilised over the next week or two?
> 
> I would prefer to not have horrible hacks even temporary

The changes that we are considering for 2.6.24 will result in a single 
zonelist per zone that will filter the zoneslist in alloc_pages. This lead 
to a performance improvement in the page allocator.

What you call a hack is doing the same for the special policy zonelist in 
2.6.23 in order to be able to apply policies to the two highest zones. We 
apply a limited portion of the changes for 2.6.24 to .23 to fix the 
ZONE_MOVABLE issue.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-08-09 21:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-08 16:15 [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Mel Gorman
2007-08-08 16:15 ` [PATCH 1/3] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
2007-08-08 17:38   ` Christoph Lameter
2007-08-08 21:06     ` Mel Gorman
2007-08-08 16:15 ` [PATCH 2/3] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman
2007-08-08 17:46   ` Christoph Lameter
2007-08-08 21:10     ` Mel Gorman
2007-08-08 23:28       ` Christoph Lameter
2007-08-08 16:16 ` [PATCH 3/3] Apply MPOL_BIND policy to two highest zones when highest is ZONE_MOVABLE Mel Gorman
2007-08-08 17:36 ` [PATCH 0/3] Use one zonelist per node instead of multiple zonelists v2 Christoph Lameter
2007-08-08 18:30   ` Lee Schermerhorn
2007-08-08 21:44     ` Mel Gorman
2007-08-08 22:40       ` Lee Schermerhorn
2007-08-08 23:37         ` Christoph Lameter
2007-08-09 14:47         ` Mel Gorman
2007-08-08 23:35       ` Christoph Lameter
2007-08-08 21:04   ` Mel Gorman
2007-08-08 23:26     ` Christoph Lameter
2007-08-09 20:19 ` Andrew Morton
2007-08-09 20:33   ` Christoph Lameter
2007-08-09 20:51   ` Mel Gorman
2007-08-09 21:20   ` Andi Kleen
2007-08-09 21:40     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).