* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 @ 2007-08-17 20:16 Mel Gorman 2007-08-17 20:17 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman ` (5 more replies) 0 siblings, 6 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:16 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm Biggest changes are altering the embedding of zone IDs so that the type is unsigned long instead of struct zone * and the removal of MPOL_BIND-specific zonelists and filering based on node data instead. The biggest concern is the last patch where FASTCALL doesn't appear to do the right thing in all cases. Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch replaces multiple zonelists with one zonelist that is filtered. The final patch is a fix that depends on the previous two patches. The patch changes policy zone so that the MPOL_BIND policy gets applied to the two highest populated zones when the highest populated zone is ZONE_MOVABLE. Otherwise, MPOL_BIND only applies to the highest populated zone. The tests passed regression tests with numactltest. Performance results varied depending on the machine configuration but were usually small performance gains. The new algorithm relies heavily on the implementation of zone_idx which is currently pretty expensive. Experiments to optimise this have shown significant improvements for this algorithm, but is beyond the scope of this patchset. Due to the nature of the change, the results for other people are likely to vary - it'll usually win but occasionally lose. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. I expect it'll be more noticable on x86_64 with many zones than on IA64 which typically would only have one active zonelist-per-node. These are the range of performance losses/gains I found when running against 2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.02% to 0.27% Elapsed time on Kernbench: -0.21% to 1.26% page_test from aim9: -3.41% to 3.90% brk_test from aim9: -0.20% to 40.94% fork_test from aim9: -0.42% to 4.59% exec_test from aim9: -0.78% to 1.95% Size reduction of pg_dat_t: 0 to 7808 bytes (depends on alignment) The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes was within noise. I haven't gone though lmbench. These three patches are a standalone set which address the MPOL_BIND problem with ZONE_MOVABLE as well as reducing memory usage and in many cases the cache footprint of the kernel. They should be considered as a bug fix due to the MPOL_BIND fixup. If these patches are accepted, the follow-on work would entail; o Encode zone_id in the zonelist pointers to avoid zone_idx() (Christoph's idea) o If zone_id works out, eliminate z_to_n from the zonelist cache as unnecessary o Remove bind_zonelist() (Patch in progress, very messy right now) o Eliminate policy_zone (Trickier) Comments? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman @ 2007-08-17 20:17 ` Mel Gorman 2007-08-17 20:17 ` [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman ` (4 subsequent siblings) 5 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:17 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 9 ++++++--- 3 files changed, 8 insertions(+), 5 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-clean/include/linux/swap.h linux-2.6.23-rc3-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc3-clean/include/linux/swap.h 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-005_freepages_zonelist/include/linux/swap.h 2007-08-17 16:35:48.000000000 +0100 @@ -188,7 +188,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-clean/mm/page_alloc.c linux-2.6.23-rc3-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc3-clean/mm/page_alloc.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-005_freepages_zonelist/mm/page_alloc.c 2007-08-17 16:35:48.000000000 +0100 @@ -1326,7 +1326,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-clean/mm/vmscan.c linux-2.6.23-rc3-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc3-clean/mm/vmscan.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-005_freepages_zonelist/mm/vmscan.c 2007-08-17 16:35:48.000000000 +0100 @@ -1075,10 +1075,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zones **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1116,7 +1117,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { int priority; int ret = 0; @@ -1124,6 +1126,7 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1150,7 +1153,7 @@ unsigned long try_to_free_pages(struct z sc.nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, &sc); + nr_reclaimed += shrink_zones(priority, zonelist, &sc); shrink_slab(sc.nr_scanned, gfp_mask, lru_pages); if (reclaim_state) { nr_reclaimed += reclaim_state->reclaimed_slab; ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman 2007-08-17 20:17 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman @ 2007-08-17 20:17 ` Mel Gorman 2007-08-17 20:59 ` Christoph Lameter 2007-08-17 20:17 ` [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer Mel Gorman ` (3 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:17 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm Currently a node has a number of zonelists, one for each zone type in the system. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists occupy memory and consume cache lines. This patch replaces the multiple zonelists in the node with a single zonelist that contains all populated zones in the system. An iterator macro is introduced called for_each_zone_zonelist() interates through each zone in the zonelist that is allowed by the GFP flags. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- arch/parisc/mm/init.c | 11 +- drivers/char/sysrq.c | 3 fs/buffer.c | 9 +- include/linux/gfp.h | 3 include/linux/mempolicy.h | 2 include/linux/mmzone.h | 39 +++++++++ mm/hugetlb.c | 8 +- mm/mempolicy.c | 6 - mm/oom_kill.c | 8 +- mm/page_alloc.c | 162 ++++++++++++++++++----------------------- mm/slab.c | 11 +- mm/slub.c | 11 +- mm/vmscan.c | 21 ++--- 13 files changed, 160 insertions(+), 134 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc3-010_use_zonelist/arch/parisc/mm/init.c --- linux-2.6.23-rc3-005_freepages_zonelist/arch/parisc/mm/init.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/arch/parisc/mm/init.c 2007-08-17 16:35:55.000000000 +0100 @@ -599,15 +599,18 @@ void show_mem(void) #ifdef CONFIG_DISCONTIGMEM { struct zonelist *zl; - int i, j, k; + int i, j; for (i = 0; i < npmem_ranges; i++) { + zl = node_zonelist(i); for (j = 0; j < MAX_NR_ZONES; j++) { - zl = NODE_DATA(i)->node_zonelists + j; + struct zone **z; + struct zone *zone; printk("Zone list for zone %d on node %d: ", j, i); - for (k = 0; zl->zones[k] != NULL; k++) - printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name); + for_each_zone_zonelist(zone, z, zl, j) + printk("[%d/%s] ", zone_to_nid(zone), + zone->name); printk("\n"); } } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/drivers/char/sysrq.c linux-2.6.23-rc3-010_use_zonelist/drivers/char/sysrq.c --- linux-2.6.23-rc3-005_freepages_zonelist/drivers/char/sysrq.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/drivers/char/sysrq.c 2007-08-17 16:35:55.000000000 +0100 @@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op static void moom_callback(struct work_struct *ignored) { - out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL], - GFP_KERNEL, 0); + out_of_memory(node_zonelist(0), GFP_KERNEL, 0); } static DECLARE_WORK(moom_work, moom_callback); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/fs/buffer.c linux-2.6.23-rc3-010_use_zonelist/fs/buffer.c --- linux-2.6.23-rc3-005_freepages_zonelist/fs/buffer.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/fs/buffer.c 2007-08-17 16:35:55.000000000 +0100 @@ -348,15 +348,16 @@ void invalidate_bdev(struct block_device static void free_more_memory(void) { struct zone **zones; - pg_data_t *pgdat; + int nid; wakeup_pdflush(1024); yield(); - for_each_online_pgdat(pgdat) { - zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones; + for_each_online_node(nid) { + zones = first_zones_zonelist(node_zonelist(nid), + gfp_zone(GFP_NOFS)); if (*zones) - try_to_free_pages(zones, 0, GFP_NOFS); + try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS); } } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/include/linux/gfp.h linux-2.6.23-rc3-010_use_zonelist/include/linux/gfp.h --- linux-2.6.23-rc3-005_freepages_zonelist/include/linux/gfp.h 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/include/linux/gfp.h 2007-08-17 16:35:55.000000000 +0100 @@ -151,8 +151,7 @@ static inline struct page *alloc_pages_n if (nid < 0) nid = numa_node_id(); - return __alloc_pages(gfp_mask, order, - NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); + return __alloc_pages(gfp_mask, order, node_zonelist(nid)); } #ifdef CONFIG_NUMA diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/include/linux/mempolicy.h linux-2.6.23-rc3-010_use_zonelist/include/linux/mempolicy.h --- linux-2.6.23-rc3-005_freepages_zonelist/include/linux/mempolicy.h 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/include/linux/mempolicy.h 2007-08-17 16:35:55.000000000 +0100 @@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_f static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags) { - return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags); + return node_zonelist(0); } static inline int do_migrate_pages(struct mm_struct *mm, diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/include/linux/mmzone.h linux-2.6.23-rc3-010_use_zonelist/include/linux/mmzone.h --- linux-2.6.23-rc3-005_freepages_zonelist/include/linux/mmzone.h 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/include/linux/mmzone.h 2007-08-17 16:35:55.000000000 +0100 @@ -437,7 +437,7 @@ extern struct page *mem_map; struct bootmem_data; typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; - struct zonelist node_zonelists[MAX_NR_ZONES]; + struct zonelist node_zonelist; int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP struct page *node_mem_map; @@ -637,6 +637,43 @@ extern struct zone *next_zone(struct zon zone; \ zone = next_zone(zone)) +/* Return the zonelist belonging to a node of a given ID */ +static inline struct zonelist *node_zonelist(int nid) +{ + return &NODE_DATA(nid)->node_zonelist; +} + +/* Returns the first zone at or below highest_zoneidx in a zonelist */ +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist, + enum zone_type highest_zoneidx) +{ + struct zone **z; + for (z = zonelist->zones; zone_idx(*z) > highest_zoneidx; z++); + return z; +} + +/* Returns the next zone at or below highest_zoneidx in a zonelist */ +static inline struct zone **next_zones_zonelist(struct zone **z, + enum zone_type highest_zoneidx) +{ + for (++z; zone_idx(*z) > highest_zoneidx; z++); + return z; +} + +/** + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index + * @zone - The current zone in the iterator + * @z - The current pointer within zonelist->zones being iterated + * @zlist - The zonelist being iterated + * @highidx - The zone index of the highest zone to return + * + * This iterator iterates though all zones at or below a given zone index. + */ +#define for_each_zone_zonelist(zone, z, zlist, highidx) \ + for (z = first_zones_zonelist(zlist, highidx), zone = *z; \ + zone; \ + z = next_zones_zonelist(z, highidx), zone = *z) + #ifdef CONFIG_SPARSEMEM #include <asm/sparsemem.h> #endif diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/hugetlb.c linux-2.6.23-rc3-010_use_zonelist/mm/hugetlb.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/hugetlb.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/hugetlb.c 2007-08-17 16:35:55.000000000 +0100 @@ -73,11 +73,11 @@ static struct page *dequeue_huge_page(st struct page *page = NULL; struct zonelist *zonelist = huge_zonelist(vma, address, htlb_alloc_mask); - struct zone **z; + struct zone *zone, **z; - for (z = zonelist->zones; *z; z++) { - nid = zone_to_nid(*z); - if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) && + for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) { + nid = zone_to_nid(zone); + if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) && !list_empty(&hugepage_freelists[nid])) { page = list_entry(hugepage_freelists[nid].next, struct page, lru); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/mempolicy.c linux-2.6.23-rc3-010_use_zonelist/mm/mempolicy.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/mempolicy.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/mempolicy.c 2007-08-17 16:35:55.000000000 +0100 @@ -1116,7 +1116,7 @@ static struct zonelist *zonelist_policy( nd = 0; BUG(); } - return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp); + return node_zonelist(nd); } /* Do dynamic interleaving for a process */ @@ -1212,7 +1212,7 @@ struct zonelist *huge_zonelist(struct vm unsigned nid; nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT); - return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags); + return node_zonelist(nid); } return zonelist_policy(GFP_HIGHUSER, pol); } @@ -1226,7 +1226,7 @@ static struct page *alloc_page_interleav struct zonelist *zl; struct page *page; - zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp); + zl = node_zonelist(nid); page = __alloc_pages(gfp, order, zl); if (page && page_zone(page) == zl->zones[0]) inc_zone_page_state(page, NUMA_INTERLEAVE_HIT); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/oom_kill.c linux-2.6.23-rc3-010_use_zonelist/mm/oom_kill.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/oom_kill.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/oom_kill.c 2007-08-17 16:35:55.000000000 +0100 @@ -177,8 +177,10 @@ static inline int constrained_alloc(stru { #ifdef CONFIG_NUMA struct zone **z; + struct zone *zone; nodemask_t nodes; int node; + enum zone_type high_zoneidx = gfp_zone(gfp_mask); nodes_clear(nodes); /* node has memory ? */ @@ -186,9 +188,9 @@ static inline int constrained_alloc(stru if (NODE_DATA(node)->node_present_pages) node_set(node, nodes); - for (z = zonelist->zones; *z; z++) - if (cpuset_zone_allowed_softwall(*z, gfp_mask)) - node_clear(zone_to_nid(*z), nodes); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) + if (cpuset_zone_allowed_softwall(zone, gfp_mask)) + node_clear(zone_to_nid(zone), nodes); else return CONSTRAINT_CPUSET; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/page_alloc.c linux-2.6.23-rc3-010_use_zonelist/mm/page_alloc.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/page_alloc.c 2007-08-17 16:35:48.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/page_alloc.c 2007-08-17 17:02:38.000000000 +0100 @@ -1148,30 +1148,32 @@ static void zlc_mark_zone_full(struct zo */ static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, - struct zonelist *zonelist, int alloc_flags) + struct zonelist *zonelist, int high_zoneidx, int alloc_flags) { struct zone **z; struct page *page = NULL; - int classzone_idx = zone_idx(zonelist->zones[0]); + struct zone *classzone; + int classzone_idx; struct zone *zone; nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ + z = first_zones_zonelist(zonelist, high_zoneidx); + classzone = *z; + classzone_idx = zone_idx(*z); + zonelist_scan: /* * Scan zonelist, looking for a zone with enough free. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ - z = zonelist->zones; - - do { + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { if (NUMA_BUILD && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes)) continue; - zone = *z; if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) && - zone->zone_pgdat != zonelist->zones[0]->zone_pgdat)) + zone->zone_pgdat != classzone->zone_pgdat)) break; if ((alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask)) @@ -1206,7 +1208,7 @@ try_next_zone: zlc_active = 1; did_zlc_setup = 1; } - } while (*(++z) != NULL); + } if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) { /* Disable zlc cache for second zonelist scan */ @@ -1224,6 +1226,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i struct zonelist *zonelist) { const gfp_t wait = gfp_mask & __GFP_WAIT; + enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone **z; struct page *page; struct reclaim_state reclaim_state; @@ -1246,7 +1249,7 @@ restart: } page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, - zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET); + zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); if (page) goto got_pg; @@ -1290,7 +1293,8 @@ restart: * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ - page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); + page = get_page_from_freelist(gfp_mask, order, zonelist, + high_zoneidx, alloc_flags); if (page) goto got_pg; @@ -1303,7 +1307,7 @@ rebalance: nofail_alloc: /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, - zonelist, ALLOC_NO_WATERMARKS); + zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); if (page) goto got_pg; if (gfp_mask & __GFP_NOFAIL) { @@ -1335,7 +1339,7 @@ nofail_alloc: if (likely(did_some_progress)) { page = get_page_from_freelist(gfp_mask, order, - zonelist, alloc_flags); + zonelist, high_zoneidx, alloc_flags); if (page) goto got_pg; } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { @@ -1346,7 +1350,7 @@ nofail_alloc: * under heavy pressure. */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, - zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET); + zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET); if (page) goto got_pg; @@ -1456,15 +1460,15 @@ EXPORT_SYMBOL(free_pages); static unsigned int nr_free_zone_pages(int offset) { + enum zone_type high_zoneidx = MAX_NR_ZONES - 1; + struct zone **z; + struct zone *zone; + /* Just pick one node, since fallback list is circular */ - pg_data_t *pgdat = NODE_DATA(numa_node_id()); unsigned int sum = 0; + struct zonelist *zonelist = node_zonelist(numa_node_id()); - struct zonelist *zonelist = pgdat->node_zonelists + offset; - struct zone **zonep = zonelist->zones; - struct zone *zone; - - for (zone = *zonep++; zone; zone = *zonep++) { + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { unsigned long size = zone->present_pages; unsigned long high = zone->pages_high; if (size > high) @@ -1823,17 +1827,14 @@ static int find_next_best_node(int node, */ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) { - enum zone_type i; int j; struct zonelist *zonelist; - for (i = 0; i < MAX_NR_ZONES; i++) { - zonelist = pgdat->node_zonelists + i; - for (j = 0; zonelist->zones[j] != NULL; j++) - ; - j = build_zonelists_node(NODE_DATA(node), zonelist, j, i); - zonelist->zones[j] = NULL; - } + zonelist = &pgdat->node_zonelist; + for (j = 0; zonelist->zones[j] != NULL; j++) + ; + j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1); + zonelist->zones[j] = NULL; } /* @@ -1846,27 +1847,24 @@ static int node_order[MAX_NUMNODES]; static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) { - enum zone_type i; int pos, j, node; int zone_type; /* needs to be signed */ struct zone *z; struct zonelist *zonelist; - for (i = 0; i < MAX_NR_ZONES; i++) { - zonelist = pgdat->node_zonelists + i; - pos = 0; - for (zone_type = i; zone_type >= 0; zone_type--) { - for (j = 0; j < nr_nodes; j++) { - node = node_order[j]; - z = &NODE_DATA(node)->node_zones[zone_type]; - if (populated_zone(z)) { - zonelist->zones[pos++] = z; - check_highest_zone(zone_type); - } + zonelist = &pgdat->node_zonelist; + pos = 0; + for (zone_type = MAX_NR_ZONES-1; zone_type >= 0; zone_type--) { + for (j = 0; j < nr_nodes; j++) { + node = node_order[j]; + z = &NODE_DATA(node)->node_zones[zone_type]; + if (populated_zone(z)) { + zonelist->zones[pos++] = z; + check_highest_zone(zone_type); } } - zonelist->zones[pos] = NULL; } + zonelist->zones[pos] = NULL; } static int default_zonelist_order(void) @@ -1933,17 +1931,14 @@ static void set_zonelist_order(void) static void build_zonelists(pg_data_t *pgdat) { int j, node, load; - enum zone_type i; nodemask_t used_mask; int local_node, prev_node; struct zonelist *zonelist; int order = current_zonelist_order; - /* initialize zonelists */ - for (i = 0; i < MAX_NR_ZONES; i++) { - zonelist = pgdat->node_zonelists + i; - zonelist->zones[0] = NULL; - } + /* initialize zonelist */ + zonelist = &pgdat->node_zonelist; + zonelist->zones[0] = NULL; /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; @@ -1990,19 +1985,15 @@ static void build_zonelists(pg_data_t *p /* Construct the zonelist performance cache - see further mmzone.h */ static void build_zonelist_cache(pg_data_t *pgdat) { - int i; + struct zonelist *zonelist; + struct zonelist_cache *zlc; + struct zone **z; - for (i = 0; i < MAX_NR_ZONES; i++) { - struct zonelist *zonelist; - struct zonelist_cache *zlc; - struct zone **z; - - zonelist = pgdat->node_zonelists + i; - zonelist->zlcache_ptr = zlc = &zonelist->zlcache; - bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); - for (z = zonelist->zones; *z; z++) - zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z); - } + zonelist = &pgdat->node_zonelist; + zonelist->zlcache_ptr = zlc = &zonelist->zlcache; + bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); + for (z = zonelist->zones; *z; z++) + zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z); } @@ -2016,45 +2007,42 @@ static void set_zonelist_order(void) static void build_zonelists(pg_data_t *pgdat) { int node, local_node; - enum zone_type i,j; + enum zone_type j; + struct zonelist *zonelist; local_node = pgdat->node_id; - for (i = 0; i < MAX_NR_ZONES; i++) { - struct zonelist *zonelist; - zonelist = pgdat->node_zonelists + i; + zonelist = &pgdat->node_zonelist; + j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES-1); - j = build_zonelists_node(pgdat, zonelist, 0, i); - /* - * Now we build the zonelist so that it contains the zones - * of all the other nodes. - * We don't want to pressure a particular node, so when - * building the zones for node N, we make sure that the - * zones coming right after the local ones are those from - * node N+1 (modulo N) - */ - for (node = local_node + 1; node < MAX_NUMNODES; node++) { - if (!node_online(node)) - continue; - j = build_zonelists_node(NODE_DATA(node), zonelist, j, i); - } - for (node = 0; node < local_node; node++) { - if (!node_online(node)) - continue; - j = build_zonelists_node(NODE_DATA(node), zonelist, j, i); - } - - zonelist->zones[j] = NULL; + /* + * Now we build the zonelist so that it contains the zones + * of all the other nodes. + * We don't want to pressure a particular node, so when + * building the zones for node N, we make sure that the + * zones coming right after the local ones are those from + * node N+1 (modulo N) + */ + for (node = local_node + 1; node < MAX_NUMNODES; node++) { + if (!node_online(node)) + continue; + j = build_zonelists_node(NODE_DATA(node), zonelist, j, + MAX_NR_ZONES-1); + } + for (node = 0; node < local_node; node++) { + if (!node_online(node)) + continue; + j = build_zonelists_node(NODE_DATA(node), zonelist, j, + MAX_NR_ZONES-1); } + + zonelist->zones[j] = NULL; } /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */ static void build_zonelist_cache(pg_data_t *pgdat) { - int i; - - for (i = 0; i < MAX_NR_ZONES; i++) - pgdat->node_zonelists[i].zlcache_ptr = NULL; + pgdat->node_zonelist.zlcache_ptr = NULL; } #endif /* CONFIG_NUMA */ diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/slab.c linux-2.6.23-rc3-010_use_zonelist/mm/slab.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/slab.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/slab.c 2007-08-17 16:35:55.000000000 +0100 @@ -3214,14 +3214,15 @@ static void *fallback_alloc(struct kmem_ struct zonelist *zonelist; gfp_t local_flags; struct zone **z; + struct zone *zone; + enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; int nid; if (flags & __GFP_THISNODE) return NULL; - zonelist = &NODE_DATA(slab_node(current->mempolicy)) - ->node_zonelists[gfp_zone(flags)]; + zonelist = node_zonelist(slab_node(current->mempolicy)); local_flags = (flags & GFP_LEVEL_MASK); retry: @@ -3229,10 +3230,10 @@ retry: * Look through allowed nodes for objects available * from existing per node queues. */ - for (z = zonelist->zones; *z && !obj; z++) { - nid = zone_to_nid(*z); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + nid = zone_to_nid(zone); - if (cpuset_zone_allowed_hardwall(*z, flags) && + if (cpuset_zone_allowed_hardwall(zone, flags) && cache->nodelists[nid] && cache->nodelists[nid]->free_objects) obj = ____cache_alloc_node(cache, diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/slub.c linux-2.6.23-rc3-010_use_zonelist/mm/slub.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/slub.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/slub.c 2007-08-17 16:35:55.000000000 +0100 @@ -1276,6 +1276,8 @@ static struct page *get_any_partial(stru #ifdef CONFIG_NUMA struct zonelist *zonelist; struct zone **z; + struct zone *zone; + enum zone_type high_zoneidx = gfp_zone(flags); struct page *page; /* @@ -1299,14 +1301,13 @@ static struct page *get_any_partial(stru if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio) return NULL; - zonelist = &NODE_DATA(slab_node(current->mempolicy)) - ->node_zonelists[gfp_zone(flags)]; - for (z = zonelist->zones; *z; z++) { + zonelist = node_zonelist(slab_node(current->mempolicy)); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { struct kmem_cache_node *n; - n = get_node(s, zone_to_nid(*z)); + n = get_node(s, zone_to_nid(zone)); - if (n && cpuset_zone_allowed_hardwall(*z, flags) && + if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > MIN_PARTIAL) { page = get_partial_node(n); if (page) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-005_freepages_zonelist/mm/vmscan.c linux-2.6.23-rc3-010_use_zonelist/mm/vmscan.c --- linux-2.6.23-rc3-005_freepages_zonelist/mm/vmscan.c 2007-08-17 16:35:48.000000000 +0100 +++ linux-2.6.23-rc3-010_use_zonelist/mm/vmscan.c 2007-08-17 16:35:55.000000000 +0100 @@ -1079,13 +1079,11 @@ static unsigned long shrink_zones(int pr struct scan_control *sc) { unsigned long nr_reclaimed = 0; - struct zones **zones = zonelist->zones; - int i; + struct zone **z; + struct zone *zone; sc->all_unreclaimable = 1; - for (i = 0; zones[i] != NULL; i++) { - struct zone *zone = zones[i]; - + for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) { if (!populated_zone(zone)) continue; @@ -1126,8 +1124,9 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; - struct zone **zones = zonelist->zones; - int i; + struct zone **z; + struct zone *zone; + enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct scan_control sc = { .gfp_mask = gfp_mask, .may_writepage = !laptop_mode, @@ -1139,9 +1138,7 @@ unsigned long try_to_free_pages(struct z count_vm_event(ALLOCSTALL); - for (i = 0; zones[i] != NULL; i++) { - struct zone *zone = zones[i]; - + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) continue; @@ -1195,9 +1192,7 @@ out: */ if (priority < 0) priority = 0; - for (i = 0; zones[i] != 0; i++) { - struct zone *zone = zones[i]; - + for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) continue; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists 2007-08-17 20:17 ` [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman @ 2007-08-17 20:59 ` Christoph Lameter 2007-08-21 8:51 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Christoph Lameter @ 2007-08-17 20:59 UTC (permalink / raw) To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On Fri, 17 Aug 2007, Mel Gorman wrote: > +/* Returns the first zone at or below highest_zoneidx in a zonelist */ > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist, > + enum zone_type highest_zoneidx) > +{ > + struct zone **z; > + for (z = zonelist->zones; zone_idx(*z) > highest_zoneidx; z++); > + return z; > +} The formatting above is a bit confusing. Add requires empty lines and put the ; on a separate line. > +/* Returns the next zone at or below highest_zoneidx in a zonelist */ > +static inline struct zone **next_zones_zonelist(struct zone **z, > + enum zone_type highest_zoneidx) > +{ > + for (++z; zone_idx(*z) > highest_zoneidx; z++); Looks weird too. ++z on an earlier line and then for ( ; zone_idx(*z) ...) ? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists 2007-08-17 20:59 ` Christoph Lameter @ 2007-08-21 8:51 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-21 8:51 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On (17/08/07 13:59), Christoph Lameter didst pronounce: > On Fri, 17 Aug 2007, Mel Gorman wrote: > > > +/* Returns the first zone at or below highest_zoneidx in a zonelist */ > > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist, > > + enum zone_type highest_zoneidx) > > +{ > > + struct zone **z; > > + for (z = zonelist->zones; zone_idx(*z) > highest_zoneidx; z++); > > + return z; > > +} > > The formatting above is a bit confusing. Add requires empty lines and put > the ; on a separate line. > > > > +/* Returns the next zone at or below highest_zoneidx in a zonelist */ > > +static inline struct zone **next_zones_zonelist(struct zone **z, > > + enum zone_type highest_zoneidx) > > +{ > > + for (++z; zone_idx(*z) > highest_zoneidx; z++); > > Looks weird too. > > ++z on an earlier line and then > > for ( ; zone_idx(*z) ...) > > ? > Ok, the relevant section now looks like +/* Returns the first zone at or below highest_zoneidx in a zonelist */ +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist, + enum zone_type highest_zoneidx) +{ + struct zone **z; + + for (z = zonelist->zones; + zone_idx(*z) > highest_zoneidx; + z++) + ; + + return z; +} + +/* Returns the next zone at or below highest_zoneidx in a zonelist */ +static inline struct zone **next_zones_zonelist(struct zone **z, + enum zone_type highest_zoneidx) +{ + /* Advance to the next zone in the zonelist */ + z++; + + /* Find the next suitable zone to use for the allocation */ + for (; zone_idx(*z) > highest_zoneidx; z++) + ; + + return z; +} Is that better? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman 2007-08-17 20:17 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-08-17 20:17 ` [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman @ 2007-08-17 20:17 ` Mel Gorman 2007-08-17 21:02 ` Christoph Lameter 2007-08-17 20:18 ` [PATCH 4/6] Record how many zones can be safely skipped in the zonelist Mel Gorman ` (2 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:17 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm Using one zonelist per node requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. struct zone is aligned on a node-interleave boundary so the pointer values of plenty of 0's at the least significant bits of the address. This patch embeds the zone_id of a zone in the zonelist->zones pointers. The real zone pointer is found using the zonelist_zone() helper function. The ID of the zone is found using zonelist_zone_idx(). To avoid accidental references, the zones field is renamed to _zones. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- arch/parisc/mm/init.c | 2 - fs/buffer.c | 2 - include/linux/mmzone.h | 74 ++++++++++++++++++++++++++++++++++++++------ kernel/cpuset.c | 4 +- mm/hugetlb.c | 3 + mm/mempolicy.c | 32 +++++++++++-------- mm/oom_kill.c | 2 - mm/page_alloc.c | 51 +++++++++++++++--------------- mm/slab.c | 2 - mm/slub.c | 2 - mm/vmscan.c | 4 +- mm/vmstat.c | 5 +- 12 files changed, 124 insertions(+), 59 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc3-015_zoneid_zonelist/arch/parisc/mm/init.c --- linux-2.6.23-rc3-010_use_zonelist/arch/parisc/mm/init.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/arch/parisc/mm/init.c 2007-08-17 16:36:04.000000000 +0100 @@ -604,7 +604,7 @@ void show_mem(void) for (i = 0; i < npmem_ranges; i++) { zl = node_zonelist(i); for (j = 0; j < MAX_NR_ZONES; j++) { - struct zone **z; + unsigned long *z; struct zone *zone; printk("Zone list for zone %d on node %d: ", j, i); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/fs/buffer.c linux-2.6.23-rc3-015_zoneid_zonelist/fs/buffer.c --- linux-2.6.23-rc3-010_use_zonelist/fs/buffer.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/fs/buffer.c 2007-08-17 16:36:04.000000000 +0100 @@ -347,7 +347,7 @@ void invalidate_bdev(struct block_device */ static void free_more_memory(void) { - struct zone **zones; + unsigned long *zones; int nid; wakeup_pdflush(1024); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/include/linux/mmzone.h linux-2.6.23-rc3-015_zoneid_zonelist/include/linux/mmzone.h --- linux-2.6.23-rc3-010_use_zonelist/include/linux/mmzone.h 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/include/linux/mmzone.h 2007-08-17 16:52:13.000000000 +0100 @@ -404,7 +404,10 @@ struct zonelist_cache; struct zonelist { struct zonelist_cache *zlcache_ptr; // NULL or &zlcache - struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited + unsigned long _zones[MAX_ZONES_PER_ZONELIST + 1]; /* Encoded pointer, + * 0 delimited, use + * zonelist_zone() + */ #ifdef CONFIG_NUMA struct zonelist_cache zlcache; // optional ... #endif @@ -637,6 +640,55 @@ extern struct zone *next_zone(struct zon zone; \ zone = next_zone(zone)) + +/* + * SMP will align zones to a large boundary so the zone ID will fit in the + * least significant biuts. Otherwise, ZONES_SHIFT must be 2 or less to + * fit + */ +#if (defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT > ZONES_SHIFT) || \ + ZONES_SHIFT <= 2 + +/* Similar to ZONES_MASK but is not available in this context */ +#define ZONELIST_ZONEIDX_MASK ((1UL << ZONES_SHIFT) - 1) + +/* zone_id is small enough to fit at bottom of zone pointer in zonelist */ +static inline struct zone *zonelist_zone(unsigned long zone_addr) +{ + return (struct zone *)(zone_addr & ~ZONELIST_ZONEIDX_MASK); +} + +static inline int zonelist_zone_idx(unsigned long zone_addr) +{ + /* ZONES_MASK not available in this context */ + return zone_addr & ZONELIST_ZONEIDX_MASK; +} + +static inline unsigned long encode_zone_idx(struct zone *zone) +{ + unsigned long encoded; + + encoded = (unsigned long)zone | zone_idx(zone); + BUG_ON(zonelist_zone(encoded) != zone); + return encoded; +} +#else +static inline struct zone *zonelist_zone(unsigned long zone_addr) +{ + return (struct zone *)zone_addr; +} + +static inline int zonelist_zone_idx(unsigned long zone_addr) +{ + return zone_idx((struct zone *)zone_addr); +} + +static inline struct zone *encode_zone_idx(struct zone *zone) +{ + return (unsigned long)zone; +} +#endif + /* Return the zonelist belonging to a node of a given ID */ static inline struct zonelist *node_zonelist(int nid) { @@ -644,19 +696,23 @@ static inline struct zonelist *node_zone } /* Returns the first zone at or below highest_zoneidx in a zonelist */ -static inline struct zone **first_zones_zonelist(struct zonelist *zonelist, +static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist, enum zone_type highest_zoneidx) { - struct zone **z; - for (z = zonelist->zones; zone_idx(*z) > highest_zoneidx; z++); + unsigned long *z; + for (z = zonelist->_zones; + zonelist_zone_idx(*z) > highest_zoneidx; + z++); return z; } /* Returns the next zone at or below highest_zoneidx in a zonelist */ -static inline struct zone **next_zones_zonelist(struct zone **z, +static inline unsigned long *next_zones_zonelist(unsigned long *z, enum zone_type highest_zoneidx) { - for (++z; zone_idx(*z) > highest_zoneidx; z++); + for (++z; + zonelist_zone_idx(*z) > highest_zoneidx; + z++); return z; } @@ -670,9 +726,9 @@ static inline struct zone **next_zones_z * This iterator iterates though all zones at or below a given zone index. */ #define for_each_zone_zonelist(zone, z, zlist, highidx) \ - for (z = first_zones_zonelist(zlist, highidx), zone = *z; \ - zone; \ - z = next_zones_zonelist(z, highidx), zone = *z) + for (z = first_zones_zonelist(zlist, highidx), zone = zonelist_zone(*z); \ + zone; \ + z = next_zones_zonelist(z, highidx), zone = zonelist_zone(*z)) #ifdef CONFIG_SPARSEMEM #include <asm/sparsemem.h> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/kernel/cpuset.c linux-2.6.23-rc3-015_zoneid_zonelist/kernel/cpuset.c --- linux-2.6.23-rc3-010_use_zonelist/kernel/cpuset.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/kernel/cpuset.c 2007-08-17 16:36:04.000000000 +0100 @@ -2336,8 +2336,8 @@ int cpuset_zonelist_valid_mems_allowed(s { int i; - for (i = 0; zl->zones[i]; i++) { - int nid = zone_to_nid(zl->zones[i]); + for (i = 0; zl->_zones[i]; i++) { + int nid = zone_to_nid(zonelist_zone(zl->_zones[i])); if (node_isset(nid, current->mems_allowed)) return 1; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/hugetlb.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/hugetlb.c --- linux-2.6.23-rc3-010_use_zonelist/mm/hugetlb.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/hugetlb.c 2007-08-17 16:36:04.000000000 +0100 @@ -73,7 +73,8 @@ static struct page *dequeue_huge_page(st struct page *page = NULL; struct zonelist *zonelist = huge_zonelist(vma, address, htlb_alloc_mask); - struct zone *zone, **z; + struct zone *zone; + unsigned long *z; for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) { nid = zone_to_nid(zone); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/mempolicy.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/mempolicy.c --- linux-2.6.23-rc3-010_use_zonelist/mm/mempolicy.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/mempolicy.c 2007-08-17 16:54:10.000000000 +0100 @@ -154,7 +154,7 @@ static struct zonelist *bind_zonelist(no for_each_node_mask(nd, *nodes) { struct zone *z = &NODE_DATA(nd)->node_zones[k]; if (z->present_pages > 0) - zl->zones[num++] = z; + zl->_zones[num++] = encode_zone_idx(z); } if (k == 0) break; @@ -164,7 +164,7 @@ static struct zonelist *bind_zonelist(no kfree(zl); return ERR_PTR(-EINVAL); } - zl->zones[num] = NULL; + zl->_zones[num] = 0; return zl; } @@ -484,9 +484,11 @@ static void get_zonemask(struct mempolic nodes_clear(*nodes); switch (p->policy) { case MPOL_BIND: - for (i = 0; p->v.zonelist->zones[i]; i++) - node_set(zone_to_nid(p->v.zonelist->zones[i]), - *nodes); + for (i = 0; p->v.zonelist->_zones[i]; i++) { + struct zone *zone; + zone = zonelist_zone(p->v.zonelist->_zones[i]); + node_set(zone_to_nid(zone), *nodes); + } break; case MPOL_DEFAULT: break; @@ -1150,7 +1152,7 @@ unsigned slab_node(struct mempolicy *pol * Follow bind policy behavior and start allocation at the * first node. */ - return zone_to_nid(policy->v.zonelist->zones[0]); + return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0])); case MPOL_PREFERRED: if (policy->v.preferred_node >= 0) @@ -1228,7 +1230,7 @@ static struct page *alloc_page_interleav zl = node_zonelist(nid); page = __alloc_pages(gfp, order, zl); - if (page && page_zone(page) == zl->zones[0]) + if (page && page_zone(page) == zonelist_zone(zl->_zones[0])) inc_zone_page_state(page, NUMA_INTERLEAVE_HIT); return page; } @@ -1353,10 +1355,14 @@ int __mpol_equal(struct mempolicy *a, st return a->v.preferred_node == b->v.preferred_node; case MPOL_BIND: { int i; - for (i = 0; a->v.zonelist->zones[i]; i++) - if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i]) + for (i = 0; a->v.zonelist->_zones[i]; i++) { + struct zone *za, *zb; + za = zonelist_zone(a->v.zonelist->_zones[i]); + zb = zonelist_zone(b->v.zonelist->_zones[i]); + if (za != zb) return 0; - return b->v.zonelist->zones[i] == NULL; + } + return b->v.zonelist->_zones[i] == 0; } default: BUG(); @@ -1674,12 +1680,12 @@ void mpol_rebind_policy(struct mempolicy break; case MPOL_BIND: { nodemask_t nodes; - struct zone **z; + unsigned long *z; struct zonelist *zonelist; nodes_clear(nodes); - for (z = pol->v.zonelist->zones; *z; z++) - node_set(zone_to_nid(*z), nodes); + for (z = pol->v.zonelist->_zones; *z; z++) + node_set(zone_to_nid(zonelist_zone(*z)), nodes); nodes_remap(tmp, nodes, *mpolmask, *newmask); nodes = tmp; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/oom_kill.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/oom_kill.c --- linux-2.6.23-rc3-010_use_zonelist/mm/oom_kill.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/oom_kill.c 2007-08-17 16:36:04.000000000 +0100 @@ -176,7 +176,7 @@ unsigned long badness(struct task_struct static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask) { #ifdef CONFIG_NUMA - struct zone **z; + unsigned long *z; struct zone *zone; nodemask_t nodes; int node; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/page_alloc.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/page_alloc.c --- linux-2.6.23-rc3-010_use_zonelist/mm/page_alloc.c 2007-08-17 17:02:38.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/page_alloc.c 2007-08-17 16:44:24.000000000 +0100 @@ -1087,7 +1087,7 @@ static nodemask_t *zlc_setup(struct zone * We are low on memory in the second scan, and should leave no stone * unturned looking for a free page. */ -static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z, +static int zlc_zone_worth_trying(struct zonelist *zonelist, unsigned long *z, nodemask_t *allowednodes) { struct zonelist_cache *zlc; /* cached zonelist speedup info */ @@ -1098,7 +1098,7 @@ static int zlc_zone_worth_trying(struct if (!zlc) return 1; - i = z - zonelist->zones; + i = z - zonelist->_zones; n = zlc->z_to_n[i]; /* This zone is worth trying if it is allowed but not full */ @@ -1110,7 +1110,7 @@ static int zlc_zone_worth_trying(struct * zlc->fullzones, so that subsequent attempts to allocate a page * from that zone don't waste time re-examining it. */ -static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z) +static void zlc_mark_zone_full(struct zonelist *zonelist, unsigned long *z) { struct zonelist_cache *zlc; /* cached zonelist speedup info */ int i; /* index of *z in zonelist zones */ @@ -1119,7 +1119,7 @@ static void zlc_mark_zone_full(struct zo if (!zlc) return; - i = z - zonelist->zones; + i = z - zonelist->_zones; set_bit(i, zlc->fullzones); } @@ -1131,13 +1131,13 @@ static nodemask_t *zlc_setup(struct zone return NULL; } -static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z, +static int zlc_zone_worth_trying(struct zonelist *zonelist, unsigned long *z, nodemask_t *allowednodes) { return 1; } -static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z) +static void zlc_mark_zone_full(struct zonelist *zonelist, unsigned long *z) { } #endif /* CONFIG_NUMA */ @@ -1150,7 +1150,7 @@ static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags) { - struct zone **z; + unsigned long *z; struct page *page = NULL; struct zone *classzone; int classzone_idx; @@ -1160,8 +1160,8 @@ get_page_from_freelist(gfp_t gfp_mask, u int did_zlc_setup = 0; /* just call zlc_setup() one time */ z = first_zones_zonelist(zonelist, high_zoneidx); - classzone = *z; - classzone_idx = zone_idx(*z); + classzone = zonelist_zone(*z); + classzone_idx = zonelist_zone_idx(*z); zonelist_scan: /* @@ -1227,7 +1227,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i { const gfp_t wait = gfp_mask & __GFP_WAIT; enum zone_type high_zoneidx = gfp_zone(gfp_mask); - struct zone **z; + unsigned long *z; struct page *page; struct reclaim_state reclaim_state; struct task_struct *p = current; @@ -1241,9 +1241,9 @@ __alloc_pages(gfp_t gfp_mask, unsigned i return NULL; restart: - z = zonelist->zones; /* the list of zones suitable for gfp_mask */ + z = zonelist->_zones; /* the list of zones suitable for gfp_mask */ - if (unlikely(*z == NULL)) { + if (unlikely(zonelist_zone(*z) == NULL)) { /* Should this ever happen?? */ return NULL; } @@ -1264,8 +1264,8 @@ restart: if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; - for (z = zonelist->zones; *z; z++) - wakeup_kswapd(*z, order); + for (z = zonelist->_zones; *z; z++) + wakeup_kswapd(zonelist_zone(*z), order); /* * OK, we're below the kswapd watermark and have kicked background @@ -1461,7 +1461,7 @@ EXPORT_SYMBOL(free_pages); static unsigned int nr_free_zone_pages(int offset) { enum zone_type high_zoneidx = MAX_NR_ZONES - 1; - struct zone **z; + unsigned long *z; struct zone *zone; /* Just pick one node, since fallback list is circular */ @@ -1655,7 +1655,7 @@ static int build_zonelists_node(pg_data_ zone_type--; zone = pgdat->node_zones + zone_type; if (populated_zone(zone)) { - zonelist->zones[nr_zones++] = zone; + zonelist->_zones[nr_zones++] = encode_zone_idx(zone); check_highest_zone(zone_type); } @@ -1831,10 +1831,10 @@ static void build_zonelists_in_node_orde struct zonelist *zonelist; zonelist = &pgdat->node_zonelist; - for (j = 0; zonelist->zones[j] != NULL; j++) + for (j = 0; zonelist->_zones[j] != 0; j++) ; j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1); - zonelist->zones[j] = NULL; + zonelist->_zones[j] = 0; } /* @@ -1859,12 +1859,12 @@ static void build_zonelists_in_zone_orde node = node_order[j]; z = &NODE_DATA(node)->node_zones[zone_type]; if (populated_zone(z)) { - zonelist->zones[pos++] = z; + zonelist->_zones[pos++] = encode_zone_idx(z); check_highest_zone(zone_type); } } } - zonelist->zones[pos] = NULL; + zonelist->_zones[pos] = 0; } static int default_zonelist_order(void) @@ -1938,7 +1938,7 @@ static void build_zonelists(pg_data_t *p /* initialize zonelist */ zonelist = &pgdat->node_zonelist; - zonelist->zones[0] = NULL; + zonelist->_zones[0] = 0; /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; @@ -1987,13 +1987,14 @@ static void build_zonelist_cache(pg_data { struct zonelist *zonelist; struct zonelist_cache *zlc; - struct zone **z; + unsigned long *z; zonelist = &pgdat->node_zonelist; zonelist->zlcache_ptr = zlc = &zonelist->zlcache; bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST); - for (z = zonelist->zones; *z; z++) - zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z); + for (z = zonelist->_zones; *z; z++) + zlc->z_to_n[z - zonelist->_zones] = + zone_to_nid(zonelist_zone(*z)); } @@ -2036,7 +2037,7 @@ static void build_zonelists(pg_data_t *p MAX_NR_ZONES-1); } - zonelist->zones[j] = NULL; + zonelist->_zones[j] = 0; } /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */ diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/slab.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/slab.c --- linux-2.6.23-rc3-010_use_zonelist/mm/slab.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/slab.c 2007-08-17 16:36:04.000000000 +0100 @@ -3213,7 +3213,7 @@ static void *fallback_alloc(struct kmem_ { struct zonelist *zonelist; gfp_t local_flags; - struct zone **z; + unsigned long *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/slub.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/slub.c --- linux-2.6.23-rc3-010_use_zonelist/mm/slub.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/slub.c 2007-08-17 16:36:04.000000000 +0100 @@ -1275,7 +1275,7 @@ static struct page *get_any_partial(stru { #ifdef CONFIG_NUMA struct zonelist *zonelist; - struct zone **z; + unsigned long *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); struct page *page; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/vmscan.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/vmscan.c --- linux-2.6.23-rc3-010_use_zonelist/mm/vmscan.c 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/vmscan.c 2007-08-17 16:36:04.000000000 +0100 @@ -1079,7 +1079,7 @@ static unsigned long shrink_zones(int pr struct scan_control *sc) { unsigned long nr_reclaimed = 0; - struct zone **z; + unsigned long *z; struct zone *zone; sc->all_unreclaimable = 1; @@ -1124,7 +1124,7 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; - struct zone **z; + unsigned long *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct scan_control sc = { diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-010_use_zonelist/mm/vmstat.c linux-2.6.23-rc3-015_zoneid_zonelist/mm/vmstat.c --- linux-2.6.23-rc3-010_use_zonelist/mm/vmstat.c 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-015_zoneid_zonelist/mm/vmstat.c 2007-08-17 16:52:54.000000000 +0100 @@ -381,11 +381,12 @@ EXPORT_SYMBOL(refresh_vm_stats); */ void zone_statistics(struct zonelist *zonelist, struct zone *z) { - if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) { + if (z->zone_pgdat == zonelist_zone(zonelist->_zones[0])->zone_pgdat) { __inc_zone_state(z, NUMA_HIT); } else { __inc_zone_state(z, NUMA_MISS); - __inc_zone_state(zonelist->zones[0], NUMA_FOREIGN); + __inc_zone_state(zonelist_zone(zonelist->_zones[0]), + NUMA_FOREIGN); } if (z->node == numa_node_id()) __inc_zone_state(z, NUMA_LOCAL); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer 2007-08-17 20:17 ` [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer Mel Gorman @ 2007-08-17 21:02 ` Christoph Lameter 2007-08-21 8:54 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Christoph Lameter @ 2007-08-17 21:02 UTC (permalink / raw) To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On Fri, 17 Aug 2007, Mel Gorman wrote: > +/* > + * SMP will align zones to a large boundary so the zone ID will fit in the > + * least significant biuts. Otherwise, ZONES_SHIFT must be 2 or less to > + * fit ZONES_SHIFT is always 2 or less.... Acked-by: Christoph Lameter <clameter@sgi.com> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer 2007-08-17 21:02 ` Christoph Lameter @ 2007-08-21 8:54 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-21 8:54 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On (17/08/07 14:02), Christoph Lameter didst pronounce: > On Fri, 17 Aug 2007, Mel Gorman wrote: > > > +/* > > + * SMP will align zones to a large boundary so the zone ID will fit in the > > + * least significant biuts. Otherwise, ZONES_SHIFT must be 2 or less to > > + * fit > > ZONES_SHIFT is always 2 or less.... > Yeah, I get that but I was trying for future proof at build time. However, there is no need to have dead code on the off-chance it is eventually used. Failing the compile should be enough so now the check looks like; +/* + * SMP will align zones to a large boundary so the zone ID will fit in the + * least significant biuts. Otherwise, ZONES_SHIFT must be 2 or less to + * fit. Error if it's not + */ +#if (defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT < ZONES_SHIFT) || \ + ZONES_SHIFT > 2 +#error There is not enough space to embed zone IDs in the zonelist +#endif + > Acked-by: Christoph Lameter <clameter@sgi.com> > Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/6] Record how many zones can be safely skipped in the zonelist 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman ` (2 preceding siblings ...) 2007-08-17 20:17 ` [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer Mel Gorman @ 2007-08-17 20:18 ` Mel Gorman 2007-08-17 21:03 ` Christoph Lameter 2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman 2007-08-17 20:18 ` [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() Mel Gorman 5 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:18 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm This patch is mainly the work of Kamezawa-san. As there is only one zonelist, it must be filtered for zones that are unusable by the GFP flags. As the zonelists very rarely change during the lifetime of the system, it is known in advance how many zones can be skipped from the beginning of the zonelist for each zone type returned by gfp_zone. This patch adds a gfp_skip[] array to struct zonelist to record how many zones may be skipped. From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/mmzone.h | 9 ++++++++- mm/mempolicy.c | 2 ++ mm/page_alloc.c | 13 +++++++++++++ 3 files changed, 23 insertions(+), 1 deletion(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-015_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h --- linux-2.6.23-rc3-015_zoneid_zonelist/include/linux/mmzone.h 2007-08-17 16:52:13.000000000 +0100 +++ linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h 2007-08-17 16:56:20.000000000 +0100 @@ -404,6 +404,7 @@ struct zonelist_cache; struct zonelist { struct zonelist_cache *zlcache_ptr; // NULL or &zlcache + unsigned short gfp_skip[MAX_NR_ZONES]; unsigned long _zones[MAX_ZONES_PER_ZONELIST + 1]; /* Encoded pointer, * 0 delimited, use * zonelist_zone() @@ -695,12 +696,18 @@ static inline struct zonelist *node_zone return &NODE_DATA(nid)->node_zonelist; } +static inline unsigned long *zonelist_gfp_skip(struct zonelist *zonelist, + enum zone_type highest_zoneidx) +{ + return zonelist->_zones + zonelist->gfp_skip[highest_zoneidx]; +} + /* Returns the first zone at or below highest_zoneidx in a zonelist */ static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist, enum zone_type highest_zoneidx) { unsigned long *z; - for (z = zonelist->_zones; + for (z = zonelist_gfp_skip(zonelist, highest_zoneidx); zonelist_zone_idx(*z) > highest_zoneidx; z++); return z; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-015_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c --- linux-2.6.23-rc3-015_zoneid_zonelist/mm/mempolicy.c 2007-08-17 16:54:10.000000000 +0100 +++ linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c 2007-08-17 16:55:31.000000000 +0100 @@ -140,10 +140,12 @@ static struct zonelist *bind_zonelist(no max = 1 + MAX_NR_ZONES * nodes_weight(*nodes); max++; /* space for zlcache_ptr (see mmzone.h) */ + max += sizeof(unsigned short) * MAX_NR_ZONES; /* gfp_skip */ zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL); if (!zl) return ERR_PTR(-ENOMEM); zl->zlcache_ptr = NULL; + memset(zl->gfp_skip, 0, sizeof(zl->gfp_skip)); num = 0; /* First put in the highest zones from all nodes, then all the next lower zones etc. Avoid empty zones because the memory allocator diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-015_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c --- linux-2.6.23-rc3-015_zoneid_zonelist/mm/page_alloc.c 2007-08-17 16:44:24.000000000 +0100 +++ linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c 2007-08-17 16:55:31.000000000 +0100 @@ -2048,6 +2048,18 @@ static void build_zonelist_cache(pg_data #endif /* CONFIG_NUMA */ +static void build_zonelist_gfpskip(pg_data_t *pgdat) +{ + enum zone_type target; + struct zonelist *zl = &pgdat->node_zonelist; + + for (target = 0; target < MAX_NR_ZONES; target++) { + unsigned long *z; + z = first_zones_zonelist(zl, target); + zl->gfp_skip[target] = z - zl->_zones; + } +} + /* return values int ....just for stop_machine_run() */ static int __build_all_zonelists(void *dummy) { @@ -2056,6 +2068,7 @@ static int __build_all_zonelists(void *d for_each_online_node(nid) { build_zonelists(NODE_DATA(nid)); build_zonelist_cache(NODE_DATA(nid)); + build_zonelist_gfpskip(NODE_DATA(nid)); } return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] Record how many zones can be safely skipped in the zonelist 2007-08-17 20:18 ` [PATCH 4/6] Record how many zones can be safely skipped in the zonelist Mel Gorman @ 2007-08-17 21:03 ` Christoph Lameter 2007-08-21 8:58 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Christoph Lameter @ 2007-08-17 21:03 UTC (permalink / raw) To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm Is there any performance improvement because of this patch? It looks like processing got more expensive since an additional cacheline needs to be fetches to get the skip factor. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] Record how many zones can be safely skipped in the zonelist 2007-08-17 21:03 ` Christoph Lameter @ 2007-08-21 8:58 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-21 8:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On (17/08/07 14:03), Christoph Lameter didst pronounce: > Is there any performance improvement because of this patch? It looks > like processing got more expensive since an additional cacheline needs to > be fetches to get the skip factor. > It's a small gain on a few machines. Where I thought it was more likely to be a win is on x86-64 NUMA machines particularly if the zonelist ordering was zone order as there would be potentially many nodes to skip. Kernbench didn't show up any regressions for the other machines but the userspace portion of that workload is unlikely to notice the loss of a cache line. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman ` (3 preceding siblings ...) 2007-08-17 20:18 ` [PATCH 4/6] Record how many zones can be safely skipped in the zonelist Mel Gorman @ 2007-08-17 20:18 ` Mel Gorman 2007-08-17 21:29 ` Christoph Lameter 2007-08-17 20:18 ` [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() Mel Gorman 5 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:18 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm The MPOL_BIND policy creates a zonelist that is used for allocations belonging to that thread that can use the policy_zone. As the zonelist is already being filtered based on a zone id, this patch adds a version of __alloc_pages() that takes a nodemask for further filtering. This eliminates the need for MPOL_BIND to create a custom zonelist. The practical upside of this is that allocations using MPOL_BIND should now use nodes closer to the running CPU first instead of using nodes in numeric order. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/buffer.c | 2 include/linux/cpuset.h | 4 - include/linux/gfp.h | 4 + include/linux/mempolicy.h | 3 include/linux/mmzone.h | 59 +++++++++++++--- kernel/cpuset.c | 16 +--- mm/mempolicy.c | 145 +++++++++++------------------------------ mm/page_alloc.c | 34 ++++++--- 8 files changed, 128 insertions(+), 139 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/fs/buffer.c linux-2.6.23-rc3-030_filter_nodemask/fs/buffer.c --- linux-2.6.23-rc3-020_gfpskip/fs/buffer.c 2007-08-17 16:36:04.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/fs/buffer.c 2007-08-17 16:56:36.000000000 +0100 @@ -355,7 +355,7 @@ static void free_more_memory(void) for_each_online_node(nid) { zones = first_zones_zonelist(node_zonelist(nid), - gfp_zone(GFP_NOFS)); + NULL, gfp_zone(GFP_NOFS)); if (*zones) try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS); } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/cpuset.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/cpuset.h --- linux-2.6.23-rc3-020_gfpskip/include/linux/cpuset.h 2007-08-13 05:25:24.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/cpuset.h 2007-08-17 16:56:36.000000000 +0100 @@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo void cpuset_update_task_memory_state(void); #define cpuset_nodes_subset_current_mems_allowed(nodes) \ nodes_subset((nodes), current->mems_allowed) -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl); +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask); extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask); @@ -98,7 +98,7 @@ static inline void cpuset_init_current_m static inline void cpuset_update_task_memory_state(void) {} #define cpuset_nodes_subset_current_mems_allowed(nodes) (1) -static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) +static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) { return 1; } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/gfp.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h --- linux-2.6.23-rc3-020_gfpskip/include/linux/gfp.h 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h 2007-08-17 16:56:36.000000000 +0100 @@ -141,6 +141,10 @@ static inline void arch_alloc_page(struc extern struct page * FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *)); +extern struct page * +FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, + struct zonelist *, nodemask_t *nodemask)); + static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/mempolicy.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/mempolicy.h --- linux-2.6.23-rc3-020_gfpskip/include/linux/mempolicy.h 2007-08-17 16:35:55.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/mempolicy.h 2007-08-17 16:56:36.000000000 +0100 @@ -63,9 +63,8 @@ struct mempolicy { atomic_t refcnt; short policy; /* See MPOL_* above */ union { - struct zonelist *zonelist; /* bind */ short preferred_node; /* preferred */ - nodemask_t nodes; /* interleave */ + nodemask_t nodes; /* interleave/bind */ /* undefined for default */ } v; nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */ diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/mmzone.h --- linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h 2007-08-17 16:56:20.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/mmzone.h 2007-08-17 17:31:05.000000000 +0100 @@ -696,6 +696,16 @@ static inline struct zonelist *node_zone return &NODE_DATA(nid)->node_zonelist; } +static inline int zone_in_nodemask(unsigned long zone_addr, + nodemask_t *nodes) +{ +#ifdef CONFIG_NUMA + return node_isset(zonelist_zone(zone_addr)->node, *nodes); +#else + return 1; +#endif /* CONFIG_NUMA */ +} + static inline unsigned long *zonelist_gfp_skip(struct zonelist *zonelist, enum zone_type highest_zoneidx) { @@ -704,26 +714,57 @@ static inline unsigned long *zonelist_gf /* Returns the first zone at or below highest_zoneidx in a zonelist */ static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist, + nodemask_t *nodes, enum zone_type highest_zoneidx) { - unsigned long *z; - for (z = zonelist_gfp_skip(zonelist, highest_zoneidx); - zonelist_zone_idx(*z) > highest_zoneidx; - z++); + unsigned long *z = zonelist_gfp_skip(zonelist, highest_zoneidx); + + /* Only filter based on the nodemask if it's set */ + if (likely(nodes == NULL)) + for (;zonelist_zone_idx(*z) > highest_zoneidx; + z++); + else + for (;zonelist_zone_idx(*z) > highest_zoneidx || + !zone_in_nodemask(*z, nodes); + z++); return z; } /* Returns the next zone at or below highest_zoneidx in a zonelist */ static inline unsigned long *next_zones_zonelist(unsigned long *z, + nodemask_t *nodes, enum zone_type highest_zoneidx) { - for (++z; - zonelist_zone_idx(*z) > highest_zoneidx; - z++); + z++; + + /* Only filter based on the nodemask if it's set */ + if (likely(nodes == NULL)) + for (;zonelist_zone_idx(*z) > highest_zoneidx; + z++); + else + for (;zonelist_zone_idx(*z) > highest_zoneidx || + !zone_in_nodemask(*z, nodes); + z++); return z; } /** + * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask + * @zone - The current zone in the iterator + * @z - The current pointer within zonelist->zones being iterated + * @zlist - The zonelist being iterated + * @highidx - The zone index of the highest zone to return + * @nodemask - Nodemask allowed by the allocator + * + * This iterator iterates though all zones at or below a given zone index and + * within a given nodemask + */ +#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \ + for (z = first_zones_zonelist(zlist, nodemask, highidx), zone = zonelist_zone(*z); \ + zone; \ + z = next_zones_zonelist(z, nodemask, highidx), zone = zonelist_zone(*z)) + +/** * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index * @zone - The current zone in the iterator * @z - The current pointer within zonelist->zones being iterated @@ -733,9 +774,7 @@ static inline unsigned long *next_zones_ * This iterator iterates though all zones at or below a given zone index. */ #define for_each_zone_zonelist(zone, z, zlist, highidx) \ - for (z = first_zones_zonelist(zlist, highidx), zone = zonelist_zone(*z); \ - zone; \ - z = next_zones_zonelist(z, highidx), zone = zonelist_zone(*z)) + for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) #ifdef CONFIG_SPARSEMEM #include <asm/sparsemem.h> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/kernel/cpuset.c linux-2.6.23-rc3-030_filter_nodemask/kernel/cpuset.c --- linux-2.6.23-rc3-020_gfpskip/kernel/cpuset.c 2007-08-17 16:36:04.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/kernel/cpuset.c 2007-08-17 16:56:36.000000000 +0100 @@ -2327,21 +2327,19 @@ nodemask_t cpuset_mems_allowed(struct ta } /** - * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed - * @zl: the zonelist to be checked + * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed + * @nodemask: the nodemask to be checked * - * Are any of the nodes on zonelist zl allowed in current->mems_allowed? + * Are any of the nodes in the nodemask allowed in current->mems_allowed? */ -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) { - int i; - - for (i = 0; zl->_zones[i]; i++) { - int nid = zone_to_nid(zonelist_zone(zl->_zones[i])); + int nid; + for_each_node_mask(nid, *nodemask) if (node_isset(nid, current->mems_allowed)) return 1; - } + return 0; } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c linux-2.6.23-rc3-030_filter_nodemask/mm/mempolicy.c --- linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c 2007-08-17 16:55:31.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/mm/mempolicy.c 2007-08-17 17:00:07.000000000 +0100 @@ -131,43 +131,20 @@ static int mpol_check_policy(int mode, n return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL; } -/* Generate a custom zonelist for the BIND policy. */ -static struct zonelist *bind_zonelist(nodemask_t *nodes) +/* Check that the nodemask contains at least one populated zone */ +static int is_valid_nodemask(nodemask_t *nodemask) { - struct zonelist *zl; - int num, max, nd; - enum zone_type k; + int nd, k; - max = 1 + MAX_NR_ZONES * nodes_weight(*nodes); - max++; /* space for zlcache_ptr (see mmzone.h) */ - max += sizeof(unsigned short) * MAX_NR_ZONES; /* gfp_skip */ - zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL); - if (!zl) - return ERR_PTR(-ENOMEM); - zl->zlcache_ptr = NULL; - memset(zl->gfp_skip, 0, sizeof(zl->gfp_skip)); - num = 0; - /* First put in the highest zones from all nodes, then all the next - lower zones etc. Avoid empty zones because the memory allocator - doesn't like them. If you implement node hot removal you - have to fix that. */ + /* Check that there is something useful in this mask */ k = policy_zone; - while (1) { - for_each_node_mask(nd, *nodes) { - struct zone *z = &NODE_DATA(nd)->node_zones[k]; - if (z->present_pages > 0) - zl->_zones[num++] = encode_zone_idx(z); - } - if (k == 0) - break; - k--; - } - if (num == 0) { - kfree(zl); - return ERR_PTR(-EINVAL); + for_each_node_mask(nd, *nodemask) { + struct zone *z = &NODE_DATA(nd)->node_zones[k]; + if (z->present_pages > 0) + return 1; } - zl->_zones[num] = 0; - return zl; + + return 0; } /* Create a new policy */ @@ -198,12 +175,11 @@ static struct mempolicy *mpol_new(int mo policy->v.preferred_node = -1; break; case MPOL_BIND: - policy->v.zonelist = bind_zonelist(nodes); - if (IS_ERR(policy->v.zonelist)) { - void *error_code = policy->v.zonelist; + if (!is_valid_nodemask(nodes)) { kmem_cache_free(policy_cache, policy); - return error_code; + return ERR_PTR(-EINVAL); } + policy->v.nodes = *nodes; break; } policy->policy = mode; @@ -481,19 +457,13 @@ long do_set_mempolicy(int mode, nodemask /* Fill a zone bitmap for a policy */ static void get_zonemask(struct mempolicy *p, nodemask_t *nodes) { - int i; nodes_clear(*nodes); switch (p->policy) { - case MPOL_BIND: - for (i = 0; p->v.zonelist->_zones[i]; i++) { - struct zone *zone; - zone = zonelist_zone(p->v.zonelist->_zones[i]); - node_set(zone_to_nid(zone), *nodes); - } - break; case MPOL_DEFAULT: break; + case MPOL_BIND: + /* Fall through */ case MPOL_INTERLEAVE: *nodes = p->v.nodes; break; @@ -1094,6 +1064,17 @@ static struct mempolicy * get_vma_policy return pol; } +/* Return a nodemask represnting a mempolicy */ +static nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy) +{ + /* Lower zones don't get a nodemask applied for MPOL_BIND */ + if (policy->policy == MPOL_BIND && + gfp_zone(gfp) >= policy_zone && + cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) + return &policy->v.nodes; + + return NULL; +} /* Return a zonelist representing a mempolicy */ static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy) { @@ -1106,11 +1087,6 @@ static struct zonelist *zonelist_policy( nd = numa_node_id(); break; case MPOL_BIND: - /* Lower zones don't get a policy applied */ - /* Careful: current->mems_allowed might have moved */ - if (gfp_zone(gfp) >= policy_zone) - if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist)) - return policy->v.zonelist; /*FALL THROUGH*/ case MPOL_INTERLEAVE: /* should not happen */ case MPOL_DEFAULT: @@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol case MPOL_INTERLEAVE: return interleave_nodes(policy); - case MPOL_BIND: + case MPOL_BIND: { /* * Follow bind policy behavior and start allocation at the * first node. */ - return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0])); + struct zonelist *zonelist; + unsigned long *z; + enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); + zonelist = &NODE_DATA(numa_node_id())->node_zonelist; + z = first_zones_zonelist(zonelist, &policy->v.nodes, + highest_zoneidx); + return zone_to_nid(zonelist_zone(*z)); + } case MPOL_PREFERRED: if (policy->v.preferred_node >= 0) @@ -1272,7 +1255,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area nid = interleave_nid(pol, vma, addr, PAGE_SHIFT); return alloc_page_interleave(gfp, 0, nid); } - return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol)); + return __alloc_pages_nodemask(gfp, 0, + zonelist_policy(gfp, pol), nodemask_policy(gfp, pol)); } /** @@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem } *new = *old; atomic_set(&new->refcnt, 1); - if (new->policy == MPOL_BIND) { - int sz = ksize(old->v.zonelist); - new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL); - if (!new->v.zonelist) { - kmem_cache_free(policy_cache, new); - return ERR_PTR(-ENOMEM); - } - } return new; } @@ -1351,21 +1327,12 @@ int __mpol_equal(struct mempolicy *a, st switch (a->policy) { case MPOL_DEFAULT: return 1; + case MPOL_BIND: + /* Fall through */ case MPOL_INTERLEAVE: return nodes_equal(a->v.nodes, b->v.nodes); case MPOL_PREFERRED: return a->v.preferred_node == b->v.preferred_node; - case MPOL_BIND: { - int i; - for (i = 0; a->v.zonelist->_zones[i]; i++) { - struct zone *za, *zb; - za = zonelist_zone(a->v.zonelist->_zones[i]); - zb = zonelist_zone(b->v.zonelist->_zones[i]); - if (za != zb) - return 0; - } - return b->v.zonelist->_zones[i] == 0; - } default: BUG(); return 0; @@ -1377,8 +1344,6 @@ void __mpol_free(struct mempolicy *p) { if (!atomic_dec_and_test(&p->refcnt)) return; - if (p->policy == MPOL_BIND) - kfree(p->v.zonelist); p->policy = MPOL_DEFAULT; kmem_cache_free(policy_cache, p); } @@ -1668,6 +1633,8 @@ void mpol_rebind_policy(struct mempolicy switch (pol->policy) { case MPOL_DEFAULT: break; + case MPOL_BIND: + /* Fall through */ case MPOL_INTERLEAVE: nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask); pol->v.nodes = tmp; @@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy *mpolmask, *newmask); *mpolmask = *newmask; break; - case MPOL_BIND: { - nodemask_t nodes; - unsigned long *z; - struct zonelist *zonelist; - - nodes_clear(nodes); - for (z = pol->v.zonelist->_zones; *z; z++) - node_set(zone_to_nid(zonelist_zone(*z)), nodes); - nodes_remap(tmp, nodes, *mpolmask, *newmask); - nodes = tmp; - - zonelist = bind_zonelist(&nodes); - - /* If no mem, then zonelist is NULL and we keep old zonelist. - * If that old zonelist has no remaining mems_allowed nodes, - * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT. - */ - - if (!IS_ERR(zonelist)) { - /* Good - got mem - substitute new zonelist */ - kfree(pol->v.zonelist); - pol->v.zonelist = zonelist; - } - *mpolmask = *newmask; - break; - } default: BUG(); break; @@ -1768,9 +1709,7 @@ static inline int mpol_to_str(char *buff break; case MPOL_BIND: - get_zonemask(pol, &nodes); - break; - + /* Fall through */ case MPOL_INTERLEAVE: nodes = pol->v.nodes; break; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c --- linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c 2007-08-17 16:55:31.000000000 +0100 +++ linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c 2007-08-17 17:00:27.000000000 +0100 @@ -1147,7 +1147,7 @@ static void zlc_mark_zone_full(struct zo * a page. */ static struct page * -get_page_from_freelist(gfp_t gfp_mask, unsigned int order, +get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags) { unsigned long *z; @@ -1159,7 +1159,7 @@ get_page_from_freelist(gfp_t gfp_mask, u int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ - z = first_zones_zonelist(zonelist, high_zoneidx); + z = first_zones_zonelist(zonelist, nodemask, high_zoneidx); classzone = zonelist_zone(*z); classzone_idx = zonelist_zone_idx(*z); @@ -1168,7 +1168,8 @@ zonelist_scan: * Scan zonelist, looking for a zone with enough free. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + for_each_zone_zonelist_nodemask(zone, z, zonelist, + high_zoneidx, nodemask) { if (NUMA_BUILD && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes)) continue; @@ -1222,8 +1223,8 @@ try_next_zone: * This is the 'heart' of the zoned buddy allocator. */ struct page * fastcall -__alloc_pages(gfp_t gfp_mask, unsigned int order, - struct zonelist *zonelist) +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, nodemask_t *nodemask) { const gfp_t wait = gfp_mask & __GFP_WAIT; enum zone_type high_zoneidx = gfp_zone(gfp_mask); @@ -1248,7 +1249,7 @@ restart: return NULL; } - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); if (page) goto got_pg; @@ -1293,7 +1294,7 @@ restart: * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ - page = get_page_from_freelist(gfp_mask, order, zonelist, + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags); if (page) goto got_pg; @@ -1306,7 +1307,7 @@ rebalance: if (!(gfp_mask & __GFP_NOMEMALLOC)) { nofail_alloc: /* go through the zonelist yet again, ignoring mins */ - page = get_page_from_freelist(gfp_mask, order, + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); if (page) goto got_pg; @@ -1338,7 +1339,7 @@ nofail_alloc: cond_resched(); if (likely(did_some_progress)) { - page = get_page_from_freelist(gfp_mask, order, + page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags); if (page) goto got_pg; @@ -1349,8 +1350,9 @@ nofail_alloc: * a parallel oom killing, we must fail if we're still * under heavy pressure. */ - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, - zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET); + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, + order, zonelist, high_zoneidx, + ALLOC_WMARK_HIGH|ALLOC_CPUSET); if (page) goto got_pg; @@ -1394,6 +1396,14 @@ got_pg: return page; } +struct page * fastcall +__alloc_pages(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist) +{ + return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); +} + + EXPORT_SYMBOL(__alloc_pages); /* @@ -2055,7 +2065,7 @@ static void build_zonelist_gfpskip(pg_da for (target = 0; target < MAX_NR_ZONES; target++) { unsigned long *z; - z = first_zones_zonelist(zl, target); + z = first_zones_zonelist(zl, NULL, target); zl->gfp_skip[target] = z - zl->_zones; } } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask 2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman @ 2007-08-17 21:29 ` Christoph Lameter 2007-08-21 9:12 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Christoph Lameter @ 2007-08-17 21:29 UTC (permalink / raw) To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On Fri, 17 Aug 2007, Mel Gorman wrote: > @@ -696,6 +696,16 @@ static inline struct zonelist *node_zone > return &NODE_DATA(nid)->node_zonelist; > } > > +static inline int zone_in_nodemask(unsigned long zone_addr, > + nodemask_t *nodes) > +{ > +#ifdef CONFIG_NUMA > + return node_isset(zonelist_zone(zone_addr)->node, *nodes); > +#else > + return 1; > +#endif /* CONFIG_NUMA */ > +} > + This is dereferencind the zone in a filtering operation. I wonder if we could encode the node in the zone_addr as well? x86_64 aligns zones on page boundaries. So we have 10 bits left after taking 2 for the zone id. > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) > { > - int i; > - > - for (i = 0; zl->_zones[i]; i++) { > - int nid = zone_to_nid(zonelist_zone(zl->_zones[i])); > + int nid; > > + for_each_node_mask(nid, *nodemask) > if (node_isset(nid, current->mems_allowed)) > return 1; > - } > + > return 0; Hmmm... This is equivalent to nodemask_t temp; nodes_and(temp, nodemask, current->mems_allowed); return !nodes_empty(temp); which avoids the loop over all nodes. > - } > - if (num == 0) { > - kfree(zl); > - return ERR_PTR(-EINVAL); > + for_each_node_mask(nd, *nodemask) { > + struct zone *z = &NODE_DATA(nd)->node_zones[k]; > + if (z->present_pages > 0) > + return 1; Here you could use an and with the N_HIGH_MEMORY or N_NORMAL_MEMORY nodemask. > @@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol > case MPOL_INTERLEAVE: > return interleave_nodes(policy); > > - case MPOL_BIND: > + case MPOL_BIND: { No { } needed. > /* > * Follow bind policy behavior and start allocation at the > * first node. > */ > - return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0])); > + struct zonelist *zonelist; > + unsigned long *z; > + enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); > + zonelist = &NODE_DATA(numa_node_id())->node_zonelist; > + z = first_zones_zonelist(zonelist, &policy->v.nodes, > + highest_zoneidx); > + return zone_to_nid(zonelist_zone(*z)); > + } > > case MPOL_PREFERRED: > if (policy->v.preferred_node >= 0) > @@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem > } > *new = *old; > atomic_set(&new->refcnt, 1); > - if (new->policy == MPOL_BIND) { > - int sz = ksize(old->v.zonelist); > - new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL); > - if (!new->v.zonelist) { > - kmem_cache_free(policy_cache, new); > - return ERR_PTR(-ENOMEM); > - } > - } > return new; That is a good optimization. > @@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy > *mpolmask, *newmask); > *mpolmask = *newmask; > break; > - case MPOL_BIND: { > - nodemask_t nodes; > - unsigned long *z; > - struct zonelist *zonelist; > - > - nodes_clear(nodes); > - for (z = pol->v.zonelist->_zones; *z; z++) > - node_set(zone_to_nid(zonelist_zone(*z)), nodes); > - nodes_remap(tmp, nodes, *mpolmask, *newmask); > - nodes = tmp; > - > - zonelist = bind_zonelist(&nodes); > - > - /* If no mem, then zonelist is NULL and we keep old zonelist. > - * If that old zonelist has no remaining mems_allowed nodes, > - * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT. > - */ > - > - if (!IS_ERR(zonelist)) { > - /* Good - got mem - substitute new zonelist */ > - kfree(pol->v.zonelist); > - pol->v.zonelist = zonelist; > - } > - *mpolmask = *newmask; > - break; > - } Simply dropped? We still need to recalculate the node_mask depending on the new cpuset environment! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask 2007-08-17 21:29 ` Christoph Lameter @ 2007-08-21 9:12 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-21 9:12 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On (17/08/07 14:29), Christoph Lameter didst pronounce: > On Fri, 17 Aug 2007, Mel Gorman wrote: > > > @@ -696,6 +696,16 @@ static inline struct zonelist *node_zone > > return &NODE_DATA(nid)->node_zonelist; > > } > > > > +static inline int zone_in_nodemask(unsigned long zone_addr, > > + nodemask_t *nodes) > > +{ > > +#ifdef CONFIG_NUMA > > + return node_isset(zonelist_zone(zone_addr)->node, *nodes); > > +#else > > + return 1; > > +#endif /* CONFIG_NUMA */ > > +} > > + > > This is dereferencind the zone in a filtering operation. I wonder if > we could encode the node in the zone_addr as well? x86_64 aligns zones on > page boundaries. So we have 10 bits left after taking 2 for the zone id. > I had considered it but not gotten around to an implementation. A quick look shows that it is likely to be a win on x86_64 and ppc64 as in those places NODES_SHIFT is small enough to fit into the lower bits of the zone addresses. It does not appear to be the case on IA-64 though. The INTERNODE_CACHE_SHIFT will be around 7 but the NODES_SHIFT defaults to 10 so it will not fit. I'll try it out anyway. > > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) > > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) > > { > > - int i; > > - > > - for (i = 0; zl->_zones[i]; i++) { > > - int nid = zone_to_nid(zonelist_zone(zl->_zones[i])); > > + int nid; > > > > + for_each_node_mask(nid, *nodemask) > > if (node_isset(nid, current->mems_allowed)) > > return 1; > > - } > > + > > return 0; > > Hmmm... This is equivalent to > > nodemask_t temp; > > nodes_and(temp, nodemask, current->mems_allowed); > return !nodes_empty(temp); > > which avoids the loop over all nodes. > Good point. I've replaced the code with your version. > > - } > > - if (num == 0) { > > - kfree(zl); > > - return ERR_PTR(-EINVAL); > > + for_each_node_mask(nd, *nodemask) { > > + struct zone *z = &NODE_DATA(nd)->node_zones[k]; > > + if (z->present_pages > 0) > > + return 1; > > Here you could use an and with the N_HIGH_MEMORY or N_NORMAL_MEMORY > nodemask. > I'm basing against 2.6.23-rc3 at the moment. I'll add an additional patch later to use the N_HIGH_MEMORy and N_NORMAL_MEMORY nodemasks. > > @@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol > > case MPOL_INTERLEAVE: > > return interleave_nodes(policy); > > > > - case MPOL_BIND: > > + case MPOL_BIND: { > > No { } needed. > > > /* > > * Follow bind policy behavior and start allocation at the > > * first node. > > */ > > - return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0])); > > + struct zonelist *zonelist; > > + unsigned long *z; Without the {}, it would fail to compile here > > + enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); > > + zonelist = &NODE_DATA(numa_node_id())->node_zonelist; > > + z = first_zones_zonelist(zonelist, &policy->v.nodes, > > + highest_zoneidx); > > + return zone_to_nid(zonelist_zone(*z)); > > + } > > > > case MPOL_PREFERRED: > > if (policy->v.preferred_node >= 0) > > > @@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem > > } > > *new = *old; > > atomic_set(&new->refcnt, 1); > > - if (new->policy == MPOL_BIND) { > > - int sz = ksize(old->v.zonelist); > > - new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL); > > - if (!new->v.zonelist) { > > - kmem_cache_free(policy_cache, new); > > - return ERR_PTR(-ENOMEM); > > - } > > - } > > return new; > > That is a good optimization. > Thanks > > @@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy > > *mpolmask, *newmask); > > *mpolmask = *newmask; > > break; > > - case MPOL_BIND: { > > - nodemask_t nodes; > > - unsigned long *z; > > - struct zonelist *zonelist; > > - > > - nodes_clear(nodes); > > - for (z = pol->v.zonelist->_zones; *z; z++) > > - node_set(zone_to_nid(zonelist_zone(*z)), nodes); > > - nodes_remap(tmp, nodes, *mpolmask, *newmask); > > - nodes = tmp; > > - > > - zonelist = bind_zonelist(&nodes); > > - > > - /* If no mem, then zonelist is NULL and we keep old zonelist. > > - * If that old zonelist has no remaining mems_allowed nodes, > > - * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT. > > - */ > > - > > - if (!IS_ERR(zonelist)) { > > - /* Good - got mem - substitute new zonelist */ > > - kfree(pol->v.zonelist); > > - pol->v.zonelist = zonelist; > > - } > > - *mpolmask = *newmask; > > - break; > > - } > > Simply dropped? We still need to recalculate the node_mask depending on > the new cpuset environment! > It's not simply dropped. The previous patch chunk made the MPOL_BIND case falls through to take the same action as MPOL_INTERLEAVE. Is that wrong? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman ` (4 preceding siblings ...) 2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman @ 2007-08-17 20:18 ` Mel Gorman 2007-08-17 21:07 ` Christoph Lameter 5 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-17 20:18 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm One PPC64 machine using gcc 3.4.6 the machine fails to boot when __alloc_pages_nodemask() uses the FASTCALL calling convention. It is not clear why this machine in particular is affected as other PPC64 machines boot. The only usual aspect of the machine is that it has memoryless nodes but I couldn't see any problem using them. The error received looks like Initializing hardware... storageUnable to handle kernel paging request for data at address 0xffffffff Faulting instruction address: 0xc0000000001aaa0c cpu 0x4: Vector: 300 (Data Access) at [c00000000fea7650] pc: c0000000001aaa0c: .strnlen+0x10/0x3c lr: c0000000001ab880: .vsnprintf+0x378/0x644 sp: c00000000fea78d0 msr: 9000000000009032 dar: ffffffff dsisr: 40000000 current = 0xc00000003fe4d7a0 paca = 0xc000000000487300 pid = 1178, comm = 05-wait_for_sys enter ? for help [link register ] c0000000001ab880 .vsnprintf+0x378/0x644 [c00000000fea78d0] c0000000003cad35 (unreliable) [c00000000fea7990] c0000000001abc70 .sprintf+0x3c/0x4c [c00000000fea7a10] c00000000021d5c0 .show_uevent+0x150/0x1a4 [c00000000fea7bb0] c00000000021cedc .dev_attr_show+0x44/0x60 [c00000000fea7c30] c000000000143874 .sysfs_read_file+0x128/0x208 [c00000000fea7cf0] c0000000000d71bc .vfs_read+0x134/0x1f8 [c00000000fea7d90] c0000000000d75f4 .sys_read+0x4c/0x8c [c00000000fea7e30] c00000000000852c syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 000000000ff65894 SP (ff90f730) is in userspace This patch creates an inline version of __alloc_pages called __alloc_pages_internal() which allows the machine to boot. Both __alloc_pages and __alloc_pages_nodemask use this interal function but only __alloc_pages() uses FASTCALL. Opinions as to why FASTCALL breaks on one machine are welcome. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/gfp.h | 3 +-- mm/page_alloc.c | 13 ++++++++++--- 2 files changed, 11 insertions(+), 5 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc3-035_nofastcall/include/linux/gfp.h --- linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h 2007-08-17 16:56:36.000000000 +0100 +++ linux-2.6.23-rc3-035_nofastcall/include/linux/gfp.h 2007-08-17 17:00:37.000000000 +0100 @@ -142,8 +142,7 @@ extern struct page * FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *)); extern struct page * -FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int, - struct zonelist *, nodemask_t *nodemask)); +__alloc_pages_nodemask(gfp_t, unsigned int, struct zonelist *, nodemask_t *); static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c linux-2.6.23-rc3-035_nofastcall/mm/page_alloc.c --- linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c 2007-08-17 17:00:27.000000000 +0100 +++ linux-2.6.23-rc3-035_nofastcall/mm/page_alloc.c 2007-08-17 17:00:37.000000000 +0100 @@ -1222,8 +1222,8 @@ try_next_zone: /* * This is the 'heart' of the zoned buddy allocator. */ -struct page * fastcall -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, +static inline struct page * +__alloc_pages_internal(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, nodemask_t *nodemask) { const gfp_t wait = gfp_mask & __GFP_WAIT; @@ -1396,11 +1396,18 @@ got_pg: return page; } +struct page * +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, nodemask_t *nodemask) +{ + return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask); +} + struct page * fastcall __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { - return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); + return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() 2007-08-17 20:18 ` [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() Mel Gorman @ 2007-08-17 21:07 ` Christoph Lameter 2007-08-18 12:51 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: Christoph Lameter @ 2007-08-17 21:07 UTC (permalink / raw) To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm On Fri, 17 Aug 2007, Mel Gorman wrote: > Opinions as to why FASTCALL breaks on one machine are welcome. Could we get rid of FASTCALL? AFAIK the compiler should automatically choose the right calling convention? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() 2007-08-17 21:07 ` Christoph Lameter @ 2007-08-18 12:51 ` Andi Kleen 2007-08-21 10:25 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2007-08-18 12:51 UTC (permalink / raw) To: Christoph Lameter; +Cc: Mel Gorman, Lee.Schermerhorn, linux-kernel, linux-mm On Friday 17 August 2007 23:07:33 Christoph Lameter wrote: > On Fri, 17 Aug 2007, Mel Gorman wrote: > > > Opinions as to why FASTCALL breaks on one machine are welcome. > > Could we get rid of FASTCALL? AFAIK the compiler should automatically > choose the right calling convention? It was a nop for some time because register parameters are always enabled on i386 and AFAIK no other architectures ever used it. Some out of tree trees some to disable register parameters though, but that's not really a concern. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() 2007-08-18 12:51 ` Andi Kleen @ 2007-08-21 10:25 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-21 10:25 UTC (permalink / raw) To: Andi Kleen; +Cc: Christoph Lameter, Lee.Schermerhorn, linux-kernel, linux-mm On (18/08/07 14:51), Andi Kleen didst pronounce: > On Friday 17 August 2007 23:07:33 Christoph Lameter wrote: > > On Fri, 17 Aug 2007, Mel Gorman wrote: > > > > > Opinions as to why FASTCALL breaks on one machine are welcome. > > > > Could we get rid of FASTCALL? AFAIK the compiler should automatically > > choose the right calling convention? > > It was a nop for some time because register parameters are always enabled > on i386 and AFAIK no other architectures ever used it. Some out of tree > trees some to disable register parameters though, but that's not > really a concern. > You're right. It now makes even less sense why it was a PPC64 machine that exhibited the problem. It should have made no difference at all. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 @ 2007-08-31 20:51 Mel Gorman 2007-08-31 20:51 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-08-31 20:51 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm The main changes here is a changeover to -mm and the dropping of gfp_skip until it has been of proven performance benefit to scanning. The -mm switch is not straight-forward as they collide heavily with the memoryless patches. This set has the memoryless patches as a pre-requisite for smooth merging. Node ID embedding in the zonelist->_zones was implemented but it was ineffectual. Only the VSMP sub-architecture on x86_64 has enough space to store the node ID so I dropped the patch again. If there are no major objections to this, I'll push these patches towards Andrew for -mm and wider testing. The full description of patchset is after the changelog. Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The third patch introduces filtering of the zonelists based on a nodemask. The fourth patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. The final patch replaces some static inline functions with macros. This is purely for gcc 3.4 and possibly older versions that produce inferior code. For readability, the patch can be dropped but if performance problems are discovered, the compiler version and this final patch should be considered. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-08-31 20:51 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman @ 2007-08-31 20:51 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-08-31 20:51 UTC (permalink / raw) To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 9 ++++++--- 3 files changed, 8 insertions(+), 5 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-mm1-clean/include/linux/swap.h linux-2.6.23-rc3-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc3-mm1-clean/include/linux/swap.h 2007-08-22 11:32:13.000000000 +0100 +++ linux-2.6.23-rc3-mm1-005_freepages_zonelist/include/linux/swap.h 2007-08-31 16:54:44.000000000 +0100 @@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-mm1-clean/mm/page_alloc.c linux-2.6.23-rc3-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc3-mm1-clean/mm/page_alloc.c 2007-08-22 11:32:13.000000000 +0100 +++ linux-2.6.23-rc3-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-08-31 16:54:44.000000000 +0100 @@ -1665,7 +1665,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-mm1-clean/mm/vmscan.c linux-2.6.23-rc3-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc3-mm1-clean/mm/vmscan.c 2007-08-22 11:32:13.000000000 +0100 +++ linux-2.6.23-rc3-mm1-005_freepages_zonelist/mm/vmscan.c 2007-08-31 16:54:44.000000000 +0100 @@ -1180,10 +1180,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1221,7 +1222,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { int priority; int ret = 0; @@ -1229,6 +1231,7 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1256,7 +1259,7 @@ unsigned long try_to_free_pages(struct z sc.nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, &sc); + nr_reclaimed += shrink_zones(priority, zonelist, &sc); shrink_slab(sc.nr_scanned, gfp_mask, lru_pages); if (reclaim_state) { nr_reclaimed += reclaim_state->reclaimed_slab; ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 @ 2007-09-11 15:19 Mel Gorman 2007-09-11 15:19 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-09-11 15:19 UTC (permalink / raw) To: apw; +Cc: Mel Gorman, linux-kernel, linux-mm This is the latest version of one-zonelist and it should be solid enough for wider testing. To briefly summarise, the patchset replaces multiple zonelists-per-node with one zonelist that is filtered based on nodemask and GFP flags. I've dropped the patch that replaces inline functions with macros from the end as it obscures the code for something that may or may not be a performance benefit on older compilers. If we see performance regressions that might have something to do with it, the patch is trivially to bring forward. Andrew, please merge to -mm for wider testing and consideration for merging to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE and MPOL_BIND. Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-09-11 15:19 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman @ 2007-09-11 15:19 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-09-11 15:19 UTC (permalink / raw) To: apw; +Cc: Mel Gorman, linux-kernel, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 13 ++++++++----- 3 files changed, 10 insertions(+), 7 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-09-10 16:06:06.000000000 +0100 @@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem); extern int __isolate_lru_page(struct page *page, int mode); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-10 16:06:06.000000000 +0100 @@ -1667,7 +1667,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-09-10 16:06:06.000000000 +0100 @@ -1207,10 +1207,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1248,7 +1249,7 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, +unsigned long do_try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask, struct scan_control *sc) { int priority; @@ -1257,6 +1258,7 @@ unsigned long do_try_to_free_pages(struc unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1275,7 +1277,7 @@ unsigned long do_try_to_free_pages(struc sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit containers @@ -1333,7 +1335,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1346,7 +1349,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CONTAINER_MEM_CONT ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend) @ 2007-09-11 21:30 Mel Gorman 2007-09-11 21:30 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-09-11 21:30 UTC (permalink / raw) To: Lee.Schermerhorn, akpm, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm (Sorry for the resend, I mucked up the TO: line in the earlier sending) This is the latest version of one-zonelist and it should be solid enough for wider testing. To briefly summarise, the patchset replaces multiple zonelists-per-node with one zonelist that is filtered based on nodemask and GFP flags. I've dropped the patch that replaces inline functions with macros from the end as it obscures the code for something that may or may not be a performance benefit on older compilers. If we see performance regressions that might have something to do with it, the patch is trivially to bring forward. Andrew, please merge to -mm for wider testing and consideration for merging to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE and MPOL_BIND. Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-09-11 21:30 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend) Mel Gorman @ 2007-09-11 21:30 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-09-11 21:30 UTC (permalink / raw) To: Lee.Schermerhorn, akpm, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 13 ++++++++----- 3 files changed, 10 insertions(+), 7 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-09-10 16:06:06.000000000 +0100 @@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem); extern int __isolate_lru_page(struct page *page, int mode); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-10 16:06:06.000000000 +0100 @@ -1667,7 +1667,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-09-10 16:06:06.000000000 +0100 @@ -1207,10 +1207,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1248,7 +1249,7 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, +unsigned long do_try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask, struct scan_control *sc) { int priority; @@ -1257,6 +1258,7 @@ unsigned long do_try_to_free_pages(struc unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1275,7 +1277,7 @@ unsigned long do_try_to_free_pages(struc sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit containers @@ -1333,7 +1335,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1346,7 +1349,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CONTAINER_MEM_CONT ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6 @ 2007-09-12 21:04 Mel Gorman 2007-09-12 21:05 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-09-12 21:04 UTC (permalink / raw) To: Lee.Schermerhorn, kamezawa.hiroyu, clameter Cc: Mel Gorman, linux-kernel, linux-mm Kamezawa-san, This version implements your idea for storing a zone pointer and zone_idx in a structure within the zonelist instead of encoding information in a pointer. It has worked out quite well. The performance is comparable on the tests I've run with similar gains/losses as I've seen with but pointer packing but this code may be easier to understand. However, the zonelist has doubled in size and consumes more cache lines. I did not put the node_idx into the structure as it was not clear that there was a real gain from doing that as the node ID is no rarely used. However, it would be trivial to add if it could be demonstrated to be of real benefit on workloads that make heavy use of nodemasks. I do not have an appropriate test environment for measuring that but prehaps someone else. If they are willing to check it out, I'll roll a suitable patch. Any opinions on whether the slight gain in apparent performance in kernbench worth the cacheline? It's very difficult to craft a benchmark that notices the extra line being used so this could be a hand-waving issue. Changelog since V6 o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-09-12 21:04 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6 Mel Gorman @ 2007-09-12 21:05 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-09-12 21:05 UTC (permalink / raw) To: Lee.Schermerhorn, kamezawa.hiroyu, clameter Cc: Mel Gorman, linux-kernel, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 13 ++++++++----- 3 files changed, 10 insertions(+), 7 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-09-12 16:05:11.000000000 +0100 @@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem); extern int __isolate_lru_page(struct page *page, int mode); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-12 16:05:11.000000000 +0100 @@ -1667,7 +1667,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-09-12 16:05:11.000000000 +0100 @@ -1207,10 +1207,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1248,7 +1249,7 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, +unsigned long do_try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask, struct scan_control *sc) { int priority; @@ -1257,6 +1258,7 @@ unsigned long do_try_to_free_pages(struc unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1275,7 +1277,7 @@ unsigned long do_try_to_free_pages(struc sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit containers @@ -1333,7 +1335,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1346,7 +1349,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CONTAINER_MEM_CONT ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7 @ 2007-09-13 17:52 Mel Gorman 2007-09-13 17:52 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-09-13 17:52 UTC (permalink / raw) To: Lee.Schermerhorn Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, kamezawa.hiroyu, clameter Hi Lee, This is the patchset I would like tested. It has Kamezawa-sans approach for using a structure instead of pointer packing. While it consumes more cache like Christoph pointed out, it should an easier starting point to optimise once workloads are identified that can show performance gains/regressions. The pointer packing is a potential optimisation but once in place, it's difficult to alter again. Please let me know how it works out for you. Changelog since V7 o Fix build bug in relation to memory controller combined with one-zonelist o Use while() instead of a stupid looking for() Changelog since V6 o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-09-13 17:52 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7 Mel Gorman @ 2007-09-13 17:52 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-09-13 17:52 UTC (permalink / raw) To: Lee.Schermerhorn Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, kamezawa.hiroyu, clameter The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 19 +++++++++++-------- 3 files changed, 13 insertions(+), 10 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-09-13 11:57:20.000000000 +0100 @@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem); extern int __isolate_lru_page(struct page *page, int mode); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-13 11:57:20.000000000 +0100 @@ -1667,7 +1667,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c 2007-09-10 09:29:14.000000000 +0100 +++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-09-13 11:57:20.000000000 +0100 @@ -1207,10 +1207,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1248,7 +1249,7 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, +unsigned long do_try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask, struct scan_control *sc) { int priority; @@ -1257,6 +1258,7 @@ unsigned long do_try_to_free_pages(struc unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1275,7 +1277,7 @@ unsigned long do_try_to_free_pages(struc sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit containers @@ -1333,7 +1335,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1346,7 +1349,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CONTAINER_MEM_CONT @@ -1370,11 +1373,11 @@ unsigned long try_to_free_mem_container_ .isolate_pages = mem_container_isolate_pages, }; int node; - struct zone **zones; + struct zonelist *zonelist; for_each_online_node(node) { - zones = NODE_DATA(node)->node_zonelists[ZONE_USERPAGES].zones; - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc)) + zonelist = &NODE_DATA(node)->node_zonelists[ZONE_USERPAGES]; + if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc)) return 1; } return 0; ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 @ 2007-09-28 14:23 Mel Gorman 2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-09-28 14:23 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter Hi Andrew, This is the one-zonelist patchset again. There were multiple collisions with patches in -mm like the policy cleanups, policy refcounting, the memory controller patches and OOM killer changes. The functionality of the code has not changed since the last release. I'm still hoping to merge this to -mm when it is considered a bit more stable. I've added David Rientjes to the cc as the OOM-zone-locking code is affected by this patchset now and I want to be sure I didn't accidently break it. The changes to try_set_zone_oom() are the most important here. I believe the code is equivilant but a second opinion would not hurt. Changelog since V7 o Rebase to 2.6.23-rc8-mm2 Changelog since V6 o Fix build bug in relation to memory controller combined with one-zonelist o Use while() instead of a stupid looking for() o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The fifth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman @ 2007-09-28 14:23 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-09-28 14:23 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 21 ++++++++++++--------- 3 files changed, 14 insertions(+), 11 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/include/linux/swap.h linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.23-rc8-mm2-clean/include/linux/swap.h 2007-09-27 14:41:05.000000000 +0100 +++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/swap.h 2007-09-28 15:48:35.000000000 +0100 @@ -185,7 +185,7 @@ extern void move_tail_pages(void); extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, gfp_t gfp_mask); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/mm/page_alloc.c linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.23-rc8-mm2-clean/mm/page_alloc.c 2007-09-27 14:41:05.000000000 +0100 +++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c 2007-09-28 15:48:35.000000000 +0100 @@ -1668,7 +1668,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/mm/vmscan.c linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.23-rc8-mm2-clean/mm/vmscan.c 2007-09-27 14:41:05.000000000 +0100 +++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/vmscan.c 2007-09-28 15:48:35.000000000 +0100 @@ -1204,10 +1204,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1245,8 +1246,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, - struct scan_control *sc) +static unsigned long do_try_to_free_pages(struct zonelist *zonelist, + gfp_t gfp_mask, struct scan_control *sc) { int priority; int ret = 0; @@ -1254,6 +1255,7 @@ static unsigned long do_try_to_free_page unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1272,7 +1274,7 @@ static unsigned long do_try_to_free_page sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1330,7 +1332,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1343,7 +1346,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CGROUP_MEM_CONT @@ -1362,12 +1365,12 @@ unsigned long try_to_free_mem_cgroup_pag .isolate_pages = mem_cgroup_isolate_pages, }; int node; - struct zone **zones; + struct zonelist *zonelist; int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE); for_each_online_node(node) { - zones = NODE_DATA(node)->node_zonelists[target_zone].zones; - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc)) + zonelist = &NODE_DATA(node)->node_zonelists[target_zone]; + if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc)) return 1; } return 0; ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9 @ 2007-11-09 14:32 Mel Gorman 2007-11-09 14:32 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-11-09 14:32 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, nacc, kamezawa.hiroyu, clameter This is basically a rebase to the broken-out -mm tree. Since v8, two fixes have been applied that showed up during testing. Most machines I test -mm on are failing to boot for a variety of reasons but on the two machines that did work, they appeared to work fine. Changelog since V8 o Rebase to 2.6.24-rc2 o Added ack for the OOM changes o Behave correctly when GFP_THISNODE and a node ID are specified o Clear up warning over type of nodes_intersects() function Changelog since V7 o Rebase to 2.6.23-rc8-mm2 Changelog since V6 o Fix build bug in relation to memory controller combined with one-zonelist o Use while() instead of a stupid looking for() o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. As a bonus, the patchset reduces the cache footprint of the kernel and should improve performance in a number of cases. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The fifth patch introduces filtering of the zonelists based on a nodemask. The final patch replaces the two zonelists with one zonelist. A nodemask is created when __GFP_THISNODE is specified to filter the list. The nodelists could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE is used often enough to be worth the effort. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. Total CPU time on Kernbench: -0.67% to 3.05% Elapsed time on Kernbench: -0.25% to 2.96% page_test from aim9: -6.98% to 5.60% brk_test from aim9: -3.94% to 4.11% fork_test from aim9: -5.72% to 4.14% exec_test from aim9: -1.02% to 1.56% The TBench figures were too variable between runs to draw conclusions from but there didn't appear to be any regressions there. The hackbench results for both sockets and pipes were within noise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-11-09 14:32 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9 Mel Gorman @ 2007-11-09 14:32 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-11-09 14:32 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, nacc, kamezawa.hiroyu, clameter The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 21 ++++++++++++--------- 3 files changed, 14 insertions(+), 11 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-b1106/include/linux/swap.h linux-2.6.24-rc1-mm-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.24-rc1-mm-b1106/include/linux/swap.h 2007-11-08 19:04:09.000000000 +0000 +++ linux-2.6.24-rc1-mm-005_freepages_zonelist/include/linux/swap.h 2007-11-08 19:05:07.000000000 +0000 @@ -181,7 +181,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, gfp_t gfp_mask); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-b1106/mm/page_alloc.c linux-2.6.24-rc1-mm-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.24-rc1-mm-b1106/mm/page_alloc.c 2007-11-08 19:04:17.000000000 +0000 +++ linux-2.6.24-rc1-mm-005_freepages_zonelist/mm/page_alloc.c 2007-11-08 19:05:07.000000000 +0000 @@ -1647,7 +1647,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-b1106/mm/vmscan.c linux-2.6.24-rc1-mm-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.24-rc1-mm-b1106/mm/vmscan.c 2007-11-08 19:04:17.000000000 +0000 +++ linux-2.6.24-rc1-mm-005_freepages_zonelist/mm/vmscan.c 2007-11-08 19:06:49.000000000 +0000 @@ -1216,10 +1216,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1257,8 +1258,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, - struct scan_control *sc) +static unsigned long do_try_to_free_pages(struct zonelist *zonelist, + gfp_t gfp_mask, struct scan_control *sc) { int priority; int ret = 0; @@ -1266,6 +1267,7 @@ static unsigned long do_try_to_free_page unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1285,7 +1287,7 @@ static unsigned long do_try_to_free_page sc->nr_io_pages = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1344,7 +1346,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1357,7 +1360,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CGROUP_MEM_CONT @@ -1376,11 +1379,11 @@ unsigned long try_to_free_mem_cgroup_pag .isolate_pages = mem_cgroup_isolate_pages, }; int node = numa_node_id(); - struct zone **zones; + struct zonelist *zonelist; int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE); - zones = NODE_DATA(node)->node_zonelists[target_zone].zones; - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc)) + zonelist = &NODE_DATA(node)->node_zonelists[target_zone]; + if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc)) return 1; return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v10 @ 2007-11-21 0:38 Mel Gorman 2007-11-21 0:39 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-11-21 0:38 UTC (permalink / raw) To: Lee.Schermerhorn, clameter Cc: Mel Gorman, linux-kernel, kamezawa.hiroyu, linux-mm This release brings the number of zonelists to two instead of one. Getting all the corner cases right for __GFP_THISNODE and one zonelist was turning into a complicated mess. Not only was it affecting too many paths but it reached the point where it should be reviewed as a standalone change. Much of the aims of the earlier sets are met by having two zonelists. The hack is still removed, the number of zonelists is reduced and the MPOL_BIND policy still behaves sensibly. I believe this to be a reasonable starting point leaving the full one-zonelist approach to be tackled later. There were a few bugs and issues highlighed from reviews fixed up which are briefly described in the changelog. There are concerns over the stability of mainline and -mm at the moment and the evidence is on http://test.kernel.org so we should verify for sure it is still ok. The set passes a slightly modified numactl regression test on x86_64. The slight modification was required because numastat behaves differently than the regression test expects (nodes in reverse order). Lee, can you confirm it still hasn't regressed with your tests before another attempt is made to push it please? Changelog since V9 o Rebase to 2.6.24-rc2-mm1 o Lookup the nodemask for each allocator callsite in mempolicy.c o Update NUMA statistics based on preferred zone, not first zonelist entry o When __GFP_THISNODE is specified with MPOL_BIND and the current node is not in the allowed nodemask, the first node in the mask will be used o Stick with using two zonelists instead of one because of excessive complexity with corner cases Changelog since V8 o Rebase to 2.6.24-rc2 o Added ack for the OOM changes o Behave correctly when GFP_THISNODE and a node ID are specified o Clear up warning over type of nodes_intersects() function Changelog since V7 o Rebase to 2.6.23-rc8-mm2 Changelog since V6 o Fix build bug in relation to memory controller combined with one-zonelist o Use while() instead of a stupid looking for() o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with one zonelist that is filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration but were usually small performance gains. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc2-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -1.54% to 0.54% Elapsed time on Kernbench: -0.75% to 0.42% page_test from aim9: -8.23% to 10.71% brk_test from aim9: -3.32% to 4.78% fork_test from aim9: -0.44% to 0.38% exec_test from aim9: -0.95% to 1.11% -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-11-21 0:38 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v10 Mel Gorman @ 2007-11-21 0:39 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-11-21 0:39 UTC (permalink / raw) To: Lee.Schermerhorn, clameter Cc: Mel Gorman, linux-kernel, kamezawa.hiroyu, linux-mm The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 21 ++++++++++++--------- 3 files changed, 14 insertions(+), 11 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-hotfixes/include/linux/swap.h linux-2.6.24-rc2-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.24-rc2-mm1-hotfixes/include/linux/swap.h 2007-11-15 11:28:03.000000000 +0000 +++ linux-2.6.24-rc2-mm1-005_freepages_zonelist/include/linux/swap.h 2007-11-20 23:25:22.000000000 +0000 @@ -181,7 +181,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, gfp_t gfp_mask); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-hotfixes/mm/page_alloc.c linux-2.6.24-rc2-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.24-rc2-mm1-hotfixes/mm/page_alloc.c 2007-11-15 11:28:11.000000000 +0000 +++ linux-2.6.24-rc2-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-11-20 23:25:22.000000000 +0000 @@ -1619,7 +1619,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc2-mm1-hotfixes/mm/vmscan.c linux-2.6.24-rc2-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.24-rc2-mm1-hotfixes/mm/vmscan.c 2007-11-15 11:28:11.000000000 +0000 +++ linux-2.6.24-rc2-mm1-005_freepages_zonelist/mm/vmscan.c 2007-11-20 23:25:22.000000000 +0000 @@ -1216,10 +1216,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; sc->all_unreclaimable = 1; @@ -1257,8 +1258,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, - struct scan_control *sc) +static unsigned long do_try_to_free_pages(struct zonelist *zonelist, + gfp_t gfp_mask, struct scan_control *sc) { int priority; int ret = 0; @@ -1266,6 +1267,7 @@ static unsigned long do_try_to_free_page unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1285,7 +1287,7 @@ static unsigned long do_try_to_free_page sc->nr_io_pages = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1344,7 +1346,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1357,7 +1360,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CGROUP_MEM_CONT @@ -1376,11 +1379,11 @@ unsigned long try_to_free_mem_cgroup_pag .isolate_pages = mem_cgroup_isolate_pages, }; int node = numa_node_id(); - struct zone **zones; + struct zonelist *zonelist; int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE); - zones = NODE_DATA(node)->node_zonelists[target_zone].zones; - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc)) + zonelist = &NODE_DATA(node)->node_zonelists[target_zone]; + if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc)) return 1; return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r2 @ 2007-12-11 20:21 Mel Gorman 2007-12-11 20:22 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2007-12-11 20:21 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter This is a rebase of the two-zonelist patchset to 2.6.24-rc4-mm1 and some warnings cleared up. The warnings were not picked up before as they were introduced early in the set and cleared up by the end. This might have hurt bisecting so were worth fixing even if the end result was correct. Tests looked good, both numactltest (slightly modified) and performance tests. I believe Lee has been testing heavily with a version of the patchset almost identical to this and hasn't complained. If Lee is happy enough, can you merge these to -mm for wider testing please Andrew? Changelog since V10 o Rebase to 2.6.24-rc4-mm1 o Clear up warnings in fs/buffer.c early in the patchset Changelog since V9 o Rebase to 2.6.24-rc2-mm1 o Lookup the nodemask for each allocator callsite in mempolicy.c o Update NUMA statistics based on preferred zone, not first zonelist entry o When __GFP_THISNODE is specified with MPOL_BIND and the current node is not in the allowed nodemask, the first node in the mask will be used o Stick with using two zonelists instead of one because of excessive complexity with corner cases Changelog since V8 o Rebase to 2.6.24-rc2 o Added ack for the OOM changes o Behave correctly when GFP_THISNODE and a node ID are specified o Clear up warning over type of nodes_intersects() function Changelog since V7 o Rebase to 2.6.23-rc8-mm2 Changelog since V6 o Fix build bug in relation to memory controller combined with one-zonelist o Use while() instead of a stupid looking for() o Instead of encoding zone index information in a pointer, this version introduces a structure that stores a zone pointer and its index Changelog since V5 o Rebase to 2.6.23-rc4-mm1 o Drop patch that replaces inline functions with macros Changelog since V4 o Rebase to -mm kernel. Host of memoryless patches collisions dealt with o Do not call wakeup_kswapd() for every zone in a zonelist o Dropped the FASTCALL removal o Have cursor in iterator advance earlier o Use nodes_and in cpuset_nodes_valid_mems_allowed() o Use defines instead of inlines, noticably better performance on gcc-3.4 No difference on later compilers such as gcc 4.1 o Dropped gfp_skip patch until it is proven to be of benefit. Tests are currently inconclusive but it definitly consumes at least one cache line Changelog since V3 o Fix compile error in the parisc change o Calculate gfp_zone only once in __alloc_pages o Calculate classzone_idx properly in get_page_from_freelist o Alter check so that zone id embedded may still be used on UP o Use Kamezawa-sans suggestion for skipping zones in zonelist o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This removes the need for MPOL_BIND to have a custom zonelist o Move zonelist iterators and helpers to mm.h o Change _zones from struct zone * to unsigned long Changelog since V2 o shrink_zones() uses zonelist instead of zonelist->zones o hugetlb uses zonelist iterator o zone_idx information is embedded in zonelist pointers o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid) Changelog since V1 o Break up the patch into 3 patches o Introduce iterators for zonelists o Performance regression test The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsitency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages 2007-12-11 20:21 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r2 Mel Gorman @ 2007-12-11 20:22 ` Mel Gorman 0 siblings, 0 replies; 27+ messages in thread From: Mel Gorman @ 2007-12-11 20:22 UTC (permalink / raw) To: akpm Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> --- fs/buffer.c | 8 ++++---- include/linux/swap.h | 2 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 21 ++++++++++++--------- 4 files changed, 18 insertions(+), 15 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc4-mm1-clean/fs/buffer.c linux-2.6.24-rc4-mm1-005_freepages_zonelist/fs/buffer.c --- linux-2.6.24-rc4-mm1-clean/fs/buffer.c 2007-12-07 12:14:06.000000000 +0000 +++ linux-2.6.24-rc4-mm1-005_freepages_zonelist/fs/buffer.c 2007-12-07 15:13:16.000000000 +0000 @@ -368,16 +368,16 @@ void invalidate_bdev(struct block_device */ static void free_more_memory(void) { - struct zone **zones; + struct zonelist *zonelist; pg_data_t *pgdat; wakeup_pdflush(1024); yield(); for_each_online_pgdat(pgdat) { - zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones; - if (*zones) - try_to_free_pages(zones, 0, GFP_NOFS); + zonelist = &pgdat->node_zonelists[gfp_zone(GFP_NOFS)]; + if (zonelist->zones[0]) + try_to_free_pages(zonelist, 0, GFP_NOFS); } } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc4-mm1-clean/include/linux/swap.h linux-2.6.24-rc4-mm1-005_freepages_zonelist/include/linux/swap.h --- linux-2.6.24-rc4-mm1-clean/include/linux/swap.h 2007-12-07 12:14:07.000000000 +0000 +++ linux-2.6.24-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-12-07 12:17:22.000000000 +0000 @@ -181,7 +181,7 @@ extern int rotate_reclaimable_page(struc extern void swap_setup(void); /* linux/mm/vmscan.c */ -extern unsigned long try_to_free_pages(struct zone **zones, int order, +extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, gfp_t gfp_mask); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc4-mm1-clean/mm/page_alloc.c linux-2.6.24-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c --- linux-2.6.24-rc4-mm1-clean/mm/page_alloc.c 2007-12-07 12:14:07.000000000 +0000 +++ linux-2.6.24-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-12-07 12:17:22.000000000 +0000 @@ -1624,7 +1624,7 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask); + did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc4-mm1-clean/mm/vmscan.c linux-2.6.24-rc4-mm1-005_freepages_zonelist/mm/vmscan.c --- linux-2.6.24-rc4-mm1-clean/mm/vmscan.c 2007-12-07 12:14:07.000000000 +0000 +++ linux-2.6.24-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-12-07 12:19:14.000000000 +0000 @@ -1267,10 +1267,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zone **zones, +static unsigned long shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { unsigned long nr_reclaimed = 0; + struct zone **zones = zonelist->zones; int i; @@ -1322,8 +1323,8 @@ static unsigned long shrink_zones(int pr * holds filesystem locks which prevent writeout this might not work, and the * allocation attempt will fail. */ -static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, - struct scan_control *sc) +static unsigned long do_try_to_free_pages(struct zonelist *zonelist, + gfp_t gfp_mask, struct scan_control *sc) { int priority; int ret = 0; @@ -1331,6 +1332,7 @@ static unsigned long do_try_to_free_page unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; + struct zone **zones = zonelist->zones; int i; count_vm_event(ALLOCSTALL); @@ -1354,7 +1356,7 @@ static unsigned long do_try_to_free_page sc->nr_io_pages = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zones, sc); + nr_reclaimed += shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1419,7 +1421,8 @@ out: return ret; } -unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask) +unsigned long try_to_free_pages(struct zonelist *zonelist, int order, + gfp_t gfp_mask) { struct scan_control sc = { .gfp_mask = gfp_mask, @@ -1432,7 +1435,7 @@ unsigned long try_to_free_pages(struct z .isolate_pages = isolate_pages_global, }; - return do_try_to_free_pages(zones, gfp_mask, &sc); + return do_try_to_free_pages(zonelist, gfp_mask, &sc); } #ifdef CONFIG_CGROUP_MEM_CONT @@ -1450,11 +1453,11 @@ unsigned long try_to_free_mem_cgroup_pag .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, }; - struct zone **zones; + struct zonelist *zonelist; int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE); - zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones; - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc)) + zonelist = &NODE_DATA(numa_node_id())->node_zonelists[target_zone]; + if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc)) return 1; return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2007-12-11 20:22 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman 2007-08-17 20:17 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-08-17 20:17 ` [PATCH 2/6] Use one zonelist that is filtered instead of multiple zonelists Mel Gorman 2007-08-17 20:59 ` Christoph Lameter 2007-08-21 8:51 ` Mel Gorman 2007-08-17 20:17 ` [PATCH 3/6] Embed zone_id information within the zonelist->zones pointer Mel Gorman 2007-08-17 21:02 ` Christoph Lameter 2007-08-21 8:54 ` Mel Gorman 2007-08-17 20:18 ` [PATCH 4/6] Record how many zones can be safely skipped in the zonelist Mel Gorman 2007-08-17 21:03 ` Christoph Lameter 2007-08-21 8:58 ` Mel Gorman 2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman 2007-08-17 21:29 ` Christoph Lameter 2007-08-21 9:12 ` Mel Gorman 2007-08-17 20:18 ` [PATCH 6/6] Do not use FASTCALL for __alloc_pages_nodemask() Mel Gorman 2007-08-17 21:07 ` Christoph Lameter 2007-08-18 12:51 ` Andi Kleen 2007-08-21 10:25 ` Mel Gorman 2007-08-31 20:51 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman 2007-08-31 20:51 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-09-11 15:19 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman 2007-09-11 15:19 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-09-11 21:30 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend) Mel Gorman 2007-09-11 21:30 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-09-12 21:04 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6 Mel Gorman 2007-09-12 21:05 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-09-13 17:52 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7 Mel Gorman 2007-09-13 17:52 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman 2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-11-09 14:32 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9 Mel Gorman 2007-11-09 14:32 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-11-21 0:38 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v10 Mel Gorman 2007-11-21 0:39 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman 2007-12-11 20:21 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r2 Mel Gorman 2007-12-11 20:22 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).