From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 346E26B004D for ; Tue, 25 Aug 2009 15:56:57 -0400 (EDT) Received: from spaceape14.eur.corp.google.com (spaceape14.eur.corp.google.com [172.28.16.148]) by smtp-out.google.com with ESMTP id n7PJug9s001992 for ; Tue, 25 Aug 2009 12:57:01 -0700 Received: from pxi39 (pxi39.prod.google.com [10.243.27.39]) by spaceape14.eur.corp.google.com with ESMTP id n7P8Abcx019730 for ; Tue, 25 Aug 2009 01:12:33 -0700 Received: by pxi39 with SMTP id 39so5720824pxi.8 for ; Tue, 25 Aug 2009 01:10:37 -0700 (PDT) Date: Tue, 25 Aug 2009 01:10:34 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 1/5] hugetlb: rework hstate_next_node_* functions In-Reply-To: <20090824192544.10317.6291.sendpatchset@localhost.localdomain> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192544.10317.6291.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Andrew Morton , Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > [PATCH 1/5] hugetlb: rework hstate_next_node* functions > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: > + cleaned up comments, removed some deemed unnecessary, > add some suggested by review > + removed check for !current in huge_mpol_nodes_allowed(). > + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). > + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to > catch out of range node id. > + add examples to patch description > > V3: > + factored this "cleanup" patch out of V2 patch 2/3 > + moved ahead of patch to add nodes_allowed mask to alloc funcs > as this patch is somewhat independent from using task mempolicy > to control huge page allocation and freeing. > > Modify the hstate_next_node* functions to allow them to be called to > obtain the "start_nid". Then, whereas prior to this patch we > unconditionally called hstate_next_node_to_{alloc|free}(), whether > or not we successfully allocated/freed a huge page on the node, > now we only call these functions on failure to alloc/free to advance > to next allowed node. > > Factor out the next_node_allowed() function to handle wrap at end > of node_online_map. In this version, the allowed nodes include all > of the online nodes. > > Reviewed-by: Mel Gorman > Signed-off-by: Lee Schermerhorn Acked-by: David Rientjes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id EA88C6B00A0 for ; Tue, 25 Aug 2009 16:05:24 -0400 (EDT) Received: from spaceape8.eur.corp.google.com (spaceape8.eur.corp.google.com [172.28.16.142]) by smtp-out.google.com with ESMTP id n7PK5NFS004589 for ; Tue, 25 Aug 2009 13:05:23 -0700 Received: from pzk36 (pzk36.prod.google.com [10.243.19.164]) by spaceape8.eur.corp.google.com with ESMTP id n7P8GSwP019964 for ; Tue, 25 Aug 2009 01:18:22 -0700 Received: by pzk36 with SMTP id 36so1455272pzk.12 for ; Tue, 25 Aug 2009 01:16:27 -0700 (PDT) Date: Tue, 25 Aug 2009 01:16:26 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns In-Reply-To: <20090824192637.10317.31039.sendpatchset@localhost.localdomain> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > [PATCH 2/4] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V3: > + moved this patch to after the "rework" of hstate_next_node_to_... > functions as this patch is more specific to using task mempolicy > to control huge page allocation and freeing. > > In preparation for constraining huge page allocation and freeing by the > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer > to the allocate, free and surplus adjustment functions. For now, pass > NULL to indicate default behavior--i.e., use node_online_map. A > subsqeuent patch will derive a non-default mask from the controlling > task's numa mempolicy. > > Reviewed-by: Mel Gorman > Signed-off-by: Lee Schermerhorn > > mm/hugetlb.c | 102 ++++++++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 67 insertions(+), 35 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > } > > /* > - * common helper function for hstate_next_node_to_{alloc|free}. > - * return next node in node_online_map, wrapping at end. > + * common helper functions for hstate_next_node_to_{alloc|free}. > + * We may have allocated or freed a huge pages based on a different > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > + * be outside of *nodes_allowed. Ensure that we use the next > + * allowed node for alloc or free. > */ > -static int next_node_allowed(int nid) > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > { > - nid = next_node(nid, node_online_map); > + nid = next_node(nid, *nodes_allowed); > if (nid == MAX_NUMNODES) > - nid = first_node(node_online_map); > + nid = first_node(*nodes_allowed); > VM_BUG_ON(nid >= MAX_NUMNODES); > > return nid; > } > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > +{ > + if (!node_isset(nid, *nodes_allowed)) > + nid = next_node_allowed(nid, nodes_allowed); > + return nid; > +} Awkward name considering this doesn't simply return true or false as expected, it returns a nid. > + > /* > * Use a helper variable to find the next node and then > * copy it back to next_nid_to_alloc afterwards: > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > * But we don't need to use a spin_lock here: it really > * doesn't matter if occasionally a racer chooses the > - * same nid as we do. Move nid forward in the mask even > - * if we just successfully allocated a hugepage so that > - * the next caller gets hugepages on the next node. > + * same nid as we do. Move nid forward in the mask whether > + * or not we just successfully allocated a hugepage so that > + * the next allocation addresses the next node. > */ > -static int hstate_next_node_to_alloc(struct hstate *h) > +static int hstate_next_node_to_alloc(struct hstate *h, > + nodemask_t *nodes_allowed) > { > int nid, next_nid; > > - nid = h->next_nid_to_alloc; > - next_nid = next_node_allowed(nid); > + if (!nodes_allowed) > + nodes_allowed = &node_online_map; > + > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > + > + next_nid = next_node_allowed(nid, nodes_allowed); > h->next_nid_to_alloc = next_nid; > + > return nid; > } Don't need next_nid. > -static int alloc_fresh_huge_page(struct hstate *h) > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > { > struct page *page; > int start_nid; > int next_nid; > int ret = 0; > > - start_nid = hstate_next_node_to_alloc(h); > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > next_nid = start_nid; > > do { > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct > ret = 1; > break; > } > - next_nid = hstate_next_node_to_alloc(h); > + next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > } while (next_nid != start_nid); > > if (ret) > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct > * whether or not we find a free huge page to free so that the > * next attempt to free addresses the next node. > */ > -static int hstate_next_node_to_free(struct hstate *h) > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > { > int nid, next_nid; > > - nid = h->next_nid_to_free; > - next_nid = next_node_allowed(nid); > + if (!nodes_allowed) > + nodes_allowed = &node_online_map; > + > + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); > + > + next_nid = next_node_allowed(nid, nodes_allowed); > h->next_nid_to_free = next_nid; > + > return nid; > } Same. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id D4A9D6B00B5 for ; Tue, 25 Aug 2009 16:35:21 -0400 (EDT) Received: from spaceape11.eur.corp.google.com (spaceape11.eur.corp.google.com [172.28.16.145]) by smtp-out.google.com with ESMTP id n7PKZH1E009306 for ; Tue, 25 Aug 2009 21:35:17 +0100 Received: from pxi32 (pxi32.prod.google.com [10.243.27.32]) by spaceape11.eur.corp.google.com with ESMTP id n7P8lsqW004038 for ; Tue, 25 Aug 2009 01:49:51 -0700 Received: by pxi32 with SMTP id 32so5132985pxi.25 for ; Tue, 25 Aug 2009 01:47:53 -0700 (PDT) Date: Tue, 25 Aug 2009 01:47:52 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy In-Reply-To: <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > This patch derives a "nodes_allowed" node mask from the numa > mempolicy of the task modifying the number of persistent huge > pages to control the allocation, freeing and adjusting of surplus > huge pages. This mask is derived as follows: > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > is produced. This will cause the hugetlb subsystem to use > node_online_map as the "nodes_allowed". This preserves the > behavior before this patch. > * For "preferred" mempolicy, including explicit local allocation, > a nodemask with the single preferred node will be produced. > "local" policy will NOT track any internode migrations of the > task adjusting nr_hugepages. > * For "bind" and "interleave" policy, the mempolicy's nodemask > will be used. > * Other than to inform the construction of the nodes_allowed node > mask, the actual mempolicy mode is ignored. That is, all modes > behave like interleave over the resulting nodes_allowed mask > with no "fallback". > > Notes: > > 1) This patch introduces a subtle change in behavior: huge page > allocation and freeing will be constrained by any mempolicy > that the task adjusting the huge page pool inherits from its > parent. This policy could come from a distant ancestor. The > adminstrator adjusting the huge page pool without explicitly > specifying a mempolicy via numactl might be surprised by this. > Additionaly, any mempolicy specified by numactl will be > constrained by the cpuset in which numactl is invoked. > > 2) Hugepages allocated at boot time use the node_online_map. > An additional patch could implement a temporary boot time > huge pages nodes_allowed command line parameter. > > 3) Using mempolicy to control persistent huge page allocation > and freeing requires no change to hugeadm when invoking > it via numactl, as shown in the examples below. However, > hugeadm could be enhanced to take the allowed nodes as an > argument and set its task mempolicy itself. This would allow > it to detect and warn about any non-default mempolicy that it > inherited from its parent, thus alleviating the issue described > in Note 1 above. > > See the updated documentation [next patch] for more information > about the implications of this patch. > > Examples: > > Starting with: > > Node 0 HugePages_Total: 0 > Node 1 HugePages_Total: 0 > Node 2 HugePages_Total: 0 > Node 3 HugePages_Total: 0 > > Default behavior [with or without this patch] balances persistent > hugepage allocation across nodes [with sufficient contiguous memory]: > > hugeadm --pool-pages-min=2048Kb:32 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 8 > Node 3 HugePages_Total: 8 > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > '--membind' because it allows multiple nodes to be specified > and it's easy to type]--we can allocate huge pages on > individual nodes or sets of nodes. So, starting from the > condition above, with 8 huge pages per node: > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The incremental 8 huge pages were restricted to node 2 by the > specified mempolicy. > > Similarly, we can use mempolicy to free persistent huge pages > from specified nodes: > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > yields: > > Node 0 HugePages_Total: 4 > Node 1 HugePages_Total: 4 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The 8 huge pages freed were balanced over nodes 0 and 1. > > Signed-off-by: Lee Schermerhorn > > include/linux/mempolicy.h | 3 ++ > mm/hugetlb.c | 14 ++++++---- > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 73 insertions(+), 5 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > } > return zl; > } > + > +/* > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > + * > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > + * to constraing the allocation and freeing of persistent huge pages > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > + * 'bind' policy in this context. An attempt to allocate a persistent huge > + * page will never "fallback" to another node inside the buddy system > + * allocator. > + * > + * If the task's mempolicy is "default" [NULL], just return NULL for > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > + * or 'interleave' policy or construct a nodemask for 'preferred' or > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > + * > + * N.B., it is the caller's responsibility to free a returned nodemask. > + */ > +nodemask_t *huge_mpol_nodes_allowed(void) > +{ > + nodemask_t *nodes_allowed = NULL; > + struct mempolicy *mempolicy; > + int nid; > + > + if (!current->mempolicy) > + return NULL; > + > + mpol_get(current->mempolicy); > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); I don't think using '\n' inside printk's is allowed anymore. > + goto out; > + } > + nodes_clear(*nodes_allowed); > + > + mempolicy = current->mempolicy; > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + if (mempolicy->flags & MPOL_F_LOCAL) > + nid = numa_node_id(); > + else > + nid = mempolicy->v.preferred_node; > + node_set(nid, *nodes_allowed); > + break; > + > + case MPOL_BIND: > + /* Fall through */ > + case MPOL_INTERLEAVE: > + *nodes_allowed = mempolicy->v.nodes; > + break; > + > + default: > + BUG(); > + } > + > +out: > + mpol_put(current->mempolicy); > + return nodes_allowed; > +} This should be all unnecessary, see below. > #endif > > /* Allocate a page in interleaved policy. > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > +extern nodemask_t *huge_mpol_nodes_allowed(void); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > return node_zonelist(0, gfp_flags); > } > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > + > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > const nodemask_t *to_nodes, int flags) > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > { > unsigned long min_count, ret; > + nodemask_t *nodes_allowed; > > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > Why can't you simply do this? struct mempolicy *pol = NULL; nodemask_t *nodes_allowed = &node_online_map; local_irq_disable(); pol = current->mempolicy; mpol_get(pol); local_irq_enable(); if (pol) { switch (pol->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: nodes_allowed = pol->v.nodes; break; case MPOL_PREFERRED: ... use NODEMASK_SCRATCH() ... default: BUG(); } } mpol_put(pol); and then use nodes_allowed throughout set_max_huge_pages()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id D99B56B00BA for ; Tue, 25 Aug 2009 16:49:03 -0400 (EDT) Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy From: Lee Schermerhorn In-Reply-To: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Content-Type: text/plain Date: Tue, 25 Aug 2009 16:49:07 -0400 Message-Id: <1251233347.16229.0.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 2009-08-25 at 01:47 -0700, David Rientjes wrote: > On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > > > This patch derives a "nodes_allowed" node mask from the numa > > mempolicy of the task modifying the number of persistent huge > > pages to control the allocation, freeing and adjusting of surplus > > huge pages. This mask is derived as follows: > > > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > > is produced. This will cause the hugetlb subsystem to use > > node_online_map as the "nodes_allowed". This preserves the > > behavior before this patch. > > * For "preferred" mempolicy, including explicit local allocation, > > a nodemask with the single preferred node will be produced. > > "local" policy will NOT track any internode migrations of the > > task adjusting nr_hugepages. > > * For "bind" and "interleave" policy, the mempolicy's nodemask > > will be used. > > * Other than to inform the construction of the nodes_allowed node > > mask, the actual mempolicy mode is ignored. That is, all modes > > behave like interleave over the resulting nodes_allowed mask > > with no "fallback". > > > > Notes: > > > > 1) This patch introduces a subtle change in behavior: huge page > > allocation and freeing will be constrained by any mempolicy > > that the task adjusting the huge page pool inherits from its > > parent. This policy could come from a distant ancestor. The > > adminstrator adjusting the huge page pool without explicitly > > specifying a mempolicy via numactl might be surprised by this. > > Additionaly, any mempolicy specified by numactl will be > > constrained by the cpuset in which numactl is invoked. > > > > 2) Hugepages allocated at boot time use the node_online_map. > > An additional patch could implement a temporary boot time > > huge pages nodes_allowed command line parameter. > > > > 3) Using mempolicy to control persistent huge page allocation > > and freeing requires no change to hugeadm when invoking > > it via numactl, as shown in the examples below. However, > > hugeadm could be enhanced to take the allowed nodes as an > > argument and set its task mempolicy itself. This would allow > > it to detect and warn about any non-default mempolicy that it > > inherited from its parent, thus alleviating the issue described > > in Note 1 above. > > > > See the updated documentation [next patch] for more information > > about the implications of this patch. > > > > Examples: > > > > Starting with: > > > > Node 0 HugePages_Total: 0 > > Node 1 HugePages_Total: 0 > > Node 2 HugePages_Total: 0 > > Node 3 HugePages_Total: 0 > > > > Default behavior [with or without this patch] balances persistent > > hugepage allocation across nodes [with sufficient contiguous memory]: > > > > hugeadm --pool-pages-min=2048Kb:32 > > > > yields: > > > > Node 0 HugePages_Total: 8 > > Node 1 HugePages_Total: 8 > > Node 2 HugePages_Total: 8 > > Node 3 HugePages_Total: 8 > > > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > > '--membind' because it allows multiple nodes to be specified > > and it's easy to type]--we can allocate huge pages on > > individual nodes or sets of nodes. So, starting from the > > condition above, with 8 huge pages per node: > > > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > > > yields: > > > > Node 0 HugePages_Total: 8 > > Node 1 HugePages_Total: 8 > > Node 2 HugePages_Total: 16 > > Node 3 HugePages_Total: 8 > > > > The incremental 8 huge pages were restricted to node 2 by the > > specified mempolicy. > > > > Similarly, we can use mempolicy to free persistent huge pages > > from specified nodes: > > > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > > > yields: > > > > Node 0 HugePages_Total: 4 > > Node 1 HugePages_Total: 4 > > Node 2 HugePages_Total: 16 > > Node 3 HugePages_Total: 8 > > > > The 8 huge pages freed were balanced over nodes 0 and 1. > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/mempolicy.h | 3 ++ > > mm/hugetlb.c | 14 ++++++---- > > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 73 insertions(+), 5 deletions(-) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > > } > > return zl; > > } > > + > > +/* > > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > > + * > > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > > + * to constraing the allocation and freeing of persistent huge pages > > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > > + * 'bind' policy in this context. An attempt to allocate a persistent huge > > + * page will never "fallback" to another node inside the buddy system > > + * allocator. > > + * > > + * If the task's mempolicy is "default" [NULL], just return NULL for > > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > > + * or 'interleave' policy or construct a nodemask for 'preferred' or > > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > > + * > > + * N.B., it is the caller's responsibility to free a returned nodemask. > > + */ > > +nodemask_t *huge_mpol_nodes_allowed(void) > > +{ > > + nodemask_t *nodes_allowed = NULL; > > + struct mempolicy *mempolicy; > > + int nid; > > + > > + if (!current->mempolicy) > > + return NULL; > > + > > + mpol_get(current->mempolicy); > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > + if (!nodes_allowed) { > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > + "for huge page allocation.\nFalling back to default.\n", > > + current->comm); > > I don't think using '\n' inside printk's is allowed anymore. OK, will remove. > > > + goto out; > > + } > > + nodes_clear(*nodes_allowed); > > + > > + mempolicy = current->mempolicy; > > + switch (mempolicy->mode) { > > + case MPOL_PREFERRED: > > + if (mempolicy->flags & MPOL_F_LOCAL) > > + nid = numa_node_id(); > > + else > > + nid = mempolicy->v.preferred_node; > > + node_set(nid, *nodes_allowed); > > + break; > > + > > + case MPOL_BIND: > > + /* Fall through */ > > + case MPOL_INTERLEAVE: > > + *nodes_allowed = mempolicy->v.nodes; > > + break; > > + > > + default: > > + BUG(); > > + } > > + > > +out: > > + mpol_put(current->mempolicy); > > + return nodes_allowed; > > +} > > This should be all unnecessary, see below. > > > #endif > > > > /* Allocate a page in interleaved policy. > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > > unsigned long addr, gfp_t gfp_flags, > > struct mempolicy **mpol, nodemask_t **nodemask); > > +extern nodemask_t *huge_mpol_nodes_allowed(void); > > extern unsigned slab_node(struct mempolicy *policy); > > > > extern enum zone_type policy_zone; > > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > > return node_zonelist(0, gfp_flags); > > } > > > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > > + > > static inline int do_migrate_pages(struct mm_struct *mm, > > const nodemask_t *from_nodes, > > const nodemask_t *to_nodes, int flags) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > { > > unsigned long min_count, ret; > > + nodemask_t *nodes_allowed; > > > > if (h->order >= MAX_ORDER) > > return h->max_huge_pages; > > > > Why can't you simply do this? > > struct mempolicy *pol = NULL; > nodemask_t *nodes_allowed = &node_online_map; > > local_irq_disable(); > pol = current->mempolicy; > mpol_get(pol); > local_irq_enable(); > if (pol) { > switch (pol->mode) { > case MPOL_BIND: > case MPOL_INTERLEAVE: > nodes_allowed = pol->v.nodes; > break; > case MPOL_PREFERRED: > ... use NODEMASK_SCRATCH() ... > default: > BUG(); > } > } > mpol_put(pol); > > and then use nodes_allowed throughout set_max_huge_pages()? Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages(). NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure it will return a kmalloc()'d nodemask, which I need because a NULL nodemask pointer means "all online nodes" [really all nodes with memory, I suppose] and I need a pointer to kmalloc()'d nodemask to return from huge_mpol_nodes_allowed(). I want to keep the access to the internals of mempolicy in mempolicy.[ch], thus the call out to huge_mpol_nodes_allowed(), instead of open coding it. It's not really a hot path, so I didn't want to fuss with a static inline in the header, even tho' this is the only call site. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 766AD6B00BC for ; Tue, 25 Aug 2009 16:49:25 -0400 (EDT) Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes From: Lee Schermerhorn In-Reply-To: <20090825101906.GB4427@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> Content-Type: text/plain; charset="UTF-8" Date: Tue, 25 Aug 2009 16:49:29 -0400 Message-Id: <1251233369.16229.1.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 2009-08-25 at 11:19 +0100, Mel Gorman wrote: > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > PATCH/RFC 5/4 hugetlb: register per node hugepages attributes > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > V2: remove dependency on kobject private bitfield. Search > > global hstates then all per node hstates for kobject > > match in attribute show/store functions. > > > > V3: rebase atop the mempolicy-based hugepage alloc/free; > > use custom "nodes_allowed" to restrict alloc/free to > > a specific node via per node attributes. Per node > > attribute overrides mempolicy. I.e., mempolicy only > > applies to global attributes. > > > > To demonstrate feasibility--if not advisability--of supporting > > both mempolicy-based persistent huge page management with per > > node "override" attributes. > > > > This patch adds the per huge page size control/query attributes > > to the per node sysdevs: > > > > /sys/devices/system/node/node/hugepages/hugepages-/ > > nr_hugepages - r/w > > free_huge_pages - r/o > > surplus_huge_pages - r/o > > > > The patch attempts to re-use/share as much of the existing > > global hstate attribute initialization and handling, and the > > "nodes_allowed" constraint processing as possible. > > In set_max_huge_pages(), a node id < 0 indicates a change to > > global hstate parameters. In this case, any non-default task > > mempolicy will be used to generate the nodes_allowed mask. A > > node id > 0 indicates a node specific update and the count > > argument specifies the target count for the node. From this > > info, we compute the target global count for the hstate and > > construct a nodes_allowed node mask contain only the specified > > node. Thus, setting the node specific nr_hugepages via the > > per node attribute effectively overrides any task mempolicy. > > > > > > Issue: dependency of base driver [node] dependency on hugetlbfs module. > > We want to keep all of the hstate attribute registration and handling > > in the hugetlb module. However, we need to call into this code to > > register the per node hstate attributes on node hot plug. > > > > With this patch: > > > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > > ./ ../ free_hugepages nr_hugepages surplus_hugepages > > > > Starting from: > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 0 > > Node 2 HugePages_Free: 0 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > > vm.nr_hugepages = 0 > > > > Allocate 16 persistent huge pages on node 2: > > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages > > > > [Note that this is equivalent to: > > numactl -m 2 hugeadmin --pool-pages-min 2M:+16 > > ] > > > > Yields: > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 16 > > Node 2 HugePages_Free: 16 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > > vm.nr_hugepages = 16 > > > > Global controls work as expected--reduce pool to 8 persistent huge pages: > > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 8 > > Node 2 HugePages_Free: 8 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > > > > > > Signed-off-by: Lee Schermerhorn > > > > drivers/base/node.c | 2 > > include/linux/hugetlb.h | 6 + > > include/linux/node.h | 3 > > mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++++++++------- > > 4 files changed, 197 insertions(+), 27 deletions(-) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-24 12:12:56.000000000 -0400 > > @@ -200,6 +200,7 @@ int register_node(struct node *node, int > > sysdev_create_file(&node->sysdev, &attr_distance); > > > > scan_unevictable_register_node(node); > > + hugetlb_register_node(node); > > } > > return error; > > } > > @@ -220,6 +221,7 @@ void unregister_node(struct node *node) > > sysdev_remove_file(&node->sysdev, &attr_distance); > > > > scan_unevictable_unregister_node(node); > > + hugetlb_unregister_node(node); > > > > sysdev_unregister(&node->sysdev); > > } > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009-08-24 12:12:56.000000000 -0400 > > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate > > return size_to_hstate(PAGE_SIZE << compound_order(page)); > > } > > > > +struct node; > > +extern void hugetlb_register_node(struct node *); > > +extern void hugetlb_unregister_node(struct node *); > > + > > #else > > struct hstate {}; > > #define alloc_bootmem_huge_page(h) NULL > > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug > > { > > return 1; > > } > > +#define hugetlb_register_node(NP) > > +#define hugetlb_unregister_node(NP) > > #endif > > > > This also needs to be done for the !NUMA case. Try building without NUMA > set and you get the following with this patch applied > > CC mm/hugetlb.o > mm/hugetlb.c: In function AcA?A?hugetlb_exitAcA?A?: > mm/hugetlb.c:1629: error: implicit declaration of function AcA?A?hugetlb_unregister_all_nodesAcA?A? > mm/hugetlb.c: In function AcA?A?hugetlb_initAcA?A?: > mm/hugetlb.c:1665: error: implicit declaration of function AcA?A?hugetlb_register_all_nodesAcA?A? > make[1]: *** [mm/hugetlb.o] Error 1 > make: *** [mm] Error 2 Ouch! Sorry. Will add stubs. > > > > #endif /* _LINUX_HUGETLB_H */ > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:56.000000000 -0400 > > @@ -24,6 +24,7 @@ > > #include > > > > #include > > +#include > > #include "internal.h" > > > > const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; > > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs > > return ret; > > } > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > +{ > > This name is a bit weird. It's creating a nodemask with just a single > node allowed. > > Is there something wrong with using the existing function > nodemask_of_node()? If stack is the problem, prehaps there is some macro > magic that would allow a nodemask to be either declared on the stack or > kmalloc'd. Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a block nested inside the context where it's invoked. I would be declaring the nodemask in the compound else clause and don't want to access it [via the nodes_allowed pointer] from outside of there. > > > + nodemask_t *nodes_allowed; > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > + if (!nodes_allowed) { > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > + "for huge page allocation.\nFalling back to default.\n", > > + current->comm); > > + } else { > > + nodes_clear(*nodes_allowed); > > + node_set(nid, *nodes_allowed); > > + } > > + return nodes_allowed; > > +} > > + > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > > + int nid) > > { > > unsigned long min_count, ret; > > nodemask_t *nodes_allowed; > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > > if (h->order >= MAX_ORDER) > > return h->max_huge_pages; > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > + if (nid < 0) > > + nodes_allowed = huge_mpol_nodes_allowed(); > > hugetlb is a bit littered with magic numbers been passed into functions. > Attempts have been made to clear them up as according as patches change > that area. Would it be possible to define something like > > #define HUGETLB_OBEY_MEMPOLICY -1 > > for the nid here as opposed to passing in -1? I know -1 is used in the page > allocator functions but there it means "current node" and here it means > "obey mempolicies". Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a per node attribute". It means "derive nodes allowed from memory policy, if non-default, else use nodes_online_map" [which is not exactly the same as obeying memory policy]. But, I can see defining a symbolic constant such as NO_NODE[_ID_SPECIFIED]. I'll try next spin. > > > + else { > > + /* > > + * incoming 'count' is for node 'nid' only, so > > + * adjust count to global, but restrict alloc/free > > + * to the specified node. > > + */ > > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > > + nodes_allowed = nodes_allowed_from_node(nid); > > + } > > > > /* > > * Increase the pool size > > @@ -1338,34 +1365,69 @@ out: > > static struct kobject *hugepages_kobj; > > static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > +{ > > + int nid; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + struct node *node = &node_devices[nid]; > > + int hi; > > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) > > Does that hi mean hello, high, nid or hstate_idx? > > hstate_idx would appear to be the appropriate name here. Or just plain 'i', like in the following, pre-existing function? > > > + if (node->hstate_kobjs[hi] == kobj) { > > + if (nidp) > > + *nidp = nid; > > + return &hstates[hi]; > > + } > > + } > > Ok.... so, there is a struct node array for the sysdev and this patch adds > references to the "hugepages" directory kobject and the subdirectories for > each page size. We walk all the objects until we find a match. Obviously, > this adds a dependency of base node support on hugetlbfs which feels backwards > and you call that out in your leader. > > Can this be the other way around? i.e. The struct hstate has an array of > kobjects arranged by nid that is filled in when the node is registered? > There will only be one kobject-per-pagesize-per-node so it seems like it > would work. I confess, I haven't prototyped this to be 100% sure. This will take a bit longer to sort out. I do want to change the registration, tho', so that hugetlb.c registers it's single node register/unregister functions with base/node.c to remove the source level dependency in that direction. node.c will only register nodes on hot plug as it's initialized too early, relative to hugetlb.c to register them at init time. This should break the call dependency of base/node.c on the hugetlb module. As far as moving the per node attributes' kobjects to the hugetlb global hstate arrays... Have to think about that. I agree that it would be nice to remove the source level [header] dependency. > > > + > > + BUG(); > > + return NULL; > > +} > > + > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) > > { > > int i; > > + > > for (i = 0; i < HUGE_MAX_HSTATE; i++) > > - if (hstate_kobjs[i] == kobj) > > + if (hstate_kobjs[i] == kobj) { > > + if (nidp) > > + *nidp = -1; > > return &hstates[i]; > > - BUG(); > > - return NULL; > > + } > > + > > + return kobj_to_node_hstate(kobj, nidp); > > } > > > > static ssize_t nr_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > > + struct hstate *h; > > + unsigned long nr_huge_pages; > > + int nid; > > + > > + h = kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + nr_huge_pages = h->nr_huge_pages; > > Here is another magic number except it means something slightly > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would > be nice if these different special nid values could be named, preferably > collapsed to being one "core" thing. Again, it means "NO NODE ID specified" [via per node attribute]. Again, I'll address this with a single constant. > > > + else > > + nr_huge_pages = h->nr_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", nr_huge_pages); > > } > > + > > static ssize_t nr_hugepages_store(struct kobject *kobj, > > struct kobj_attribute *attr, const char *buf, size_t count) > > { > > - int err; > > unsigned long input; > > - struct hstate *h = kobj_to_hstate(kobj); > > + struct hstate *h; > > + int nid; > > + int err; > > > > err = strict_strtoul(buf, 10, &input); > > if (err) > > return 0; > > > > - h->max_huge_pages = set_max_huge_pages(h, input); > > "input" is a bit meaningless. The function you are passing to calls this > parameter "count". Can you match the naming please? Otherwise, I might > guess that this is a "delta" which occurs elsewhere in the hugetlb code. I guess I can change that. It's the pre-exiting name, and 'count' was already used. Guess I can change 'count' to 'len' and 'input' to 'count' > > > + h = kobj_to_hstate(kobj, &nid); > > + h->max_huge_pages = set_max_huge_pages(h, input, nid); > > > > return count; > > } > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > + > > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > > } > > + > > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > > struct kobj_attribute *attr, const char *buf, size_t count) > > { > > int err; > > unsigned long input; > > - struct hstate *h = kobj_to_hstate(kobj); > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > > err = strict_strtoul(buf, 10, &input); > > if (err) > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > > static ssize_t free_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->free_huge_pages); > > + struct hstate *h; > > + unsigned long free_huge_pages; > > + int nid; > > + > > + h = kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + free_huge_pages = h->free_huge_pages; > > + else > > + free_huge_pages = h->free_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", free_huge_pages); > > } > > HSTATE_ATTR_RO(free_hugepages); > > > > static ssize_t resv_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > return sprintf(buf, "%lu\n", h->resv_huge_pages); > > } > > HSTATE_ATTR_RO(resv_hugepages); > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > > static ssize_t surplus_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > > + struct hstate *h; > > + unsigned long surplus_huge_pages; > > + int nid; > > + > > + h = kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + surplus_huge_pages = h->surplus_huge_pages; > > + else > > + surplus_huge_pages = h->surplus_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", surplus_huge_pages); > > } > > HSTATE_ATTR_RO(surplus_hugepages); > > > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > > .attrs = hstate_attrs, > > }; > > > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > > + struct kobject *parent, > > + struct kobject **hstate_kobjs, > > + struct attribute_group *hstate_attr_group) > > { > > int retval; > > + int hi = h - hstates; > > > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, > > - hugepages_kobj); > > - if (!hstate_kobjs[h - hstates]) > > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); > > + if (!hstate_kobjs[hi]) > > return -ENOMEM; > > > > - retval = sysfs_create_group(hstate_kobjs[h - hstates], > > - &hstate_attr_group); > > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > > if (retval) > > - kobject_put(hstate_kobjs[h - hstates]); > > + kobject_put(hstate_kobjs[hi]); > > > > return retval; > > } > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > > return; > > > > for_each_hstate(h) { > > - err = hugetlb_sysfs_add_hstate(h); > > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > > + hstate_kobjs, &hstate_attr_group); > > if (err) > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > > h->name); > > } > > } > > > > +#ifdef CONFIG_NUMA > > +static struct attribute *per_node_hstate_attrs[] = { > > + &nr_hugepages_attr.attr, > > + &free_hugepages_attr.attr, > > + &surplus_hugepages_attr.attr, > > + NULL, > > +}; > > + > > +static struct attribute_group per_node_hstate_attr_group = { > > + .attrs = per_node_hstate_attrs, > > +}; > > + > > + > > +void hugetlb_unregister_node(struct node *node) > > +{ > > + struct hstate *h; > > + > > + for_each_hstate(h) { > > + kobject_put(node->hstate_kobjs[h - hstates]); > > + node->hstate_kobjs[h - hstates] = NULL; > > + } > > + > > + kobject_put(node->hugepages_kobj); > > + node->hugepages_kobj = NULL; > > +} > > + > > +static void hugetlb_unregister_all_nodes(void) > > +{ > > + int nid; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) > > + hugetlb_unregister_node(&node_devices[nid]); > > +} > > + > > +void hugetlb_register_node(struct node *node) > > +{ > > + struct hstate *h; > > + int err; > > + > > + if (!hugepages_kobj) > > + return; /* too early */ > > + > > + node->hugepages_kobj = kobject_create_and_add("hugepages", > > + &node->sysdev.kobj); > > + if (!node->hugepages_kobj) > > + return; > > + > > + for_each_hstate(h) { > > + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > > + node->hstate_kobjs, > > + &per_node_hstate_attr_group); > > + if (err) > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > > + " for node %d\n", > > + h->name, node->sysdev.id); > > + } > > +} > > + > > +static void hugetlb_register_all_nodes(void) > > +{ > > + int nid; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + struct node *node = &node_devices[nid]; > > + if (node->sysdev.id == nid && !node->hugepages_kobj) > > + hugetlb_register_node(node); > > + } > > +} > > +#endif > > + > > static void __exit hugetlb_exit(void) > > { > > struct hstate *h; > > > > + hugetlb_unregister_all_nodes(); > > + > > for_each_hstate(h) { > > kobject_put(hstate_kobjs[h - hstates]); > > } > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > > > > hugetlb_sysfs_init(); > > > > + hugetlb_register_all_nodes(); > > + > > return 0; > > } > > module_init(hugetlb_init); > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > > > if (write) > > - h->max_huge_pages = set_max_huge_pages(h, tmp); > > + h->max_huge_pages = set_max_huge_pages(h, tmp, -1); > > > > return 0; > > } > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > @@ -21,9 +21,12 @@ > > > > #include > > #include > > +#include > > > > struct node { > > struct sys_device sysdev; > > + struct kobject *hugepages_kobj; > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > }; > > > > struct memory_block; > > > > I'm not against this idea and think it can work side-by-side with the memory > policies. I believe it does need a bit more cleaning up before merging > though. I also wasn't able to test this yet due to various build and > deploy issues. OK. I'll do the cleanup. I have tested this atop the mempolicy version by working around the build issues that I thought were just temporary glitches in the mmotm series. In my [limited] experience, one can interleave numactl+hugeadm with setting values via the per node attributes and it does the right thing. No heavy testing with racing tasks, tho'. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 6B62D6B00BD for ; Tue, 25 Aug 2009 16:49:31 -0400 (EDT) Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns From: Lee Schermerhorn In-Reply-To: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> Content-Type: text/plain Date: Tue, 25 Aug 2009 16:49:34 -0400 Message-Id: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 2009-08-25 at 01:16 -0700, David Rientjes wrote: > On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > > > [PATCH 2/4] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > V3: > > + moved this patch to after the "rework" of hstate_next_node_to_... > > functions as this patch is more specific to using task mempolicy > > to control huge page allocation and freeing. > > > > In preparation for constraining huge page allocation and freeing by the > > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer > > to the allocate, free and surplus adjustment functions. For now, pass > > NULL to indicate default behavior--i.e., use node_online_map. A > > subsqeuent patch will derive a non-default mask from the controlling > > task's numa mempolicy. > > > > Reviewed-by: Mel Gorman > > Signed-off-by: Lee Schermerhorn > > > > mm/hugetlb.c | 102 ++++++++++++++++++++++++++++++++++++++--------------------- > > 1 file changed, 67 insertions(+), 35 deletions(-) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > > } > > > > /* > > - * common helper function for hstate_next_node_to_{alloc|free}. > > - * return next node in node_online_map, wrapping at end. > > + * common helper functions for hstate_next_node_to_{alloc|free}. > > + * We may have allocated or freed a huge pages based on a different > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > > + * be outside of *nodes_allowed. Ensure that we use the next > > + * allowed node for alloc or free. > > */ > > -static int next_node_allowed(int nid) > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > > { > > - nid = next_node(nid, node_online_map); > > + nid = next_node(nid, *nodes_allowed); > > if (nid == MAX_NUMNODES) > > - nid = first_node(node_online_map); > > + nid = first_node(*nodes_allowed); > > VM_BUG_ON(nid >= MAX_NUMNODES); > > > > return nid; > > } > > > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > > +{ > > + if (!node_isset(nid, *nodes_allowed)) > > + nid = next_node_allowed(nid, nodes_allowed); > > + return nid; > > +} > > Awkward name considering this doesn't simply return true or false as > expected, it returns a nid. Well, it's not a predicate function so I wouldn't expect true or false return, but I can see how the trailing "allowed" can sound like we're asking the question "Is this node allowed?". Maybe, "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid to "startnid"], ... Or, do you have a suggestion? > > > + > > /* > > * Use a helper variable to find the next node and then > > * copy it back to next_nid_to_alloc afterwards: > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > * But we don't need to use a spin_lock here: it really > > * doesn't matter if occasionally a racer chooses the > > - * same nid as we do. Move nid forward in the mask even > > - * if we just successfully allocated a hugepage so that > > - * the next caller gets hugepages on the next node. > > + * same nid as we do. Move nid forward in the mask whether > > + * or not we just successfully allocated a hugepage so that > > + * the next allocation addresses the next node. > > */ > > -static int hstate_next_node_to_alloc(struct hstate *h) > > +static int hstate_next_node_to_alloc(struct hstate *h, > > + nodemask_t *nodes_allowed) > > { > > int nid, next_nid; > > > > - nid = h->next_nid_to_alloc; > > - next_nid = next_node_allowed(nid); > > + if (!nodes_allowed) > > + nodes_allowed = &node_online_map; > > + > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > + > > + next_nid = next_node_allowed(nid, nodes_allowed); > > h->next_nid_to_alloc = next_nid; > > + > > return nid; > > } > > Don't need next_nid. Well, the pre-existing comment block indicated that the use of the apparently spurious next_nid variable is necessary to close a race. Not sure whether that comment still applies with this rework. What do you think? > > > -static int alloc_fresh_huge_page(struct hstate *h) > > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > > { > > struct page *page; > > int start_nid; > > int next_nid; > > int ret = 0; > > > > - start_nid = hstate_next_node_to_alloc(h); > > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > next_nid = start_nid; > > > > do { > > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct > > ret = 1; > > break; > > } > > - next_nid = hstate_next_node_to_alloc(h); > > + next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > } while (next_nid != start_nid); > > > > if (ret) > > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct > > * whether or not we find a free huge page to free so that the > > * next attempt to free addresses the next node. > > */ > > -static int hstate_next_node_to_free(struct hstate *h) > > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > > { > > int nid, next_nid; > > > > - nid = h->next_nid_to_free; > > - next_nid = next_node_allowed(nid); > > + if (!nodes_allowed) > > + nodes_allowed = &node_online_map; > > + > > + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); > > + > > + next_nid = next_node_allowed(nid, nodes_allowed); > > h->next_nid_to_free = next_nid; > > + > > return nid; > > } > > Same. Yes, and I modeled this on "next to alloc", with the extra next_nid for the same reason. Do we dare remove it? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 6A8006B00BD for ; Tue, 25 Aug 2009 16:49:36 -0400 (EDT) Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes From: Lee Schermerhorn In-Reply-To: <20090825133516.GE21335@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825133516.GE21335@csn.ul.ie> Content-Type: text/plain Date: Tue, 25 Aug 2009 16:49:40 -0400 Message-Id: <1251233380.16229.3.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote: > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > @@ -21,9 +21,12 @@ > > > > #include > > #include > > +#include > > > > Is this header inclusion necessary? It does not appear to be required by > the structure modification (which is iffy in itself as discussed in the > earlier mail) and it breaks build on x86-64. Hi, Mel: I recall that it is necessary to build. You can try w/o it. > > CC arch/x86/kernel/setup_percpu.o > In file included from include/linux/pagemap.h:10, > from include/linux/mempolicy.h:62, > from include/linux/hugetlb.h:8, > from include/linux/node.h:24, > from include/linux/cpu.h:23, > from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, > from arch/x86/kernel/setup_percpu.c:19: > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 > make[1]: *** [arch/x86/kernel] Error 2 I saw this. I've been testing on x86_64. I *thought* that it only started showing up in a recent mmotm from changes in the linux-next patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately !ARCH_HAS_KMAP in highmem.h But maybe that was coincidental with my adding the include. Lee > > > > > struct node { > > struct sys_device sysdev; > > + struct kobject *hugepages_kobj; > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > }; > > > > struct memory_block; > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id EA8CF6B00D9 for ; Tue, 25 Aug 2009 17:59:28 -0400 (EDT) Received: from zps18.corp.google.com (zps18.corp.google.com [172.25.146.18]) by smtp-out.google.com with ESMTP id n7PLxJaS004906 for ; Tue, 25 Aug 2009 14:59:24 -0700 Received: from pzk3 (pzk3.prod.google.com [10.243.19.131]) by zps18.corp.google.com with ESMTP id n7PLwjkq014881 for ; Tue, 25 Aug 2009 14:59:17 -0700 Received: by pzk3 with SMTP id 3so1888859pzk.31 for ; Tue, 25 Aug 2009 14:59:17 -0700 (PDT) Date: Tue, 25 Aug 2009 14:59:11 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns In-Reply-To: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> <1251233374.16229.2.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 25 Aug 2009, Lee Schermerhorn wrote: > > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > > > } > > > > > > /* > > > - * common helper function for hstate_next_node_to_{alloc|free}. > > > - * return next node in node_online_map, wrapping at end. > > > + * common helper functions for hstate_next_node_to_{alloc|free}. > > > + * We may have allocated or freed a huge pages based on a different > > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > > > + * be outside of *nodes_allowed. Ensure that we use the next > > > + * allowed node for alloc or free. > > > */ > > > -static int next_node_allowed(int nid) > > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > > > { > > > - nid = next_node(nid, node_online_map); > > > + nid = next_node(nid, *nodes_allowed); > > > if (nid == MAX_NUMNODES) > > > - nid = first_node(node_online_map); > > > + nid = first_node(*nodes_allowed); > > > VM_BUG_ON(nid >= MAX_NUMNODES); > > > > > > return nid; > > > } > > > > > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > > > +{ > > > + if (!node_isset(nid, *nodes_allowed)) > > > + nid = next_node_allowed(nid, nodes_allowed); > > > + return nid; > > > +} > > > > Awkward name considering this doesn't simply return true or false as > > expected, it returns a nid. > > Well, it's not a predicate function so I wouldn't expect true or false > return, but I can see how the trailing "allowed" can sound like we're > asking the question "Is this node allowed?". Maybe, > "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid > to "startnid"], ... Or, do you have a suggestion? > this_node_allowed() just seemed like a very similar name to cpuset_zone_allowed() in the cpuset code, which does return true or false depending on whether the zone is allowed by current's cpuset. As usual with the mempolicy discussions, I come from a biased cpuset perspective :) > > > > > + > > > /* > > > * Use a helper variable to find the next node and then > > > * copy it back to next_nid_to_alloc afterwards: > > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > > * But we don't need to use a spin_lock here: it really > > > * doesn't matter if occasionally a racer chooses the > > > - * same nid as we do. Move nid forward in the mask even > > > - * if we just successfully allocated a hugepage so that > > > - * the next caller gets hugepages on the next node. > > > + * same nid as we do. Move nid forward in the mask whether > > > + * or not we just successfully allocated a hugepage so that > > > + * the next allocation addresses the next node. > > > */ > > > -static int hstate_next_node_to_alloc(struct hstate *h) > > > +static int hstate_next_node_to_alloc(struct hstate *h, > > > + nodemask_t *nodes_allowed) > > > { > > > int nid, next_nid; > > > > > > - nid = h->next_nid_to_alloc; > > > - next_nid = next_node_allowed(nid); > > > + if (!nodes_allowed) > > > + nodes_allowed = &node_online_map; > > > + > > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > > + > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_alloc = next_nid; > > > + > > > return nid; > > > } > > > > Don't need next_nid. > > Well, the pre-existing comment block indicated that the use of the > apparently spurious next_nid variable is necessary to close a race. Not > sure whether that comment still applies with this rework. What do you > think? > What race is it closing exactly if gcc is going to optimize it out anyways? I think you can safely fold the following into your patch. --- diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -659,15 +659,14 @@ static int this_node_allowed(int nid, nodemask_t *nodes_allowed) static int hstate_next_node_to_alloc(struct hstate *h, nodemask_t *nodes_allowed) { - int nid, next_nid; + int nid; if (!nodes_allowed) nodes_allowed = &node_online_map; nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); - next_nid = next_node_allowed(nid, nodes_allowed); - h->next_nid_to_alloc = next_nid; + h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed); return nid; } @@ -707,15 +706,14 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) */ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) { - int nid, next_nid; + int nid; if (!nodes_allowed) nodes_allowed = &node_online_map; nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); - next_nid = next_node_allowed(nid, nodes_allowed); - h->next_nid_to_free = next_nid; + h->next_nid_to_free = next_node_allowed(nid, nodes_allowed); return nid; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 616C26B0087 for ; Wed, 26 Aug 2009 06:56:12 -0400 (EDT) Date: Wed, 26 Aug 2009 11:12:03 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090826101202.GE10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825133516.GE21335@csn.ul.ie> <1251233380.16229.3.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1251233380.16229.3.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Aug 25, 2009 at 04:49:40PM -0400, Lee Schermerhorn wrote: > On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote: > > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > > @@ -21,9 +21,12 @@ > > > > > > #include > > > #include > > > +#include > > > > > > > Is this header inclusion necessary? It does not appear to be required by > > the structure modification (which is iffy in itself as discussed in the > > earlier mail) and it breaks build on x86-64. > > Hi, Mel: > > I recall that it is necessary to build. You can try w/o it. > I did, it appeared to work but I didn't dig deep as to why. > > > > CC arch/x86/kernel/setup_percpu.o > > In file included from include/linux/pagemap.h:10, > > from include/linux/mempolicy.h:62, > > from include/linux/hugetlb.h:8, > > from include/linux/node.h:24, > > from include/linux/cpu.h:23, > > from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, > > from arch/x86/kernel/setup_percpu.c:19: > > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here > > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here > > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here > > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 > > make[1]: *** [arch/x86/kernel] Error 2 > > I saw this. I've been testing on x86_64. I *thought* that it only > started showing up in a recent mmotm from changes in the linux-next > patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately > !ARCH_HAS_KMAP in highmem.h But maybe that was coincidental with my > adding the include. > Maybe we were looking at different mmotm's -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 62A5F6B0096 for ; Wed, 26 Aug 2009 06:56:13 -0400 (EDT) Date: Tue, 25 Aug 2009 11:19:07 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090825101906.GB4427@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > PATCH/RFC 5/4 hugetlb: register per node hugepages attributes > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: remove dependency on kobject private bitfield. Search > global hstates then all per node hstates for kobject > match in attribute show/store functions. > > V3: rebase atop the mempolicy-based hugepage alloc/free; > use custom "nodes_allowed" to restrict alloc/free to > a specific node via per node attributes. Per node > attribute overrides mempolicy. I.e., mempolicy only > applies to global attributes. > > To demonstrate feasibility--if not advisability--of supporting > both mempolicy-based persistent huge page management with per > node "override" attributes. > > This patch adds the per huge page size control/query attributes > to the per node sysdevs: > > /sys/devices/system/node/node/hugepages/hugepages-/ > nr_hugepages - r/w > free_huge_pages - r/o > surplus_huge_pages - r/o > > The patch attempts to re-use/share as much of the existing > global hstate attribute initialization and handling, and the > "nodes_allowed" constraint processing as possible. > In set_max_huge_pages(), a node id < 0 indicates a change to > global hstate parameters. In this case, any non-default task > mempolicy will be used to generate the nodes_allowed mask. A > node id > 0 indicates a node specific update and the count > argument specifies the target count for the node. From this > info, we compute the target global count for the hstate and > construct a nodes_allowed node mask contain only the specified > node. Thus, setting the node specific nr_hugepages via the > per node attribute effectively overrides any task mempolicy. > > > Issue: dependency of base driver [node] dependency on hugetlbfs module. > We want to keep all of the hstate attribute registration and handling > in the hugetlb module. However, we need to call into this code to > register the per node hstate attributes on node hot plug. > > With this patch: > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > ./ ../ free_hugepages nr_hugepages surplus_hugepages > > Starting from: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 0 > Node 2 HugePages_Free: 0 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 0 > > Allocate 16 persistent huge pages on node 2: > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages > > [Note that this is equivalent to: > numactl -m 2 hugeadmin --pool-pages-min 2M:+16 > ] > > Yields: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 16 > Node 2 HugePages_Free: 16 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 16 > > Global controls work as expected--reduce pool to 8 persistent huge pages: > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 8 > Node 2 HugePages_Free: 8 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > > > Signed-off-by: Lee Schermerhorn > > drivers/base/node.c | 2 > include/linux/hugetlb.h | 6 + > include/linux/node.h | 3 > mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++++++++------- > 4 files changed, 197 insertions(+), 27 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-24 12:12:56.000000000 -0400 > @@ -200,6 +200,7 @@ int register_node(struct node *node, int > sysdev_create_file(&node->sysdev, &attr_distance); > > scan_unevictable_register_node(node); > + hugetlb_register_node(node); > } > return error; > } > @@ -220,6 +221,7 @@ void unregister_node(struct node *node) > sysdev_remove_file(&node->sysdev, &attr_distance); > > scan_unevictable_unregister_node(node); > + hugetlb_unregister_node(node); > > sysdev_unregister(&node->sysdev); > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009-08-24 12:12:56.000000000 -0400 > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate > return size_to_hstate(PAGE_SIZE << compound_order(page)); > } > > +struct node; > +extern void hugetlb_register_node(struct node *); > +extern void hugetlb_unregister_node(struct node *); > + > #else > struct hstate {}; > #define alloc_bootmem_huge_page(h) NULL > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug > { > return 1; > } > +#define hugetlb_register_node(NP) > +#define hugetlb_unregister_node(NP) > #endif > This also needs to be done for the !NUMA case. Try building without NUMA set and you get the following with this patch applied CC mm/hugetlb.o mm/hugetlb.c: In function a??hugetlb_exita??: mm/hugetlb.c:1629: error: implicit declaration of function a??hugetlb_unregister_all_nodesa?? mm/hugetlb.c: In function a??hugetlb_inita??: mm/hugetlb.c:1665: error: implicit declaration of function a??hugetlb_register_all_nodesa?? make[1]: *** [mm/hugetlb.o] Error 1 make: *** [mm] Error 2 > #endif /* _LINUX_HUGETLB_H */ > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:56.000000000 -0400 > @@ -24,6 +24,7 @@ > #include > > #include > +#include > #include "internal.h" > > const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs > return ret; > } > > +static nodemask_t *nodes_allowed_from_node(int nid) > +{ This name is a bit weird. It's creating a nodemask with just a single node allowed. Is there something wrong with using the existing function nodemask_of_node()? If stack is the problem, prehaps there is some macro magic that would allow a nodemask to be either declared on the stack or kmalloc'd. > + nodemask_t *nodes_allowed; > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); > + } else { > + nodes_clear(*nodes_allowed); > + node_set(nid, *nodes_allowed); > + } > + return nodes_allowed; > +} > + > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > + int nid) > { > unsigned long min_count, ret; > nodemask_t *nodes_allowed; > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > > - nodes_allowed = huge_mpol_nodes_allowed(); > + if (nid < 0) > + nodes_allowed = huge_mpol_nodes_allowed(); hugetlb is a bit littered with magic numbers been passed into functions. Attempts have been made to clear them up as according as patches change that area. Would it be possible to define something like #define HUGETLB_OBEY_MEMPOLICY -1 for the nid here as opposed to passing in -1? I know -1 is used in the page allocator functions but there it means "current node" and here it means "obey mempolicies". > + else { > + /* > + * incoming 'count' is for node 'nid' only, so > + * adjust count to global, but restrict alloc/free > + * to the specified node. > + */ > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > + nodes_allowed = nodes_allowed_from_node(nid); > + } > > /* > * Increase the pool size > @@ -1338,34 +1365,69 @@ out: > static struct kobject *hugepages_kobj; > static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + struct node *node = &node_devices[nid]; > + int hi; > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) Does that hi mean hello, high, nid or hstate_idx? hstate_idx would appear to be the appropriate name here. > + if (node->hstate_kobjs[hi] == kobj) { > + if (nidp) > + *nidp = nid; > + return &hstates[hi]; > + } > + } Ok.... so, there is a struct node array for the sysdev and this patch adds references to the "hugepages" directory kobject and the subdirectories for each page size. We walk all the objects until we find a match. Obviously, this adds a dependency of base node support on hugetlbfs which feels backwards and you call that out in your leader. Can this be the other way around? i.e. The struct hstate has an array of kobjects arranged by nid that is filled in when the node is registered? There will only be one kobject-per-pagesize-per-node so it seems like it would work. I confess, I haven't prototyped this to be 100% sure. > + > + BUG(); > + return NULL; > +} > + > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) > { > int i; > + > for (i = 0; i < HUGE_MAX_HSTATE; i++) > - if (hstate_kobjs[i] == kobj) > + if (hstate_kobjs[i] == kobj) { > + if (nidp) > + *nidp = -1; > return &hstates[i]; > - BUG(); > - return NULL; > + } > + > + return kobj_to_node_hstate(kobj, nidp); > } > > static ssize_t nr_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > + struct hstate *h; > + unsigned long nr_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + nr_huge_pages = h->nr_huge_pages; Here is another magic number except it means something slightly different. It means NR_GLOBAL_HUGEPAGES or something similar. It would be nice if these different special nid values could be named, preferably collapsed to being one "core" thing. > + else > + nr_huge_pages = h->nr_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", nr_huge_pages); > } > + > static ssize_t nr_hugepages_store(struct kobject *kobj, > struct kobj_attribute *attr, const char *buf, size_t count) > { > - int err; > unsigned long input; > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h; > + int nid; > + int err; > > err = strict_strtoul(buf, 10, &input); > if (err) > return 0; > > - h->max_huge_pages = set_max_huge_pages(h, input); "input" is a bit meaningless. The function you are passing to calls this parameter "count". Can you match the naming please? Otherwise, I might guess that this is a "delta" which occurs elsewhere in the hugetlb code. > + h = kobj_to_hstate(kobj, &nid); > + h->max_huge_pages = set_max_huge_pages(h, input, nid); > > return count; > } > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > + > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > } > + > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > struct kobj_attribute *attr, const char *buf, size_t count) > { > int err; > unsigned long input; > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > err = strict_strtoul(buf, 10, &input); > if (err) > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > static ssize_t free_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->free_huge_pages); > + struct hstate *h; > + unsigned long free_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + free_huge_pages = h->free_huge_pages; > + else > + free_huge_pages = h->free_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", free_huge_pages); > } > HSTATE_ATTR_RO(free_hugepages); > > static ssize_t resv_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > return sprintf(buf, "%lu\n", h->resv_huge_pages); > } > HSTATE_ATTR_RO(resv_hugepages); > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > static ssize_t surplus_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > + struct hstate *h; > + unsigned long surplus_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + surplus_huge_pages = h->surplus_huge_pages; > + else > + surplus_huge_pages = h->surplus_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", surplus_huge_pages); > } > HSTATE_ATTR_RO(surplus_hugepages); > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > .attrs = hstate_attrs, > }; > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > + struct kobject *parent, > + struct kobject **hstate_kobjs, > + struct attribute_group *hstate_attr_group) > { > int retval; > + int hi = h - hstates; > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, > - hugepages_kobj); > - if (!hstate_kobjs[h - hstates]) > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); > + if (!hstate_kobjs[hi]) > return -ENOMEM; > > - retval = sysfs_create_group(hstate_kobjs[h - hstates], > - &hstate_attr_group); > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > if (retval) > - kobject_put(hstate_kobjs[h - hstates]); > + kobject_put(hstate_kobjs[hi]); > > return retval; > } > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > return; > > for_each_hstate(h) { > - err = hugetlb_sysfs_add_hstate(h); > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > + hstate_kobjs, &hstate_attr_group); > if (err) > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > h->name); > } > } > > +#ifdef CONFIG_NUMA > +static struct attribute *per_node_hstate_attrs[] = { > + &nr_hugepages_attr.attr, > + &free_hugepages_attr.attr, > + &surplus_hugepages_attr.attr, > + NULL, > +}; > + > +static struct attribute_group per_node_hstate_attr_group = { > + .attrs = per_node_hstate_attrs, > +}; > + > + > +void hugetlb_unregister_node(struct node *node) > +{ > + struct hstate *h; > + > + for_each_hstate(h) { > + kobject_put(node->hstate_kobjs[h - hstates]); > + node->hstate_kobjs[h - hstates] = NULL; > + } > + > + kobject_put(node->hugepages_kobj); > + node->hugepages_kobj = NULL; > +} > + > +static void hugetlb_unregister_all_nodes(void) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) > + hugetlb_unregister_node(&node_devices[nid]); > +} > + > +void hugetlb_register_node(struct node *node) > +{ > + struct hstate *h; > + int err; > + > + if (!hugepages_kobj) > + return; /* too early */ > + > + node->hugepages_kobj = kobject_create_and_add("hugepages", > + &node->sysdev.kobj); > + if (!node->hugepages_kobj) > + return; > + > + for_each_hstate(h) { > + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > + node->hstate_kobjs, > + &per_node_hstate_attr_group); > + if (err) > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > + " for node %d\n", > + h->name, node->sysdev.id); > + } > +} > + > +static void hugetlb_register_all_nodes(void) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + struct node *node = &node_devices[nid]; > + if (node->sysdev.id == nid && !node->hugepages_kobj) > + hugetlb_register_node(node); > + } > +} > +#endif > + > static void __exit hugetlb_exit(void) > { > struct hstate *h; > > + hugetlb_unregister_all_nodes(); > + > for_each_hstate(h) { > kobject_put(hstate_kobjs[h - hstates]); > } > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > > hugetlb_sysfs_init(); > > + hugetlb_register_all_nodes(); > + > return 0; > } > module_init(hugetlb_init); > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > if (write) > - h->max_huge_pages = set_max_huge_pages(h, tmp); > + h->max_huge_pages = set_max_huge_pages(h, tmp, -1); > > return 0; > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > @@ -21,9 +21,12 @@ > > #include > #include > +#include > > struct node { > struct sys_device sysdev; > + struct kobject *hugepages_kobj; > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > }; > > struct memory_block; > I'm not against this idea and think it can work side-by-side with the memory policies. I believe it does need a bit more cleaning up before merging though. I also wasn't able to test this yet due to various build and deploy issues. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 1451E6B00BF for ; Wed, 26 Aug 2009 06:56:14 -0400 (EDT) Date: Wed, 26 Aug 2009 11:11:22 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090826101122.GD10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1251233369.16229.1.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote: > > > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > > +{ > > > > This name is a bit weird. It's creating a nodemask with just a single > > node allowed. > > > > Is there something wrong with using the existing function > > nodemask_of_node()? If stack is the problem, prehaps there is some macro > > magic that would allow a nodemask to be either declared on the stack or > > kmalloc'd. > > Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a > block nested inside the context where it's invoked. I would be > declaring the nodemask in the compound else clause and don't want to > access it [via the nodes_allowed pointer] from outside of there. > So, the existance of the mask on the stack is the problem. I can understand that, they are potentially quite large. Would it be possible to add a helper along side it like init_nodemask_of_node() that does the same work as nodemask_of_node() but takes a nodemask parameter? nodemask_of_node() would reuse the init_nodemask_of_node() except it declares the nodemask on the stack. > > > > > + nodemask_t *nodes_allowed; > > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > > + if (!nodes_allowed) { > > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > > + "for huge page allocation.\nFalling back to default.\n", > > > + current->comm); > > > + } else { > > > + nodes_clear(*nodes_allowed); > > > + node_set(nid, *nodes_allowed); > > > + } > > > + return nodes_allowed; > > > +} > > > + > > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > > > + int nid) > > > { > > > unsigned long min_count, ret; > > > nodemask_t *nodes_allowed; > > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > > + if (nid < 0) > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > > hugetlb is a bit littered with magic numbers been passed into functions. > > Attempts have been made to clear them up as according as patches change > > that area. Would it be possible to define something like > > > > #define HUGETLB_OBEY_MEMPOLICY -1 > > > > for the nid here as opposed to passing in -1? I know -1 is used in the page > > allocator functions but there it means "current node" and here it means > > "obey mempolicies". > > Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a > per node attribute". It means "derive nodes allowed from memory policy, > if non-default, else use nodes_online_map" [which is not exactly the > same as obeying memory policy]. > > But, I can see defining a symbolic constant such as > NO_NODE[_ID_SPECIFIED]. I'll try next spin. > That NO_NODE_ID_SPECIFIED was the underlying definition I was looking for. It makes sense at both sites. > > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + struct node *node = &node_devices[nid]; > > > + int hi; > > > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) > > > > Does that hi mean hello, high, nid or hstate_idx? > > > > hstate_idx would appear to be the appropriate name here. > > Or just plain 'i', like in the following, pre-existing function? > Whichever suits you best. If hstate_idx is really what it is, I see no harm in using it but 'i' is an index and I'd sooner recognise that than the less meaningful "hi". > > > > > + if (node->hstate_kobjs[hi] == kobj) { > > > + if (nidp) > > > + *nidp = nid; > > > + return &hstates[hi]; > > > + } > > > + } > > > > Ok.... so, there is a struct node array for the sysdev and this patch adds > > references to the "hugepages" directory kobject and the subdirectories for > > each page size. We walk all the objects until we find a match. Obviously, > > this adds a dependency of base node support on hugetlbfs which feels backwards > > and you call that out in your leader. > > > > Can this be the other way around? i.e. The struct hstate has an array of > > kobjects arranged by nid that is filled in when the node is registered? > > There will only be one kobject-per-pagesize-per-node so it seems like it > > would work. I confess, I haven't prototyped this to be 100% sure. > > This will take a bit longer to sort out. I do want to change the > registration, tho', so that hugetlb.c registers it's single node > register/unregister functions with base/node.c to remove the source > level dependency in that direction. node.c will only register nodes on > hot plug as it's initialized too early, relative to hugetlb.c to > register them at init time. This should break the call dependency of > base/node.c on the hugetlb module. > > As far as moving the per node attributes' kobjects to the hugetlb global > hstate arrays... Have to think about that. I agree that it would be > nice to remove the source level [header] dependency. > FWIW, I see no problem with the mempolicy stuff going ahead separately from this patch after the few relatively minor cleanups highlighted in the thread and tackling this patch as a separate cycle. It's up to you really. > > > > > + > > > + BUG(); > > > + return NULL; > > > +} > > > + > > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) > > > { > > > int i; > > > + > > > for (i = 0; i < HUGE_MAX_HSTATE; i++) > > > - if (hstate_kobjs[i] == kobj) > > > + if (hstate_kobjs[i] == kobj) { > > > + if (nidp) > > > + *nidp = -1; > > > return &hstates[i]; > > > - BUG(); > > > - return NULL; > > > + } > > > + > > > + return kobj_to_node_hstate(kobj, nidp); > > > } > > > > > > static ssize_t nr_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > > > + struct hstate *h; > > > + unsigned long nr_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + nr_huge_pages = h->nr_huge_pages; > > > > Here is another magic number except it means something slightly > > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would > > be nice if these different special nid values could be named, preferably > > collapsed to being one "core" thing. > > Again, it means "NO NODE ID specified" [via per node attribute]. Again, > I'll address this with a single constant. > > > > > > + else > > > + nr_huge_pages = h->nr_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", nr_huge_pages); > > > } > > > + > > > static ssize_t nr_hugepages_store(struct kobject *kobj, > > > struct kobj_attribute *attr, const char *buf, size_t count) > > > { > > > - int err; > > > unsigned long input; > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h; > > > + int nid; > > > + int err; > > > > > > err = strict_strtoul(buf, 10, &input); > > > if (err) > > > return 0; > > > > > > - h->max_huge_pages = set_max_huge_pages(h, input); > > > > "input" is a bit meaningless. The function you are passing to calls this > > parameter "count". Can you match the naming please? Otherwise, I might > > guess that this is a "delta" which occurs elsewhere in the hugetlb code. > > I guess I can change that. It's the pre-exiting name, and 'count' was > already used. Guess I can change 'count' to 'len' and 'input' to > 'count' Makes sense. > > > > > + h = kobj_to_hstate(kobj, &nid); > > > + h->max_huge_pages = set_max_huge_pages(h, input, nid); > > > > > > return count; > > > } > > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > + > > > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > > > } > > > + > > > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > > > struct kobj_attribute *attr, const char *buf, size_t count) > > > { > > > int err; > > > unsigned long input; > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > > > > err = strict_strtoul(buf, 10, &input); > > > if (err) > > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > > > static ssize_t free_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->free_huge_pages); > > > + struct hstate *h; > > > + unsigned long free_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + free_huge_pages = h->free_huge_pages; > > > + else > > > + free_huge_pages = h->free_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", free_huge_pages); > > > } > > > HSTATE_ATTR_RO(free_hugepages); > > > > > > static ssize_t resv_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > return sprintf(buf, "%lu\n", h->resv_huge_pages); > > > } > > > HSTATE_ATTR_RO(resv_hugepages); > > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > > > static ssize_t surplus_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > > > + struct hstate *h; > > > + unsigned long surplus_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + surplus_huge_pages = h->surplus_huge_pages; > > > + else > > > + surplus_huge_pages = h->surplus_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", surplus_huge_pages); > > > } > > > HSTATE_ATTR_RO(surplus_hugepages); > > > > > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > > > .attrs = hstate_attrs, > > > }; > > > > > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > > > + struct kobject *parent, > > > + struct kobject **hstate_kobjs, > > > + struct attribute_group *hstate_attr_group) > > > { > > > int retval; > > > + int hi = h - hstates; > > > > > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, > > > - hugepages_kobj); > > > - if (!hstate_kobjs[h - hstates]) > > > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); > > > + if (!hstate_kobjs[hi]) > > > return -ENOMEM; > > > > > > - retval = sysfs_create_group(hstate_kobjs[h - hstates], > > > - &hstate_attr_group); > > > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > > > if (retval) > > > - kobject_put(hstate_kobjs[h - hstates]); > > > + kobject_put(hstate_kobjs[hi]); > > > > > > return retval; > > > } > > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > > > return; > > > > > > for_each_hstate(h) { > > > - err = hugetlb_sysfs_add_hstate(h); > > > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > > > + hstate_kobjs, &hstate_attr_group); > > > if (err) > > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > > > h->name); > > > } > > > } > > > > > > +#ifdef CONFIG_NUMA > > > +static struct attribute *per_node_hstate_attrs[] = { > > > + &nr_hugepages_attr.attr, > > > + &free_hugepages_attr.attr, > > > + &surplus_hugepages_attr.attr, > > > + NULL, > > > +}; > > > + > > > +static struct attribute_group per_node_hstate_attr_group = { > > > + .attrs = per_node_hstate_attrs, > > > +}; > > > + > > > + > > > +void hugetlb_unregister_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + > > > + for_each_hstate(h) { > > > + kobject_put(node->hstate_kobjs[h - hstates]); > > > + node->hstate_kobjs[h - hstates] = NULL; > > > + } > > > + > > > + kobject_put(node->hugepages_kobj); > > > + node->hugepages_kobj = NULL; > > > +} > > > + > > > +static void hugetlb_unregister_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) > > > + hugetlb_unregister_node(&node_devices[nid]); > > > +} > > > + > > > +void hugetlb_register_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + int err; > > > + > > > + if (!hugepages_kobj) > > > + return; /* too early */ > > > + > > > + node->hugepages_kobj = kobject_create_and_add("hugepages", > > > + &node->sysdev.kobj); > > > + if (!node->hugepages_kobj) > > > + return; > > > + > > > + for_each_hstate(h) { > > > + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > > > + node->hstate_kobjs, > > > + &per_node_hstate_attr_group); > > > + if (err) > > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > > > + " for node %d\n", > > > + h->name, node->sysdev.id); > > > + } > > > +} > > > + > > > +static void hugetlb_register_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + struct node *node = &node_devices[nid]; > > > + if (node->sysdev.id == nid && !node->hugepages_kobj) > > > + hugetlb_register_node(node); > > > + } > > > +} > > > +#endif > > > + > > > static void __exit hugetlb_exit(void) > > > { > > > struct hstate *h; > > > > > > + hugetlb_unregister_all_nodes(); > > > + > > > for_each_hstate(h) { > > > kobject_put(hstate_kobjs[h - hstates]); > > > } > > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > > > > > > hugetlb_sysfs_init(); > > > > > > + hugetlb_register_all_nodes(); > > > + > > > return 0; > > > } > > > module_init(hugetlb_init); > > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > > > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > > > > > if (write) > > > - h->max_huge_pages = set_max_huge_pages(h, tmp); > > > + h->max_huge_pages = set_max_huge_pages(h, tmp, -1); > > > > > > return 0; > > > } > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > > @@ -21,9 +21,12 @@ > > > > > > #include > > > #include > > > +#include > > > > > > struct node { > > > struct sys_device sysdev; > > > + struct kobject *hugepages_kobj; > > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > > }; > > > > > > struct memory_block; > > > > > > > I'm not against this idea and think it can work side-by-side with the memory > > policies. I believe it does need a bit more cleaning up before merging > > though. I also wasn't able to test this yet due to various build and > > deploy issues. > > OK. I'll do the cleanup. I have tested this atop the mempolicy > version by working around the build issues that I thought were just > temporary glitches in the mmotm series. In my [limited] experience, one > can interleave numactl+hugeadm with setting values via the per node > attributes and it does the right thing. No heavy testing with racing > tasks, tho'. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 3F29F6B0144 for ; Wed, 26 Aug 2009 07:16:14 -0400 (EDT) Date: Wed, 26 Aug 2009 10:58:35 +0100 From: Mel Gorman Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Message-ID: <20090826095835.GB10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> <1251233374.16229.2.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: David Rientjes , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Aug 25, 2009 at 04:49:34PM -0400, Lee Schermerhorn wrote: > > > > > > +static int hstate_next_node_to_alloc(struct hstate *h, > > > + nodemask_t *nodes_allowed) > > > { > > > int nid, next_nid; > > > > > > - nid = h->next_nid_to_alloc; > > > - next_nid = next_node_allowed(nid); > > > + if (!nodes_allowed) > > > + nodes_allowed = &node_online_map; > > > + > > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > > + > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_alloc = next_nid; > > > + > > > return nid; > > > } > > > > Don't need next_nid. > > Well, the pre-existing comment block indicated that the use of the > apparently spurious next_nid variable is necessary to close a race. Not > sure whether that comment still applies with this rework. What do you > think? > The original intention was not to return h->next_nid_to_alloc because there is a race window where it's MAX_NUMNODES. nid is a stack-local variable here, it should not become MAX_NUMNODES by accident because this_node_allowed() and next_node_allowed() are both taking care not to return MAX_NUMNODES so it's safe as a return value. Even in the presense of races with the code structure you currently have. I think it's safe to have nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed); return nid; because at worse in the presense of races, h->next_nid_to_alloc gets assigned to the same value twice, but never MAX_NUMNODES. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 881DC6B00D4 for ; Wed, 26 Aug 2009 07:36:18 -0400 (EDT) Date: Tue, 25 Aug 2009 11:22:04 +0100 From: Mel Gorman Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Message-ID: <20090825102204.GC4427@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, Aug 24, 2009 at 03:27:52PM -0400, Lee Schermerhorn wrote: > [PATCH 3/4] hugetlb: derive huge pages nodes allowed from task mempolicy > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: > + cleaned up comments, removed some deemed unnecessary, > add some suggested by review > + removed check for !current in huge_mpol_nodes_allowed(). > + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). > + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to > catch out of range node id. > + add examples to patch description > > V3: Factored this patch from V2 patch 2/3 > > V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages() > > This patch derives a "nodes_allowed" node mask from the numa > mempolicy of the task modifying the number of persistent huge > pages to control the allocation, freeing and adjusting of surplus > huge pages. This mask is derived as follows: > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > is produced. This will cause the hugetlb subsystem to use > node_online_map as the "nodes_allowed". This preserves the > behavior before this patch. > * For "preferred" mempolicy, including explicit local allocation, > a nodemask with the single preferred node will be produced. > "local" policy will NOT track any internode migrations of the > task adjusting nr_hugepages. > * For "bind" and "interleave" policy, the mempolicy's nodemask > will be used. > * Other than to inform the construction of the nodes_allowed node > mask, the actual mempolicy mode is ignored. That is, all modes > behave like interleave over the resulting nodes_allowed mask > with no "fallback". > > Notes: > > 1) This patch introduces a subtle change in behavior: huge page > allocation and freeing will be constrained by any mempolicy > that the task adjusting the huge page pool inherits from its > parent. This policy could come from a distant ancestor. The > adminstrator adjusting the huge page pool without explicitly > specifying a mempolicy via numactl might be surprised by this. > Additionaly, any mempolicy specified by numactl will be > constrained by the cpuset in which numactl is invoked. > > 2) Hugepages allocated at boot time use the node_online_map. > An additional patch could implement a temporary boot time > huge pages nodes_allowed command line parameter. > > 3) Using mempolicy to control persistent huge page allocation > and freeing requires no change to hugeadm when invoking > it via numactl, as shown in the examples below. However, > hugeadm could be enhanced to take the allowed nodes as an > argument and set its task mempolicy itself. This would allow > it to detect and warn about any non-default mempolicy that it > inherited from its parent, thus alleviating the issue described > in Note 1 above. > > See the updated documentation [next patch] for more information > about the implications of this patch. > > Examples: > > Starting with: > > Node 0 HugePages_Total: 0 > Node 1 HugePages_Total: 0 > Node 2 HugePages_Total: 0 > Node 3 HugePages_Total: 0 > > Default behavior [with or without this patch] balances persistent > hugepage allocation across nodes [with sufficient contiguous memory]: > > hugeadm --pool-pages-min=2048Kb:32 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 8 > Node 3 HugePages_Total: 8 > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > '--membind' because it allows multiple nodes to be specified > and it's easy to type]--we can allocate huge pages on > individual nodes or sets of nodes. So, starting from the > condition above, with 8 huge pages per node: > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The incremental 8 huge pages were restricted to node 2 by the > specified mempolicy. > > Similarly, we can use mempolicy to free persistent huge pages > from specified nodes: > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > yields: > > Node 0 HugePages_Total: 4 > Node 1 HugePages_Total: 4 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The 8 huge pages freed were balanced over nodes 0 and 1. > > Signed-off-by: Lee Schermerhorn I haven't been able to test this yet because of some build and deploy issues but I didn't spot anything wrong when eyeballing the patch. For the moment; Acked-by: Mel Gorman > > include/linux/mempolicy.h | 3 ++ > mm/hugetlb.c | 14 ++++++---- > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 73 insertions(+), 5 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > } > return zl; > } > + > +/* > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > + * > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > + * to constraing the allocation and freeing of persistent huge pages > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > + * 'bind' policy in this context. An attempt to allocate a persistent huge > + * page will never "fallback" to another node inside the buddy system > + * allocator. > + * > + * If the task's mempolicy is "default" [NULL], just return NULL for > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > + * or 'interleave' policy or construct a nodemask for 'preferred' or > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > + * > + * N.B., it is the caller's responsibility to free a returned nodemask. > + */ > +nodemask_t *huge_mpol_nodes_allowed(void) > +{ > + nodemask_t *nodes_allowed = NULL; > + struct mempolicy *mempolicy; > + int nid; > + > + if (!current->mempolicy) > + return NULL; > + > + mpol_get(current->mempolicy); > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); > + goto out; > + } > + nodes_clear(*nodes_allowed); > + > + mempolicy = current->mempolicy; > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + if (mempolicy->flags & MPOL_F_LOCAL) > + nid = numa_node_id(); > + else > + nid = mempolicy->v.preferred_node; > + node_set(nid, *nodes_allowed); > + break; > + > + case MPOL_BIND: > + /* Fall through */ > + case MPOL_INTERLEAVE: > + *nodes_allowed = mempolicy->v.nodes; > + break; > + > + default: > + BUG(); > + } > + > +out: > + mpol_put(current->mempolicy); > + return nodes_allowed; > +} > #endif > > /* Allocate a page in interleaved policy. > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > +extern nodemask_t *huge_mpol_nodes_allowed(void); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > return node_zonelist(0, gfp_flags); > } > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > + > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > const nodemask_t *to_nodes, int flags) > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > { > unsigned long min_count, ret; > + nodemask_t *nodes_allowed; > > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > > + nodes_allowed = huge_mpol_nodes_allowed(); > + > /* > * Increase the pool size > * First take pages out of surplus state. Then make up the > @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages( > */ > spin_lock(&hugetlb_lock); > while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { > - if (!adjust_pool_surplus(h, NULL, -1)) > + if (!adjust_pool_surplus(h, nodes_allowed, -1)) > break; > } > > @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages( > * and reducing the surplus. > */ > spin_unlock(&hugetlb_lock); > - ret = alloc_fresh_huge_page(h, NULL); > + ret = alloc_fresh_huge_page(h, nodes_allowed); > spin_lock(&hugetlb_lock); > if (!ret) > goto out; > @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages( > */ > min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; > min_count = max(count, min_count); > - try_to_free_low(h, min_count, NULL); > + try_to_free_low(h, min_count, nodes_allowed); > while (min_count < persistent_huge_pages(h)) { > - if (!free_pool_huge_page(h, NULL, 0)) > + if (!free_pool_huge_page(h, nodes_allowed, 0)) > break; > } > while (count < persistent_huge_pages(h)) { > - if (!adjust_pool_surplus(h, NULL, 1)) > + if (!adjust_pool_surplus(h, nodes_allowed, 1)) > break; > } > out: > ret = persistent_huge_pages(h); > spin_unlock(&hugetlb_lock); > + kfree(nodes_allowed); > return ret; > } > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id ADB8D6B00D8 for ; Wed, 26 Aug 2009 07:51:13 -0400 (EDT) Date: Tue, 25 Aug 2009 14:35:16 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090825133516.GE21335@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > @@ -21,9 +21,12 @@ > > #include > #include > +#include > Is this header inclusion necessary? It does not appear to be required by the structure modification (which is iffy in itself as discussed in the earlier mail) and it breaks build on x86-64. CC arch/x86/kernel/setup_percpu.o In file included from include/linux/pagemap.h:10, from include/linux/mempolicy.h:62, from include/linux/hugetlb.h:8, from include/linux/node.h:24, from include/linux/cpu.h:23, from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, from arch/x86/kernel/setup_percpu.c:19: include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 make[1]: *** [arch/x86/kernel] Error 2 > struct node { > struct sys_device sysdev; > + struct kobject *hugepages_kobj; > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > }; > > struct memory_block; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 1A1416B004D for ; Wed, 26 Aug 2009 14:02:29 -0400 (EDT) Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes From: Lee Schermerhorn In-Reply-To: <20090826101122.GD10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> Content-Type: text/plain Date: Wed, 26 Aug 2009 14:02:27 -0400 Message-Id: <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote: > On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote: > > > > > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > > > +{ > > > > > > This name is a bit weird. It's creating a nodemask with just a single > > > node allowed. > > > > > > Is there something wrong with using the existing function > > > nodemask_of_node()? If stack is the problem, prehaps there is some macro > > > magic that would allow a nodemask to be either declared on the stack or > > > kmalloc'd. > > > > Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a > > block nested inside the context where it's invoked. I would be > > declaring the nodemask in the compound else clause and don't want to > > access it [via the nodes_allowed pointer] from outside of there. > > > > So, the existance of the mask on the stack is the problem. I can > understand that, they are potentially quite large. > > Would it be possible to add a helper along side it like > init_nodemask_of_node() that does the same work as nodemask_of_node() > but takes a nodemask parameter? nodemask_of_node() would reuse the > init_nodemask_of_node() except it declares the nodemask on the stack. > Here's the patch that introduces the helper function that I propose. I'll send an update of the subject patch that uses this macro and, I think, addresses your other issues via a separate message. This patch applies just before the "register per node attributes" patch. Once we can agree on these [or subsequent] changes, I'll repost the entire updated series. Lee --- PATCH 4/6 - hugetlb: introduce alloc_nodemask_of_node() Against: 2.6.31-rc6-mmotm-090820-1918 Introduce nodemask macro to allocate a nodemask and initialize it to contain a single node, using existing nodemask_of_node() macro. Coded as a macro to avoid header dependency hell. This will be used to construct the huge pages "nodes_allowed" nodemask for a single node when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn include/linux/nodemask.h | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 @@ -257,6 +257,16 @@ static inline int __next_node(int n, con m; \ }) +#define alloc_nodemask_of_node(node) \ +({ \ + typeof(_unused_nodemask_arg_) *nmp; \ + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ + if (nmp) \ + *nmp = nodemask_of_node(node); \ + nmp; \ +}) + + #define first_unset_node(mask) __first_unset_node(&(mask)) static inline int __first_unset_node(const nodemask_t *maskp) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 14:04:03 -0400 Message-ID: <1251309843.4409.48.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090826101122.GD10955@csn.ul.ie> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com Proposed revised patch attached. Some comments in-line... On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote: > On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote: > > > > > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > > > +{ > > > > > > This name is a bit weird. It's creating a nodemask with just a single > > > node allowed. > > > > > > Is there something wrong with using the existing function > > > nodemask_of_node()? If stack is the problem, prehaps there is some macro > > > magic that would allow a nodemask to be either declared on the stack or > > > kmalloc'd. > > > > Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a > > block nested inside the context where it's invoked. I would be > > declaring the nodemask in the compound else clause and don't want to > > access it [via the nodes_allowed pointer] from outside of there. > > > > So, the existance of the mask on the stack is the problem. I can > understand that, they are potentially quite large. > > Would it be possible to add a helper along side it like > init_nodemask_of_node() that does the same work as nodemask_of_node() > but takes a nodemask parameter? nodemask_of_node() would reuse the > init_nodemask_of_node() except it declares the nodemask on the stack. Now use "alloc_nodemask_of_node()" to alloc/init a nodemask with a single node. > > > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > > > > + int nid) > > > > { > > > > unsigned long min_count, ret; > > > > nodemask_t *nodes_allowed; > > > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > > > > if (h->order >= MAX_ORDER) > > > > return h->max_huge_pages; > > > > > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > > > + if (nid < 0) > > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > > > > hugetlb is a bit littered with magic numbers been passed into functions. > > > Attempts have been made to clear them up as according as patches change > > > that area. Would it be possible to define something like > > > > > > #define HUGETLB_OBEY_MEMPOLICY -1 > > > > > > for the nid here as opposed to passing in -1? I know -1 is used in the page > > > allocator functions but there it means "current node" and here it means > > > "obey mempolicies". > > > > Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a > > per node attribute". It means "derive nodes allowed from memory policy, > > if non-default, else use nodes_online_map" [which is not exactly the > > same as obeying memory policy]. > > > > But, I can see defining a symbolic constant such as > > NO_NODE[_ID_SPECIFIED]. I'll try next spin. > > > > That NO_NODE_ID_SPECIFIED was the underlying definition I was looking > for. It makes sense at both sites. Done. > > > > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > > > +{ > > > > + int nid; > > > > + > > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > > + struct node *node = &node_devices[nid]; > > > > + int hi; > > > > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) > > > > > > Does that hi mean hello, high, nid or hstate_idx? > > > > > > hstate_idx would appear to be the appropriate name here. > > > > Or just plain 'i', like in the following, pre-existing function? > > > > Whichever suits you best. If hstate_idx is really what it is, I see no > harm in using it but 'i' is an index and I'd sooner recognise that than > the less meaningful "hi". Changed to 'i' > > > > > > Ok.... so, there is a struct node array for the sysdev and this patch adds > > > references to the "hugepages" directory kobject and the subdirectories for > > > each page size. We walk all the objects until we find a match. Obviously, > > > this adds a dependency of base node support on hugetlbfs which feels backwards > > > and you call that out in your leader. > > > > > > Can this be the other way around? i.e. The struct hstate has an array of > > > kobjects arranged by nid that is filled in when the node is registered? > > > There will only be one kobject-per-pagesize-per-node so it seems like it > > > would work. I confess, I haven't prototyped this to be 100% sure. > > > > This will take a bit longer to sort out. I do want to change the > > registration, tho', so that hugetlb.c registers it's single node > > register/unregister functions with base/node.c to remove the source > > level dependency in that direction. node.c will only register nodes on > > hot plug as it's initialized too early, relative to hugetlb.c to > > register them at init time. This should break the call dependency of > > base/node.c on the hugetlb module. > > > > As far as moving the per node attributes' kobjects to the hugetlb global > > hstate arrays... Have to think about that. I agree that it would be > > nice to remove the source level [header] dependency. > > > > FWIW, I see no problem with the mempolicy stuff going ahead separately from > this patch after the few relatively minor cleanups highlighted in the thread > and tackling this patch as a separate cycle. It's up to you really. I took a look at it and propose the attached rework. I moved all of the per node per hstate kobj pointers to hugetlb.c. hugetlb.c now registers its single node register/unregister functions with base/node.c to suppport hot-plug. If hugetlbfs never registers with node.c, it will never try to register. This patch applies atop the "introduce alloc_nodemask_of_node()" patch I sent earlier. Let me know what you think. > > > > > > > > static ssize_t nr_hugepages_show(struct kobject *kobj, > > > > struct kobj_attribute *attr, char *buf) > > > > { > > > > - struct hstate *h = kobj_to_hstate(kobj); > > > > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > > > > + struct hstate *h; > > > > + unsigned long nr_huge_pages; > > > > + int nid; > > > > + > > > > + h = kobj_to_hstate(kobj, &nid); > > > > + if (nid < 0) > > > > + nr_huge_pages = h->nr_huge_pages; > > > > > > Here is another magic number except it means something slightly > > > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would > > > be nice if these different special nid values could be named, preferably > > > collapsed to being one "core" thing. > > > > Again, it means "NO NODE ID specified" [via per node attribute]. Again, > > I'll address this with a single constant. Fixed. > > > > > > > > > + else > > > > + nr_huge_pages = h->nr_huge_pages_node[nid]; > > > > + > > > > + return sprintf(buf, "%lu\n", nr_huge_pages); > > > > } > > > > + > > > > static ssize_t nr_hugepages_store(struct kobject *kobj, > > > > struct kobj_attribute *attr, const char *buf, size_t count) > > > > { > > > > - int err; > > > > unsigned long input; > > > > - struct hstate *h = kobj_to_hstate(kobj); > > > > + struct hstate *h; > > > > + int nid; > > > > + int err; > > > > > > > > err = strict_strtoul(buf, 10, &input); > > > > if (err) > > > > return 0; > > > > > > > > - h->max_huge_pages = set_max_huge_pages(h, input); > > > > > > "input" is a bit meaningless. The function you are passing to calls this > > > parameter "count". Can you match the naming please? Otherwise, I might > > > guess that this is a "delta" which occurs elsewhere in the hugetlb code. > > > > I guess I can change that. It's the pre-exiting name, and 'count' was > > already used. Guess I can change 'count' to 'len' and 'input' to > > 'count' > > Makes sense. fixed. > > > > > > > > + h = kobj_to_hstate(kobj, &nid); > > > > + h->max_huge_pages = set_max_huge_pages(h, input, nid); > > > > > > > > return count; > > > > } > > > I'm not against this idea and think it can work side-by-side with the memory > > > policies. I believe it does need a bit more cleaning up before merging > > > though. I also wasn't able to test this yet due to various build and > > > deploy issues. > > > > OK. I'll do the cleanup. I have tested this atop the mempolicy > > version by working around the build issues that I thought were just > > temporary glitches in the mmotm series. In my [limited] experience, one > > can interleave numactl+hugeadm with setting values via the per node > > attributes and it does the right thing. No heavy testing with racing > > tasks, tho'. > > This revised patch also removes the include of hugetlb.h from node.h. Lee --- PATCH 5/6 hugetlb: register per node hugepages attributes Against: 2.6.31-rc6-mmotm-090820-1918 V2: remove dependency on kobject private bitfield. Search global hstates then all per node hstates for kobject match in attribute show/store functions. V3: rebase atop the mempolicy-based hugepage alloc/free; use custom "nodes_allowed" to restrict alloc/free to a specific node via per node attributes. Per node attribute overrides mempolicy. I.e., mempolicy only applies to global attributes. V4: Fix issues raised by Mel Gorman: + add !NUMA versions of hugetlb_[un]register_node() + rename 'hi' to 'i' in kobj_to_node_hstate() + rename (count, input) to (len, count) in nr_hugepages_store() + moved per node hugepages_kobj and hstate_kobjs[] from the struct node [sysdev] to hugetlb.c private arrays. + changed registration mechanism so that hugetlbfs [a module] register its attributes registration callbacks with the node driver, eliminating the dependency between the node driver and hugetlbfs. From it's init func, hugetlbfs will register all on-line nodes' hugepage sysfs attributes along with hugetlbfs' attributes register/unregister functions. The node driver will use these functions to [un]register nodes with hugetlbfs on node hot-plug. + replaced hugetlb.c private "nodes_allowed_from_node()" with generic "alloc_nodemask_of_node()". This patch adds the per huge page size control/query attributes to the per node sysdevs: /sys/devices/system/node/node/hugepages/hugepages-/ nr_hugepages - r/w free_huge_pages - r/o surplus_huge_pages - r/o The patch attempts to re-use/share as much of the existing global hstate attribute initialization and handling, and the "nodes_allowed" constraint processing as possible. Calling set_max_huge_pages() with no node indicates a change to global hstate parameters. In this case, any non-default task mempolicy will be used to generate the nodes_allowed mask. A valid node id indicates an update to that node's hstate parameters, and the count argument specifies the target count for the specified node. From this info, we compute the target global count for the hstate and construct a nodes_allowed node mask contain only the specified node. Setting the node specific nr_hugepages via the per node attribute effectively ignores any task mempolicy or cpuset constraints. With this patch: (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB ./ ../ free_hugepages nr_hugepages surplus_hugepages Starting from: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 0 Allocate 16 persistent huge pages on node 2: (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages [Note that this is equivalent to: numactl -m 2 hugeadmin --pool-pages-min 2M:+16 ] Yields: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 16 Node 2 HugePages_Free: 16 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 16 Global controls work as expected--reduce pool to 8 persistent huge pages: (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 8 Node 2 HugePages_Free: 8 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 Signed-off-by: Lee Schermerhorn drivers/base/node.c | 27 +++++ include/linux/node.h | 6 + include/linux/numa.h | 2 mm/hugetlb.c | 245 ++++++++++++++++++++++++++++++++++++++++++++------- 4 files changed, 250 insertions(+), 30 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-26 12:37:03.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-26 13:01:54.000000000 -0400 @@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct } static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); +/* + * hugetlbfs per node attributes registration interface + */ +NODE_REGISTRATION_FUNC __hugetlb_register_node; +NODE_REGISTRATION_FUNC __hugetlb_unregister_node; + +static inline void hugetlb_register_node(struct node *node) +{ + if (__hugetlb_register_node) + __hugetlb_register_node(node); +} + +static inline void hugetlb_unregister_node(struct node *node) +{ + if (__hugetlb_unregister_node) + __hugetlb_unregister_node(node); +} + +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, + NODE_REGISTRATION_FUNC unregister) +{ + __hugetlb_register_node = doregister; + __hugetlb_unregister_node = unregister; +} + /* * register_node - Setup a sysfs device for a node. @@ -200,6 +225,7 @@ int register_node(struct node *node, int sysdev_create_file(&node->sysdev, &attr_distance); scan_unevictable_register_node(node); + hugetlb_register_node(node); } return error; } @@ -220,6 +246,7 @@ void unregister_node(struct node *node) sysdev_remove_file(&node->sysdev, &attr_distance); scan_unevictable_unregister_node(node); + hugetlb_unregister_node(node); sysdev_unregister(&node->sysdev); } Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-26 12:37:04.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-26 13:01:54.000000000 -0400 @@ -24,6 +24,7 @@ #include #include +#include #include "internal.h" const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs } #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, + int nid) { unsigned long min_count, ret; nodemask_t *nodes_allowed; @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages( if (h->order >= MAX_ORDER) return h->max_huge_pages; - nodes_allowed = huge_mpol_nodes_allowed(); + if (nid == NO_NODEID_SPECIFIED) + nodes_allowed = huge_mpol_nodes_allowed(); + else { + /* + * incoming 'count' is for node 'nid' only, so + * adjust count to global, but restrict alloc/free + * to the specified node. + */ + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; + nodes_allowed = alloc_nodemask_of_node(nid); + if (!nodes_allowed) + printk(KERN_WARNING "%s unable to allocate allowed " + "nodes mask for huge page allocation/free. " + "Falling back to default.\n", current->comm); + } /* * Increase the pool size @@ -1329,51 +1345,71 @@ out: static struct kobject *hugepages_kobj; static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; -static struct hstate *kobj_to_hstate(struct kobject *kobj) +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp); + +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) { int i; + for (i = 0; i < HUGE_MAX_HSTATE; i++) - if (hstate_kobjs[i] == kobj) + if (hstate_kobjs[i] == kobj) { + if (nidp) + *nidp = NO_NODEID_SPECIFIED; return &hstates[i]; - BUG(); - return NULL; + } + + return kobj_to_node_hstate(kobj, nidp); } static ssize_t nr_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->nr_huge_pages); + struct hstate *h; + unsigned long nr_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid == NO_NODEID_SPECIFIED) + nr_huge_pages = h->nr_huge_pages; + else + nr_huge_pages = h->nr_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", nr_huge_pages); } + static ssize_t nr_hugepages_store(struct kobject *kobj, - struct kobj_attribute *attr, const char *buf, size_t count) + struct kobj_attribute *attr, const char *buf, size_t len) { + unsigned long count; + struct hstate *h; + int nid; int err; - unsigned long input; - struct hstate *h = kobj_to_hstate(kobj); - err = strict_strtoul(buf, 10, &input); + err = strict_strtoul(buf, 10, &count); if (err) return 0; - h->max_huge_pages = set_max_huge_pages(h, input); + h = kobj_to_hstate(kobj, &nid); + h->max_huge_pages = set_max_huge_pages(h, count, nid); - return count; + return len; } HSTATE_ATTR(nr_hugepages); static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); + return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); } + static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long input; - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); err = strict_strtoul(buf, 10, &input); if (err) @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); static ssize_t free_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->free_huge_pages); + struct hstate *h; + unsigned long free_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid == NO_NODEID_SPECIFIED) + free_huge_pages = h->free_huge_pages; + else + free_huge_pages = h->free_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", free_huge_pages); } HSTATE_ATTR_RO(free_hugepages); static ssize_t resv_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); return sprintf(buf, "%lu\n", h->resv_huge_pages); } HSTATE_ATTR_RO(resv_hugepages); @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages); static ssize_t surplus_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->surplus_huge_pages); + struct hstate *h; + unsigned long surplus_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid == NO_NODEID_SPECIFIED) + surplus_huge_pages = h->surplus_huge_pages; + else + surplus_huge_pages = h->surplus_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", surplus_huge_pages); } HSTATE_ATTR_RO(surplus_hugepages); @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att .attrs = hstate_attrs, }; -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, + struct kobject *parent, + struct kobject **hstate_kobjs, + struct attribute_group *hstate_attr_group) { int retval; + int hi = h - hstates; - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, - hugepages_kobj); - if (!hstate_kobjs[h - hstates]) + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); + if (!hstate_kobjs[hi]) return -ENOMEM; - retval = sysfs_create_group(hstate_kobjs[h - hstates], - &hstate_attr_group); + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); if (retval) - kobject_put(hstate_kobjs[h - hstates]); + kobject_put(hstate_kobjs[hi]); return retval; } @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo return; for_each_hstate(h) { - err = hugetlb_sysfs_add_hstate(h); + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, + hstate_kobjs, &hstate_attr_group); if (err) printk(KERN_ERR "Hugetlb: Unable to add hstate %s", h->name); } } +#ifdef CONFIG_NUMA + +struct node_hstate { + struct kobject *hugepages_kobj; + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; +}; +struct node_hstate node_hstates[MAX_NUMNODES]; + +static struct attribute *per_node_hstate_attrs[] = { + &nr_hugepages_attr.attr, + &free_hugepages_attr.attr, + &surplus_hugepages_attr.attr, + NULL, +}; + +static struct attribute_group per_node_hstate_attr_group = { + .attrs = per_node_hstate_attrs, +}; + +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) { + struct node_hstate *nhs = &node_hstates[nid]; + int i; + for (i = 0; i < HUGE_MAX_HSTATE; i++) + if (nhs->hstate_kobjs[i] == kobj) { + if (nidp) + *nidp = nid; + return &hstates[i]; + } + } + + BUG(); + return NULL; +} + +void hugetlb_unregister_node(struct node *node) +{ + struct hstate *h; + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; + + if (!nhs->hugepages_kobj) + return; + + for_each_hstate(h) + if (nhs->hstate_kobjs[h - hstates]) { + kobject_put(nhs->hstate_kobjs[h - hstates]); + nhs->hstate_kobjs[h - hstates] = NULL; + } + + kobject_put(nhs->hugepages_kobj); + nhs->hugepages_kobj = NULL; +} + +static void hugetlb_unregister_all_nodes(void) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) + hugetlb_unregister_node(&node_devices[nid]); + + register_hugetlbfs_with_node(NULL, NULL); +} + +void hugetlb_register_node(struct node *node) +{ + struct hstate *h; + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; + int err; + + if (nhs->hugepages_kobj) + return; /* already allocated */ + + nhs->hugepages_kobj = kobject_create_and_add("hugepages", + &node->sysdev.kobj); + if (!nhs->hugepages_kobj) + return; + + for_each_hstate(h) { + err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj, + nhs->hstate_kobjs, + &per_node_hstate_attr_group); + if (err) { + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" + " for node %d\n", + h->name, node->sysdev.id); + hugetlb_unregister_node(node); + break; + } + } +} + +static void hugetlb_register_all_nodes(void) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) { + struct node *node = &node_devices[nid]; + if (node->sysdev.id == nid) + hugetlb_register_node(node); + } + + register_hugetlbfs_with_node(hugetlb_register_node, + hugetlb_unregister_node); +} +#else /* !CONFIG_NUMA */ + +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) +{ + BUG(); + if (nidp) + *nidp = -1; + return NULL; +} + +static void hugetlb_unregister_all_nodes(void) { } + +static void hugetlb_register_all_nodes(void) { } + +#endif + static void __exit hugetlb_exit(void) { struct hstate *h; + hugetlb_unregister_all_nodes(); + for_each_hstate(h) { kobject_put(hstate_kobjs[h - hstates]); } @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void) hugetlb_sysfs_init(); + hugetlb_register_all_nodes(); + return 0; } module_init(hugetlb_init); @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta proc_doulongvec_minmax(table, write, file, buffer, length, ppos); if (write) - h->max_huge_pages = set_max_huge_pages(h, tmp); + h->max_huge_pages = set_max_huge_pages(h, tmp, + NO_NODEID_SPECIFIED); return 0; } Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/numa.h 2009-08-26 12:37:03.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h 2009-08-26 12:58:54.000000000 -0400 @@ -10,4 +10,6 @@ #define MAX_NUMNODES (1 << NODES_SHIFT) +#define NO_NODEID_SPECIFIED (-1) + #endif /* _LINUX_NUMA_H */ Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-26 12:37:03.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-26 12:40:19.000000000 -0400 @@ -28,6 +28,7 @@ struct node { struct memory_block; extern struct node node_devices[]; +typedef void (*NODE_REGISTRATION_FUNC)(struct node *); extern int register_node(struct node *, int, struct node *); extern void unregister_node(struct node *node); @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns extern int register_mem_sect_under_node(struct memory_block *mem_blk, int nid); extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk); +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, + NODE_REGISTRATION_FUNC unregister); #else static inline int register_one_node(int nid) { @@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un { return 0; } + +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do, + NODE_REGISTRATION_FUNC un) { } #endif #define to_node(sys_device) container_of(sys_device, struct node, sysdev) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 47BF46B0055 for ; Wed, 26 Aug 2009 15:48:06 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id n7QJm1qq017688 for ; Wed, 26 Aug 2009 12:48:01 -0700 Received: from pxi8 (pxi8.prod.google.com [10.243.27.8]) by wpaz5.hot.corp.google.com with ESMTP id n7QJkNm6023889 for ; Wed, 26 Aug 2009 12:47:59 -0700 Received: by pxi8 with SMTP id 8so463564pxi.9 for ; Wed, 26 Aug 2009 12:47:58 -0700 (PDT) Date: Wed, 26 Aug 2009 12:47:57 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes In-Reply-To: <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > Against: 2.6.31-rc6-mmotm-090820-1918 > > Introduce nodemask macro to allocate a nodemask and > initialize it to contain a single node, using existing > nodemask_of_node() macro. Coded as a macro to avoid header > dependency hell. > > This will be used to construct the huge pages "nodes_allowed" > nodemask for a single node when a persistent huge page > pool page count is modified via a per node sysfs attribute. > > Signed-off-by: Lee Schermerhorn > > include/linux/nodemask.h | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > m; \ > }) > > +#define alloc_nodemask_of_node(node) \ > +({ \ > + typeof(_unused_nodemask_arg_) *nmp; \ > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > + if (nmp) \ > + *nmp = nodemask_of_node(node); \ > + nmp; \ > +}) > + > + > #define first_unset_node(mask) __first_unset_node(&(mask)) > static inline int __first_unset_node(const nodemask_t *maskp) > { I think it would probably be better to use the generic NODEMASK_ALLOC() interface by requiring it to pass the entire type (including "struct") as part of the first parameter. Then it automatically takes care of dynamically allocating large nodemasks vs. allocating them on the stack. Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case to be this: #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct nodemask_scratch, x), and then doing this in your code: NODEMASK_ALLOC(nodemask_t, nodes_allowed); if (nodes_allowed) *nodes_allowed = nodemask_of_node(node); The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can probably be made more general to handle cases like this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id C40456B004F for ; Wed, 26 Aug 2009 16:46:40 -0400 (EDT) Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes From: Lee Schermerhorn In-Reply-To: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Content-Type: text/plain Date: Wed, 26 Aug 2009 16:46:43 -0400 Message-Id: <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote: > On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > Introduce nodemask macro to allocate a nodemask and > > initialize it to contain a single node, using existing > > nodemask_of_node() macro. Coded as a macro to avoid header > > dependency hell. > > > > This will be used to construct the huge pages "nodes_allowed" > > nodemask for a single node when a persistent huge page > > pool page count is modified via a per node sysfs attribute. > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/nodemask.h | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > > m; \ > > }) > > > > +#define alloc_nodemask_of_node(node) \ > > +({ \ > > + typeof(_unused_nodemask_arg_) *nmp; \ > > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > > + if (nmp) \ > > + *nmp = nodemask_of_node(node); \ > > + nmp; \ > > +}) > > + > > + > > #define first_unset_node(mask) __first_unset_node(&(mask)) > > static inline int __first_unset_node(const nodemask_t *maskp) > > { > > I think it would probably be better to use the generic NODEMASK_ALLOC() > interface by requiring it to pass the entire type (including "struct") as > part of the first parameter. Then it automatically takes care of > dynamically allocating large nodemasks vs. allocating them on the stack. > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > to be this: > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > nodemask_scratch, x), and then doing this in your code: > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > if (nodes_allowed) > *nodes_allowed = nodemask_of_node(node); > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > probably be made more general to handle cases like this. I just don't know what that would accomplish. Heck, I'm not all that happy with the alloc_nodemask_from_node() because it's allocating both a hidden nodemask_t and a pointer thereto on the stack just to return a pointer to a kmalloc()ed nodemask_t--which is what I want/need here. One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] is that it declares the pointer variable as well as initializing it, perhaps with kmalloc(), ... Indeed, it's purpose is to replace on stack nodemask declarations. So, to use it at the start of, e.g., set_max_huge_pages() where I can safely use it throughout the function, I'll end up allocating the nodes_allowed mask on every call, whether or not a node is specified or there is a non-default mempolicy. If it turns out that no node was specified and we have default policy, we need to free the mask and NULL out nodes_allowed up front so that we get default behavior. That seems uglier to me that only allocating the nodemask when we know we need one. I'm not opposed to using a generic function/macro where one exists that suits my purposes. I just don't see one. I tried to create one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse nodemask_from_node() to initialize it. I'm really not happy with the results--because of those extra, hidden stack variables. I could eliminate those by creating a out of line function, but there's no good place to put a generic nodemask function--no nodemask.c. I'm leaning towards going back to my original hugetlb-private "nodes_allowed_from_node()" or such. I can use nodemask_from_node to initialize it, if that will make Mel happy, but trying to force fit an existing "generic" function just because it's generic seems pointless. So, I'm going to let this series rest until I hear back from you and Mel on how to proceed with this. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 74CF96B004F for ; Thu, 27 Aug 2009 05:52:03 -0400 (EDT) Date: Thu, 27 Aug 2009 10:52:10 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090827095210.GB21183@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> <1251319603.4409.92.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: David Rientjes , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Aug 26, 2009 at 04:46:43PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote: > > On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > > > Introduce nodemask macro to allocate a nodemask and > > > initialize it to contain a single node, using existing > > > nodemask_of_node() macro. Coded as a macro to avoid header > > > dependency hell. > > > > > > This will be used to construct the huge pages "nodes_allowed" > > > nodemask for a single node when a persistent huge page > > > pool page count is modified via a per node sysfs attribute. > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > include/linux/nodemask.h | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > > > m; \ > > > }) > > > > > > +#define alloc_nodemask_of_node(node) \ > > > +({ \ > > > + typeof(_unused_nodemask_arg_) *nmp; \ > > > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > > > + if (nmp) \ > > > + *nmp = nodemask_of_node(node); \ > > > + nmp; \ > > > +}) > > > + > > > + > > > #define first_unset_node(mask) __first_unset_node(&(mask)) > > > static inline int __first_unset_node(const nodemask_t *maskp) > > > { > > > > I think it would probably be better to use the generic NODEMASK_ALLOC() > > interface by requiring it to pass the entire type (including "struct") as > > part of the first parameter. Then it automatically takes care of > > dynamically allocating large nodemasks vs. allocating them on the stack. > > > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > > to be this: > > > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > > nodemask_scratch, x), and then doing this in your code: > > > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > > if (nodes_allowed) > > *nodes_allowed = nodemask_of_node(node); > > > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > > probably be made more general to handle cases like this. > > I just don't know what that would accomplish. Heck, I'm not all that > happy with the alloc_nodemask_from_node() because it's allocating both a > hidden nodemask_t and a pointer thereto on the stack just to return a > pointer to a kmalloc()ed nodemask_t--which is what I want/need here. > > One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] > is that it declares the pointer variable as well as initializing it, > perhaps with kmalloc(), ... Indeed, it's purpose is to replace on > stack nodemask declarations. > > So, to use it at the start of, e.g., set_max_huge_pages() where I can > safely use it throughout the function, I'll end up allocating the > nodes_allowed mask on every call, whether or not a node is specified or > there is a non-default mempolicy. If it turns out that no node was > specified and we have default policy, we need to free the mask and NULL > out nodes_allowed up front so that we get default behavior. That seems > uglier to me that only allocating the nodemask when we know we need one. > > I'm not opposed to using a generic function/macro where one exists that > suits my purposes. I just don't see one. I tried to create > one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse > nodemask_from_node() to initialize it. I'm really not happy with the > results--because of those extra, hidden stack variables. I could > eliminate those by creating a out of line function, but there's no good > place to put a generic nodemask function--no nodemask.c. > Ok. When I brought the subject up, it looked like you were creating a hugetlbfs-specific helper that looked like it would have generic helpers. While that is still the case, it's looking like generic helpers make things worse and hide side-effects in helper functions that might cause greater difficulty in the future. I'm happier to go with the existing code than I was before so consider my objection dropped. > I'm leaning towards going back to my original hugetlb-private > "nodes_allowed_from_node()" or such. I can use nodemask_from_node to > initialize it, if that will make Mel happy, but trying to force fit an > existing "generic" function just because it's generic seems pointless. > > So, I'm going to let this series rest until I hear back from you and Mel > on how to proceed with this. > I hate to do it to you, but at this point, I'm leaning towards your current approach. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Thu, 27 Aug 2009 11:23:39 +0100 Message-ID: <20090827102338.GC21183@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309843.4409.48.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251309843.4409.48.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, Aug 26, 2009 at 02:04:03PM -0400, Lee Schermerhorn wrote: > > This revised patch also removes the include of hugetlb.h from node.h. > > Lee > > --- > > PATCH 5/6 hugetlb: register per node hugepages attributes > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: remove dependency on kobject private bitfield. Search > global hstates then all per node hstates for kobject > match in attribute show/store functions. > > V3: rebase atop the mempolicy-based hugepage alloc/free; > use custom "nodes_allowed" to restrict alloc/free to > a specific node via per node attributes. Per node > attribute overrides mempolicy. I.e., mempolicy only > applies to global attributes. > > V4: Fix issues raised by Mel Gorman: > + add !NUMA versions of hugetlb_[un]register_node() > + rename 'hi' to 'i' in kobj_to_node_hstate() > + rename (count, input) to (len, count) in nr_hugepages_store() > + moved per node hugepages_kobj and hstate_kobjs[] from the > struct node [sysdev] to hugetlb.c private arrays. > + changed registration mechanism so that hugetlbfs [a module] > register its attributes registration callbacks with the node > driver, eliminating the dependency between the node driver > and hugetlbfs. From it's init func, hugetlbfs will register > all on-line nodes' hugepage sysfs attributes along with > hugetlbfs' attributes register/unregister functions. The > node driver will use these functions to [un]register nodes > with hugetlbfs on node hot-plug. > + replaced hugetlb.c private "nodes_allowed_from_node()" with > generic "alloc_nodemask_of_node()". > > This patch adds the per huge page size control/query attributes > to the per node sysdevs: > > /sys/devices/system/node/node/hugepages/hugepages-/ > nr_hugepages - r/w > free_huge_pages - r/o > surplus_huge_pages - r/o > > The patch attempts to re-use/share as much of the existing > global hstate attribute initialization and handling, and the > "nodes_allowed" constraint processing as possible. > Calling set_max_huge_pages() with no node indicates a change to > global hstate parameters. In this case, any non-default task > mempolicy will be used to generate the nodes_allowed mask. A > valid node id indicates an update to that node's hstate > parameters, and the count argument specifies the target count > for the specified node. From this info, we compute the target > global count for the hstate and construct a nodes_allowed node > mask contain only the specified node. > > Setting the node specific nr_hugepages via the per node attribute > effectively ignores any task mempolicy or cpuset constraints. > > With this patch: > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > ./ ../ free_hugepages nr_hugepages surplus_hugepages > > Starting from: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 0 > Node 2 HugePages_Free: 0 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 0 > > Allocate 16 persistent huge pages on node 2: > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages > > [Note that this is equivalent to: > numactl -m 2 hugeadmin --pool-pages-min 2M:+16 > ] > > Yields: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 16 > Node 2 HugePages_Free: 16 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages = 16 > > Global controls work as expected--reduce pool to 8 persistent huge pages: > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 8 > Node 2 HugePages_Free: 8 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > > Signed-off-by: Lee Schermerhorn > > drivers/base/node.c | 27 +++++ > include/linux/node.h | 6 + > include/linux/numa.h | 2 > mm/hugetlb.c | 245 ++++++++++++++++++++++++++++++++++++++++++++------- > 4 files changed, 250 insertions(+), 30 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-26 12:37:03.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-26 13:01:54.000000000 -0400 > @@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct > } > static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); > > +/* > + * hugetlbfs per node attributes registration interface > + */ > +NODE_REGISTRATION_FUNC __hugetlb_register_node; > +NODE_REGISTRATION_FUNC __hugetlb_unregister_node; > + > +static inline void hugetlb_register_node(struct node *node) > +{ > + if (__hugetlb_register_node) > + __hugetlb_register_node(node); > +} > + > +static inline void hugetlb_unregister_node(struct node *node) > +{ > + if (__hugetlb_unregister_node) > + __hugetlb_unregister_node(node); > +} > + > +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, > + NODE_REGISTRATION_FUNC unregister) > +{ > + __hugetlb_register_node = doregister; > + __hugetlb_unregister_node = unregister; > +} > + > I think I get this. Basically, you want to avoid the functions being called too early before sysfs is initialised and still work with hotplug later. So early in boot, no registeration happens. sysfs and hugetlbfs get initialised and at that point, these hooks become active, all nodes registered and hotplug later continues to work. Is that accurate? Can it get a comment? > /* > * register_node - Setup a sysfs device for a node. > @@ -200,6 +225,7 @@ int register_node(struct node *node, int > sysdev_create_file(&node->sysdev, &attr_distance); > > scan_unevictable_register_node(node); > + hugetlb_register_node(node); > } > return error; > } > @@ -220,6 +246,7 @@ void unregister_node(struct node *node) > sysdev_remove_file(&node->sysdev, &attr_distance); > > scan_unevictable_unregister_node(node); > + hugetlb_unregister_node(node); > > sysdev_unregister(&node->sysdev); > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-26 12:37:04.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-26 13:01:54.000000000 -0400 > @@ -24,6 +24,7 @@ > #include > > #include > +#include > #include "internal.h" > > const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; > @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs > } > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > + int nid) > { > unsigned long min_count, ret; > nodemask_t *nodes_allowed; > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages( > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > > - nodes_allowed = huge_mpol_nodes_allowed(); > + if (nid == NO_NODEID_SPECIFIED) > + nodes_allowed = huge_mpol_nodes_allowed(); > + else { > + /* > + * incoming 'count' is for node 'nid' only, so > + * adjust count to global, but restrict alloc/free > + * to the specified node. > + */ > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > + nodes_allowed = alloc_nodemask_of_node(nid); alloc_nodemask_of_node() isn't defined anywhere. > + if (!nodes_allowed) > + printk(KERN_WARNING "%s unable to allocate allowed " > + "nodes mask for huge page allocation/free. " > + "Falling back to default.\n", current->comm); > + } > > /* > * Increase the pool size > @@ -1329,51 +1345,71 @@ out: > static struct kobject *hugepages_kobj; > static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp); > + > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) > { > int i; > + > for (i = 0; i < HUGE_MAX_HSTATE; i++) > - if (hstate_kobjs[i] == kobj) > + if (hstate_kobjs[i] == kobj) { > + if (nidp) > + *nidp = NO_NODEID_SPECIFIED; > return &hstates[i]; > - BUG(); > - return NULL; > + } > + > + return kobj_to_node_hstate(kobj, nidp); > } > > static ssize_t nr_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > + struct hstate *h; > + unsigned long nr_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid == NO_NODEID_SPECIFIED) > + nr_huge_pages = h->nr_huge_pages; > + else > + nr_huge_pages = h->nr_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", nr_huge_pages); > } > + > static ssize_t nr_hugepages_store(struct kobject *kobj, > - struct kobj_attribute *attr, const char *buf, size_t count) > + struct kobj_attribute *attr, const char *buf, size_t len) > { > + unsigned long count; > + struct hstate *h; > + int nid; > int err; > - unsigned long input; > - struct hstate *h = kobj_to_hstate(kobj); > > - err = strict_strtoul(buf, 10, &input); > + err = strict_strtoul(buf, 10, &count); > if (err) > return 0; > > - h->max_huge_pages = set_max_huge_pages(h, input); > + h = kobj_to_hstate(kobj, &nid); > + h->max_huge_pages = set_max_huge_pages(h, count, nid); > > - return count; > + return len; > } > HSTATE_ATTR(nr_hugepages); > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > + > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > } > + > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > struct kobj_attribute *attr, const char *buf, size_t count) > { > int err; > unsigned long input; > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > err = strict_strtoul(buf, 10, &input); > if (err) > @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > static ssize_t free_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->free_huge_pages); > + struct hstate *h; > + unsigned long free_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid == NO_NODEID_SPECIFIED) > + free_huge_pages = h->free_huge_pages; > + else > + free_huge_pages = h->free_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", free_huge_pages); > } > HSTATE_ATTR_RO(free_hugepages); > > static ssize_t resv_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > + struct hstate *h = kobj_to_hstate(kobj, NULL); > return sprintf(buf, "%lu\n", h->resv_huge_pages); > } > HSTATE_ATTR_RO(resv_hugepages); > @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages); > static ssize_t surplus_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h = kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > + struct hstate *h; > + unsigned long surplus_huge_pages; > + int nid; > + > + h = kobj_to_hstate(kobj, &nid); > + if (nid == NO_NODEID_SPECIFIED) > + surplus_huge_pages = h->surplus_huge_pages; > + else > + surplus_huge_pages = h->surplus_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", surplus_huge_pages); > } > HSTATE_ATTR_RO(surplus_hugepages); > > @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att > .attrs = hstate_attrs, > }; > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > + struct kobject *parent, > + struct kobject **hstate_kobjs, > + struct attribute_group *hstate_attr_group) > { > int retval; > + int hi = h - hstates; > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, > - hugepages_kobj); > - if (!hstate_kobjs[h - hstates]) > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); > + if (!hstate_kobjs[hi]) > return -ENOMEM; > > - retval = sysfs_create_group(hstate_kobjs[h - hstates], > - &hstate_attr_group); > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > if (retval) > - kobject_put(hstate_kobjs[h - hstates]); > + kobject_put(hstate_kobjs[hi]); > > return retval; > } > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo > return; > > for_each_hstate(h) { > - err = hugetlb_sysfs_add_hstate(h); > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > + hstate_kobjs, &hstate_attr_group); > if (err) > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > h->name); > } > } > > +#ifdef CONFIG_NUMA > + > +struct node_hstate { > + struct kobject *hugepages_kobj; > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > +}; > +struct node_hstate node_hstates[MAX_NUMNODES]; > + > +static struct attribute *per_node_hstate_attrs[] = { > + &nr_hugepages_attr.attr, > + &free_hugepages_attr.attr, > + &surplus_hugepages_attr.attr, > + NULL, > +}; > + > +static struct attribute_group per_node_hstate_attr_group = { > + .attrs = per_node_hstate_attrs, > +}; > + > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + struct node_hstate *nhs = &node_hstates[nid]; > + int i; > + for (i = 0; i < HUGE_MAX_HSTATE; i++) > + if (nhs->hstate_kobjs[i] == kobj) { > + if (nidp) > + *nidp = nid; > + return &hstates[i]; > + } > + } > + > + BUG(); > + return NULL; > +} Ok, this looks nicer in that the dependencies between hugetlbfs and base node support are going the right direction. > + > +void hugetlb_unregister_node(struct node *node) > +{ > + struct hstate *h; > + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; > + > + if (!nhs->hugepages_kobj) > + return; > + > + for_each_hstate(h) > + if (nhs->hstate_kobjs[h - hstates]) { > + kobject_put(nhs->hstate_kobjs[h - hstates]); > + nhs->hstate_kobjs[h - hstates] = NULL; > + } > + > + kobject_put(nhs->hugepages_kobj); > + nhs->hugepages_kobj = NULL; > +} > + > +static void hugetlb_unregister_all_nodes(void) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) > + hugetlb_unregister_node(&node_devices[nid]); > + > + register_hugetlbfs_with_node(NULL, NULL); > +} > + > +void hugetlb_register_node(struct node *node) > +{ > + struct hstate *h; > + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; > + int err; > + > + if (nhs->hugepages_kobj) > + return; /* already allocated */ > + > + nhs->hugepages_kobj = kobject_create_and_add("hugepages", > + &node->sysdev.kobj); > + if (!nhs->hugepages_kobj) > + return; > + > + for_each_hstate(h) { > + err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj, > + nhs->hstate_kobjs, > + &per_node_hstate_attr_group); > + if (err) { > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > + " for node %d\n", > + h->name, node->sysdev.id); > + hugetlb_unregister_node(node); > + break; > + } > + } > +} > + > +static void hugetlb_register_all_nodes(void) > +{ > + int nid; > + > + for (nid = 0; nid < nr_node_ids; nid++) { > + struct node *node = &node_devices[nid]; > + if (node->sysdev.id == nid) > + hugetlb_register_node(node); > + } > + > + register_hugetlbfs_with_node(hugetlb_register_node, > + hugetlb_unregister_node); > +} > +#else /* !CONFIG_NUMA */ > + > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > +{ > + BUG(); > + if (nidp) > + *nidp = -1; > + return NULL; > +} > + > +static void hugetlb_unregister_all_nodes(void) { } > + > +static void hugetlb_register_all_nodes(void) { } > + > +#endif > + > static void __exit hugetlb_exit(void) > { > struct hstate *h; > > + hugetlb_unregister_all_nodes(); > + > for_each_hstate(h) { > kobject_put(hstate_kobjs[h - hstates]); > } > @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void) > > hugetlb_sysfs_init(); > > + hugetlb_register_all_nodes(); > + > return 0; > } > module_init(hugetlb_init); > @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > if (write) > - h->max_huge_pages = set_max_huge_pages(h, tmp); > + h->max_huge_pages = set_max_huge_pages(h, tmp, > + NO_NODEID_SPECIFIED); > > return 0; > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/numa.h 2009-08-26 12:37:03.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h 2009-08-26 12:58:54.000000000 -0400 > @@ -10,4 +10,6 @@ > > #define MAX_NUMNODES (1 << NODES_SHIFT) > > +#define NO_NODEID_SPECIFIED (-1) > + > #endif /* _LINUX_NUMA_H */ > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-26 12:37:03.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-26 12:40:19.000000000 -0400 > @@ -28,6 +28,7 @@ struct node { > > struct memory_block; > extern struct node node_devices[]; > +typedef void (*NODE_REGISTRATION_FUNC)(struct node *); > > extern int register_node(struct node *, int, struct node *); > extern void unregister_node(struct node *node); > @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns > extern int register_mem_sect_under_node(struct memory_block *mem_blk, > int nid); > extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk); > +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, > + NODE_REGISTRATION_FUNC unregister); > #else > static inline int register_one_node(int nid) > { > @@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un > { > return 0; > } > + > +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do, > + NODE_REGISTRATION_FUNC un) { } "do" is a keyword. This won't compile on !NUMA. needs to be called doregister and unregister or basically anything other than "do" > #endif > > #define to_node(sys_device) container_of(sys_device, struct node, sysdev) > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Thu, 27 Aug 2009 12:52:10 -0400 Message-ID: <1251391930.4374.89.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309843.4409.48.camel@useless.americas.hpqcorp.net> <20090827102338.GC21183@csn.ul.ie> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090827102338.GC21183@csn.ul.ie> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Thu, 2009-08-27 at 11:23 +0100, Mel Gorman wrote: > On Wed, Aug 26, 2009 at 02:04:03PM -0400, Lee Schermerhorn wrote: > > > > This revised patch also removes the include of hugetlb.h from node.h. > > > > Lee > > > > --- > > > > PATCH 5/6 hugetlb: register per node hugepages attributes > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > V2: remove dependency on kobject private bitfield. Search > > global hstates then all per node hstates for kobject > > match in attribute show/store functions. > > > > V3: rebase atop the mempolicy-based hugepage alloc/free; > > use custom "nodes_allowed" to restrict alloc/free to > > a specific node via per node attributes. Per node > > attribute overrides mempolicy. I.e., mempolicy only > > applies to global attributes. > > > > V4: Fix issues raised by Mel Gorman: > > + add !NUMA versions of hugetlb_[un]register_node() > > + rename 'hi' to 'i' in kobj_to_node_hstate() > > + rename (count, input) to (len, count) in nr_hugepages_store() > > + moved per node hugepages_kobj and hstate_kobjs[] from the > > struct node [sysdev] to hugetlb.c private arrays. > > + changed registration mechanism so that hugetlbfs [a module] > > register its attributes registration callbacks with the node > > driver, eliminating the dependency between the node driver > > and hugetlbfs. From it's init func, hugetlbfs will register > > all on-line nodes' hugepage sysfs attributes along with > > hugetlbfs' attributes register/unregister functions. The > > node driver will use these functions to [un]register nodes > > with hugetlbfs on node hot-plug. > > + replaced hugetlb.c private "nodes_allowed_from_node()" with > > generic "alloc_nodemask_of_node()". > > > > This patch adds the per huge page size control/query attributes > > to the per node sysdevs: > > > > /sys/devices/system/node/node/hugepages/hugepages-/ > > nr_hugepages - r/w > > free_huge_pages - r/o > > surplus_huge_pages - r/o > > > > The patch attempts to re-use/share as much of the existing > > global hstate attribute initialization and handling, and the > > "nodes_allowed" constraint processing as possible. > > Calling set_max_huge_pages() with no node indicates a change to > > global hstate parameters. In this case, any non-default task > > mempolicy will be used to generate the nodes_allowed mask. A > > valid node id indicates an update to that node's hstate > > parameters, and the count argument specifies the target count > > for the specified node. From this info, we compute the target > > global count for the hstate and construct a nodes_allowed node > > mask contain only the specified node. > > > > Setting the node specific nr_hugepages via the per node attribute > > effectively ignores any task mempolicy or cpuset constraints. > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-26 12:37:03.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-26 13:01:54.000000000 -0400 > > @@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct > > } > > static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL); > > > > +/* > > + * hugetlbfs per node attributes registration interface > > + */ > > +NODE_REGISTRATION_FUNC __hugetlb_register_node; > > +NODE_REGISTRATION_FUNC __hugetlb_unregister_node; > > + > > +static inline void hugetlb_register_node(struct node *node) > > +{ > > + if (__hugetlb_register_node) > > + __hugetlb_register_node(node); > > +} > > + > > +static inline void hugetlb_unregister_node(struct node *node) > > +{ > > + if (__hugetlb_unregister_node) > > + __hugetlb_unregister_node(node); > > +} > > + > > +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, > > + NODE_REGISTRATION_FUNC unregister) > > +{ > > + __hugetlb_register_node = doregister; > > + __hugetlb_unregister_node = unregister; > > +} > > + > > > > I think I get this. Basically, you want to avoid the functions being > called too early before sysfs is initialised and still work with hotplug > later. So early in boot, no registeration happens. sysfs and hugetlbfs > get initialised and at that point, these hooks become active, all nodes > registered and hotplug later continues to work. > > Is that accurate? Can it get a comment? Yes, you got it, and yes, I'll add a comment. I had explained it in the patch description [V4], but that's not too useful to someone coming along later... > > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages( > > if (h->order >= MAX_ORDER) > > return h->max_huge_pages; > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > + if (nid == NO_NODEID_SPECIFIED) > > + nodes_allowed = huge_mpol_nodes_allowed(); > > + else { > > + /* > > + * incoming 'count' is for node 'nid' only, so > > + * adjust count to global, but restrict alloc/free > > + * to the specified node. > > + */ > > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > > + nodes_allowed = alloc_nodemask_of_node(nid); > > alloc_nodemask_of_node() isn't defined anywhere. Well, that's because the patch that defines it is in a message that I meant to send before this one. I see it's in my Drafts folder. I'll attach that patch below. I'm rebasing against the 0827 mmotm, and I'll resend the rebased series. However, I wanted to get your opinion of the nodemask patch below. > > > > +#ifdef CONFIG_NUMA > > + > > +struct node_hstate { > > + struct kobject *hugepages_kobj; > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > +}; > > +struct node_hstate node_hstates[MAX_NUMNODES]; > > + > > +static struct attribute *per_node_hstate_attrs[] = { > > + &nr_hugepages_attr.attr, > > + &free_hugepages_attr.attr, > > + &surplus_hugepages_attr.attr, > > + NULL, > > +}; > > + > > +static struct attribute_group per_node_hstate_attr_group = { > > + .attrs = per_node_hstate_attrs, > > +}; > > + > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > +{ > > + int nid; > > + > > + for (nid = 0; nid < nr_node_ids; nid++) { > > + struct node_hstate *nhs = &node_hstates[nid]; > > + int i; > > + for (i = 0; i < HUGE_MAX_HSTATE; i++) > > + if (nhs->hstate_kobjs[i] == kobj) { > > + if (nidp) > > + *nidp = nid; > > + return &hstates[i]; > > + } > > + } > > + > > + BUG(); > > + return NULL; > > +} > > Ok, this looks nicer in that the dependencies between hugetlbfs and base > node support are going the right direction. Agreed. I removed that "issue" from the patch description. > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-26 12:37:03.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-26 12:40:19.000000000 -0400 > > @@ -28,6 +28,7 @@ struct node { > > > > struct memory_block; > > extern struct node node_devices[]; > > +typedef void (*NODE_REGISTRATION_FUNC)(struct node *); > > > > extern int register_node(struct node *, int, struct node *); > > extern void unregister_node(struct node *node); > > @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns > > extern int register_mem_sect_under_node(struct memory_block *mem_blk, > > int nid); > > extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk); > > +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister, > > + NODE_REGISTRATION_FUNC unregister); > > #else > > static inline int register_one_node(int nid) > > { > > @@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un > > { > > return 0; > > } > > + > > +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do, > > + NODE_REGISTRATION_FUNC un) { } > > "do" is a keyword. This won't compile on !NUMA. needs to be called > doregister and unregister or basically anything other than "do" Sorry. Last minute, obviously untested, addition. I have built the reworked code with and without NUMA. Here's my current "alloc_nodemask_of_node()" patch. What do you think about going with this? PATCH 4/6 - hugetlb: introduce alloc_nodemask_of_node() Against: 2.6.31-rc6-mmotm-090820-1918 Introduce nodemask macro to allocate a nodemask and initialize it to contain a single node, using existing nodemask_of_node() macro. Coded as a macro to avoid header dependency hell. This will be used to construct the huge pages "nodes_allowed" nodemask for a single node when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn include/linux/nodemask.h | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-27 09:16:39.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-27 09:52:21.000000000 -0400 @@ -245,18 +245,31 @@ static inline int __next_node(int n, con return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1)); } +#define init_nodemask_of_nodes(mask, node) \ + nodes_clear(*(mask)); \ + node_set((node), *(mask)); + #define nodemask_of_node(node) \ ({ \ typeof(_unused_nodemask_arg_) m; \ if (sizeof(m) == sizeof(unsigned long)) { \ m.bits[0] = 1UL<<(node); \ } else { \ - nodes_clear(m); \ - node_set((node), m); \ + init_nodemask_of_nodes(&m, (node)); \ } \ m; \ }) +#define alloc_nodemask_of_node(node) \ +({ \ + typeof(_unused_nodemask_arg_) *nmp; \ + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ + if (nmp) \ + init_nodemask_of_nodes(nmp, (node)); \ + nmp; \ +}) + + #define first_unset_node(mask) __first_unset_node(&(mask)) static inline int __first_unset_node(const nodemask_t *maskp) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Thu, 27 Aug 2009 12:35:20 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251401727; bh=0RG3C6ZL2QUGR6ofNztafPQouYA=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=m5qcWOik+Lri4ME9fWGWE4r76CMagBy 2zlhy1LE4bzuXgSHDFYprqRRsiva93fN4e3ugkcOw8EbRJxN106w8Kw== In-Reply-To: <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > I think it would probably be better to use the generic NODEMASK_ALLOC() > > interface by requiring it to pass the entire type (including "struct") as > > part of the first parameter. Then it automatically takes care of > > dynamically allocating large nodemasks vs. allocating them on the stack. > > > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > > to be this: > > > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > > nodemask_scratch, x), and then doing this in your code: > > > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > > if (nodes_allowed) > > *nodes_allowed = nodemask_of_node(node); > > > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > > probably be made more general to handle cases like this. > > I just don't know what that would accomplish. Heck, I'm not all that > happy with the alloc_nodemask_from_node() because it's allocating both a > hidden nodemask_t and a pointer thereto on the stack just to return a > pointer to a kmalloc()ed nodemask_t--which is what I want/need here. > > One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] > is that it declares the pointer variable as well as initializing it, > perhaps with kmalloc(), ... Indeed, it's purpose is to replace on > stack nodemask declarations. > Right, which is why I suggest we only have one such interface to dynamically allocate nodemasks when NODES_SHIFT > 8. That's what defines NODEMASK_ALLOC() as being special: it's taking NODES_SHIFT into consideration just like CPUMASK_ALLOC() would take NR_CPUS into consideration. Your use case is the intended purpose of NODEMASK_ALLOC() and I see no reason why your code can't use the same interface with some modification and it's in the best interest of a maintainability to not duplicate specialized cases where pre-existing interfaces can be used (or improved, in this case). > So, to use it at the start of, e.g., set_max_huge_pages() where I can > safely use it throughout the function, I'll end up allocating the > nodes_allowed mask on every call, whether or not a node is specified or > there is a non-default mempolicy. If it turns out that no node was > specified and we have default policy, we need to free the mask and NULL > out nodes_allowed up front so that we get default behavior. That seems > uglier to me that only allocating the nodemask when we know we need one. > Not with my suggested code of disabling local irqs, getting a reference to the mempolicy so it can't be freed, reenabling, and then only using NODEMASK_ALLOC() in the switch statement on mpol->mode for MPOL_PREFERRED. > I'm not opposed to using a generic function/macro where one exists that > suits my purposes. I just don't see one. I tried to create > one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse > nodemask_from_node() to initialize it. I'm really not happy with the > results--because of those extra, hidden stack variables. I could > eliminate those by creating a out of line function, but there's no good > place to put a generic nodemask function--no nodemask.c. > Using NODEMASK_ALLOC(nodes_allowed) wouldn't really be a hidden stack variable, would it? I think most developers would assume that it is some automatic variable called `nodes_allowed' since it's later referenced (and only needs to be in the case of MPOL_PREFERRED if my mpol_get() solution with disabled local irqs is used). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 1C4426B004F for ; Thu, 27 Aug 2009 15:40:51 -0400 (EDT) Received: from wpaz33.hot.corp.google.com (wpaz33.hot.corp.google.com [172.24.198.97]) by smtp-out.google.com with ESMTP id n7RJeo6v004623 for ; Thu, 27 Aug 2009 20:40:50 +0100 Received: from wa-out-1112.google.com (wafj32.prod.google.com [10.114.186.32]) by wpaz33.hot.corp.google.com with ESMTP id n7RJelqt013440 for ; Thu, 27 Aug 2009 12:40:47 -0700 Received: by wa-out-1112.google.com with SMTP id j32so303549waf.29 for ; Thu, 27 Aug 2009 12:40:47 -0700 (PDT) Date: Thu, 27 Aug 2009 12:40:44 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy In-Reply-To: <1251233347.16229.0.camel@useless.americas.hpqcorp.net> Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> <1251233347.16229.0.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, 25 Aug 2009, Lee Schermerhorn wrote: > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > > > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > { > > > unsigned long min_count, ret; > > > + nodemask_t *nodes_allowed; > > > > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > > Why can't you simply do this? > > > > struct mempolicy *pol = NULL; > > nodemask_t *nodes_allowed = &node_online_map; > > > > local_irq_disable(); > > pol = current->mempolicy; > > mpol_get(pol); > > local_irq_enable(); > > if (pol) { > > switch (pol->mode) { > > case MPOL_BIND: > > case MPOL_INTERLEAVE: > > nodes_allowed = pol->v.nodes; > > break; > > case MPOL_PREFERRED: > > ... use NODEMASK_SCRATCH() ... > > default: > > BUG(); > > } > > } > > mpol_put(pol); > > > > and then use nodes_allowed throughout set_max_huge_pages()? > > > Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages(). Yeah, the above code would all be in set_max_huge_pages() and huge_mpol_nodes_allowed() would be removed. > NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure > it will return a kmalloc()'d nodemask, which I need because a NULL > nodemask pointer means "all online nodes" [really all nodes with memory, > I suppose] and I need a pointer to kmalloc()'d nodemask to return from > huge_mpol_nodes_allowed(). I want to keep the access to the internals > of mempolicy in mempolicy.[ch], thus the call out to > huge_mpol_nodes_allowed(), instead of open coding it. Ok, so you could add a mempolicy.c helper function that returns nodemask_t * and either points to mpol->v.nodes for most cases after getting a reference on mpol with mpol_get() or points to a dynamically allocated NODEMASK_ALLOC() on a nodemask created for MPOL_PREFERRED. This works nicely because either way you still have a reference to mpol, so you'll need to call into a mpol_nodemask_free() function which can use the same switch statement: void mpol_nodemask_free(struct mempolicy *mpol, struct nodemask_t *nodes_allowed) { switch (mpol->mode) { case MPOL_PREFERRED: kfree(nodes_allowed); break; default: break; } mpol_put(mpol); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id DFE906B00A6 for ; Fri, 28 Aug 2009 06:09:17 -0400 (EDT) Date: Fri, 28 Aug 2009 11:09:20 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Message-ID: <20090828100919.GC5054@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309843.4409.48.camel@useless.americas.hpqcorp.net> <20090827102338.GC21183@csn.ul.ie> <1251391930.4374.89.camel@useless.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1251391930.4374.89.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Thu, Aug 27, 2009 at 12:52:10PM -0400, Lee Schermerhorn wrote: > > > > > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages( > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > > + if (nid == NO_NODEID_SPECIFIED) > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > + else { > > > + /* > > > + * incoming 'count' is for node 'nid' only, so > > > + * adjust count to global, but restrict alloc/free > > > + * to the specified node. > > > + */ > > > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > > > + nodes_allowed = alloc_nodemask_of_node(nid); > > > > alloc_nodemask_of_node() isn't defined anywhere. > > > Well, that's because the patch that defines it is in a message that I > meant to send before this one. I see it's in my Drafts folder. I'll > attach that patch below. I'm rebasing against the 0827 mmotm, and I'll > resend the rebased series. However, I wanted to get your opinion of the > nodemask patch below. > It looks very reasonable to my eye. The caller must know that kfree() is used to free it instead of free_nodemask_of_node() but it's not worth getting into a twist over. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Fri, 28 Aug 2009 08:56:52 -0400 Message-ID: <1251464212.9989.52.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: David Rientjes Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Thu, 2009-08-27 at 12:35 -0700, David Rientjes wrote: > On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > > > I think it would probably be better to use the generic NODEMASK_ALLOC() > > > interface by requiring it to pass the entire type (including "struct") as > > > part of the first parameter. Then it automatically takes care of > > > dynamically allocating large nodemasks vs. allocating them on the stack. > > > > > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > > > to be this: > > > > > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > > > > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > > > nodemask_scratch, x), and then doing this in your code: > > > > > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > > > if (nodes_allowed) > > > *nodes_allowed = nodemask_of_node(node); > > > > > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > > > probably be made more general to handle cases like this. > > > > I just don't know what that would accomplish. Heck, I'm not all that > > happy with the alloc_nodemask_from_node() because it's allocating both a > > hidden nodemask_t and a pointer thereto on the stack just to return a > > pointer to a kmalloc()ed nodemask_t--which is what I want/need here. > > > > One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] > > is that it declares the pointer variable as well as initializing it, > > perhaps with kmalloc(), ... Indeed, it's purpose is to replace on > > stack nodemask declarations. > > > > Right, which is why I suggest we only have one such interface to > dynamically allocate nodemasks when NODES_SHIFT > 8. That's what defines > NODEMASK_ALLOC() as being special: it's taking NODES_SHIFT into > consideration just like CPUMASK_ALLOC() would take NR_CPUS into > consideration. Your use case is the intended purpose of NODEMASK_ALLOC() > and I see no reason why your code can't use the same interface with some > modification and it's in the best interest of a maintainability to not > duplicate specialized cases where pre-existing interfaces can be used (or > improved, in this case). > > > So, to use it at the start of, e.g., set_max_huge_pages() where I can > > safely use it throughout the function, I'll end up allocating the > > nodes_allowed mask on every call, whether or not a node is specified or > > there is a non-default mempolicy. If it turns out that no node was > > specified and we have default policy, we need to free the mask and NULL > > out nodes_allowed up front so that we get default behavior. That seems > > uglier to me that only allocating the nodemask when we know we need one. > > > > Not with my suggested code of disabling local irqs, getting a reference to > the mempolicy so it can't be freed, reenabling, and then only using > NODEMASK_ALLOC() in the switch statement on mpol->mode for MPOL_PREFERRED. > > > I'm not opposed to using a generic function/macro where one exists that > > suits my purposes. I just don't see one. I tried to create > > one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse > > nodemask_from_node() to initialize it. I'm really not happy with the > > results--because of those extra, hidden stack variables. I could > > eliminate those by creating a out of line function, but there's no good > > place to put a generic nodemask function--no nodemask.c. > > > > Using NODEMASK_ALLOC(nodes_allowed) wouldn't really be a hidden stack > variable, would it? I think most developers would assume that it is > some automatic variable called `nodes_allowed' since it's later referenced > (and only needs to be in the case of MPOL_PREFERRED if my mpol_get() > solution with disabled local irqs is used). David: I'm going to repost my series with the version of alloc_nodemask_of_node() that I sent our yesterday. My entire implementation is based on nodes_allowed, in set_max_huge_pages() being a pointer to a nodemask. nodes_allowed must be NULL for default behavior [NO_NODEID_SPECIFIED && default mempolicy]. It only gets allocated when nid >0 or task has non-default memory policy. This seems to work fairly well for both the mempolicy based constraint and the per node attributes. Please take a look at this series. If you want to propose a patch to rework the nodes_allowed allocation, have at it. I'm satisfied with the current implementation. Now, we have a couple of options: Mel said he's willing to proceed with the mempolicy based constraint and leave the per node attributes to a follow up submit. If you want to take over the per node attributes feature and rework it, I can extract it from the series, including the doc update and turn it over to you. Or, we can try to submit the current implementation and follow up with patches to rework the generic nodemask support as you propose. Let me know how you want to proceed. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Date: Mon, 24 Aug 2009 15:24:37 -0400 Message-ID: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com PATCH 0/5 hugetlb: numa control of persistent huge pages alloc/free Against: 2.6.31-rc6-mmotm-090820-1918 This is V4 of a series of patches to provide control over the location of the allocation and freeing of persistent huge pages on a NUMA platform. This series uses the task NUMA mempolicy of the task modifying "nr_hugepages" to constrain the affected nodes. This method is based on Mel Gorman's suggestion to use task mempolicy. One of the benefits of this method is that it does not *require* modification to hugeadm(8) to use this feature. One of the possible downsides is that task mempolicy is limited by cpuset constraints. V4 add a subset of the hugepages sysfs attributes to each per node system device directory under: /sys/devices/node/node[0-9]*/hugepages. The per node attibutes allow direct assignment of a huge page count on a specific node, regardless of the task's mempolicy or cpuset constraints. Note, I haven't implemented a boot time parameter to constrain the boot time allocation of huge pages. This can be added if anyone feels strongly that it is required. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Date: Mon, 24 Aug 2009 15:25:44 -0400 Message-ID: <20090824192544.10317.6291.sendpatchset@localhost.localdomain> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com [PATCH 1/5] hugetlb: rework hstate_next_node* functions Against: 2.6.31-rc6-mmotm-090820-1918 V2: + cleaned up comments, removed some deemed unnecessary, add some suggested by review + removed check for !current in huge_mpol_nodes_allowed(). + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to catch out of range node id. + add examples to patch description V3: + factored this "cleanup" patch out of V2 patch 2/3 + moved ahead of patch to add nodes_allowed mask to alloc funcs as this patch is somewhat independent from using task mempolicy to control huge page allocation and freeing. Modify the hstate_next_node* functions to allow them to be called to obtain the "start_nid". Then, whereas prior to this patch we unconditionally called hstate_next_node_to_{alloc|free}(), whether or not we successfully allocated/freed a huge page on the node, now we only call these functions on failure to alloc/free to advance to next allowed node. Factor out the next_node_allowed() function to handle wrap at end of node_online_map. In this version, the allowed nodes include all of the online nodes. Reviewed-by: Mel Gorman Signed-off-by: Lee Schermerhorn mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 45 insertions(+), 25 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 @@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag } /* + * common helper function for hstate_next_node_to_{alloc|free}. + * return next node in node_online_map, wrapping at end. + */ +static int next_node_allowed(int nid) +{ + nid = next_node(nid, node_online_map); + if (nid == MAX_NUMNODES) + nid = first_node(node_online_map); + VM_BUG_ON(nid >= MAX_NUMNODES); + + return nid; +} + +/* * Use a helper variable to find the next node and then * copy it back to next_nid_to_alloc afterwards: * otherwise there's a window in which a racer might @@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag */ static int hstate_next_node_to_alloc(struct hstate *h) { - int next_nid; - next_nid = next_node(h->next_nid_to_alloc, node_online_map); - if (next_nid == MAX_NUMNODES) - next_nid = first_node(node_online_map); + int nid, next_nid; + + nid = h->next_nid_to_alloc; + next_nid = next_node_allowed(nid); h->next_nid_to_alloc = next_nid; - return next_nid; + return nid; } static int alloc_fresh_huge_page(struct hstate *h) @@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct int next_nid; int ret = 0; - start_nid = h->next_nid_to_alloc; + start_nid = hstate_next_node_to_alloc(h); next_nid = start_nid; do { page = alloc_fresh_huge_page_node(h, next_nid); - if (page) + if (page) { ret = 1; + break; + } next_nid = hstate_next_node_to_alloc(h); - } while (!page && next_nid != start_nid); + } while (next_nid != start_nid); if (ret) count_vm_event(HTLB_BUDDY_PGALLOC); @@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct } /* - * helper for free_pool_huge_page() - find next node - * from which to free a huge page + * helper for free_pool_huge_page() - return the next node + * from which to free a huge page. Advance the next node id + * whether or not we find a free huge page to free so that the + * next attempt to free addresses the next node. */ static int hstate_next_node_to_free(struct hstate *h) { - int next_nid; - next_nid = next_node(h->next_nid_to_free, node_online_map); - if (next_nid == MAX_NUMNODES) - next_nid = first_node(node_online_map); + int nid, next_nid; + + nid = h->next_nid_to_free; + next_nid = next_node_allowed(nid); h->next_nid_to_free = next_nid; - return next_nid; + return nid; } /* @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs int next_nid; int ret = 0; - start_nid = h->next_nid_to_free; + start_nid = hstate_next_node_to_free(h); next_nid = start_nid; do { @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs } update_and_free_page(h, page); ret = 1; + break; } next_nid = hstate_next_node_to_free(h); - } while (!ret && next_nid != start_nid); + } while (next_nid != start_nid); return ret; } @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc void *addr; addr = __alloc_bootmem_node_nopanic( - NODE_DATA(h->next_nid_to_alloc), + NODE_DATA(hstate_next_node_to_alloc(h)), huge_page_size(h), huge_page_size(h), 0); - hstate_next_node_to_alloc(h); if (addr) { /* * Use the beginning of the huge page to store the @@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs VM_BUG_ON(delta != -1 && delta != 1); if (delta < 0) - start_nid = h->next_nid_to_alloc; + start_nid = hstate_next_node_to_alloc(h); else - start_nid = h->next_nid_to_free; + start_nid = hstate_next_node_to_free(h); next_nid = start_nid; do { int nid = next_nid; if (delta < 0) { - next_nid = hstate_next_node_to_alloc(h); /* * To shrink on this node, there must be a surplus page */ - if (!h->surplus_huge_pages_node[nid]) + if (!h->surplus_huge_pages_node[nid]) { + next_nid = hstate_next_node_to_alloc(h); continue; + } } if (delta > 0) { - next_nid = hstate_next_node_to_free(h); /* * Surplus cannot exceed the total number of pages */ if (h->surplus_huge_pages_node[nid] >= - h->nr_huge_pages_node[nid]) + h->nr_huge_pages_node[nid]) { + next_nid = hstate_next_node_to_free(h); continue; + } } h->surplus_huge_pages += delta; From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Date: Mon, 24 Aug 2009 15:26:37 -0400 Message-ID: <20090824192637.10317.31039.sendpatchset@localhost.localdomain> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com [PATCH 2/4] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Against: 2.6.31-rc6-mmotm-090820-1918 V3: + moved this patch to after the "rework" of hstate_next_node_to_... functions as this patch is more specific to using task mempolicy to control huge page allocation and freeing. In preparation for constraining huge page allocation and freeing by the controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer to the allocate, free and surplus adjustment functions. For now, pass NULL to indicate default behavior--i.e., use node_online_map. A subsqeuent patch will derive a non-default mask from the controlling task's numa mempolicy. Reviewed-by: Mel Gorman Signed-off-by: Lee Schermerhorn mm/hugetlb.c | 102 ++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 67 insertions(+), 35 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag } /* - * common helper function for hstate_next_node_to_{alloc|free}. - * return next node in node_online_map, wrapping at end. + * common helper functions for hstate_next_node_to_{alloc|free}. + * We may have allocated or freed a huge pages based on a different + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might + * be outside of *nodes_allowed. Ensure that we use the next + * allowed node for alloc or free. */ -static int next_node_allowed(int nid) +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) { - nid = next_node(nid, node_online_map); + nid = next_node(nid, *nodes_allowed); if (nid == MAX_NUMNODES) - nid = first_node(node_online_map); + nid = first_node(*nodes_allowed); VM_BUG_ON(nid >= MAX_NUMNODES); return nid; } +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) +{ + if (!node_isset(nid, *nodes_allowed)) + nid = next_node_allowed(nid, nodes_allowed); + return nid; +} + /* * Use a helper variable to find the next node and then * copy it back to next_nid_to_alloc afterwards: @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. * But we don't need to use a spin_lock here: it really * doesn't matter if occasionally a racer chooses the - * same nid as we do. Move nid forward in the mask even - * if we just successfully allocated a hugepage so that - * the next caller gets hugepages on the next node. + * same nid as we do. Move nid forward in the mask whether + * or not we just successfully allocated a hugepage so that + * the next allocation addresses the next node. */ -static int hstate_next_node_to_alloc(struct hstate *h) +static int hstate_next_node_to_alloc(struct hstate *h, + nodemask_t *nodes_allowed) { int nid, next_nid; - nid = h->next_nid_to_alloc; - next_nid = next_node_allowed(nid); + if (!nodes_allowed) + nodes_allowed = &node_online_map; + + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); + + next_nid = next_node_allowed(nid, nodes_allowed); h->next_nid_to_alloc = next_nid; + return nid; } -static int alloc_fresh_huge_page(struct hstate *h) +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) { struct page *page; int start_nid; int next_nid; int ret = 0; - start_nid = hstate_next_node_to_alloc(h); + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); next_nid = start_nid; do { @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct ret = 1; break; } - next_nid = hstate_next_node_to_alloc(h); + next_nid = hstate_next_node_to_alloc(h, nodes_allowed); } while (next_nid != start_nid); if (ret) @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct * whether or not we find a free huge page to free so that the * next attempt to free addresses the next node. */ -static int hstate_next_node_to_free(struct hstate *h) +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) { int nid, next_nid; - nid = h->next_nid_to_free; - next_nid = next_node_allowed(nid); + if (!nodes_allowed) + nodes_allowed = &node_online_map; + + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); + + next_nid = next_node_allowed(nid, nodes_allowed); h->next_nid_to_free = next_nid; + return nid; } @@ -705,13 +726,14 @@ static int hstate_next_node_to_free(stru * balanced over allowed nodes. * Called with hugetlb_lock locked. */ -static int free_pool_huge_page(struct hstate *h, bool acct_surplus) +static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, + bool acct_surplus) { int start_nid; int next_nid; int ret = 0; - start_nid = hstate_next_node_to_free(h); + start_nid = hstate_next_node_to_free(h, nodes_allowed); next_nid = start_nid; do { @@ -735,7 +757,7 @@ static int free_pool_huge_page(struct hs ret = 1; break; } - next_nid = hstate_next_node_to_free(h); + next_nid = hstate_next_node_to_free(h, nodes_allowed); } while (next_nid != start_nid); return ret; @@ -937,7 +959,7 @@ static void return_unused_surplus_pages( * on-line nodes for us and will handle the hstate accounting. */ while (nr_pages--) { - if (!free_pool_huge_page(h, 1)) + if (!free_pool_huge_page(h, NULL, 1)) break; } } @@ -1047,7 +1069,7 @@ int __weak alloc_bootmem_huge_page(struc void *addr; addr = __alloc_bootmem_node_nopanic( - NODE_DATA(hstate_next_node_to_alloc(h)), + NODE_DATA(hstate_next_node_to_alloc(h, NULL)), huge_page_size(h), huge_page_size(h), 0); if (addr) { @@ -1102,7 +1124,7 @@ static void __init hugetlb_hstate_alloc_ if (h->order >= MAX_ORDER) { if (!alloc_bootmem_huge_page(h)) break; - } else if (!alloc_fresh_huge_page(h)) + } else if (!alloc_fresh_huge_page(h, NULL)) break; } h->max_huge_pages = i; @@ -1144,16 +1166,22 @@ static void __init report_hugepages(void } #ifdef CONFIG_HIGHMEM -static void try_to_free_low(struct hstate *h, unsigned long count) +static void try_to_free_low(struct hstate *h, unsigned long count, + nodemask_t *nodes_allowed) { int i; if (h->order >= MAX_ORDER) return; + if (!nodes_allowed) + nodes_allowed = &node_online_map; + for (i = 0; i < MAX_NUMNODES; ++i) { struct page *page, *next; struct list_head *freel = &h->hugepage_freelists[i]; + if (!node_isset(i, *nodes_allowed)) + continue; list_for_each_entry_safe(page, next, freel, lru) { if (count >= h->nr_huge_pages) return; @@ -1167,7 +1195,8 @@ static void try_to_free_low(struct hstat } } #else -static inline void try_to_free_low(struct hstate *h, unsigned long count) +static inline void try_to_free_low(struct hstate *h, unsigned long count, + nodemask_t *nodes_allowed) { } #endif @@ -1177,7 +1206,8 @@ static inline void try_to_free_low(struc * balanced by operating on them in a round-robin fashion. * Returns 1 if an adjustment was made. */ -static int adjust_pool_surplus(struct hstate *h, int delta) +static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, + int delta) { int start_nid, next_nid; int ret = 0; @@ -1185,9 +1215,9 @@ static int adjust_pool_surplus(struct hs VM_BUG_ON(delta != -1 && delta != 1); if (delta < 0) - start_nid = hstate_next_node_to_alloc(h); + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); else - start_nid = hstate_next_node_to_free(h); + start_nid = hstate_next_node_to_free(h, nodes_allowed); next_nid = start_nid; do { @@ -1197,7 +1227,8 @@ static int adjust_pool_surplus(struct hs * To shrink on this node, there must be a surplus page */ if (!h->surplus_huge_pages_node[nid]) { - next_nid = hstate_next_node_to_alloc(h); + next_nid = hstate_next_node_to_alloc(h, + nodes_allowed); continue; } } @@ -1207,7 +1238,8 @@ static int adjust_pool_surplus(struct hs */ if (h->surplus_huge_pages_node[nid] >= h->nr_huge_pages_node[nid]) { - next_nid = hstate_next_node_to_free(h); + next_nid = hstate_next_node_to_free(h, + nodes_allowed); continue; } } @@ -1242,7 +1274,7 @@ static unsigned long set_max_huge_pages( */ spin_lock(&hugetlb_lock); while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { - if (!adjust_pool_surplus(h, -1)) + if (!adjust_pool_surplus(h, NULL, -1)) break; } @@ -1253,7 +1285,7 @@ static unsigned long set_max_huge_pages( * and reducing the surplus. */ spin_unlock(&hugetlb_lock); - ret = alloc_fresh_huge_page(h); + ret = alloc_fresh_huge_page(h, NULL); spin_lock(&hugetlb_lock); if (!ret) goto out; @@ -1277,13 +1309,13 @@ static unsigned long set_max_huge_pages( */ min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; min_count = max(count, min_count); - try_to_free_low(h, min_count); + try_to_free_low(h, min_count, NULL); while (min_count < persistent_huge_pages(h)) { - if (!free_pool_huge_page(h, 0)) + if (!free_pool_huge_page(h, NULL, 0)) break; } while (count < persistent_huge_pages(h)) { - if (!adjust_pool_surplus(h, 1)) + if (!adjust_pool_surplus(h, NULL, 1)) break; } out: From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Date: Mon, 24 Aug 2009 15:27:52 -0400 Message-ID: <20090824192752.10317.96125.sendpatchset@localhost.localdomain> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com [PATCH 3/4] hugetlb: derive huge pages nodes allowed from task mempolicy Against: 2.6.31-rc6-mmotm-090820-1918 V2: + cleaned up comments, removed some deemed unnecessary, add some suggested by review + removed check for !current in huge_mpol_nodes_allowed(). + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to catch out of range node id. + add examples to patch description V3: Factored this patch from V2 patch 2/3 V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages() This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages. This mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". Notes: 1) This patch introduces a subtle change in behavior: huge page allocation and freeing will be constrained by any mempolicy that the task adjusting the huge page pool inherits from its parent. This policy could come from a distant ancestor. The adminstrator adjusting the huge page pool without explicitly specifying a mempolicy via numactl might be surprised by this. Additionaly, any mempolicy specified by numactl will be constrained by the cpuset in which numactl is invoked. 2) Hugepages allocated at boot time use the node_online_map. An additional patch could implement a temporary boot time huge pages nodes_allowed command line parameter. 3) Using mempolicy to control persistent huge page allocation and freeing requires no change to hugeadm when invoking it via numactl, as shown in the examples below. However, hugeadm could be enhanced to take the allowed nodes as an argument and set its task mempolicy itself. This would allow it to detect and warn about any non-default mempolicy that it inherited from its parent, thus alleviating the issue described in Note 1 above. See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: hugeadm --pool-pages-min=2048Kb:32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node: numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. Signed-off-by: Lee Schermerhorn include/linux/mempolicy.h | 3 ++ mm/hugetlb.c | 14 ++++++---- mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 73 insertions(+), 5 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm } return zl; } + +/* + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. + * + * Returns a [pointer to a] nodelist based on the current task's mempolicy + * to constraing the allocation and freeing of persistent huge pages + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like + * 'bind' policy in this context. An attempt to allocate a persistent huge + * page will never "fallback" to another node inside the buddy system + * allocator. + * + * If the task's mempolicy is "default" [NULL], just return NULL for + * default behavior. Otherwise, extract the policy nodemask for 'bind' + * or 'interleave' policy or construct a nodemask for 'preferred' or + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. + * + * N.B., it is the caller's responsibility to free a returned nodemask. + */ +nodemask_t *huge_mpol_nodes_allowed(void) +{ + nodemask_t *nodes_allowed = NULL; + struct mempolicy *mempolicy; + int nid; + + if (!current->mempolicy) + return NULL; + + mpol_get(current->mempolicy); + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); + if (!nodes_allowed) { + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " + "for huge page allocation.\nFalling back to default.\n", + current->comm); + goto out; + } + nodes_clear(*nodes_allowed); + + mempolicy = current->mempolicy; + switch (mempolicy->mode) { + case MPOL_PREFERRED: + if (mempolicy->flags & MPOL_F_LOCAL) + nid = numa_node_id(); + else + nid = mempolicy->v.preferred_node; + node_set(nid, *nodes_allowed); + break; + + case MPOL_BIND: + /* Fall through */ + case MPOL_INTERLEAVE: + *nodes_allowed = mempolicy->v.nodes; + break; + + default: + BUG(); + } + +out: + mpol_put(current->mempolicy); + return nodes_allowed; +} #endif /* Allocate a page in interleaved policy. Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol, nodemask_t **nodemask); +extern nodemask_t *huge_mpol_nodes_allowed(void); extern unsigned slab_node(struct mempolicy *policy); extern enum zone_type policy_zone; @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone return node_zonelist(0, gfp_flags); } +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } + static inline int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags) Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) { unsigned long min_count, ret; + nodemask_t *nodes_allowed; if (h->order >= MAX_ORDER) return h->max_huge_pages; + nodes_allowed = huge_mpol_nodes_allowed(); + /* * Increase the pool size * First take pages out of surplus state. Then make up the @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages( */ spin_lock(&hugetlb_lock); while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { - if (!adjust_pool_surplus(h, NULL, -1)) + if (!adjust_pool_surplus(h, nodes_allowed, -1)) break; } @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages( * and reducing the surplus. */ spin_unlock(&hugetlb_lock); - ret = alloc_fresh_huge_page(h, NULL); + ret = alloc_fresh_huge_page(h, nodes_allowed); spin_lock(&hugetlb_lock); if (!ret) goto out; @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages( */ min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; min_count = max(count, min_count); - try_to_free_low(h, min_count, NULL); + try_to_free_low(h, min_count, nodes_allowed); while (min_count < persistent_huge_pages(h)) { - if (!free_pool_huge_page(h, NULL, 0)) + if (!free_pool_huge_page(h, nodes_allowed, 0)) break; } while (count < persistent_huge_pages(h)) { - if (!adjust_pool_surplus(h, NULL, 1)) + if (!adjust_pool_surplus(h, nodes_allowed, 1)) break; } out: ret = persistent_huge_pages(h); spin_unlock(&hugetlb_lock); + kfree(nodes_allowed); return ret; } From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Mon, 24 Aug 2009 15:29:02 -0400 Message-ID: <20090824192902.10317.94512.sendpatchset@localhost.localdomain> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com PATCH/RFC 5/4 hugetlb: register per node hugepages attributes Against: 2.6.31-rc6-mmotm-090820-1918 V2: remove dependency on kobject private bitfield. Search global hstates then all per node hstates for kobject match in attribute show/store functions. V3: rebase atop the mempolicy-based hugepage alloc/free; use custom "nodes_allowed" to restrict alloc/free to a specific node via per node attributes. Per node attribute overrides mempolicy. I.e., mempolicy only applies to global attributes. To demonstrate feasibility--if not advisability--of supporting both mempolicy-based persistent huge page management with per node "override" attributes. This patch adds the per huge page size control/query attributes to the per node sysdevs: /sys/devices/system/node/node/hugepages/hugepages-/ nr_hugepages - r/w free_huge_pages - r/o surplus_huge_pages - r/o The patch attempts to re-use/share as much of the existing global hstate attribute initialization and handling, and the "nodes_allowed" constraint processing as possible. In set_max_huge_pages(), a node id < 0 indicates a change to global hstate parameters. In this case, any non-default task mempolicy will be used to generate the nodes_allowed mask. A node id > 0 indicates a node specific update and the count argument specifies the target count for the node. From this info, we compute the target global count for the hstate and construct a nodes_allowed node mask contain only the specified node. Thus, setting the node specific nr_hugepages via the per node attribute effectively overrides any task mempolicy. Issue: dependency of base driver [node] dependency on hugetlbfs module. We want to keep all of the hstate attribute registration and handling in the hugetlb module. However, we need to call into this code to register the per node hstate attributes on node hot plug. With this patch: (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB ./ ../ free_hugepages nr_hugepages surplus_hugepages Starting from: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 0 Allocate 16 persistent huge pages on node 2: (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages [Note that this is equivalent to: numactl -m 2 hugeadmin --pool-pages-min 2M:+16 ] Yields: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 16 Node 2 HugePages_Free: 16 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 16 Global controls work as expected--reduce pool to 8 persistent huge pages: (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 8 Node 2 HugePages_Free: 8 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 Signed-off-by: Lee Schermerhorn drivers/base/node.c | 2 include/linux/hugetlb.h | 6 + include/linux/node.h | 3 mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++++++++------- 4 files changed, 197 insertions(+), 27 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-24 12:12:56.000000000 -0400 @@ -200,6 +200,7 @@ int register_node(struct node *node, int sysdev_create_file(&node->sysdev, &attr_distance); scan_unevictable_register_node(node); + hugetlb_register_node(node); } return error; } @@ -220,6 +221,7 @@ void unregister_node(struct node *node) sysdev_remove_file(&node->sysdev, &attr_distance); scan_unevictable_unregister_node(node); + hugetlb_unregister_node(node); sysdev_unregister(&node->sysdev); } Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009-08-24 12:12:56.000000000 -0400 @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate return size_to_hstate(PAGE_SIZE << compound_order(page)); } +struct node; +extern void hugetlb_register_node(struct node *); +extern void hugetlb_unregister_node(struct node *); + #else struct hstate {}; #define alloc_bootmem_huge_page(h) NULL @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug { return 1; } +#define hugetlb_register_node(NP) +#define hugetlb_unregister_node(NP) #endif #endif /* _LINUX_HUGETLB_H */ Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:56.000000000 -0400 @@ -24,6 +24,7 @@ #include #include +#include #include "internal.h" const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs return ret; } +static nodemask_t *nodes_allowed_from_node(int nid) +{ + nodemask_t *nodes_allowed; + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); + if (!nodes_allowed) { + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " + "for huge page allocation.\nFalling back to default.\n", + current->comm); + } else { + nodes_clear(*nodes_allowed); + node_set(nid, *nodes_allowed); + } + return nodes_allowed; +} + #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, + int nid) { unsigned long min_count, ret; nodemask_t *nodes_allowed; @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( if (h->order >= MAX_ORDER) return h->max_huge_pages; - nodes_allowed = huge_mpol_nodes_allowed(); + if (nid < 0) + nodes_allowed = huge_mpol_nodes_allowed(); + else { + /* + * incoming 'count' is for node 'nid' only, so + * adjust count to global, but restrict alloc/free + * to the specified node. + */ + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; + nodes_allowed = nodes_allowed_from_node(nid); + } /* * Increase the pool size @@ -1338,34 +1365,69 @@ out: static struct kobject *hugepages_kobj; static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; -static struct hstate *kobj_to_hstate(struct kobject *kobj) +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) { + struct node *node = &node_devices[nid]; + int hi; + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) + if (node->hstate_kobjs[hi] == kobj) { + if (nidp) + *nidp = nid; + return &hstates[hi]; + } + } + + BUG(); + return NULL; +} + +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) { int i; + for (i = 0; i < HUGE_MAX_HSTATE; i++) - if (hstate_kobjs[i] == kobj) + if (hstate_kobjs[i] == kobj) { + if (nidp) + *nidp = -1; return &hstates[i]; - BUG(); - return NULL; + } + + return kobj_to_node_hstate(kobj, nidp); } static ssize_t nr_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->nr_huge_pages); + struct hstate *h; + unsigned long nr_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid < 0) + nr_huge_pages = h->nr_huge_pages; + else + nr_huge_pages = h->nr_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", nr_huge_pages); } + static ssize_t nr_hugepages_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - int err; unsigned long input; - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h; + int nid; + int err; err = strict_strtoul(buf, 10, &input); if (err) return 0; - h->max_huge_pages = set_max_huge_pages(h, input); + h = kobj_to_hstate(kobj, &nid); + h->max_huge_pages = set_max_huge_pages(h, input, nid); return count; } @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); + return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); } + static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int err; unsigned long input; - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); err = strict_strtoul(buf, 10, &input); if (err) @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); static ssize_t free_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->free_huge_pages); + struct hstate *h; + unsigned long free_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid < 0) + free_huge_pages = h->free_huge_pages; + else + free_huge_pages = h->free_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", free_huge_pages); } HSTATE_ATTR_RO(free_hugepages); static ssize_t resv_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); + struct hstate *h = kobj_to_hstate(kobj, NULL); return sprintf(buf, "%lu\n", h->resv_huge_pages); } HSTATE_ATTR_RO(resv_hugepages); @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); static ssize_t surplus_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - struct hstate *h = kobj_to_hstate(kobj); - return sprintf(buf, "%lu\n", h->surplus_huge_pages); + struct hstate *h; + unsigned long surplus_huge_pages; + int nid; + + h = kobj_to_hstate(kobj, &nid); + if (nid < 0) + surplus_huge_pages = h->surplus_huge_pages; + else + surplus_huge_pages = h->surplus_huge_pages_node[nid]; + + return sprintf(buf, "%lu\n", surplus_huge_pages); } HSTATE_ATTR_RO(surplus_hugepages); @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att .attrs = hstate_attrs, }; -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, + struct kobject *parent, + struct kobject **hstate_kobjs, + struct attribute_group *hstate_attr_group) { int retval; + int hi = h - hstates; - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, - hugepages_kobj); - if (!hstate_kobjs[h - hstates]) + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); + if (!hstate_kobjs[hi]) return -ENOMEM; - retval = sysfs_create_group(hstate_kobjs[h - hstates], - &hstate_attr_group); + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); if (retval) - kobject_put(hstate_kobjs[h - hstates]); + kobject_put(hstate_kobjs[hi]); return retval; } @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo return; for_each_hstate(h) { - err = hugetlb_sysfs_add_hstate(h); + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, + hstate_kobjs, &hstate_attr_group); if (err) printk(KERN_ERR "Hugetlb: Unable to add hstate %s", h->name); } } +#ifdef CONFIG_NUMA +static struct attribute *per_node_hstate_attrs[] = { + &nr_hugepages_attr.attr, + &free_hugepages_attr.attr, + &surplus_hugepages_attr.attr, + NULL, +}; + +static struct attribute_group per_node_hstate_attr_group = { + .attrs = per_node_hstate_attrs, +}; + + +void hugetlb_unregister_node(struct node *node) +{ + struct hstate *h; + + for_each_hstate(h) { + kobject_put(node->hstate_kobjs[h - hstates]); + node->hstate_kobjs[h - hstates] = NULL; + } + + kobject_put(node->hugepages_kobj); + node->hugepages_kobj = NULL; +} + +static void hugetlb_unregister_all_nodes(void) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) + hugetlb_unregister_node(&node_devices[nid]); +} + +void hugetlb_register_node(struct node *node) +{ + struct hstate *h; + int err; + + if (!hugepages_kobj) + return; /* too early */ + + node->hugepages_kobj = kobject_create_and_add("hugepages", + &node->sysdev.kobj); + if (!node->hugepages_kobj) + return; + + for_each_hstate(h) { + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, + node->hstate_kobjs, + &per_node_hstate_attr_group); + if (err) + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" + " for node %d\n", + h->name, node->sysdev.id); + } +} + +static void hugetlb_register_all_nodes(void) +{ + int nid; + + for (nid = 0; nid < nr_node_ids; nid++) { + struct node *node = &node_devices[nid]; + if (node->sysdev.id == nid && !node->hugepages_kobj) + hugetlb_register_node(node); + } +} +#endif + static void __exit hugetlb_exit(void) { struct hstate *h; + hugetlb_unregister_all_nodes(); + for_each_hstate(h) { kobject_put(hstate_kobjs[h - hstates]); } @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) hugetlb_sysfs_init(); + hugetlb_register_all_nodes(); + return 0; } module_init(hugetlb_init); @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta proc_doulongvec_minmax(table, write, file, buffer, length, ppos); if (write) - h->max_huge_pages = set_max_huge_pages(h, tmp); + h->max_huge_pages = set_max_huge_pages(h, tmp, -1); return 0; } Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 @@ -21,9 +21,12 @@ #include #include +#include struct node { struct sys_device sysdev; + struct kobject *hugepages_kobj; + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; }; struct memory_block; From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 5/5] hugetlb: update hugetlb documentation for mempolicy based management. Date: Mon, 24 Aug 2009 15:30:12 -0400 Message-ID: <20090824193012.10317.70679.sendpatchset@localhost.localdomain> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com PATCH 4/4 hugetlb: update hugetlb documentation for mempolicy based management. Against: 2.6.31-rc6-mmotm-090820-1918 V2: Add brief description of per node attributes. This patch updates the kernel huge tlb documentation to describe the numa memory policy based huge page management. Additionaly, the patch includes a fair amount of rework to improve consistency, eliminate duplication and set the context for documenting the memory policy interaction. Signed-off-by: Lee Schermerhorn Documentation/vm/hugetlbpage.txt | 257 ++++++++++++++++++++++++++------------- 1 file changed, 172 insertions(+), 85 deletions(-) Index: linux-2.6.31-rc6-mmotm-090820-1918/Documentation/vm/hugetlbpage.txt =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/Documentation/vm/hugetlbpage.txt 2009-08-24 12:12:44.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/Documentation/vm/hugetlbpage.txt 2009-08-24 12:50:49.000000000 -0400 @@ -11,23 +11,21 @@ This optimization is more critical now a (several GBs) are more readily available. Users can use the huge page support in Linux kernel by either using the mmap -system call or standard SYSv shared memory system calls (shmget, shmat). +system call or standard SYSV shared memory system calls (shmget, shmat). First the Linux kernel needs to be built with the CONFIG_HUGETLBFS (present under "File systems") and CONFIG_HUGETLB_PAGE (selected automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with huge page support should show the number of configured -huge pages in the system by running the "cat /proc/meminfo" command. +The /proc/meminfo file provides information about the total number of hugetlb +pages preallocated in the kernel's huge page pool. It also displays +information about the number of free, reserved and surplus huge pages and the +[default] huge page size. The huge page size is needed for generating the +proper alignment and size of the arguments to system calls that map huge page +regions. -/proc/meminfo also provides information about the total number of hugetlb -pages configured in the kernel. It also displays information about the -number of free hugetlb pages at any time. It also displays information about -the configured huge page size - this is needed for generating the proper -alignment and size of the arguments to the above system calls. - -The output of "cat /proc/meminfo" will have lines like: +The output of "cat /proc/meminfo" will include lines like: ..... HugePages_Total: vvv @@ -53,26 +51,25 @@ HugePages_Surp is short for "surplus," /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb -pages in the kernel. Super user can dynamically request more (or free some -pre-configured) huge pages. -The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of huge pages is -possible only if there are enough hugetlb pages free that can be transferred -back to regular memory pool). - -Pages that are used as hugetlb pages are reserved inside the kernel and cannot -be used for other purposes. - -Once the kernel with Hugetlb page support is built and running, a user can -use either the mmap system call or shared memory system calls to start using -the huge pages. It is required that the system administrator preallocate -enough memory for huge page purposes. - -The administrator can preallocate huge pages on the kernel boot command line by -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages -requested. This is the most reliable method for preallocating huge pages as -memory has not yet become fragmented. +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre- +allocated in the kernel's huge page pool. These are called "persistent" +huge pages. A user with root privileges can dynamically allocate more or +free some persistent huge pages by increasing or decreasing the value of +'nr_hugepages'. + +Pages that are used as huge pages are reserved inside the kernel and cannot +be used for other purposes. Huge pages can not be swapped out under +memory pressure. + +Once a number of huge pages have been pre-allocated to the kernel huge page +pool, a user with appropriate privilege can use either the mmap system call +or shared memory system calls to use the huge pages. See the discussion of +Using Huge Pages, below + +The administrator can preallocate persistent huge pages on the kernel boot +command line by specifying the "hugepages=N" parameter, where 'N' = the +number of requested huge pages requested. This is the most reliable method +or preallocating huge pages as memory has not yet become fragmented. Some platforms support multiple huge page sizes. To preallocate huge pages of a specific size, one must preceed the huge pages boot command parameters @@ -80,19 +77,24 @@ with a huge page size selection paramete be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the "default_hugepagesz=" boot parameter. -/proc/sys/vm/nr_hugepages indicates the current number of configured [default -size] hugetlb pages in the kernel. Super user can dynamically request more -(or free some pre-configured) huge pages. - -Use the following command to dynamically allocate/deallocate default sized -huge pages: +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages +indicates the current number of pre-allocated huge pages of the default size. +Thus, one can use the following command to dynamically allocate/deallocate +default sized persistent huge pages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 default sized huge pages in the system. +This command will try to adjust the number of default sized huge pages in the +huge page pool to 20, allocating or freeing huge pages, as required. + On a NUMA platform, the kernel will attempt to distribute the huge page pool -over the all on-line nodes. These huge pages, allocated when nr_hugepages -is increased, are called "persistent huge pages". +over the all the nodes specified by the NUMA memory policy of the task that +modifies nr_hugepages that contain sufficient available contiguous memory. +These nodes are called the huge pages "allowed nodes". The default for the +huge pages allowed nodes--when the task has default memory policy--is all +on-line nodes. See the discussion below of the interaction of task memory +policy, cpusets and per node attributes with the allocation and freeing of +persistent huge pages. The success or failure of huge page allocation depends on the amount of physically contiguous memory that is preset in system at the time of the @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att allocating extra pages on other nodes with sufficient available contiguous memory, if any. -System administrators may want to put this command in one of the local rc init -files. This will enable the kernel to request huge pages early in the boot -process when the possibility of getting physical contiguous pages is still -very high. Administrators can verify the number of huge pages actually -allocated by checking the sysctl or meminfo. To check the per node +System administrators may want to put this command in one of the local rc +init files. This will enable the kernel to preallocate huge pages early in +the boot process when the possibility of getting physical contiguous pages +is still very high. Administrators can verify the number of huge pages +actually allocated by checking the sysctl or meminfo. To check the per node distribution of huge pages in a NUMA system, use: cat /sys/devices/system/node/node*/meminfo | fgrep Huge @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are requested by applications. Writing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain "surplus" -huge pages from the buddy allocator, when the normal pool is exhausted. As -these surplus huge pages go out of use, they are freed back to the buddy -allocator. +indicates that the hugetlb subsystem is allowed to try to obtain that +number of "surplus" huge pages from the kernel's normal page pool, when the +persistent huge page pool is exhausted. As these surplus huge pages become +unused, they are freed back to the kernel's normal page pool. -When increasing the huge page pool size via nr_hugepages, any surplus +When increasing the huge page pool size via nr_hugepages, any existing surplus pages will first be promoted to persistent huge pages. Then, additional huge pages will be allocated, if necessary and if possible, to fulfill -the new huge page pool size. +the new persistent huge page pool size. The administrator may shrink the pool of preallocated huge pages for the default huge page size by setting the nr_hugepages sysctl to a smaller value. The kernel will attempt to balance the freeing of huge pages -across all on-line nodes. Any free huge pages on the selected nodes will -be freed back to the buddy allocator. - -Caveat: Shrinking the pool via nr_hugepages such that it becomes less -than the number of huge pages in use will convert the balance to surplus -huge pages even if it would exceed the overcommit value. As long as -this condition holds, however, no more surplus huge pages will be -allowed on the system until one of the two sysctls are increased -sufficiently, or the surplus huge pages go out of use and are freed. +across all nodes in the memory policy of the task modifying nr_hugepages. +Any free huge pages on the selected nodes will be freed back to the kernel's +normal page pool. + +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that +it becomes less than the number of huge pages in use will convert the balance +of the in-use huge pages to surplus huge pages. This will occur even if +the number of surplus pages it would exceed the overcommit value. As long as +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is +increased sufficiently, or the surplus huge pages go out of use and are freed-- +no more surplus huge pages will be allowed to be allocated. With support for multiple huge page pools at run-time available, much of -the huge page userspace interface has been duplicated in sysfs. The above -information applies to the default huge page size which will be -controlled by the /proc interfaces for backwards compatibility. The root -huge page control directory in sysfs is: +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. +The /proc interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is: /sys/kernel/mm/hugepages For each huge page size supported by the running kernel, a subdirectory -will exist, of the form +will exist, of the form: hugepages-${size}kB @@ -159,6 +162,98 @@ Inside each of these directories, the sa which function as described above for the default huge page-sized case. + +Interaction of Task Memory Policy with Huge Page Allocation/Freeing: + +Whether huge pages are allocated and freed via the /proc interface or +the /sysfs interface, the NUMA nodes from which huge pages are allocated +or freed are controlled by the NUMA memory policy of the task that modifies +the nr_hugepages parameter. [nr_overcommit_hugepages is a global limit.] + +The recommended method to allocate or free huge pages to/from the kernel +huge page pool, using the nr_hugepages example above, is: + + numactl --interleave echo 20 >/proc/sys/vm/nr_hugepages. + +or, more succinctly: + + numactl -m echo 20 >/proc/sys/vm/nr_hugepages. + +This will allocate or free abs(20 - nr_hugepages) to or from the nodes +specified in , depending on whether nr_hugepages is initially +less than or greater than 20, respectively. No huge pages will be +allocated nor freed on any node not included in the specified . + +Any memory policy mode--bind, preferred, local or interleave--may be +used. The effect on persistent huge page allocation will be as follows: + +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], + persistent huge pages will be distributed across the node or nodes + specified in the mempolicy as if "interleave" had been specified. + However, if a node in the policy does not contain sufficient contiguous + memory for a huge page, the allocation will not "fallback" to the nearest + neighbor node with sufficient contiguous memory. To do this would cause + undesirable imbalance in the distribution of the huge page pool, or + possibly, allocation of persistent huge pages on nodes not allowed by + the task's memory policy. + +2) One or more nodes may be specified with the bind or interleave policy. + If more than one node is specified with the preferred policy, only the + lowest numeric id will be used. Local policy will select the node where + the task is running at the time the nodes_allowed mask is constructed. + +3) For local policy to be deterministic, the task must be bound to a cpu or + cpus in a single node. Otherwise, the task could be migrated to some + other node at any time after launch and the resulting node will be + indeterminate. Thus, local policy is not very useful for this purpose. + Any of the other mempolicy modes may be used to specify a single node. + +4) The nodes allowed mask will be derived from any non-default task mempolicy, + whether this policy was set explicitly by the task itself or one of its + ancestors, such as numactl. This means that if the task is invoked from a + shell with non-default policy, that policy will be used. One can specify a + node list of "all" with numactl --interleave or --membind [-m] to achieve + interleaving over all nodes in the system or cpuset. + +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by + the resource limits of any cpuset in which the task runs. Thus, there will + be no way for a task with non-default policy running in a cpuset with a + subset of the system nodes to allocate huge pages outside the cpuset + without first moving to a cpuset that contains all of the desired nodes. + +6) Hugepages allocated at boot time always use the node_online_map. + + +Per Node Hugepages Attributes + +A subset of the contents of the root huge page control directory in sysfs, +described above, has been replicated under each "node" system device in: + + /sys/devices/system/node/node[0-9]*/hugepages/ + +Under this directory, the subdirectory for each supported huge page size +contains the following attribute files: + + nr_hugepages + free_hugepages + surplus_hugepages + +The free_' and surplus_' attribute files are read-only. They return the number +of free and surplus [overcommitted] huge pages, respectively, on the parent +node. + +The nr_hugepages attribute will return the total number of huge pages on the +specified node. When this attribute is written, the number of persistent huge +pages on the parent node will be adjusted to the specified value, if sufficient +resources exist, regardless of the task's mempolicy or cpuset constraints. + +Note that the number of overcommit and reserve pages remain global quantities, +as we don't know until fault time, when the faulting task's mempolicy is applied, +from which node the huge page allocation will be attempted. + + +Using Huge Pages: + If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs: @@ -204,9 +299,11 @@ mount of filesystem will be required for * requesting huge pages. * * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * huge pages. That means the addresses starting with 0x800000... will need - * to be specified. Specifying a fixed address is not required on ppc64, - * i386 or x86_64. + * huge pages. That means that if one requires a fixed address, a huge page + * aligned address starting with 0x800000... will be required. If a fixed + * address is not required, the kernel will select an address in the proper + * range. + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. * * Note: The default shared memory limit is quite low on many kernels, * you may need to increase it via: @@ -235,14 +332,8 @@ mount of filesystem will be required for #define dprintf(x) printf(x) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define SHMAT_FLAGS (SHM_RND) -#else -#define ADDR (void *)(0x0UL) +#define ADDR (void *)(0x0UL) /* let kernel choose address */ #define SHMAT_FLAGS (0) -#endif int main(void) { @@ -300,10 +391,12 @@ int main(void) * example, the app is requesting memory of size 256MB that is backed by * huge pages. * - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. - * That means the addresses starting with 0x800000... will need to be - * specified. Specifying a fixed address is not required on ppc64, i386 - * or x86_64. + * For the ia64 architecture, the Linux kernel reserves Region number 4 for + * huge pages. That means that if one requires a fixed address, a huge page + * aligned address starting with 0x800000... will be required. If a fixed + * address is not required, the kernel will select an address in the proper + * range. + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. */ #include #include @@ -315,14 +408,8 @@ int main(void) #define LENGTH (256UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define FLAGS (MAP_SHARED | MAP_FIXED) -#else -#define ADDR (void *)(0x0UL) +#define ADDR (void *)(0x0UL) /* let kernel choose address */ #define FLAGS (MAP_SHARED) -#endif void check_bytes(char *addr) { From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Date: Tue, 25 Aug 2009 01:10:34 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192544.10317.6291.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251187952; bh=Zsu+rCLiTG4ZCSTKdw+Py/Btayk=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=Rmy7hVgF+Ic4gdcglHz8nadonqqL/Ec 6XzayAEBo17dQC3X//hrrCN4DK9cE60pKhTuwUCBE3axyRXaqPI+4Pw== In-Reply-To: <20090824192544.10317.6291.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Andrew Morton , Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > [PATCH 1/5] hugetlb: rework hstate_next_node* functions > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: > + cleaned up comments, removed some deemed unnecessary, > add some suggested by review > + removed check for !current in huge_mpol_nodes_allowed(). > + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). > + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to > catch out of range node id. > + add examples to patch description > > V3: > + factored this "cleanup" patch out of V2 patch 2/3 > + moved ahead of patch to add nodes_allowed mask to alloc funcs > as this patch is somewhat independent from using task mempolicy > to control huge page allocation and freeing. > > Modify the hstate_next_node* functions to allow them to be called to > obtain the "start_nid". Then, whereas prior to this patch we > unconditionally called hstate_next_node_to_{alloc|free}(), whether > or not we successfully allocated/freed a huge page on the node, > now we only call these functions on failure to alloc/free to advance > to next allowed node. > > Factor out the next_node_allowed() function to handle wrap at end > of node_online_map. In this version, the allowed nodes include all > of the online nodes. > > Reviewed-by: Mel Gorman > Signed-off-by: Lee Schermerhorn Acked-by: David Rientjes From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Date: Tue, 25 Aug 2009 01:16:26 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251188303; bh=rzx+DseYK9bERUjtMPic3xFu3Aw=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=aZnAqe8DdlxNGFh3RLj1ltIvuqqut9h lCRpsQ8FYzG9LD3sTFW00bVbqGKnO4E6IOBHAkcYMjRplW5SIZc+nGw== In-Reply-To: <20090824192637.10317.31039.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > [PATCH 2/4] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V3: > + moved this patch to after the "rework" of hstate_next_node_to_... > functions as this patch is more specific to using task mempolicy > to control huge page allocation and freeing. > > In preparation for constraining huge page allocation and freeing by the > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer > to the allocate, free and surplus adjustment functions. For now, pass > NULL to indicate default behavior--i.e., use node_online_map. A > subsqeuent patch will derive a non-default mask from the controlling > task's numa mempolicy. > > Reviewed-by: Mel Gorman > Signed-off-by: Lee Schermerhorn > > mm/hugetlb.c | 102 ++++++++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 67 insertions(+), 35 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > } > > /* > - * common helper function for hstate_next_node_to_{alloc|free}. > - * return next node in node_online_map, wrapping at end. > + * common helper functions for hstate_next_node_to_{alloc|free}. > + * We may have allocated or freed a huge pages based on a different > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > + * be outside of *nodes_allowed. Ensure that we use the next > + * allowed node for alloc or free. > */ > -static int next_node_allowed(int nid) > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > { > - nid = next_node(nid, node_online_map); > + nid = next_node(nid, *nodes_allowed); > if (nid == MAX_NUMNODES) > - nid = first_node(node_online_map); > + nid = first_node(*nodes_allowed); > VM_BUG_ON(nid >= MAX_NUMNODES); > > return nid; > } > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > +{ > + if (!node_isset(nid, *nodes_allowed)) > + nid = next_node_allowed(nid, nodes_allowed); > + return nid; > +} Awkward name considering this doesn't simply return true or false as expected, it returns a nid. > + > /* > * Use a helper variable to find the next node and then > * copy it back to next_nid_to_alloc afterwards: > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > * But we don't need to use a spin_lock here: it really > * doesn't matter if occasionally a racer chooses the > - * same nid as we do. Move nid forward in the mask even > - * if we just successfully allocated a hugepage so that > - * the next caller gets hugepages on the next node. > + * same nid as we do. Move nid forward in the mask whether > + * or not we just successfully allocated a hugepage so that > + * the next allocation addresses the next node. > */ > -static int hstate_next_node_to_alloc(struct hstate *h) > +static int hstate_next_node_to_alloc(struct hstate *h, > + nodemask_t *nodes_allowed) > { > int nid, next_nid; > > - nid = h->next_nid_to_alloc; > - next_nid = next_node_allowed(nid); > + if (!nodes_allowed) > + nodes_allowed = &node_online_map; > + > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > + > + next_nid = next_node_allowed(nid, nodes_allowed); > h->next_nid_to_alloc = next_nid; > + > return nid; > } Don't need next_nid. > -static int alloc_fresh_huge_page(struct hstate *h) > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > { > struct page *page; > int start_nid; > int next_nid; > int ret = 0; > > - start_nid = hstate_next_node_to_alloc(h); > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > next_nid = start_nid; > > do { > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct > ret = 1; > break; > } > - next_nid = hstate_next_node_to_alloc(h); > + next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > } while (next_nid != start_nid); > > if (ret) > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct > * whether or not we find a free huge page to free so that the > * next attempt to free addresses the next node. > */ > -static int hstate_next_node_to_free(struct hstate *h) > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > { > int nid, next_nid; > > - nid = h->next_nid_to_free; > - next_nid = next_node_allowed(nid); > + if (!nodes_allowed) > + nodes_allowed = &node_online_map; > + > + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); > + > + next_nid = next_node_allowed(nid, nodes_allowed); > h->next_nid_to_free = next_nid; > + > return nid; > } Same. From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Date: Tue, 25 Aug 2009 01:47:52 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251190190; bh=pufH2oncBwHBerjrfiZQuCZW34Y=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=ONrFEv/iv9EMJI/v+06aKDhTEKVW3Mi jOxTOr0EHlXJsNqXlhI/p6vM8EDY8bgNGL4TLkfw3702+6oLoURWxQg== In-Reply-To: <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > This patch derives a "nodes_allowed" node mask from the numa > mempolicy of the task modifying the number of persistent huge > pages to control the allocation, freeing and adjusting of surplus > huge pages. This mask is derived as follows: > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > is produced. This will cause the hugetlb subsystem to use > node_online_map as the "nodes_allowed". This preserves the > behavior before this patch. > * For "preferred" mempolicy, including explicit local allocation, > a nodemask with the single preferred node will be produced. > "local" policy will NOT track any internode migrations of the > task adjusting nr_hugepages. > * For "bind" and "interleave" policy, the mempolicy's nodemask > will be used. > * Other than to inform the construction of the nodes_allowed node > mask, the actual mempolicy mode is ignored. That is, all modes > behave like interleave over the resulting nodes_allowed mask > with no "fallback". > > Notes: > > 1) This patch introduces a subtle change in behavior: huge page > allocation and freeing will be constrained by any mempolicy > that the task adjusting the huge page pool inherits from its > parent. This policy could come from a distant ancestor. The > adminstrator adjusting the huge page pool without explicitly > specifying a mempolicy via numactl might be surprised by this. > Additionaly, any mempolicy specified by numactl will be > constrained by the cpuset in which numactl is invoked. > > 2) Hugepages allocated at boot time use the node_online_map. > An additional patch could implement a temporary boot time > huge pages nodes_allowed command line parameter. > > 3) Using mempolicy to control persistent huge page allocation > and freeing requires no change to hugeadm when invoking > it via numactl, as shown in the examples below. However, > hugeadm could be enhanced to take the allowed nodes as an > argument and set its task mempolicy itself. This would allow > it to detect and warn about any non-default mempolicy that it > inherited from its parent, thus alleviating the issue described > in Note 1 above. > > See the updated documentation [next patch] for more information > about the implications of this patch. > > Examples: > > Starting with: > > Node 0 HugePages_Total: 0 > Node 1 HugePages_Total: 0 > Node 2 HugePages_Total: 0 > Node 3 HugePages_Total: 0 > > Default behavior [with or without this patch] balances persistent > hugepage allocation across nodes [with sufficient contiguous memory]: > > hugeadm --pool-pages-min=2048Kb:32 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 8 > Node 3 HugePages_Total: 8 > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > '--membind' because it allows multiple nodes to be specified > and it's easy to type]--we can allocate huge pages on > individual nodes or sets of nodes. So, starting from the > condition above, with 8 huge pages per node: > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The incremental 8 huge pages were restricted to node 2 by the > specified mempolicy. > > Similarly, we can use mempolicy to free persistent huge pages > from specified nodes: > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > yields: > > Node 0 HugePages_Total: 4 > Node 1 HugePages_Total: 4 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The 8 huge pages freed were balanced over nodes 0 and 1. > > Signed-off-by: Lee Schermerhorn > > include/linux/mempolicy.h | 3 ++ > mm/hugetlb.c | 14 ++++++---- > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 73 insertions(+), 5 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > } > return zl; > } > + > +/* > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > + * > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > + * to constraing the allocation and freeing of persistent huge pages > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > + * 'bind' policy in this context. An attempt to allocate a persistent huge > + * page will never "fallback" to another node inside the buddy system > + * allocator. > + * > + * If the task's mempolicy is "default" [NULL], just return NULL for > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > + * or 'interleave' policy or construct a nodemask for 'preferred' or > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > + * > + * N.B., it is the caller's responsibility to free a returned nodemask. > + */ > +nodemask_t *huge_mpol_nodes_allowed(void) > +{ > + nodemask_t *nodes_allowed = NULL; > + struct mempolicy *mempolicy; > + int nid; > + > + if (!current->mempolicy) > + return NULL; > + > + mpol_get(current->mempolicy); > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); I don't think using '\n' inside printk's is allowed anymore. > + goto out; > + } > + nodes_clear(*nodes_allowed); > + > + mempolicy = current->mempolicy; > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + if (mempolicy->flags & MPOL_F_LOCAL) > + nid = numa_node_id(); > + else > + nid = mempolicy->v.preferred_node; > + node_set(nid, *nodes_allowed); > + break; > + > + case MPOL_BIND: > + /* Fall through */ > + case MPOL_INTERLEAVE: > + *nodes_allowed = mempolicy->v.nodes; > + break; > + > + default: > + BUG(); > + } > + > +out: > + mpol_put(current->mempolicy); > + return nodes_allowed; > +} This should be all unnecessary, see below. > #endif > > /* Allocate a page in interleaved policy. > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > +extern nodemask_t *huge_mpol_nodes_allowed(void); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > return node_zonelist(0, gfp_flags); > } > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > + > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > const nodemask_t *to_nodes, int flags) > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > { > unsigned long min_count, ret; > + nodemask_t *nodes_allowed; > > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > Why can't you simply do this? struct mempolicy *pol = NULL; nodemask_t *nodes_allowed = &node_online_map; local_irq_disable(); pol = current->mempolicy; mpol_get(pol); local_irq_enable(); if (pol) { switch (pol->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: nodes_allowed = pol->v.nodes; break; case MPOL_PREFERRED: ... use NODEMASK_SCRATCH() ... default: BUG(); } } mpol_put(pol); and then use nodes_allowed throughout set_max_huge_pages()? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Tue, 25 Aug 2009 11:19:07 +0100 Message-ID: <20090825101906.GB4427@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="utf-8" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > PATCH/RFC 5/4 hugetlb: register per node hugepages attributes >=20 > Against: 2.6.31-rc6-mmotm-090820-1918 >=20 > V2: remove dependency on kobject private bitfield. Search > global hstates then all per node hstates for kobject > match in attribute show/store functions. >=20 > V3: rebase atop the mempolicy-based hugepage alloc/free; > use custom "nodes_allowed" to restrict alloc/free to > a specific node via per node attributes. Per node > attribute overrides mempolicy. I.e., mempolicy only > applies to global attributes. >=20 > To demonstrate feasibility--if not advisability--of supporting > both mempolicy-based persistent huge page management with per > node "override" attributes. >=20 > This patch adds the per huge page size control/query attributes > to the per node sysdevs: >=20 > /sys/devices/system/node/node/hugepages/hugepages-/ > nr_hugepages - r/w > free_huge_pages - r/o > surplus_huge_pages - r/o >=20 > The patch attempts to re-use/share as much of the existing > global hstate attribute initialization and handling, and the > "nodes_allowed" constraint processing as possible. > In set_max_huge_pages(), a node id < 0 indicates a change to > global hstate parameters. In this case, any non-default task > mempolicy will be used to generate the nodes_allowed mask. A > node id > 0 indicates a node specific update and the count=20 > argument specifies the target count for the node. From this > info, we compute the target global count for the hstate and > construct a nodes_allowed node mask contain only the specified > node. Thus, setting the node specific nr_hugepages via the > per node attribute effectively overrides any task mempolicy. >=20 >=20 > Issue: dependency of base driver [node] dependency on hugetlbfs modu= le. > We want to keep all of the hstate attribute registration and handling > in the hugetlb module. However, we need to call into this code to > register the per node hstate attributes on node hot plug. >=20 > With this patch: >=20 > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > ./ ../ free_hugepages nr_hugepages surplus_hugepages >=20 > Starting from: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 0 > Node 2 HugePages_Free: 0 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages =3D 0 >=20 > Allocate 16 persistent huge pages on node 2: > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048= kB/nr_hugepages >=20 > [Note that this is equivalent to: > numactl -m 2 hugeadmin --pool-pages-min 2M:+16 > ] >=20 > Yields: > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 16 > Node 2 HugePages_Free: 16 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 > vm.nr_hugepages =3D 16 >=20 > Global controls work as expected--reduce pool to 8 persistent huge pa= ges: > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages >=20 > Node 0 HugePages_Total: 0 > Node 0 HugePages_Free: 0 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > Node 2 HugePages_Total: 8 > Node 2 HugePages_Free: 8 > Node 2 HugePages_Surp: 0 > Node 3 HugePages_Total: 0 > Node 3 HugePages_Free: 0 > Node 3 HugePages_Surp: 0 >=20 >=20 > Signed-off-by: Lee Schermerhorn >=20 > drivers/base/node.c | 2=20 > include/linux/hugetlb.h | 6 + > include/linux/node.h | 3=20 > mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++++= ++++------- > 4 files changed, 197 insertions(+), 27 deletions(-) >=20 > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-= 08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-24= 12:12:56.000000000 -0400 > @@ -200,6 +200,7 @@ int register_node(struct node *node, int > sysdev_create_file(&node->sysdev, &attr_distance); > =20 > scan_unevictable_register_node(node); > + hugetlb_register_node(node); > } > return error; > } > @@ -220,6 +221,7 @@ void unregister_node(struct node *node) > sysdev_remove_file(&node->sysdev, &attr_distance); > =20 > scan_unevictable_unregister_node(node); > + hugetlb_unregister_node(node); > =20 > sysdev_unregister(&node->sysdev); > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h 2= 009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009-0= 8-24 12:12:56.000000000 -0400 > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate > return size_to_hstate(PAGE_SIZE << compound_order(page)); > } > =20 > +struct node; > +extern void hugetlb_register_node(struct node *); > +extern void hugetlb_unregister_node(struct node *); > + > #else > struct hstate {}; > #define alloc_bootmem_huge_page(h) NULL > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug > { > return 1; > } > +#define hugetlb_register_node(NP) > +#define hugetlb_unregister_node(NP) > #endif > =20 This also needs to be done for the !NUMA case. Try building without NUM= A set and you get the following with this patch applied CC mm/hugetlb.o mm/hugetlb.c: In function =E2=80=98hugetlb_exit=E2=80=99: mm/hugetlb.c:1629: error: implicit declaration of function =E2=80=98hug= etlb_unregister_all_nodes=E2=80=99 mm/hugetlb.c: In function =E2=80=98hugetlb_init=E2=80=99: mm/hugetlb.c:1665: error: implicit declaration of function =E2=80=98hug= etlb_register_all_nodes=E2=80=99 make[1]: *** [mm/hugetlb.o] Error 1 make: *** [mm] Error 2 > #endif /* _LINUX_HUGETLB_H */ > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 1= 2:12:53.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:= 56.000000000 -0400 > @@ -24,6 +24,7 @@ > #include > =20 > #include > +#include > #include "internal.h" > =20 > const unsigned long hugetlb_zero =3D 0, hugetlb_infinity =3D ~0UL; > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs > return ret; > } > =20 > +static nodemask_t *nodes_allowed_from_node(int nid) > +{ This name is a bit weird. It's creating a nodemask with just a single node allowed. Is there something wrong with using the existing function nodemask_of_node()? If stack is the problem, prehaps there is some macr= o magic that would allow a nodemask to be either declared on the stack or kmalloc'd. > + nodemask_t *nodes_allowed; > + nodes_allowed =3D kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); > + } else { > + nodes_clear(*nodes_allowed); > + node_set(nid, *nodes_allowed); > + } > + return nodes_allowed; > +} > + > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge= _pages) > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned l= ong count) > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned l= ong count, > + int nid) > { > unsigned long min_count, ret; > nodemask_t *nodes_allowed; > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > if (h->order >=3D MAX_ORDER) > return h->max_huge_pages; > =20 > - nodes_allowed =3D huge_mpol_nodes_allowed(); > + if (nid < 0) > + nodes_allowed =3D huge_mpol_nodes_allowed(); hugetlb is a bit littered with magic numbers been passed into functions= =2E Attempts have been made to clear them up as according as patches change that area. Would it be possible to define something like #define HUGETLB_OBEY_MEMPOLICY -1 for the nid here as opposed to passing in -1? I know -1 is used in the = page allocator functions but there it means "current node" and here it means "obey mempolicies". > + else { > + /* > + * incoming 'count' is for node 'nid' only, so > + * adjust count to global, but restrict alloc/free > + * to the specified node. > + */ > + count +=3D h->nr_huge_pages - h->nr_huge_pages_node[nid]; > + nodes_allowed =3D nodes_allowed_from_node(nid); > + } > =20 > /* > * Increase the pool size > @@ -1338,34 +1365,69 @@ out: > static struct kobject *hugepages_kobj; > static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > =20 > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int = *nidp) > +{ > + int nid; > + > + for (nid =3D 0; nid < nr_node_ids; nid++) { > + struct node *node =3D &node_devices[nid]; > + int hi; > + for (hi =3D 0; hi < HUGE_MAX_HSTATE; hi++) Does that hi mean hello, high, nid or hstate_idx? hstate_idx would appear to be the appropriate name here. > + if (node->hstate_kobjs[hi] =3D=3D kobj) { > + if (nidp) > + *nidp =3D nid; > + return &hstates[hi]; > + } > + } Ok.... so, there is a struct node array for the sysdev and this patch a= dds references to the "hugepages" directory kobject and the subdirectories = for each page size. We walk all the objects until we find a match. Obviousl= y, this adds a dependency of base node support on hugetlbfs which feels ba= ckwards and you call that out in your leader. Can this be the other way around? i.e. The struct hstate has an array o= f kobjects arranged by nid that is filled in when the node is registered? There will only be one kobject-per-pagesize-per-node so it seems like i= t would work. I confess, I haven't prototyped this to be 100% sure. > + > + BUG(); > + return NULL; > +} > + > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp= ) > { > int i; > + > for (i =3D 0; i < HUGE_MAX_HSTATE; i++) > - if (hstate_kobjs[i] =3D=3D kobj) > + if (hstate_kobjs[i] =3D=3D kobj) { > + if (nidp) > + *nidp =3D -1; > return &hstates[i]; > - BUG(); > - return NULL; > + } > + > + return kobj_to_node_hstate(kobj, nidp); > } > =20 > static ssize_t nr_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h =3D kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > + struct hstate *h; > + unsigned long nr_huge_pages; > + int nid; > + > + h =3D kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + nr_huge_pages =3D h->nr_huge_pages; Here is another magic number except it means something slightly different. It means NR_GLOBAL_HUGEPAGES or something similar. It would be nice if these different special nid values could be named, preferabl= y collapsed to being one "core" thing. > + else > + nr_huge_pages =3D h->nr_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", nr_huge_pages); > } > + > static ssize_t nr_hugepages_store(struct kobject *kobj, > struct kobj_attribute *attr, const char *buf, size_t count) > { > - int err; > unsigned long input; > - struct hstate *h =3D kobj_to_hstate(kobj); > + struct hstate *h; > + int nid; > + int err; > =20 > err =3D strict_strtoul(buf, 10, &input); > if (err) > return 0; > =20 > - h->max_huge_pages =3D set_max_huge_pages(h, input); "input" is a bit meaningless. The function you are passing to calls thi= s parameter "count". Can you match the naming please? Otherwise, I might guess that this is a "delta" which occurs elsewhere in the hugetlb code= =2E > + h =3D kobj_to_hstate(kobj, &nid); > + h->max_huge_pages =3D set_max_huge_pages(h, input, nid); > =20 > return count; > } > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h =3D kobj_to_hstate(kobj); > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > + > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > } > + > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > struct kobj_attribute *attr, const char *buf, size_t count) > { > int err; > unsigned long input; > - struct hstate *h =3D kobj_to_hstate(kobj); > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > =20 > err =3D strict_strtoul(buf, 10, &input); > if (err) > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > static ssize_t free_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h =3D kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->free_huge_pages); > + struct hstate *h; > + unsigned long free_huge_pages; > + int nid; > + > + h =3D kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + free_huge_pages =3D h->free_huge_pages; > + else > + free_huge_pages =3D h->free_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", free_huge_pages); > } > HSTATE_ATTR_RO(free_hugepages); > =20 > static ssize_t resv_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h =3D kobj_to_hstate(kobj); > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > return sprintf(buf, "%lu\n", h->resv_huge_pages); > } > HSTATE_ATTR_RO(resv_hugepages); > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > static ssize_t surplus_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > - struct hstate *h =3D kobj_to_hstate(kobj); > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > + struct hstate *h; > + unsigned long surplus_huge_pages; > + int nid; > + > + h =3D kobj_to_hstate(kobj, &nid); > + if (nid < 0) > + surplus_huge_pages =3D h->surplus_huge_pages; > + else > + surplus_huge_pages =3D h->surplus_huge_pages_node[nid]; > + > + return sprintf(buf, "%lu\n", surplus_huge_pages); > } > HSTATE_ATTR_RO(surplus_hugepages); > =20 > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > .attrs =3D hstate_attrs, > }; > =20 > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > + struct kobject *parent, > + struct kobject **hstate_kobjs, > + struct attribute_group *hstate_attr_group) > { > int retval; > + int hi =3D h - hstates; > =20 > - hstate_kobjs[h - hstates] =3D kobject_create_and_add(h->name, > - hugepages_kobj); > - if (!hstate_kobjs[h - hstates]) > + hstate_kobjs[hi] =3D kobject_create_and_add(h->name, parent); > + if (!hstate_kobjs[hi]) > return -ENOMEM; > =20 > - retval =3D sysfs_create_group(hstate_kobjs[h - hstates], > - &hstate_attr_group); > + retval =3D sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > if (retval) > - kobject_put(hstate_kobjs[h - hstates]); > + kobject_put(hstate_kobjs[hi]); > =20 > return retval; > } > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > return; > =20 > for_each_hstate(h) { > - err =3D hugetlb_sysfs_add_hstate(h); > + err =3D hugetlb_sysfs_add_hstate(h, hugepages_kobj, > + hstate_kobjs, &hstate_attr_group); > if (err) > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > h->name); > } > } > =20 > +#ifdef CONFIG_NUMA > +static struct attribute *per_node_hstate_attrs[] =3D { > + &nr_hugepages_attr.attr, > + &free_hugepages_attr.attr, > + &surplus_hugepages_attr.attr, > + NULL, > +}; > + > +static struct attribute_group per_node_hstate_attr_group =3D { > + .attrs =3D per_node_hstate_attrs, > +}; > + > + > +void hugetlb_unregister_node(struct node *node) > +{ > + struct hstate *h; > + > + for_each_hstate(h) { > + kobject_put(node->hstate_kobjs[h - hstates]); > + node->hstate_kobjs[h - hstates] =3D NULL; > + } > + > + kobject_put(node->hugepages_kobj); > + node->hugepages_kobj =3D NULL; > +} > + > +static void hugetlb_unregister_all_nodes(void) > +{ > + int nid; > + > + for (nid =3D 0; nid < nr_node_ids; nid++) > + hugetlb_unregister_node(&node_devices[nid]); > +} > + > +void hugetlb_register_node(struct node *node) > +{ > + struct hstate *h; > + int err; > + > + if (!hugepages_kobj) > + return; /* too early */ > + > + node->hugepages_kobj =3D kobject_create_and_add("hugepages", > + &node->sysdev.kobj); > + if (!node->hugepages_kobj) > + return; > + > + for_each_hstate(h) { > + err =3D hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > + node->hstate_kobjs, > + &per_node_hstate_attr_group); > + if (err) > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > + " for node %d\n", > + h->name, node->sysdev.id); > + } > +} > + > +static void hugetlb_register_all_nodes(void) > +{ > + int nid; > + > + for (nid =3D 0; nid < nr_node_ids; nid++) { > + struct node *node =3D &node_devices[nid]; > + if (node->sysdev.id =3D=3D nid && !node->hugepages_kobj) > + hugetlb_register_node(node); > + } > +} > +#endif > + > static void __exit hugetlb_exit(void) > { > struct hstate *h; > =20 > + hugetlb_unregister_all_nodes(); > + > for_each_hstate(h) { > kobject_put(hstate_kobjs[h - hstates]); > } > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > =20 > hugetlb_sysfs_init(); > =20 > + hugetlb_register_all_nodes(); > + > return 0; > } > module_init(hugetlb_init); > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > =20 > if (write) > - h->max_huge_pages =3D set_max_huge_pages(h, tmp); > + h->max_huge_pages =3D set_max_huge_pages(h, tmp, -1); > =20 > return 0; > } > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009= -08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-2= 4 12:12:56.000000000 -0400 > @@ -21,9 +21,12 @@ > =20 > #include > #include > +#include > =20 > struct node { > struct sys_device sysdev; > + struct kobject *hugepages_kobj; > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > }; > =20 > struct memory_block; >=20 I'm not against this idea and think it can work side-by-side with the m= emory policies. I believe it does need a bit more cleaning up before merging though. I also wasn't able to test this yet due to various build and deploy issues. --=20 Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-numa" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Date: Tue, 25 Aug 2009 11:22:04 +0100 Message-ID: <20090825102204.GC4427@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, Aug 24, 2009 at 03:27:52PM -0400, Lee Schermerhorn wrote: > [PATCH 3/4] hugetlb: derive huge pages nodes allowed from task mempolicy > > Against: 2.6.31-rc6-mmotm-090820-1918 > > V2: > + cleaned up comments, removed some deemed unnecessary, > add some suggested by review > + removed check for !current in huge_mpol_nodes_allowed(). > + added 'current->comm' to warning message in huge_mpol_nodes_allowed(). > + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to > catch out of range node id. > + add examples to patch description > > V3: Factored this patch from V2 patch 2/3 > > V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages() > > This patch derives a "nodes_allowed" node mask from the numa > mempolicy of the task modifying the number of persistent huge > pages to control the allocation, freeing and adjusting of surplus > huge pages. This mask is derived as follows: > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > is produced. This will cause the hugetlb subsystem to use > node_online_map as the "nodes_allowed". This preserves the > behavior before this patch. > * For "preferred" mempolicy, including explicit local allocation, > a nodemask with the single preferred node will be produced. > "local" policy will NOT track any internode migrations of the > task adjusting nr_hugepages. > * For "bind" and "interleave" policy, the mempolicy's nodemask > will be used. > * Other than to inform the construction of the nodes_allowed node > mask, the actual mempolicy mode is ignored. That is, all modes > behave like interleave over the resulting nodes_allowed mask > with no "fallback". > > Notes: > > 1) This patch introduces a subtle change in behavior: huge page > allocation and freeing will be constrained by any mempolicy > that the task adjusting the huge page pool inherits from its > parent. This policy could come from a distant ancestor. The > adminstrator adjusting the huge page pool without explicitly > specifying a mempolicy via numactl might be surprised by this. > Additionaly, any mempolicy specified by numactl will be > constrained by the cpuset in which numactl is invoked. > > 2) Hugepages allocated at boot time use the node_online_map. > An additional patch could implement a temporary boot time > huge pages nodes_allowed command line parameter. > > 3) Using mempolicy to control persistent huge page allocation > and freeing requires no change to hugeadm when invoking > it via numactl, as shown in the examples below. However, > hugeadm could be enhanced to take the allowed nodes as an > argument and set its task mempolicy itself. This would allow > it to detect and warn about any non-default mempolicy that it > inherited from its parent, thus alleviating the issue described > in Note 1 above. > > See the updated documentation [next patch] for more information > about the implications of this patch. > > Examples: > > Starting with: > > Node 0 HugePages_Total: 0 > Node 1 HugePages_Total: 0 > Node 2 HugePages_Total: 0 > Node 3 HugePages_Total: 0 > > Default behavior [with or without this patch] balances persistent > hugepage allocation across nodes [with sufficient contiguous memory]: > > hugeadm --pool-pages-min=2048Kb:32 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 8 > Node 3 HugePages_Total: 8 > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > '--membind' because it allows multiple nodes to be specified > and it's easy to type]--we can allocate huge pages on > individual nodes or sets of nodes. So, starting from the > condition above, with 8 huge pages per node: > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > yields: > > Node 0 HugePages_Total: 8 > Node 1 HugePages_Total: 8 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The incremental 8 huge pages were restricted to node 2 by the > specified mempolicy. > > Similarly, we can use mempolicy to free persistent huge pages > from specified nodes: > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > yields: > > Node 0 HugePages_Total: 4 > Node 1 HugePages_Total: 4 > Node 2 HugePages_Total: 16 > Node 3 HugePages_Total: 8 > > The 8 huge pages freed were balanced over nodes 0 and 1. > > Signed-off-by: Lee Schermerhorn I haven't been able to test this yet because of some build and deploy issues but I didn't spot anything wrong when eyeballing the patch. For the moment; Acked-by: Mel Gorman > > include/linux/mempolicy.h | 3 ++ > mm/hugetlb.c | 14 ++++++---- > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 73 insertions(+), 5 deletions(-) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > } > return zl; > } > + > +/* > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > + * > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > + * to constraing the allocation and freeing of persistent huge pages > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > + * 'bind' policy in this context. An attempt to allocate a persistent huge > + * page will never "fallback" to another node inside the buddy system > + * allocator. > + * > + * If the task's mempolicy is "default" [NULL], just return NULL for > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > + * or 'interleave' policy or construct a nodemask for 'preferred' or > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > + * > + * N.B., it is the caller's responsibility to free a returned nodemask. > + */ > +nodemask_t *huge_mpol_nodes_allowed(void) > +{ > + nodemask_t *nodes_allowed = NULL; > + struct mempolicy *mempolicy; > + int nid; > + > + if (!current->mempolicy) > + return NULL; > + > + mpol_get(current->mempolicy); > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > + if (!nodes_allowed) { > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > + "for huge page allocation.\nFalling back to default.\n", > + current->comm); > + goto out; > + } > + nodes_clear(*nodes_allowed); > + > + mempolicy = current->mempolicy; > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + if (mempolicy->flags & MPOL_F_LOCAL) > + nid = numa_node_id(); > + else > + nid = mempolicy->v.preferred_node; > + node_set(nid, *nodes_allowed); > + break; > + > + case MPOL_BIND: > + /* Fall through */ > + case MPOL_INTERLEAVE: > + *nodes_allowed = mempolicy->v.nodes; > + break; > + > + default: > + BUG(); > + } > + > +out: > + mpol_put(current->mempolicy); > + return nodes_allowed; > +} > #endif > > /* Allocate a page in interleaved policy. > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > +extern nodemask_t *huge_mpol_nodes_allowed(void); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > return node_zonelist(0, gfp_flags); > } > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > + > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > const nodemask_t *to_nodes, int flags) > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > { > unsigned long min_count, ret; > + nodemask_t *nodes_allowed; > > if (h->order >= MAX_ORDER) > return h->max_huge_pages; > > + nodes_allowed = huge_mpol_nodes_allowed(); > + > /* > * Increase the pool size > * First take pages out of surplus state. Then make up the > @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages( > */ > spin_lock(&hugetlb_lock); > while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { > - if (!adjust_pool_surplus(h, NULL, -1)) > + if (!adjust_pool_surplus(h, nodes_allowed, -1)) > break; > } > > @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages( > * and reducing the surplus. > */ > spin_unlock(&hugetlb_lock); > - ret = alloc_fresh_huge_page(h, NULL); > + ret = alloc_fresh_huge_page(h, nodes_allowed); > spin_lock(&hugetlb_lock); > if (!ret) > goto out; > @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages( > */ > min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; > min_count = max(count, min_count); > - try_to_free_low(h, min_count, NULL); > + try_to_free_low(h, min_count, nodes_allowed); > while (min_count < persistent_huge_pages(h)) { > - if (!free_pool_huge_page(h, NULL, 0)) > + if (!free_pool_huge_page(h, nodes_allowed, 0)) > break; > } > while (count < persistent_huge_pages(h)) { > - if (!adjust_pool_surplus(h, NULL, 1)) > + if (!adjust_pool_surplus(h, nodes_allowed, 1)) > break; > } > out: > ret = persistent_huge_pages(h); > spin_unlock(&hugetlb_lock); > + kfree(nodes_allowed); > return ret; > } > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Tue, 25 Aug 2009 14:35:16 +0100 Message-ID: <20090825133516.GE21335@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20090824192902.10317.94512.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > @@ -21,9 +21,12 @@ > > #include > #include > +#include > Is this header inclusion necessary? It does not appear to be required by the structure modification (which is iffy in itself as discussed in the earlier mail) and it breaks build on x86-64. CC arch/x86/kernel/setup_percpu.o In file included from include/linux/pagemap.h:10, from include/linux/mempolicy.h:62, from include/linux/hugetlb.h:8, from include/linux/node.h:24, from include/linux/cpu.h:23, from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, from arch/x86/kernel/setup_percpu.c:19: include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 make[1]: *** [arch/x86/kernel] Error 2 > struct node { > struct sys_device sysdev; > + struct kobject *hugepages_kobj; > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > }; > > struct memory_block; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Date: Tue, 25 Aug 2009 16:49:07 -0400 Message-ID: <1251233347.16229.0.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: David Rientjes Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 2009-08-25 at 01:47 -0700, David Rientjes wrote: > On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > > > This patch derives a "nodes_allowed" node mask from the numa > > mempolicy of the task modifying the number of persistent huge > > pages to control the allocation, freeing and adjusting of surplus > > huge pages. This mask is derived as follows: > > > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > > is produced. This will cause the hugetlb subsystem to use > > node_online_map as the "nodes_allowed". This preserves the > > behavior before this patch. > > * For "preferred" mempolicy, including explicit local allocation, > > a nodemask with the single preferred node will be produced. > > "local" policy will NOT track any internode migrations of the > > task adjusting nr_hugepages. > > * For "bind" and "interleave" policy, the mempolicy's nodemask > > will be used. > > * Other than to inform the construction of the nodes_allowed node > > mask, the actual mempolicy mode is ignored. That is, all modes > > behave like interleave over the resulting nodes_allowed mask > > with no "fallback". > > > > Notes: > > > > 1) This patch introduces a subtle change in behavior: huge page > > allocation and freeing will be constrained by any mempolicy > > that the task adjusting the huge page pool inherits from its > > parent. This policy could come from a distant ancestor. The > > adminstrator adjusting the huge page pool without explicitly > > specifying a mempolicy via numactl might be surprised by this. > > Additionaly, any mempolicy specified by numactl will be > > constrained by the cpuset in which numactl is invoked. > > > > 2) Hugepages allocated at boot time use the node_online_map. > > An additional patch could implement a temporary boot time > > huge pages nodes_allowed command line parameter. > > > > 3) Using mempolicy to control persistent huge page allocation > > and freeing requires no change to hugeadm when invoking > > it via numactl, as shown in the examples below. However, > > hugeadm could be enhanced to take the allowed nodes as an > > argument and set its task mempolicy itself. This would allow > > it to detect and warn about any non-default mempolicy that it > > inherited from its parent, thus alleviating the issue described > > in Note 1 above. > > > > See the updated documentation [next patch] for more information > > about the implications of this patch. > > > > Examples: > > > > Starting with: > > > > Node 0 HugePages_Total: 0 > > Node 1 HugePages_Total: 0 > > Node 2 HugePages_Total: 0 > > Node 3 HugePages_Total: 0 > > > > Default behavior [with or without this patch] balances persistent > > hugepage allocation across nodes [with sufficient contiguous memory]: > > > > hugeadm --pool-pages-min=2048Kb:32 > > > > yields: > > > > Node 0 HugePages_Total: 8 > > Node 1 HugePages_Total: 8 > > Node 2 HugePages_Total: 8 > > Node 3 HugePages_Total: 8 > > > > Applying mempolicy--e.g., with numactl [using '-m' a.k.a. > > '--membind' because it allows multiple nodes to be specified > > and it's easy to type]--we can allocate huge pages on > > individual nodes or sets of nodes. So, starting from the > > condition above, with 8 huge pages per node: > > > > numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8 > > > > yields: > > > > Node 0 HugePages_Total: 8 > > Node 1 HugePages_Total: 8 > > Node 2 HugePages_Total: 16 > > Node 3 HugePages_Total: 8 > > > > The incremental 8 huge pages were restricted to node 2 by the > > specified mempolicy. > > > > Similarly, we can use mempolicy to free persistent huge pages > > from specified nodes: > > > > numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8 > > > > yields: > > > > Node 0 HugePages_Total: 4 > > Node 1 HugePages_Total: 4 > > Node 2 HugePages_Total: 16 > > Node 3 HugePages_Total: 8 > > > > The 8 huge pages freed were balanced over nodes 0 and 1. > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/mempolicy.h | 3 ++ > > mm/hugetlb.c | 14 ++++++---- > > mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 73 insertions(+), 5 deletions(-) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400 > > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm > > } > > return zl; > > } > > + > > +/* > > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages. > > + * > > + * Returns a [pointer to a] nodelist based on the current task's mempolicy > > + * to constraing the allocation and freeing of persistent huge pages > > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like > > + * 'bind' policy in this context. An attempt to allocate a persistent huge > > + * page will never "fallback" to another node inside the buddy system > > + * allocator. > > + * > > + * If the task's mempolicy is "default" [NULL], just return NULL for > > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > > + * or 'interleave' policy or construct a nodemask for 'preferred' or > > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > > + * > > + * N.B., it is the caller's responsibility to free a returned nodemask. > > + */ > > +nodemask_t *huge_mpol_nodes_allowed(void) > > +{ > > + nodemask_t *nodes_allowed = NULL; > > + struct mempolicy *mempolicy; > > + int nid; > > + > > + if (!current->mempolicy) > > + return NULL; > > + > > + mpol_get(current->mempolicy); > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > + if (!nodes_allowed) { > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > + "for huge page allocation.\nFalling back to default.\n", > > + current->comm); > > I don't think using '\n' inside printk's is allowed anymore. OK, will remove. > > > + goto out; > > + } > > + nodes_clear(*nodes_allowed); > > + > > + mempolicy = current->mempolicy; > > + switch (mempolicy->mode) { > > + case MPOL_PREFERRED: > > + if (mempolicy->flags & MPOL_F_LOCAL) > > + nid = numa_node_id(); > > + else > > + nid = mempolicy->v.preferred_node; > > + node_set(nid, *nodes_allowed); > > + break; > > + > > + case MPOL_BIND: > > + /* Fall through */ > > + case MPOL_INTERLEAVE: > > + *nodes_allowed = mempolicy->v.nodes; > > + break; > > + > > + default: > > + BUG(); > > + } > > + > > +out: > > + mpol_put(current->mempolicy); > > + return nodes_allowed; > > +} > > This should be all unnecessary, see below. > > > #endif > > > > /* Allocate a page in interleaved policy. > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400 > > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > > unsigned long addr, gfp_t gfp_flags, > > struct mempolicy **mpol, nodemask_t **nodemask); > > +extern nodemask_t *huge_mpol_nodes_allowed(void); > > extern unsigned slab_node(struct mempolicy *policy); > > > > extern enum zone_type policy_zone; > > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > > return node_zonelist(0, gfp_flags); > > } > > > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > > + > > static inline int do_migrate_pages(struct mm_struct *mm, > > const nodemask_t *from_nodes, > > const nodemask_t *to_nodes, int flags) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > { > > unsigned long min_count, ret; > > + nodemask_t *nodes_allowed; > > > > if (h->order >= MAX_ORDER) > > return h->max_huge_pages; > > > > Why can't you simply do this? > > struct mempolicy *pol = NULL; > nodemask_t *nodes_allowed = &node_online_map; > > local_irq_disable(); > pol = current->mempolicy; > mpol_get(pol); > local_irq_enable(); > if (pol) { > switch (pol->mode) { > case MPOL_BIND: > case MPOL_INTERLEAVE: > nodes_allowed = pol->v.nodes; > break; > case MPOL_PREFERRED: > ... use NODEMASK_SCRATCH() ... > default: > BUG(); > } > } > mpol_put(pol); > > and then use nodes_allowed throughout set_max_huge_pages()? Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages(). NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure it will return a kmalloc()'d nodemask, which I need because a NULL nodemask pointer means "all online nodes" [really all nodes with memory, I suppose] and I need a pointer to kmalloc()'d nodemask to return from huge_mpol_nodes_allowed(). I want to keep the access to the internals of mempolicy in mempolicy.[ch], thus the call out to huge_mpol_nodes_allowed(), instead of open coding it. It's not really a hot path, so I didn't want to fuss with a static inline in the header, even tho' this is the only call site. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Tue, 25 Aug 2009 16:49:29 -0400 Message-ID: <1251233369.16229.1.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20090825101906.GB4427@csn.ul.ie> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="utf-8" To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 2009-08-25 at 11:19 +0100, Mel Gorman wrote: > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > PATCH/RFC 5/4 hugetlb: register per node hugepages attributes > >=20 > > Against: 2.6.31-rc6-mmotm-090820-1918 > >=20 > > V2: remove dependency on kobject private bitfield. Search > > global hstates then all per node hstates for kobject > > match in attribute show/store functions. > >=20 > > V3: rebase atop the mempolicy-based hugepage alloc/free; > > use custom "nodes_allowed" to restrict alloc/free to > > a specific node via per node attributes. Per node > > attribute overrides mempolicy. I.e., mempolicy only > > applies to global attributes. > >=20 > > To demonstrate feasibility--if not advisability--of supporting > > both mempolicy-based persistent huge page management with per > > node "override" attributes. > >=20 > > This patch adds the per huge page size control/query attributes > > to the per node sysdevs: > >=20 > > /sys/devices/system/node/node/hugepages/hugepages-/ > > nr_hugepages - r/w > > free_huge_pages - r/o > > surplus_huge_pages - r/o > >=20 > > The patch attempts to re-use/share as much of the existing > > global hstate attribute initialization and handling, and the > > "nodes_allowed" constraint processing as possible. > > In set_max_huge_pages(), a node id < 0 indicates a change to > > global hstate parameters. In this case, any non-default task > > mempolicy will be used to generate the nodes_allowed mask. A > > node id > 0 indicates a node specific update and the count=20 > > argument specifies the target count for the node. From this > > info, we compute the target global count for the hstate and > > construct a nodes_allowed node mask contain only the specified > > node. Thus, setting the node specific nr_hugepages via the > > per node attribute effectively overrides any task mempolicy. > >=20 > >=20 > > Issue: dependency of base driver [node] dependency on hugetlbfs mo= dule. > > We want to keep all of the hstate attribute registration and handli= ng > > in the hugetlb module. However, we need to call into this code to > > register the per node hstate attributes on node hot plug. > >=20 > > With this patch: > >=20 > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB > > ./ ../ free_hugepages nr_hugepages surplus_hugepages > >=20 > > Starting from: > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 0 > > Node 2 HugePages_Free: 0 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > > vm.nr_hugepages =3D 0 > >=20 > > Allocate 16 persistent huge pages on node 2: > > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-20= 48kB/nr_hugepages > >=20 > > [Note that this is equivalent to: > > numactl -m 2 hugeadmin --pool-pages-min 2M:+16 > > ] > >=20 > > Yields: > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 16 > > Node 2 HugePages_Free: 16 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > > vm.nr_hugepages =3D 16 > >=20 > > Global controls work as expected--reduce pool to 8 persistent huge = pages: > > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > >=20 > > Node 0 HugePages_Total: 0 > > Node 0 HugePages_Free: 0 > > Node 0 HugePages_Surp: 0 > > Node 1 HugePages_Total: 0 > > Node 1 HugePages_Free: 0 > > Node 1 HugePages_Surp: 0 > > Node 2 HugePages_Total: 8 > > Node 2 HugePages_Free: 8 > > Node 2 HugePages_Surp: 0 > > Node 3 HugePages_Total: 0 > > Node 3 HugePages_Free: 0 > > Node 3 HugePages_Surp: 0 > >=20 > >=20 > > Signed-off-by: Lee Schermerhorn > >=20 > > drivers/base/node.c | 2=20 > > include/linux/hugetlb.h | 6 + > > include/linux/node.h | 3=20 > > mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++= ++++++------- > > 4 files changed, 197 insertions(+), 27 deletions(-) > >=20 > > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 200= 9-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-= 24 12:12:56.000000000 -0400 > > @@ -200,6 +200,7 @@ int register_node(struct node *node, int > > sysdev_create_file(&node->sysdev, &attr_distance); > > =20 > > scan_unevictable_register_node(node); > > + hugetlb_register_node(node); > > } > > return error; > > } > > @@ -220,6 +221,7 @@ void unregister_node(struct node *node) > > sysdev_remove_file(&node->sysdev, &attr_distance); > > =20 > > scan_unevictable_unregister_node(node); > > + hugetlb_unregister_node(node); > > =20 > > sysdev_unregister(&node->sysdev); > > } > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h= 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009= -08-24 12:12:56.000000000 -0400 > > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate > > return size_to_hstate(PAGE_SIZE << compound_order(page)); > > } > > =20 > > +struct node; > > +extern void hugetlb_register_node(struct node *); > > +extern void hugetlb_unregister_node(struct node *); > > + > > #else > > struct hstate {}; > > #define alloc_bootmem_huge_page(h) NULL > > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug > > { > > return 1; > > } > > +#define hugetlb_register_node(NP) > > +#define hugetlb_unregister_node(NP) > > #endif > > =20 >=20 > This also needs to be done for the !NUMA case. Try building without N= UMA > set and you get the following with this patch applied >=20 > CC mm/hugetlb.o > mm/hugetlb.c: In function =C3=A2=C2=80=C2=98hugetlb_exit=C3=A2=C2=80=C2= =99: > mm/hugetlb.c:1629: error: implicit declaration of function =C3=A2=C2=80= =C2=98hugetlb_unregister_all_nodes=C3=A2=C2=80=C2=99 > mm/hugetlb.c: In function =C3=A2=C2=80=C2=98hugetlb_init=C3=A2=C2=80=C2= =99: > mm/hugetlb.c:1665: error: implicit declaration of function =C3=A2=C2=80= =C2=98hugetlb_register_all_nodes=C3=A2=C2=80=C2=99 > make[1]: *** [mm/hugetlb.o] Error 1 > make: *** [mm] Error 2 Ouch! Sorry. Will add stubs. >=20 >=20 > > #endif /* _LINUX_HUGETLB_H */ > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24= 12:12:53.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:1= 2:56.000000000 -0400 > > @@ -24,6 +24,7 @@ > > #include > > =20 > > #include > > +#include > > #include "internal.h" > > =20 > > const unsigned long hugetlb_zero =3D 0, hugetlb_infinity =3D ~0UL; > > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs > > return ret; > > } > > =20 > > +static nodemask_t *nodes_allowed_from_node(int nid) > > +{ >=20 > This name is a bit weird. It's creating a nodemask with just a single > node allowed. >=20 > Is there something wrong with using the existing function > nodemask_of_node()? If stack is the problem, prehaps there is some ma= cro > magic that would allow a nodemask to be either declared on the stack = or > kmalloc'd. Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a block nested inside the context where it's invoked. I would be declaring the nodemask in the compound else clause and don't want to access it [via the nodes_allowed pointer] from outside of there. >=20 > > + nodemask_t *nodes_allowed; > > + nodes_allowed =3D kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > + if (!nodes_allowed) { > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > + "for huge page allocation.\nFalling back to default.\n", > > + current->comm); > > + } else { > > + nodes_clear(*nodes_allowed); > > + node_set(nid, *nodes_allowed); > > + } > > + return nodes_allowed; > > +} > > + > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_hu= ge_pages) > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned= long count) > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned= long count, > > + int nid) > > { > > unsigned long min_count, ret; > > nodemask_t *nodes_allowed; > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > > if (h->order >=3D MAX_ORDER) > > return h->max_huge_pages; > > =20 > > - nodes_allowed =3D huge_mpol_nodes_allowed(); > > + if (nid < 0) > > + nodes_allowed =3D huge_mpol_nodes_allowed(); >=20 > hugetlb is a bit littered with magic numbers been passed into functio= ns. > Attempts have been made to clear them up as according as patches chan= ge > that area. Would it be possible to define something like >=20 > #define HUGETLB_OBEY_MEMPOLICY -1 >=20 > for the nid here as opposed to passing in -1? I know -1 is used in th= e page > allocator functions but there it means "current node" and here it mea= ns > "obey mempolicies". Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a per node attribute". It means "derive nodes allowed from memory policy= , if non-default, else use nodes_online_map" [which is not exactly the same as obeying memory policy]. But, I can see defining a symbolic constant such as NO_NODE[_ID_SPECIFIED]. I'll try next spin. >=20 > > + else { > > + /* > > + * incoming 'count' is for node 'nid' only, so > > + * adjust count to global, but restrict alloc/free > > + * to the specified node. > > + */ > > + count +=3D h->nr_huge_pages - h->nr_huge_pages_node[nid]; > > + nodes_allowed =3D nodes_allowed_from_node(nid); > > + } > > =20 > > /* > > * Increase the pool size > > @@ -1338,34 +1365,69 @@ out: > > static struct kobject *hugepages_kobj; > > static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > =20 > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, in= t *nidp) > > +{ > > + int nid; > > + > > + for (nid =3D 0; nid < nr_node_ids; nid++) { > > + struct node *node =3D &node_devices[nid]; > > + int hi; > > + for (hi =3D 0; hi < HUGE_MAX_HSTATE; hi++) >=20 > Does that hi mean hello, high, nid or hstate_idx? >=20 > hstate_idx would appear to be the appropriate name here. Or just plain 'i', like in the following, pre-existing function? >=20 > > + if (node->hstate_kobjs[hi] =3D=3D kobj) { > > + if (nidp) > > + *nidp =3D nid; > > + return &hstates[hi]; > > + } > > + } >=20 > Ok.... so, there is a struct node array for the sysdev and this patch= adds > references to the "hugepages" directory kobject and the subdirectorie= s for > each page size. We walk all the objects until we find a match. Obviou= sly, > this adds a dependency of base node support on hugetlbfs which feels = backwards > and you call that out in your leader. >=20 > Can this be the other way around? i.e. The struct hstate has an array= of > kobjects arranged by nid that is filled in when the node is registere= d? > There will only be one kobject-per-pagesize-per-node so it seems like= it > would work. I confess, I haven't prototyped this to be 100% sure. This will take a bit longer to sort out. I do want to change the registration, tho', so that hugetlb.c registers it's single node register/unregister functions with base/node.c to remove the source level dependency in that direction. node.c will only register nodes on hot plug as it's initialized too early, relative to hugetlb.c to register them at init time. This should break the call dependency of base/node.c on the hugetlb module. As far as moving the per node attributes' kobjects to the hugetlb globa= l hstate arrays... Have to think about that. I agree that it would be nice to remove the source level [header] dependency. >=20 > > + > > + BUG(); > > + return NULL; > > +} > > + > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *ni= dp) > > { > > int i; > > + > > for (i =3D 0; i < HUGE_MAX_HSTATE; i++) > > - if (hstate_kobjs[i] =3D=3D kobj) > > + if (hstate_kobjs[i] =3D=3D kobj) { > > + if (nidp) > > + *nidp =3D -1; > > return &hstates[i]; > > - BUG(); > > - return NULL; > > + } > > + > > + return kobj_to_node_hstate(kobj, nidp); > > } > > =20 > > static ssize_t nr_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h =3D kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > > + struct hstate *h; > > + unsigned long nr_huge_pages; > > + int nid; > > + > > + h =3D kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + nr_huge_pages =3D h->nr_huge_pages; >=20 > Here is another magic number except it means something slightly > different. It means NR_GLOBAL_HUGEPAGES or something similar. It woul= d > be nice if these different special nid values could be named, prefera= bly > collapsed to being one "core" thing. Again, it means "NO NODE ID specified" [via per node attribute]. Again= , I'll address this with a single constant. >=20 > > + else > > + nr_huge_pages =3D h->nr_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", nr_huge_pages); > > } > > + > > static ssize_t nr_hugepages_store(struct kobject *kobj, > > struct kobj_attribute *attr, const char *buf, size_t count) > > { > > - int err; > > unsigned long input; > > - struct hstate *h =3D kobj_to_hstate(kobj); > > + struct hstate *h; > > + int nid; > > + int err; > > =20 > > err =3D strict_strtoul(buf, 10, &input); > > if (err) > > return 0; > > =20 > > - h->max_huge_pages =3D set_max_huge_pages(h, input); >=20 > "input" is a bit meaningless. The function you are passing to calls t= his > parameter "count". Can you match the naming please? Otherwise, I migh= t > guess that this is a "delta" which occurs elsewhere in the hugetlb co= de. I guess I can change that. It's the pre-exiting name, and 'count' was already used. Guess I can change 'count' to 'len' and 'input' to 'count' >=20 > > + h =3D kobj_to_hstate(kobj, &nid); > > + h->max_huge_pages =3D set_max_huge_pages(h, input, nid); > > =20 > > return count; > > } > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h =3D kobj_to_hstate(kobj); > > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > > + > > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > > } > > + > > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > > struct kobj_attribute *attr, const char *buf, size_t count) > > { > > int err; > > unsigned long input; > > - struct hstate *h =3D kobj_to_hstate(kobj); > > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > > =20 > > err =3D strict_strtoul(buf, 10, &input); > > if (err) > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > > static ssize_t free_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h =3D kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->free_huge_pages); > > + struct hstate *h; > > + unsigned long free_huge_pages; > > + int nid; > > + > > + h =3D kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + free_huge_pages =3D h->free_huge_pages; > > + else > > + free_huge_pages =3D h->free_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", free_huge_pages); > > } > > HSTATE_ATTR_RO(free_hugepages); > > =20 > > static ssize_t resv_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h =3D kobj_to_hstate(kobj); > > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > > return sprintf(buf, "%lu\n", h->resv_huge_pages); > > } > > HSTATE_ATTR_RO(resv_hugepages); > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > > static ssize_t surplus_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > - struct hstate *h =3D kobj_to_hstate(kobj); > > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > > + struct hstate *h; > > + unsigned long surplus_huge_pages; > > + int nid; > > + > > + h =3D kobj_to_hstate(kobj, &nid); > > + if (nid < 0) > > + surplus_huge_pages =3D h->surplus_huge_pages; > > + else > > + surplus_huge_pages =3D h->surplus_huge_pages_node[nid]; > > + > > + return sprintf(buf, "%lu\n", surplus_huge_pages); > > } > > HSTATE_ATTR_RO(surplus_hugepages); > > =20 > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > > .attrs =3D hstate_attrs, > > }; > > =20 > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > > + struct kobject *parent, > > + struct kobject **hstate_kobjs, > > + struct attribute_group *hstate_attr_group) > > { > > int retval; > > + int hi =3D h - hstates; > > =20 > > - hstate_kobjs[h - hstates] =3D kobject_create_and_add(h->name, > > - hugepages_kobj); > > - if (!hstate_kobjs[h - hstates]) > > + hstate_kobjs[hi] =3D kobject_create_and_add(h->name, parent); > > + if (!hstate_kobjs[hi]) > > return -ENOMEM; > > =20 > > - retval =3D sysfs_create_group(hstate_kobjs[h - hstates], > > - &hstate_attr_group); > > + retval =3D sysfs_create_group(hstate_kobjs[hi], hstate_attr_group= ); > > if (retval) > > - kobject_put(hstate_kobjs[h - hstates]); > > + kobject_put(hstate_kobjs[hi]); > > =20 > > return retval; > > } > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > > return; > > =20 > > for_each_hstate(h) { > > - err =3D hugetlb_sysfs_add_hstate(h); > > + err =3D hugetlb_sysfs_add_hstate(h, hugepages_kobj, > > + hstate_kobjs, &hstate_attr_group); > > if (err) > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > > h->name); > > } > > } > > =20 > > +#ifdef CONFIG_NUMA > > +static struct attribute *per_node_hstate_attrs[] =3D { > > + &nr_hugepages_attr.attr, > > + &free_hugepages_attr.attr, > > + &surplus_hugepages_attr.attr, > > + NULL, > > +}; > > + > > +static struct attribute_group per_node_hstate_attr_group =3D { > > + .attrs =3D per_node_hstate_attrs, > > +}; > > + > > + > > +void hugetlb_unregister_node(struct node *node) > > +{ > > + struct hstate *h; > > + > > + for_each_hstate(h) { > > + kobject_put(node->hstate_kobjs[h - hstates]); > > + node->hstate_kobjs[h - hstates] =3D NULL; > > + } > > + > > + kobject_put(node->hugepages_kobj); > > + node->hugepages_kobj =3D NULL; > > +} > > + > > +static void hugetlb_unregister_all_nodes(void) > > +{ > > + int nid; > > + > > + for (nid =3D 0; nid < nr_node_ids; nid++) > > + hugetlb_unregister_node(&node_devices[nid]); > > +} > > + > > +void hugetlb_register_node(struct node *node) > > +{ > > + struct hstate *h; > > + int err; > > + > > + if (!hugepages_kobj) > > + return; /* too early */ > > + > > + node->hugepages_kobj =3D kobject_create_and_add("hugepages", > > + &node->sysdev.kobj); > > + if (!node->hugepages_kobj) > > + return; > > + > > + for_each_hstate(h) { > > + err =3D hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > > + node->hstate_kobjs, > > + &per_node_hstate_attr_group); > > + if (err) > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > > + " for node %d\n", > > + h->name, node->sysdev.id); > > + } > > +} > > + > > +static void hugetlb_register_all_nodes(void) > > +{ > > + int nid; > > + > > + for (nid =3D 0; nid < nr_node_ids; nid++) { > > + struct node *node =3D &node_devices[nid]; > > + if (node->sysdev.id =3D=3D nid && !node->hugepages_kobj) > > + hugetlb_register_node(node); > > + } > > +} > > +#endif > > + > > static void __exit hugetlb_exit(void) > > { > > struct hstate *h; > > =20 > > + hugetlb_unregister_all_nodes(); > > + > > for_each_hstate(h) { > > kobject_put(hstate_kobjs[h - hstates]); > > } > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > > =20 > > hugetlb_sysfs_init(); > > =20 > > + hugetlb_register_all_nodes(); > > + > > return 0; > > } > > module_init(hugetlb_init); > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > =20 > > if (write) > > - h->max_huge_pages =3D set_max_huge_pages(h, tmp); > > + h->max_huge_pages =3D set_max_huge_pages(h, tmp, -1); > > =20 > > return 0; > > } > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 20= 09-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08= -24 12:12:56.000000000 -0400 > > @@ -21,9 +21,12 @@ > > =20 > > #include > > #include > > +#include > > =20 > > struct node { > > struct sys_device sysdev; > > + struct kobject *hugepages_kobj; > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > }; > > =20 > > struct memory_block; > >=20 >=20 > I'm not against this idea and think it can work side-by-side with the= memory > policies. I believe it does need a bit more cleaning up before mergin= g > though. I also wasn't able to test this yet due to various build and > deploy issues. OK. I'll do the cleanup. I have tested this atop the mempolicy version by working around the build issues that I thought were just temporary glitches in the mmotm series. In my [limited] experience, on= e can interleave numactl+hugeadm with setting values via the per node attributes and it does the right thing. No heavy testing with racing tasks, tho'. Lee -- To unsubscribe from this list: send the line "unsubscribe linux-numa" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Date: Tue, 25 Aug 2009 16:49:34 -0400 Message-ID: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: David Rientjes Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 2009-08-25 at 01:16 -0700, David Rientjes wrote: > On Mon, 24 Aug 2009, Lee Schermerhorn wrote: > > > [PATCH 2/4] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > V3: > > + moved this patch to after the "rework" of hstate_next_node_to_... > > functions as this patch is more specific to using task mempolicy > > to control huge page allocation and freeing. > > > > In preparation for constraining huge page allocation and freeing by the > > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer > > to the allocate, free and surplus adjustment functions. For now, pass > > NULL to indicate default behavior--i.e., use node_online_map. A > > subsqeuent patch will derive a non-default mask from the controlling > > task's numa mempolicy. > > > > Reviewed-by: Mel Gorman > > Signed-off-by: Lee Schermerhorn > > > > mm/hugetlb.c | 102 ++++++++++++++++++++++++++++++++++++++--------------------- > > 1 file changed, 67 insertions(+), 35 deletions(-) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:46.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > > } > > > > /* > > - * common helper function for hstate_next_node_to_{alloc|free}. > > - * return next node in node_online_map, wrapping at end. > > + * common helper functions for hstate_next_node_to_{alloc|free}. > > + * We may have allocated or freed a huge pages based on a different > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > > + * be outside of *nodes_allowed. Ensure that we use the next > > + * allowed node for alloc or free. > > */ > > -static int next_node_allowed(int nid) > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > > { > > - nid = next_node(nid, node_online_map); > > + nid = next_node(nid, *nodes_allowed); > > if (nid == MAX_NUMNODES) > > - nid = first_node(node_online_map); > > + nid = first_node(*nodes_allowed); > > VM_BUG_ON(nid >= MAX_NUMNODES); > > > > return nid; > > } > > > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > > +{ > > + if (!node_isset(nid, *nodes_allowed)) > > + nid = next_node_allowed(nid, nodes_allowed); > > + return nid; > > +} > > Awkward name considering this doesn't simply return true or false as > expected, it returns a nid. Well, it's not a predicate function so I wouldn't expect true or false return, but I can see how the trailing "allowed" can sound like we're asking the question "Is this node allowed?". Maybe, "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid to "startnid"], ... Or, do you have a suggestion? > > > + > > /* > > * Use a helper variable to find the next node and then > > * copy it back to next_nid_to_alloc afterwards: > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > * But we don't need to use a spin_lock here: it really > > * doesn't matter if occasionally a racer chooses the > > - * same nid as we do. Move nid forward in the mask even > > - * if we just successfully allocated a hugepage so that > > - * the next caller gets hugepages on the next node. > > + * same nid as we do. Move nid forward in the mask whether > > + * or not we just successfully allocated a hugepage so that > > + * the next allocation addresses the next node. > > */ > > -static int hstate_next_node_to_alloc(struct hstate *h) > > +static int hstate_next_node_to_alloc(struct hstate *h, > > + nodemask_t *nodes_allowed) > > { > > int nid, next_nid; > > > > - nid = h->next_nid_to_alloc; > > - next_nid = next_node_allowed(nid); > > + if (!nodes_allowed) > > + nodes_allowed = &node_online_map; > > + > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > + > > + next_nid = next_node_allowed(nid, nodes_allowed); > > h->next_nid_to_alloc = next_nid; > > + > > return nid; > > } > > Don't need next_nid. Well, the pre-existing comment block indicated that the use of the apparently spurious next_nid variable is necessary to close a race. Not sure whether that comment still applies with this rework. What do you think? > > > -static int alloc_fresh_huge_page(struct hstate *h) > > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > > { > > struct page *page; > > int start_nid; > > int next_nid; > > int ret = 0; > > > > - start_nid = hstate_next_node_to_alloc(h); > > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > next_nid = start_nid; > > > > do { > > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct > > ret = 1; > > break; > > } > > - next_nid = hstate_next_node_to_alloc(h); > > + next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > } while (next_nid != start_nid); > > > > if (ret) > > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct > > * whether or not we find a free huge page to free so that the > > * next attempt to free addresses the next node. > > */ > > -static int hstate_next_node_to_free(struct hstate *h) > > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > > { > > int nid, next_nid; > > > > - nid = h->next_nid_to_free; > > - next_nid = next_node_allowed(nid); > > + if (!nodes_allowed) > > + nodes_allowed = &node_online_map; > > + > > + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); > > + > > + next_nid = next_node_allowed(nid, nodes_allowed); > > h->next_nid_to_free = next_nid; > > + > > return nid; > > } > > Same. Yes, and I modeled this on "next to alloc", with the extra next_nid for the same reason. Do we dare remove it? Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Tue, 25 Aug 2009 16:49:40 -0400 Message-ID: <1251233380.16229.3.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825133516.GE21335@csn.ul.ie> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090825133516.GE21335@csn.ul.ie> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote: > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > @@ -21,9 +21,12 @@ > > > > #include > > #include > > +#include > > > > Is this header inclusion necessary? It does not appear to be required by > the structure modification (which is iffy in itself as discussed in the > earlier mail) and it breaks build on x86-64. Hi, Mel: I recall that it is necessary to build. You can try w/o it. > > CC arch/x86/kernel/setup_percpu.o > In file included from include/linux/pagemap.h:10, > from include/linux/mempolicy.h:62, > from include/linux/hugetlb.h:8, > from include/linux/node.h:24, > from include/linux/cpu.h:23, > from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, > from arch/x86/kernel/setup_percpu.c:19: > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 > make[1]: *** [arch/x86/kernel] Error 2 I saw this. I've been testing on x86_64. I *thought* that it only started showing up in a recent mmotm from changes in the linux-next patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately !ARCH_HAS_KMAP in highmem.h But maybe that was coincidental with my adding the include. Lee > > > > > struct node { > > struct sys_device sysdev; > > + struct kobject *hugepages_kobj; > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > }; > > > > struct memory_block; > > > From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Date: Tue, 25 Aug 2009 14:59:11 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251237561; bh=t/ncZzaWq9T3eTwFteUzJXwcD7I=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=ppmLfxVUxgqqhnd0xB16gfdoZfF4xP2 8VnFezu70A+nnTdXN2BwaSE78OHWV1LDsG8nqhKw7ytKp5MX24f6Gkw== In-Reply-To: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 25 Aug 2009, Lee Schermerhorn wrote: > > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag > > > } > > > > > > /* > > > - * common helper function for hstate_next_node_to_{alloc|free}. > > > - * return next node in node_online_map, wrapping at end. > > > + * common helper functions for hstate_next_node_to_{alloc|free}. > > > + * We may have allocated or freed a huge pages based on a different > > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > > > + * be outside of *nodes_allowed. Ensure that we use the next > > > + * allowed node for alloc or free. > > > */ > > > -static int next_node_allowed(int nid) > > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > > > { > > > - nid = next_node(nid, node_online_map); > > > + nid = next_node(nid, *nodes_allowed); > > > if (nid == MAX_NUMNODES) > > > - nid = first_node(node_online_map); > > > + nid = first_node(*nodes_allowed); > > > VM_BUG_ON(nid >= MAX_NUMNODES); > > > > > > return nid; > > > } > > > > > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > > > +{ > > > + if (!node_isset(nid, *nodes_allowed)) > > > + nid = next_node_allowed(nid, nodes_allowed); > > > + return nid; > > > +} > > > > Awkward name considering this doesn't simply return true or false as > > expected, it returns a nid. > > Well, it's not a predicate function so I wouldn't expect true or false > return, but I can see how the trailing "allowed" can sound like we're > asking the question "Is this node allowed?". Maybe, > "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid > to "startnid"], ... Or, do you have a suggestion? > this_node_allowed() just seemed like a very similar name to cpuset_zone_allowed() in the cpuset code, which does return true or false depending on whether the zone is allowed by current's cpuset. As usual with the mempolicy discussions, I come from a biased cpuset perspective :) > > > > > + > > > /* > > > * Use a helper variable to find the next node and then > > > * copy it back to next_nid_to_alloc afterwards: > > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid) > > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > > * But we don't need to use a spin_lock here: it really > > > * doesn't matter if occasionally a racer chooses the > > > - * same nid as we do. Move nid forward in the mask even > > > - * if we just successfully allocated a hugepage so that > > > - * the next caller gets hugepages on the next node. > > > + * same nid as we do. Move nid forward in the mask whether > > > + * or not we just successfully allocated a hugepage so that > > > + * the next allocation addresses the next node. > > > */ > > > -static int hstate_next_node_to_alloc(struct hstate *h) > > > +static int hstate_next_node_to_alloc(struct hstate *h, > > > + nodemask_t *nodes_allowed) > > > { > > > int nid, next_nid; > > > > > > - nid = h->next_nid_to_alloc; > > > - next_nid = next_node_allowed(nid); > > > + if (!nodes_allowed) > > > + nodes_allowed = &node_online_map; > > > + > > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > > + > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_alloc = next_nid; > > > + > > > return nid; > > > } > > > > Don't need next_nid. > > Well, the pre-existing comment block indicated that the use of the > apparently spurious next_nid variable is necessary to close a race. Not > sure whether that comment still applies with this rework. What do you > think? > What race is it closing exactly if gcc is going to optimize it out anyways? I think you can safely fold the following into your patch. --- diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -659,15 +659,14 @@ static int this_node_allowed(int nid, nodemask_t *nodes_allowed) static int hstate_next_node_to_alloc(struct hstate *h, nodemask_t *nodes_allowed) { - int nid, next_nid; + int nid; if (!nodes_allowed) nodes_allowed = &node_online_map; nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); - next_nid = next_node_allowed(nid, nodes_allowed); - h->next_nid_to_alloc = next_nid; + h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed); return nid; } @@ -707,15 +706,14 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) */ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) { - int nid, next_nid; + int nid; if (!nodes_allowed) nodes_allowed = &node_online_map; nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); - next_nid = next_node_allowed(nid, nodes_allowed); - h->next_nid_to_free = next_nid; + h->next_nid_to_free = next_node_allowed(nid, nodes_allowed); return nid; } From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Date: Wed, 26 Aug 2009 10:58:35 +0100 Message-ID: <20090826095835.GB10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192637.10317.31039.sendpatchset@localhost.localdomain> <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251233374.16229.2.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: David Rientjes , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, Aug 25, 2009 at 04:49:34PM -0400, Lee Schermerhorn wrote: > > > > > > +static int hstate_next_node_to_alloc(struct hstate *h, > > > + nodemask_t *nodes_allowed) > > > { > > > int nid, next_nid; > > > > > > - nid = h->next_nid_to_alloc; > > > - next_nid = next_node_allowed(nid); > > > + if (!nodes_allowed) > > > + nodes_allowed = &node_online_map; > > > + > > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > > + > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_alloc = next_nid; > > > + > > > return nid; > > > } > > > > Don't need next_nid. > > Well, the pre-existing comment block indicated that the use of the > apparently spurious next_nid variable is necessary to close a race. Not > sure whether that comment still applies with this rework. What do you > think? > The original intention was not to return h->next_nid_to_alloc because there is a race window where it's MAX_NUMNODES. nid is a stack-local variable here, it should not become MAX_NUMNODES by accident because this_node_allowed() and next_node_allowed() are both taking care not to return MAX_NUMNODES so it's safe as a return value. Even in the presense of races with the code structure you currently have. I think it's safe to have nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed); return nid; because at worse in the presense of races, h->next_nid_to_alloc gets assigned to the same value twice, but never MAX_NUMNODES. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 11:11:22 +0100 Message-ID: <20090826101122.GD10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251233369.16229.1.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote: > > > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > > +{ > > > > This name is a bit weird. It's creating a nodemask with just a single > > node allowed. > > > > Is there something wrong with using the existing function > > nodemask_of_node()? If stack is the problem, prehaps there is some macro > > magic that would allow a nodemask to be either declared on the stack or > > kmalloc'd. > > Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a > block nested inside the context where it's invoked. I would be > declaring the nodemask in the compound else clause and don't want to > access it [via the nodes_allowed pointer] from outside of there. > So, the existance of the mask on the stack is the problem. I can understand that, they are potentially quite large. Would it be possible to add a helper along side it like init_nodemask_of_node() that does the same work as nodemask_of_node() but takes a nodemask parameter? nodemask_of_node() would reuse the init_nodemask_of_node() except it declares the nodemask on the stack. > > > > > + nodemask_t *nodes_allowed; > > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > > + if (!nodes_allowed) { > > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask " > > > + "for huge page allocation.\nFalling back to default.\n", > > > + current->comm); > > > + } else { > > > + nodes_clear(*nodes_allowed); > > > + node_set(nid, *nodes_allowed); > > > + } > > > + return nodes_allowed; > > > +} > > > + > > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, > > > + int nid) > > > { > > > unsigned long min_count, ret; > > > nodemask_t *nodes_allowed; > > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages( > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > > + if (nid < 0) > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > > hugetlb is a bit littered with magic numbers been passed into functions. > > Attempts have been made to clear them up as according as patches change > > that area. Would it be possible to define something like > > > > #define HUGETLB_OBEY_MEMPOLICY -1 > > > > for the nid here as opposed to passing in -1? I know -1 is used in the page > > allocator functions but there it means "current node" and here it means > > "obey mempolicies". > > Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a > per node attribute". It means "derive nodes allowed from memory policy, > if non-default, else use nodes_online_map" [which is not exactly the > same as obeying memory policy]. > > But, I can see defining a symbolic constant such as > NO_NODE[_ID_SPECIFIED]. I'll try next spin. > That NO_NODE_ID_SPECIFIED was the underlying definition I was looking for. It makes sense at both sites. > > > -static struct hstate *kobj_to_hstate(struct kobject *kobj) > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + struct node *node = &node_devices[nid]; > > > + int hi; > > > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++) > > > > Does that hi mean hello, high, nid or hstate_idx? > > > > hstate_idx would appear to be the appropriate name here. > > Or just plain 'i', like in the following, pre-existing function? > Whichever suits you best. If hstate_idx is really what it is, I see no harm in using it but 'i' is an index and I'd sooner recognise that than the less meaningful "hi". > > > > > + if (node->hstate_kobjs[hi] == kobj) { > > > + if (nidp) > > > + *nidp = nid; > > > + return &hstates[hi]; > > > + } > > > + } > > > > Ok.... so, there is a struct node array for the sysdev and this patch adds > > references to the "hugepages" directory kobject and the subdirectories for > > each page size. We walk all the objects until we find a match. Obviously, > > this adds a dependency of base node support on hugetlbfs which feels backwards > > and you call that out in your leader. > > > > Can this be the other way around? i.e. The struct hstate has an array of > > kobjects arranged by nid that is filled in when the node is registered? > > There will only be one kobject-per-pagesize-per-node so it seems like it > > would work. I confess, I haven't prototyped this to be 100% sure. > > This will take a bit longer to sort out. I do want to change the > registration, tho', so that hugetlb.c registers it's single node > register/unregister functions with base/node.c to remove the source > level dependency in that direction. node.c will only register nodes on > hot plug as it's initialized too early, relative to hugetlb.c to > register them at init time. This should break the call dependency of > base/node.c on the hugetlb module. > > As far as moving the per node attributes' kobjects to the hugetlb global > hstate arrays... Have to think about that. I agree that it would be > nice to remove the source level [header] dependency. > FWIW, I see no problem with the mempolicy stuff going ahead separately from this patch after the few relatively minor cleanups highlighted in the thread and tackling this patch as a separate cycle. It's up to you really. > > > > > + > > > + BUG(); > > > + return NULL; > > > +} > > > + > > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp) > > > { > > > int i; > > > + > > > for (i = 0; i < HUGE_MAX_HSTATE; i++) > > > - if (hstate_kobjs[i] == kobj) > > > + if (hstate_kobjs[i] == kobj) { > > > + if (nidp) > > > + *nidp = -1; > > > return &hstates[i]; > > > - BUG(); > > > - return NULL; > > > + } > > > + > > > + return kobj_to_node_hstate(kobj, nidp); > > > } > > > > > > static ssize_t nr_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->nr_huge_pages); > > > + struct hstate *h; > > > + unsigned long nr_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + nr_huge_pages = h->nr_huge_pages; > > > > Here is another magic number except it means something slightly > > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would > > be nice if these different special nid values could be named, preferably > > collapsed to being one "core" thing. > > Again, it means "NO NODE ID specified" [via per node attribute]. Again, > I'll address this with a single constant. > > > > > > + else > > > + nr_huge_pages = h->nr_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", nr_huge_pages); > > > } > > > + > > > static ssize_t nr_hugepages_store(struct kobject *kobj, > > > struct kobj_attribute *attr, const char *buf, size_t count) > > > { > > > - int err; > > > unsigned long input; > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h; > > > + int nid; > > > + int err; > > > > > > err = strict_strtoul(buf, 10, &input); > > > if (err) > > > return 0; > > > > > > - h->max_huge_pages = set_max_huge_pages(h, input); > > > > "input" is a bit meaningless. The function you are passing to calls this > > parameter "count". Can you match the naming please? Otherwise, I might > > guess that this is a "delta" which occurs elsewhere in the hugetlb code. > > I guess I can change that. It's the pre-exiting name, and 'count' was > already used. Guess I can change 'count' to 'len' and 'input' to > 'count' Makes sense. > > > > > + h = kobj_to_hstate(kobj, &nid); > > > + h->max_huge_pages = set_max_huge_pages(h, input, nid); > > > > > > return count; > > > } > > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages); > > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > + > > > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); > > > } > > > + > > > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, > > > struct kobj_attribute *attr, const char *buf, size_t count) > > > { > > > int err; > > > unsigned long input; > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > > > > err = strict_strtoul(buf, 10, &input); > > > if (err) > > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages); > > > static ssize_t free_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->free_huge_pages); > > > + struct hstate *h; > > > + unsigned long free_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + free_huge_pages = h->free_huge_pages; > > > + else > > > + free_huge_pages = h->free_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", free_huge_pages); > > > } > > > HSTATE_ATTR_RO(free_hugepages); > > > > > > static ssize_t resv_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > + struct hstate *h = kobj_to_hstate(kobj, NULL); > > > return sprintf(buf, "%lu\n", h->resv_huge_pages); > > > } > > > HSTATE_ATTR_RO(resv_hugepages); > > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages); > > > static ssize_t surplus_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > - return sprintf(buf, "%lu\n", h->surplus_huge_pages); > > > + struct hstate *h; > > > + unsigned long surplus_huge_pages; > > > + int nid; > > > + > > > + h = kobj_to_hstate(kobj, &nid); > > > + if (nid < 0) > > > + surplus_huge_pages = h->surplus_huge_pages; > > > + else > > > + surplus_huge_pages = h->surplus_huge_pages_node[nid]; > > > + > > > + return sprintf(buf, "%lu\n", surplus_huge_pages); > > > } > > > HSTATE_ATTR_RO(surplus_hugepages); > > > > > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att > > > .attrs = hstate_attrs, > > > }; > > > > > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h) > > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h, > > > + struct kobject *parent, > > > + struct kobject **hstate_kobjs, > > > + struct attribute_group *hstate_attr_group) > > > { > > > int retval; > > > + int hi = h - hstates; > > > > > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name, > > > - hugepages_kobj); > > > - if (!hstate_kobjs[h - hstates]) > > > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); > > > + if (!hstate_kobjs[hi]) > > > return -ENOMEM; > > > > > > - retval = sysfs_create_group(hstate_kobjs[h - hstates], > > > - &hstate_attr_group); > > > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group); > > > if (retval) > > > - kobject_put(hstate_kobjs[h - hstates]); > > > + kobject_put(hstate_kobjs[hi]); > > > > > > return retval; > > > } > > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo > > > return; > > > > > > for_each_hstate(h) { > > > - err = hugetlb_sysfs_add_hstate(h); > > > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > > > + hstate_kobjs, &hstate_attr_group); > > > if (err) > > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > > > h->name); > > > } > > > } > > > > > > +#ifdef CONFIG_NUMA > > > +static struct attribute *per_node_hstate_attrs[] = { > > > + &nr_hugepages_attr.attr, > > > + &free_hugepages_attr.attr, > > > + &surplus_hugepages_attr.attr, > > > + NULL, > > > +}; > > > + > > > +static struct attribute_group per_node_hstate_attr_group = { > > > + .attrs = per_node_hstate_attrs, > > > +}; > > > + > > > + > > > +void hugetlb_unregister_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + > > > + for_each_hstate(h) { > > > + kobject_put(node->hstate_kobjs[h - hstates]); > > > + node->hstate_kobjs[h - hstates] = NULL; > > > + } > > > + > > > + kobject_put(node->hugepages_kobj); > > > + node->hugepages_kobj = NULL; > > > +} > > > + > > > +static void hugetlb_unregister_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) > > > + hugetlb_unregister_node(&node_devices[nid]); > > > +} > > > + > > > +void hugetlb_register_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + int err; > > > + > > > + if (!hugepages_kobj) > > > + return; /* too early */ > > > + > > > + node->hugepages_kobj = kobject_create_and_add("hugepages", > > > + &node->sysdev.kobj); > > > + if (!node->hugepages_kobj) > > > + return; > > > + > > > + for_each_hstate(h) { > > > + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj, > > > + node->hstate_kobjs, > > > + &per_node_hstate_attr_group); > > > + if (err) > > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > > > + " for node %d\n", > > > + h->name, node->sysdev.id); > > > + } > > > +} > > > + > > > +static void hugetlb_register_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + struct node *node = &node_devices[nid]; > > > + if (node->sysdev.id == nid && !node->hugepages_kobj) > > > + hugetlb_register_node(node); > > > + } > > > +} > > > +#endif > > > + > > > static void __exit hugetlb_exit(void) > > > { > > > struct hstate *h; > > > > > > + hugetlb_unregister_all_nodes(); > > > + > > > for_each_hstate(h) { > > > kobject_put(hstate_kobjs[h - hstates]); > > > } > > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void) > > > > > > hugetlb_sysfs_init(); > > > > > > + hugetlb_register_all_nodes(); > > > + > > > return 0; > > > } > > > module_init(hugetlb_init); > > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta > > > proc_doulongvec_minmax(table, write, file, buffer, length, ppos); > > > > > > if (write) > > > - h->max_huge_pages = set_max_huge_pages(h, tmp); > > > + h->max_huge_pages = set_max_huge_pages(h, tmp, -1); > > > > > > return 0; > > > } > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > > @@ -21,9 +21,12 @@ > > > > > > #include > > > #include > > > +#include > > > > > > struct node { > > > struct sys_device sysdev; > > > + struct kobject *hugepages_kobj; > > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > > }; > > > > > > struct memory_block; > > > > > > > I'm not against this idea and think it can work side-by-side with the memory > > policies. I believe it does need a bit more cleaning up before merging > > though. I also wasn't able to test this yet due to various build and > > deploy issues. > > OK. I'll do the cleanup. I have tested this atop the mempolicy > version by working around the build issues that I thought were just > temporary glitches in the mmotm series. In my [limited] experience, one > can interleave numactl+hugeadm with setting values via the per node > attributes and it does the right thing. No heavy testing with racing > tasks, tho'. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 11:12:03 +0100 Message-ID: <20090826101202.GE10955@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825133516.GE21335@csn.ul.ie> <1251233380.16229.3.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251233380.16229.3.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, Aug 25, 2009 at 04:49:40PM -0400, Lee Schermerhorn wrote: > On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote: > > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote: > > > > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400 > > > @@ -21,9 +21,12 @@ > > > > > > #include > > > #include > > > +#include > > > > > > > Is this header inclusion necessary? It does not appear to be required by > > the structure modification (which is iffy in itself as discussed in the > > earlier mail) and it breaks build on x86-64. > > Hi, Mel: > > I recall that it is necessary to build. You can try w/o it. > I did, it appeared to work but I didn't dig deep as to why. > > > > CC arch/x86/kernel/setup_percpu.o > > In file included from include/linux/pagemap.h:10, > > from include/linux/mempolicy.h:62, > > from include/linux/hugetlb.h:8, > > from include/linux/node.h:24, > > from include/linux/cpu.h:23, > > from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5, > > from arch/x86/kernel/setup_percpu.c:19: > > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here > > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here > > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration > > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here > > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1 > > make[1]: *** [arch/x86/kernel] Error 2 > > I saw this. I've been testing on x86_64. I *thought* that it only > started showing up in a recent mmotm from changes in the linux-next > patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately > !ARCH_HAS_KMAP in highmem.h But maybe that was coincidental with my > adding the include. > Maybe we were looking at different mmotm's -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 14:02:27 -0400 Message-ID: <1251309747.4409.45.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090826101122.GD10955@csn.ul.ie> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Mel Gorman Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote: > On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote: > > > > > > > > +static nodemask_t *nodes_allowed_from_node(int nid) > > > > +{ > > > > > > This name is a bit weird. It's creating a nodemask with just a single > > > node allowed. > > > > > > Is there something wrong with using the existing function > > > nodemask_of_node()? If stack is the problem, prehaps there is some macro > > > magic that would allow a nodemask to be either declared on the stack or > > > kmalloc'd. > > > > Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a > > block nested inside the context where it's invoked. I would be > > declaring the nodemask in the compound else clause and don't want to > > access it [via the nodes_allowed pointer] from outside of there. > > > > So, the existance of the mask on the stack is the problem. I can > understand that, they are potentially quite large. > > Would it be possible to add a helper along side it like > init_nodemask_of_node() that does the same work as nodemask_of_node() > but takes a nodemask parameter? nodemask_of_node() would reuse the > init_nodemask_of_node() except it declares the nodemask on the stack. > Here's the patch that introduces the helper function that I propose. I'll send an update of the subject patch that uses this macro and, I think, addresses your other issues via a separate message. This patch applies just before the "register per node attributes" patch. Once we can agree on these [or subsequent] changes, I'll repost the entire updated series. Lee --- PATCH 4/6 - hugetlb: introduce alloc_nodemask_of_node() Against: 2.6.31-rc6-mmotm-090820-1918 Introduce nodemask macro to allocate a nodemask and initialize it to contain a single node, using existing nodemask_of_node() macro. Coded as a macro to avoid header dependency hell. This will be used to construct the huge pages "nodes_allowed" nodemask for a single node when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn include/linux/nodemask.h | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h =================================================================== --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 @@ -257,6 +257,16 @@ static inline int __next_node(int n, con m; \ }) +#define alloc_nodemask_of_node(node) \ +({ \ + typeof(_unused_nodemask_arg_) *nmp; \ + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ + if (nmp) \ + *nmp = nodemask_of_node(node); \ + nmp; \ +}) + + #define first_unset_node(mask) __first_unset_node(&(mask)) static inline int __first_unset_node(const nodemask_t *maskp) { From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 12:47:57 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251316082; bh=VpC4ZXLNBOdhUU6axlq9AD7RDCo=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=Zr4uL2UzffYOG8PycRC/MEsjgaTackA pwzRtHNqXfFeJj90RQ6AVAbs5CJIKgRGY9fq3vvS5vhIXqCqmxS+naw== In-Reply-To: <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > Against: 2.6.31-rc6-mmotm-090820-1918 > > Introduce nodemask macro to allocate a nodemask and > initialize it to contain a single node, using existing > nodemask_of_node() macro. Coded as a macro to avoid header > dependency hell. > > This will be used to construct the huge pages "nodes_allowed" > nodemask for a single node when a persistent huge page > pool page count is modified via a per node sysfs attribute. > > Signed-off-by: Lee Schermerhorn > > include/linux/nodemask.h | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > =================================================================== > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > m; \ > }) > > +#define alloc_nodemask_of_node(node) \ > +({ \ > + typeof(_unused_nodemask_arg_) *nmp; \ > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > + if (nmp) \ > + *nmp = nodemask_of_node(node); \ > + nmp; \ > +}) > + > + > #define first_unset_node(mask) __first_unset_node(&(mask)) > static inline int __first_unset_node(const nodemask_t *maskp) > { I think it would probably be better to use the generic NODEMASK_ALLOC() interface by requiring it to pass the entire type (including "struct") as part of the first parameter. Then it automatically takes care of dynamically allocating large nodemasks vs. allocating them on the stack. Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case to be this: #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct nodemask_scratch, x), and then doing this in your code: NODEMASK_ALLOC(nodemask_t, nodes_allowed); if (nodes_allowed) *nodes_allowed = nodemask_of_node(node); The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can probably be made more general to handle cases like this. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Wed, 26 Aug 2009 16:46:43 -0400 Message-ID: <1251319603.4409.92.camel@useless.americas.hpqcorp.net> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: David Rientjes Cc: Mel Gorman , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote: > On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > Introduce nodemask macro to allocate a nodemask and > > initialize it to contain a single node, using existing > > nodemask_of_node() macro. Coded as a macro to avoid header > > dependency hell. > > > > This will be used to construct the huge pages "nodes_allowed" > > nodemask for a single node when a persistent huge page > > pool page count is modified via a per node sysfs attribute. > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/nodemask.h | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > > =================================================================== > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > > m; \ > > }) > > > > +#define alloc_nodemask_of_node(node) \ > > +({ \ > > + typeof(_unused_nodemask_arg_) *nmp; \ > > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > > + if (nmp) \ > > + *nmp = nodemask_of_node(node); \ > > + nmp; \ > > +}) > > + > > + > > #define first_unset_node(mask) __first_unset_node(&(mask)) > > static inline int __first_unset_node(const nodemask_t *maskp) > > { > > I think it would probably be better to use the generic NODEMASK_ALLOC() > interface by requiring it to pass the entire type (including "struct") as > part of the first parameter. Then it automatically takes care of > dynamically allocating large nodemasks vs. allocating them on the stack. > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > to be this: > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > nodemask_scratch, x), and then doing this in your code: > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > if (nodes_allowed) > *nodes_allowed = nodemask_of_node(node); > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > probably be made more general to handle cases like this. I just don't know what that would accomplish. Heck, I'm not all that happy with the alloc_nodemask_from_node() because it's allocating both a hidden nodemask_t and a pointer thereto on the stack just to return a pointer to a kmalloc()ed nodemask_t--which is what I want/need here. One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] is that it declares the pointer variable as well as initializing it, perhaps with kmalloc(), ... Indeed, it's purpose is to replace on stack nodemask declarations. So, to use it at the start of, e.g., set_max_huge_pages() where I can safely use it throughout the function, I'll end up allocating the nodes_allowed mask on every call, whether or not a node is specified or there is a non-default mempolicy. If it turns out that no node was specified and we have default policy, we need to free the mask and NULL out nodes_allowed up front so that we get default behavior. That seems uglier to me that only allocating the nodemask when we know we need one. I'm not opposed to using a generic function/macro where one exists that suits my purposes. I just don't see one. I tried to create one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse nodemask_from_node() to initialize it. I'm really not happy with the results--because of those extra, hidden stack variables. I could eliminate those by creating a out of line function, but there's no good place to put a generic nodemask function--no nodemask.c. I'm leaning towards going back to my original hugetlb-private "nodes_allowed_from_node()" or such. I can use nodemask_from_node to initialize it, if that will make Mel happy, but trying to force fit an existing "generic" function just because it's generic seems pointless. So, I'm going to let this series rest until I hear back from you and Mel on how to proceed with this. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Thu, 27 Aug 2009 10:52:10 +0100 Message-ID: <20090827095210.GB21183@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309747.4409.45.camel@useless.americas.hpqcorp.net> <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251319603.4409.92.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: David Rientjes , linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Wed, Aug 26, 2009 at 04:46:43PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote: > > On Wed, 26 Aug 2009, Lee Schermerhorn wrote: > > > > > Against: 2.6.31-rc6-mmotm-090820-1918 > > > > > > Introduce nodemask macro to allocate a nodemask and > > > initialize it to contain a single node, using existing > > > nodemask_of_node() macro. Coded as a macro to avoid header > > > dependency hell. > > > > > > This will be used to construct the huge pages "nodes_allowed" > > > nodemask for a single node when a persistent huge page > > > pool page count is modified via a per node sysfs attribute. > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > include/linux/nodemask.h | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h 2009-08-24 10:16:56.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h 2009-08-26 12:38:31.000000000 -0400 > > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con > > > m; \ > > > }) > > > > > > +#define alloc_nodemask_of_node(node) \ > > > +({ \ > > > + typeof(_unused_nodemask_arg_) *nmp; \ > > > + nmp = kmalloc(sizeof(*nmp), GFP_KERNEL); \ > > > + if (nmp) \ > > > + *nmp = nodemask_of_node(node); \ > > > + nmp; \ > > > +}) > > > + > > > + > > > #define first_unset_node(mask) __first_unset_node(&(mask)) > > > static inline int __first_unset_node(const nodemask_t *maskp) > > > { > > > > I think it would probably be better to use the generic NODEMASK_ALLOC() > > interface by requiring it to pass the entire type (including "struct") as > > part of the first parameter. Then it automatically takes care of > > dynamically allocating large nodemasks vs. allocating them on the stack. > > > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case > > to be this: > > > > #define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL); > > > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct > > nodemask_scratch, x), and then doing this in your code: > > > > NODEMASK_ALLOC(nodemask_t, nodes_allowed); > > if (nodes_allowed) > > *nodes_allowed = nodemask_of_node(node); > > > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can > > probably be made more general to handle cases like this. > > I just don't know what that would accomplish. Heck, I'm not all that > happy with the alloc_nodemask_from_node() because it's allocating both a > hidden nodemask_t and a pointer thereto on the stack just to return a > pointer to a kmalloc()ed nodemask_t--which is what I want/need here. > > One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al] > is that it declares the pointer variable as well as initializing it, > perhaps with kmalloc(), ... Indeed, it's purpose is to replace on > stack nodemask declarations. > > So, to use it at the start of, e.g., set_max_huge_pages() where I can > safely use it throughout the function, I'll end up allocating the > nodes_allowed mask on every call, whether or not a node is specified or > there is a non-default mempolicy. If it turns out that no node was > specified and we have default policy, we need to free the mask and NULL > out nodes_allowed up front so that we get default behavior. That seems > uglier to me that only allocating the nodemask when we know we need one. > > I'm not opposed to using a generic function/macro where one exists that > suits my purposes. I just don't see one. I tried to create > one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse > nodemask_from_node() to initialize it. I'm really not happy with the > results--because of those extra, hidden stack variables. I could > eliminate those by creating a out of line function, but there's no good > place to put a generic nodemask function--no nodemask.c. > Ok. When I brought the subject up, it looked like you were creating a hugetlbfs-specific helper that looked like it would have generic helpers. While that is still the case, it's looking like generic helpers make things worse and hide side-effects in helper functions that might cause greater difficulty in the future. I'm happier to go with the existing code than I was before so consider my objection dropped. > I'm leaning towards going back to my original hugetlb-private > "nodes_allowed_from_node()" or such. I can use nodemask_from_node to > initialize it, if that will make Mel happy, but trying to force fit an > existing "generic" function just because it's generic seems pointless. > > So, I'm going to let this series rest until I hear back from you and Mel > on how to proceed with this. > I hate to do it to you, but at this point, I'm leaning towards your current approach. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Date: Thu, 27 Aug 2009 12:40:44 -0700 (PDT) Message-ID: References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192752.10317.96125.sendpatchset@localhost.localdomain> <1251233347.16229.0.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1251402050; bh=fZAeE7iBti8Xa56XZ+7Vivsu2hY=; h=DomainKey-Signature:Date:From:X-X-Sender:To:cc:Subject: In-Reply-To:Message-ID:References:User-Agent:MIME-Version: Content-Type:X-System-Of-Record; b=FDsI0uAl9SOHSaTKlx7X+t6SlZN3MNy YZKzuhC3FpwuoM5xf3hBPuurqrvlnzZxIvGCv0WWLd7Jk4DB+s7uZFQ== In-Reply-To: <1251233347.16229.0.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Tue, 25 Aug 2009, Lee Schermerhorn wrote: > > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400 > > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400 > > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs > > > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > { > > > unsigned long min_count, ret; > > > + nodemask_t *nodes_allowed; > > > > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > > Why can't you simply do this? > > > > struct mempolicy *pol = NULL; > > nodemask_t *nodes_allowed = &node_online_map; > > > > local_irq_disable(); > > pol = current->mempolicy; > > mpol_get(pol); > > local_irq_enable(); > > if (pol) { > > switch (pol->mode) { > > case MPOL_BIND: > > case MPOL_INTERLEAVE: > > nodes_allowed = pol->v.nodes; > > break; > > case MPOL_PREFERRED: > > ... use NODEMASK_SCRATCH() ... > > default: > > BUG(); > > } > > } > > mpol_put(pol); > > > > and then use nodes_allowed throughout set_max_huge_pages()? > > > Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages(). Yeah, the above code would all be in set_max_huge_pages() and huge_mpol_nodes_allowed() would be removed. > NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure > it will return a kmalloc()'d nodemask, which I need because a NULL > nodemask pointer means "all online nodes" [really all nodes with memory, > I suppose] and I need a pointer to kmalloc()'d nodemask to return from > huge_mpol_nodes_allowed(). I want to keep the access to the internals > of mempolicy in mempolicy.[ch], thus the call out to > huge_mpol_nodes_allowed(), instead of open coding it. Ok, so you could add a mempolicy.c helper function that returns nodemask_t * and either points to mpol->v.nodes for most cases after getting a reference on mpol with mpol_get() or points to a dynamically allocated NODEMASK_ALLOC() on a nodemask created for MPOL_PREFERRED. This works nicely because either way you still have a reference to mpol, so you'll need to call into a mpol_nodemask_free() function which can use the same switch statement: void mpol_nodemask_free(struct mempolicy *mpol, struct nodemask_t *nodes_allowed) { switch (mpol->mode) { case MPOL_PREFERRED: kfree(nodes_allowed); break; default: break; } mpol_put(mpol); } From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 4/5] hugetlb: add per node hstate attributes Date: Fri, 28 Aug 2009 11:09:20 +0100 Message-ID: <20090828100919.GC5054@csn.ul.ie> References: <20090824192437.10317.77172.sendpatchset@localhost.localdomain> <20090824192902.10317.94512.sendpatchset@localhost.localdomain> <20090825101906.GB4427@csn.ul.ie> <1251233369.16229.1.camel@useless.americas.hpqcorp.net> <20090826101122.GD10955@csn.ul.ie> <1251309843.4409.48.camel@useless.americas.hpqcorp.net> <20090827102338.GC21183@csn.ul.ie> <1251391930.4374.89.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1251391930.4374.89.camel@useless.americas.hpqcorp.net> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com On Thu, Aug 27, 2009 at 12:52:10PM -0400, Lee Schermerhorn wrote: > > > > > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages( > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > - nodes_allowed = huge_mpol_nodes_allowed(); > > > + if (nid == NO_NODEID_SPECIFIED) > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > + else { > > > + /* > > > + * incoming 'count' is for node 'nid' only, so > > > + * adjust count to global, but restrict alloc/free > > > + * to the specified node. > > > + */ > > > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > > > + nodes_allowed = alloc_nodemask_of_node(nid); > > > > alloc_nodemask_of_node() isn't defined anywhere. > > > Well, that's because the patch that defines it is in a message that I > meant to send before this one. I see it's in my Drafts folder. I'll > attach that patch below. I'm rebasing against the 0827 mmotm, and I'll > resend the rebased series. However, I wanted to get your opinion of the > nodemask patch below. > It looks very reasonable to my eye. The caller must know that kfree() is used to free it instead of free_nodemask_of_node() but it's not worth getting into a twist over. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab