From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 0C1686B0055 for ; Tue, 16 Jun 2009 09:50:55 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:52:28 -0400 Message-Id: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 0/5] Huge Pages Nodes Allowed Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: Because of assymmetries in some NUMA platforms, and "interesting" topologies emerging in the "scale up x86" world, we have need for better control over the placement of "fresh huge pages". A while back Nish Aravamundan floated a series of patches to add per node controls for allocating pages to the hugepage pool and removing them. Nish apparently moved on to other tasks before those patches were accepted. I have kept a copy of Nish's patches and have intended to rebase and test them and resubmit. In an [off-list] exchange with Mel Gorman, who admits to knowledge in the huge pages area, I asked his opinion of per node controls for huge pages and he suggested another approach: using the mempolicy of the task that changes nr_hugepages to constrain the fresh huge page allocations. I considered this approach but it seemed to me to be a misuse of mempolicy for populating the huge pages free pool. Interleave policy doesn't have same "this node" semantics that we want and bind policy would require constructing a custom node mask for node as well as addressing OOM, which we don't want during fresh huge page allocation. One could derive a node mask of allowed nodes for huge pages from the mempolicy of the task that is modifying nr_hugepages and use that for fresh huge pages with GFP_THISNODE. However, if we're not going to use mempolicy directly--e.g., via alloc_page_current() or alloc_page_vma() [with yet another on-stack pseudo-vma :(]--I thought it cleaner to define a "nodes allowed" nodemask for populating the [persistent] huge pages free pool. This patch series introduces a [per hugepage size] "sysctl", hugepages_nodes_allowed, that specifies a nodemask to constrain the allocation of persistent, fresh huge pages. The nodemask may be specified by a sysctl, a sysfs huge pages attribute and on the kernel boot command line. The series includes a patch to free hugepages from the pool in a "round robin" fashion, interleaved across all on-line nodes to balance the hugepage pool across nodes. Nish had a patch to do this, too. Together, these changes don't provide the fine grain of control that per node attributes would. Specifically, there is no easy way to reduce the persistent huge page count for a specific node. I think the degree of control provided by these patches is the minimal necessary and sufficient for managing the persistent the huge page pool. However, with a bit more reorganization, we could implement per node controls if others would find that useful. For more info, see the patch descriptions and the updated kernel hugepages documentation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 1177B6B005A for ; Tue, 16 Jun 2009 09:51:19 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:52:53 -0400 Message-Id: <20090616135253.25248.96346.sendpatchset@lts-notebook> In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: [PATCH 2/5] add nodes_allowed members to hugepages hstate struct Against: 17may09 mmotm This patch adds a nodes_allowed nodemask_t pointer and a __nodes_allowed nodemask_t to the hstate struct for constraining fresh hugepage allocations. It then adds sysfs attributes and boot command line options to set and [for sysfs attributes] query the allowed nodes mask. A separate patch will hook up this nodes_allowed mask/pointer to fresh huge page allocation and promoting of surplus pages to persistent. Another patch will add a 'sysctl' hugepages_nodes_allowed to /proc/sys/vm. Signed-off-by: Lee Schermerhorn include/linux/hugetlb.h | 2 + mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+) Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 @@ -193,6 +193,8 @@ struct hstate { unsigned long resv_huge_pages; unsigned long surplus_huge_pages; unsigned long nr_overcommit_huge_pages; + nodemask_t *nodes_allowed; + nodemask_t __nodes_allowed; struct list_head hugepage_freelists[MAX_NUMNODES]; unsigned int nr_huge_pages_node[MAX_NUMNODES]; unsigned int free_huge_pages_node[MAX_NUMNODES]; Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages); static struct hstate * __initdata parsed_hstate; static unsigned long __initdata default_hstate_max_huge_pages; static unsigned long __initdata default_hstate_size; +static struct hstate __initdata default_boot_hstate = { + .nodes_allowed = &node_online_map, +}; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++) @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates( struct hstate *h; for_each_hstate(h) { + if (!h->nodes_allowed) + h->nodes_allowed = &node_online_map; + /* oversize hugepages were init'ed in early boot */ if (h->order < MAX_ORDER) hugetlb_hstate_alloc_pages(h); @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s } HSTATE_ATTR(nr_overcommit_hugepages); +static ssize_t nodes_allowed_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct hstate *h = kobj_to_hstate(kobj); + int len = 3; + + if (h->nodes_allowed == &node_online_map) + strcpy(buf, "all"); + else + len = nodelist_scnprintf(buf, PAGE_SIZE, + *h->nodes_allowed); + + if (len) + buf[len++] = '\n'; + + return len; +} + +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf, + bool lock) +{ + nodemask_t nodes_allowed; + int ret = 1; + bool all = !strncasecmp(buf, "all", 3); + + if (!all) + ret = !nodelist_parse(buf, nodes_allowed); + if (ret) { + if (lock) + spin_lock(&hugetlb_lock); + + if (all) { + h->nodes_allowed = &node_online_map; + } else { + h->__nodes_allowed = nodes_allowed; + h->nodes_allowed = &h->__nodes_allowed; + } + + if (lock) + spin_unlock(&hugetlb_lock); + } + return ret; +} + +static ssize_t nodes_allowed_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct hstate *h = kobj_to_hstate(kobj); + + if (set_hstate_nodes_allowed(h, buf, 1)) + return count; + else + return 0; +} +HSTATE_ATTR(nodes_allowed); + static ssize_t free_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages); static struct attribute *hstate_attrs[] = { &nr_hugepages_attr.attr, &nr_overcommit_hugepages_attr.attr, + &nodes_allowed_attr.attr, &free_hugepages_attr.attr, &resv_hugepages_attr.attr, &surplus_hugepages_attr.attr, @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void) if (default_hstate_max_huge_pages) default_hstate.max_huge_pages = default_hstate_max_huge_pages; + if (default_boot_hstate.nodes_allowed != &node_online_map) { + default_hstate.__nodes_allowed = + default_boot_hstate.__nodes_allowed; + default_hstate.nodes_allowed = + &default_hstate.__nodes_allowed; + } + hugetlb_init_hstates(); gather_bootmem_prealloc(); @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", huge_page_size(h)/1024); + h->nodes_allowed = &node_online_map; parsed_hstate = h; } @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup( } __setup("default_hugepagesz=", hugetlb_default_setup); +static int __init hugetlb_nodes_allowed_setup(char *s) +{ + struct hstate *h = &default_boot_hstate; + + /* + * max_hstate means we've parsed a hugepagesz= parameter, so + * use the [most recently] parsed_hstate. Else use default. + */ + if (max_hstate) + h = parsed_hstate; + + return set_hstate_nodes_allowed(h, s, 0); +} +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup); + static unsigned int cpuset_mems_nr(unsigned int *array) { int node; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id C2FBA6B005A for ; Tue, 16 Jun 2009 09:51:25 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:53:01 -0400 Message-Id: <20090616135301.25248.91276.sendpatchset@lts-notebook> In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Against: 17may09 mmotm Select only nodes from the per hstate nodes_allowed mask when promoting surplus pages to persistent or when allocating fresh huge pages to the pool. Note that alloc_buddy_huge_page() still uses task policy to allocate surplus huge pages. This could be changed. Signed-off-by: Lee Schermerhorn mm/hugetlb.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag static int hstate_next_node(struct hstate *h) { int next_nid; - next_nid = next_node(h->hugetlb_next_nid, node_online_map); + next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed); if (next_nid == MAX_NUMNODES) - next_nid = first_node(node_online_map); + next_nid = first_node(*h->nodes_allowed); h->hugetlb_next_nid = next_nid; return next_nid; } @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct int ret = 0; start_nid = h->hugetlb_next_nid; + /* + * we may have allocated with a different nodes_allowed previously + */ + if (!node_isset(start_nid, *h->nodes_allowed)) + start_nid = hstate_next_node(h); do { page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc /* * Increment or decrement surplus_huge_pages. Keep node-specific counters - * balanced by operating on them in a round-robin fashion. + * balanced by operating on them in a round-robin fashion. Use nodes_allowed + * mask when decreasing suplus pages as we're "promoting" them to persistent. + * Use node_online_map for increment surplus pages as we're demoting previously + * persistent huge pages. + * Called holding the hugetlb_lock. * Returns 1 if an adjustment was made. */ static int adjust_pool_surplus(struct hstate *h, int delta) { + nodemask_t *nodemask = &node_online_map; static int prev_nid; int nid = prev_nid; int ret = 0; VM_BUG_ON(delta != -1 && delta != 1); + if (delta < 0) + nodemask = h->nodes_allowed; + do { - nid = next_node(nid, node_online_map); + nid = next_node(nid, *nodemask); if (nid == MAX_NUMNODES) - nid = first_node(node_online_map); + nid = first_node(*nodemask); /* To shrink on this node, there must be a surplus page */ if (delta < 0 && !h->surplus_huge_pages_node[nid]) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 8749D6B005C for ; Tue, 16 Jun 2009 09:51:35 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:53:08 -0400 Message-Id: <20090616135308.25248.57593.sendpatchset@lts-notebook> In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 4/5] Add sysctl for default hstate nodes_allowed. Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: [PATCH 4/5] add sysctl for default hstate nodes_allowed. Against: 17may09 mmotm This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed -- to set/query the default hstate's nodes_allowed. I don't know that this patch is required, given that we have the per hstate controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls for other recent hugepages controls, like nr_overcommit_hugepages, so I've followed that convention. Factor the formatting of the nodes_allowed mask out of nodes_allowed_show() for use by both that function and the hugetlb_nodes_allowed_handler(). Signed-off-by: Lee Schermerhorn include/linux/hugetlb.h | 1 + kernel/sysctl.c | 8 ++++++++ mm/hugetlb.c | 43 ++++++++++++++++++++++++++++++++++++++----- 3 files changed, 47 insertions(+), 5 deletions(-) Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:35.000000000 -0400 @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); void unmap_hugepage_range(struct vm_area_struct *, Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c 2009-06-04 12:59:26.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c 2009-06-04 12:59:35.000000000 -0400 @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = { .extra1 = (void *)&hugetlb_zero, .extra2 = (void *)&hugetlb_infinity, }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "hugepages_nodes_allowed", + .data = NULL, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &hugetlb_nodes_allowed_handler, + }, #endif { .ctl_name = VM_LOWMEM_RESERVE_RATIO, Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:35.000000000 -0400 @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s } HSTATE_ATTR(nr_overcommit_hugepages); -static ssize_t nodes_allowed_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) +static int format_hstate_nodes_allowed(struct hstate *h, char *buf, + size_t buflen) { - struct hstate *h = kobj_to_hstate(kobj); int len = 3; if (h->nodes_allowed == &node_online_map) strcpy(buf, "all"); else - len = nodelist_scnprintf(buf, PAGE_SIZE, + len = nodelist_scnprintf(buf, buflen, *h->nodes_allowed); + return len; + +} + +static ssize_t nodes_allowed_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct hstate *h = kobj_to_hstate(kobj); + int len = format_hstate_nodes_allowed(h, buf, PAGE_SIZE); - if (len) + if (len && (len +1) < PAGE_SIZE) buf[len++] = '\n'; return len; @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct return 0; } +#define NODES_ALLOWED_MAX 64 +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write, + struct file *file, void __user *buffer, + size_t *length, loff_t *ppos) +{ + struct hstate *h = &default_hstate; + int ret = 0; + + if (write) { + (void)set_hstate_nodes_allowed(h, buffer, 1); + } else { + char buf[NODES_ALLOWED_MAX]; + struct ctl_table tbl = { + .data = buf, + .maxlen = NODES_ALLOWED_MAX, + }; + int len = format_hstate_nodes_allowed(h, buf, sizeof(buf)); + + if (len) + ret = proc_dostring(&tbl, write, file, buffer, + length, ppos); + } + return ret; +} + #endif /* CONFIG_SYSCTL */ void hugetlb_report_meminfo(struct seq_file *m) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id A9C226B005C for ; Tue, 16 Jun 2009 09:51:54 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:53:15 -0400 Message-Id: <20090616135315.25248.7893.sendpatchset@lts-notebook> In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 5/5] Update huge pages kernel documentation Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: PATCH 5/5 - update huge pages kernel documentation Against: 17may09 mmotm Add description of nodes_allowed sysctl and boot parameter and adjust surrounding context to accommodate the description. Signed-off-by: Lee Schermerhorn Documentation/vm/hugetlbpage.txt | 105 ++++++++++++++++++++++++++++----------- 1 file changed, 77 insertions(+), 28 deletions(-) Index: linux-2.6.30-rc8-mmotm-090603-1633/Documentation/vm/hugetlbpage.txt =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/Documentation/vm/hugetlbpage.txt 2009-06-04 11:05:11.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/Documentation/vm/hugetlbpage.txt 2009-06-04 12:59:52.000000000 -0400 @@ -18,13 +18,13 @@ First the Linux kernel needs to be built automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with hugepage support should show the number of configured -hugepages in the system by running the "cat /proc/meminfo" command. +The kernel built with huge page support should show the number of configured +huge pages in the system by running the "cat /proc/meminfo" command. /proc/meminfo also provides information about the total number of hugetlb pages configured in the kernel. It also displays information about the number of free hugetlb pages at any time. It also displays information about -the configured hugepage size - this is needed for generating the proper +the configured huge page size - this is needed for generating the proper alignment and size of the arguments to the above system calls. The output of "cat /proc/meminfo" will have lines like: @@ -37,25 +37,23 @@ HugePages_Surp: yyy Hugepagesize: zzz kB where: -HugePages_Total is the size of the pool of hugepages. -HugePages_Free is the number of hugepages in the pool that are not yet -allocated. -HugePages_Rsvd is short for "reserved," and is the number of hugepages -for which a commitment to allocate from the pool has been made, but no -allocation has yet been made. It's vaguely analogous to overcommit. -HugePages_Surp is short for "surplus," and is the number of hugepages in -the pool above the value in /proc/sys/vm/nr_hugepages. The maximum -number of surplus hugepages is controlled by -/proc/sys/vm/nr_overcommit_hugepages. +HugePages_Total is the size of the pool of huge pages. +HugePages_Free is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. It's vaguely analogous + to overcommit. +HugePages_Surp is short for "surplus," and is the number of huge pages in + the pool above the value in /proc/sys/vm/nr_hugepages. The + maximum number of surplus huge pages is controlled by + /proc/sys/vm/nr_overcommit_hugepages. /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb -pages in the kernel. Super user can dynamically request more (or free some -pre-configured) hugepages. The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of hugepages is +enough physically contiguous free pages in system (freeing of huge pages is possible only if there are enough hugetlb pages free that can be transferred back to regular memory pool). @@ -67,26 +65,76 @@ use either the mmap system call or share the huge pages. It is required that the system administrator preallocate enough memory for huge page purposes. -Use the following command to dynamically allocate/deallocate hugepages: +The administrator can preallocate huge pages on the kernel boot command line by +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages +requested. This is the most reliable method for preallocating huge pages as +memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To preallocate huge pages +of a specific size, one must preceed the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=". must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=" boot parameter. + +/proc/sys/vm/nr_hugepages indicates the current number of configured [default +size] hugetlb pages in the kernel. Super user can dynamically request more +(or free some pre-configured) hugepages. + +Use the following command to dynamically allocate/deallocate default sized +hugepages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 hugepages in the system. The success -or failure of allocation depends on the amount of physically contiguous -memory that is preset in system at this time. System administrators may want -to put this command in one of the local rc init files. This will enable the -kernel to request huge pages early in the boot process (when the possibility -of getting physical contiguous pages is still very high). In either -case, administrators will want to verify the number of hugepages actually -allocated by checking the sysctl or meminfo. +This command will try to configure 20 default sized hugepages in the system. +On a NUMA platform, the kernel will attempt to distribute the hugepage pool +over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed node mask. +hugepages_nodes_allowed defaults to all on-line nodes. + +To control the nodes on which huge pages are preallocated, the administrator +may set the hugepages_nodes_allowed for the default huge page size using: + + echo >/proc/sys/vm/hugepages_nodes_allowed + +where is a comma separated list of one or more node ranges. For +example, "1,3-5" specifies nodes 1, 3, 4 and 5. Specify "all" to request +huge page preallocation on all on-line nodes. The hugepages_nodes_allowed +parameter may be specified on the kernel boot command line. + +The success or failure of allocation depends on the amount of physically +contiguous memory that is preset in system at this time. If the kernel is +unable to allocate hugepages from some nodes in a NUMA system, it will +attempt to make up the difference by allocating extra pages on other nodes +with available contiguous memory, if any, within the constraints of the +allowed nodes. + +System administrators may want to put this command in one of the local rc init +files. This will enable the kernel to request hugepages early in the boot +process (when the possibility of getting physical contiguous pages is still +very high). In either case, administrators will want to verify the number of +hugepages actually allocated by checking the sysctl or meminfo. To check the +per node distribution of huge pages, use: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge /proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are requested by applications. echo'ing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain +indicates that the hugetlb subsystem is allowed to try to obtain "surplus" hugepages from the buddy allocator, if the normal pool is exhausted. As these surplus hugepages go out of use, they are freed back to the buddy -allocator. +allocator. Note that surplus hugepages are not constrained by the +hugepages_nodes_allowed mask. + +When increasing the huge page pool size via nr_hugepages, any surplus +pages on the allowed nodes will first be promoted to persistent huge +pages. Then, additional huge pages will be allocated from the allowed +nodes, if necessary and if possible, to fulfil the new pool size. + +The administrator may shrink the pool of preallocated huge pages for +the default huge page size by setting the nr_hugepages sysctl to a +smaller value. The kernel will attempt to free huge pages "round robin" +across all on-line nodes, ignoring the nodes_allowed mask. Any free huge +pages on the selected nodes will be freed back to the buddy allocator. Caveat: Shrinking the pool via nr_hugepages such that it becomes less than the number of hugepages in use will convert the balance to surplus @@ -112,6 +160,7 @@ Inside each of these directories, the sa nr_hugepages nr_overcommit_hugepages + nodes_allowed free_hugepages resv_hugepages surplus_hugepages -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 1AF116B005D for ; Tue, 16 Jun 2009 09:51:56 -0400 (EDT) From: Lee Schermerhorn Date: Tue, 16 Jun 2009 09:52:36 -0400 Message-Id: <20090616135236.25248.93692.sendpatchset@lts-notebook> In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> Subject: [PATCH 1/5] Free huge pages round robin to balance across nodes Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: [PATCH 1/5] Free huge pages round robin to balance across nodes Against: 17may09 mmotm Currently, altho' increasing nr_hugepages will [attempt to] distribute the new huge pages across all nodes in the system, reducing nr_hugepages will free or surplus all free pages from nodes in node id order. This patch frees huges pages from nodes in round robin fashion in an attempt to keep [persistent] hugepage allocates balanced across the nodes. New function free_pool_huge_page() is modeled on and performs roughly the inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which now has no callers and can be removed. Helper function hstate_next_to_free_node() uses new hstate member next_to_free_nid to distribute "frees" across all nodes with huge pages. I placed this patch first in the series because I think it [or something similar] should be applied independent of the rest of the series. Signed-off-by: Lee Schermerhorn include/linux/hugetlb.h | 1 mm/hugetlb.c | 68 +++++++++++++++++++++++++++++++++--------------- 2 files changed, 48 insertions(+), 21 deletions(-) Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:29.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 @@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area( /* Defines one hugetlb page size */ struct hstate { int hugetlb_next_nid; + int next_to_free_nid; unsigned int order; unsigned long mask; unsigned long max_huge_pages; Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c =================================================================== --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:29.000000000 -0400 +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst h->free_huge_pages_node[nid]++; } -static struct page *dequeue_huge_page(struct hstate *h) -{ - int nid; - struct page *page = NULL; - - for (nid = 0; nid < MAX_NUMNODES; ++nid) { - if (!list_empty(&h->hugepage_freelists[nid])) { - page = list_entry(h->hugepage_freelists[nid].next, - struct page, lru); - list_del(&page->lru); - h->free_huge_pages--; - h->free_huge_pages_node[nid]--; - break; - } - } - return page; -} - static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address, int avoid_reserve) @@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct return ret; } +/* + * helper for free_pool_huge_page() - find next node + * from which to free a huge page + */ +static int hstate_next_to_free_node(struct hstate *h) +{ + int next_nid; + next_nid = next_node(h->next_to_free_nid, node_online_map); + if (next_nid == MAX_NUMNODES) + next_nid = first_node(node_online_map); + h->next_to_free_nid = next_nid; + return next_nid; +} + +/* + * Free huge page from pool from next node to free. + * Attempt to keep persistent huge pages more or less + * balanced over allowed nodes. + * Called with hugetlb_lock locked. + */ +static int free_pool_huge_page(struct hstate *h) +{ + int start_nid; + int nid; + int ret = 0; + + start_nid = h->next_to_free_nid; + nid = h->next_to_free_nid; + + do { + if (!list_empty(&h->hugepage_freelists[nid])) { + struct page *page = + list_entry(h->hugepage_freelists[nid].next, + struct page, lru); + list_del(&page->lru); + h->free_huge_pages--; + h->free_huge_pages_node[nid]--; + update_and_free_page(h, page); + ret = 1; + } + nid = hstate_next_to_free_node(h); + } while (!ret && nid != start_nid); + + return ret; +} + static struct page *alloc_buddy_huge_page(struct hstate *h, struct vm_area_struct *vma, unsigned long address) { @@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages( min_count = max(count, min_count); try_to_free_low(h, min_count); while (min_count < persistent_huge_pages(h)) { - struct page *page = dequeue_huge_page(h); - if (!page) + if (!free_pool_huge_page(h)) break; - update_and_free_page(h, page); } while (count < persistent_huge_pages(h)) { if (!adjust_pool_surplus(h, 1)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 043DC6B0055 for ; Wed, 17 Jun 2009 09:02:18 -0400 (EDT) Date: Wed, 17 Jun 2009 14:02:16 +0100 From: Mel Gorman Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed Message-ID: <20090617130216.GF28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090616135228.25248.22018.sendpatchset@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote: > Because of assymmetries in some NUMA platforms, and "interesting" > topologies emerging in the "scale up x86" world, we have need for > better control over the placement of "fresh huge pages". A while > back Nish Aravamundan floated a series of patches to add per node > controls for allocating pages to the hugepage pool and removing > them. Nish apparently moved on to other tasks before those patches > were accepted. I have kept a copy of Nish's patches and have > intended to rebase and test them and resubmit. > > In an [off-list] exchange with Mel Gorman, who admits to knowledge > in the huge pages area, I asked his opinion of per node controls > for huge pages and he suggested another approach: using the mempolicy > of the task that changes nr_hugepages to constrain the fresh huge > page allocations. I considered this approach but it seemed to me > to be a misuse of mempolicy for populating the huge pages free > pool. Why would it be a misuse? Fundamentally, the huge page pools are being filled by the current process when nr_hugepages is being used. Or are you concerned about the specification of hugepages on the kernel command line? > Interleave policy doesn't have same "this node" semantics > that we want By "this node" semantics, do you mean allocating from one specific node? In that case, why would specifying a nodemask of just one node not be sufficient? > and bind policy would require constructing a custom > node mask for node as well as addressing OOM, which we don't want > during fresh huge page allocation. Would the required mask not already be setup when the process set the policy? OOM is not a major concern, it doesn't trigger for failed hugepage allocations. > One could derive a node mask > of allowed nodes for huge pages from the mempolicy of the task > that is modifying nr_hugepages and use that for fresh huge pages > with GFP_THISNODE. However, if we're not going to use mempolicy > directly--e.g., via alloc_page_current() or alloc_page_vma() [with > yet another on-stack pseudo-vma :(]--I thought it cleaner to > define a "nodes allowed" nodemask for populating the [persistent] > huge pages free pool. > How about adding alloc_page_mempolicy() that takes the explicit mempolicy you need? > This patch series introduces a [per hugepage size] "sysctl", > hugepages_nodes_allowed, that specifies a nodemask to constrain > the allocation of persistent, fresh huge pages. The nodemask > may be specified by a sysctl, a sysfs huge pages attribute and > on the kernel boot command line. > > The series includes a patch to free hugepages from the pool in a > "round robin" fashion, interleaved across all on-line nodes to > balance the hugepage pool across nodes. Nish had a patch to do > this, too. > > Together, these changes don't provide the fine grain of control > that per node attributes would. I'm failing to understand at the moment why mem policies set by numactl would not do the job for allocation at least. Freeing is a different problem. > Specifically, there is no easy > way to reduce the persistent huge page count for a specific node. > I think the degree of control provided by these patches is the > minimal necessary and sufficient for managing the persistent the > huge page pool. However, with a bit more reorganization, we > could implement per node controls if others would find that > useful. > > For more info, see the patch descriptions and the updated kernel > hugepages documentation. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 355B86B005A for ; Wed, 17 Jun 2009 09:18:05 -0400 (EDT) Date: Wed, 17 Jun 2009 14:18:33 +0100 From: Mel Gorman Subject: Re: [PATCH 1/5] Free huge pages round robin to balance across nodes Message-ID: <20090617131833.GG28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135236.25248.93692.sendpatchset@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090616135236.25248.93692.sendpatchset@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 09:52:36AM -0400, Lee Schermerhorn wrote: > [PATCH 1/5] Free huge pages round robin to balance across nodes > > Against: 17may09 mmotm > > Currently, altho' increasing nr_hugepages will [attempt to] > distribute the new huge pages across all nodes in the system, > reducing nr_hugepages will free or surplus all free pages > from nodes in node id order. This patch frees huges pages > from nodes in round robin fashion in an attempt to keep > [persistent] hugepage allocates balanced across the nodes. > > New function free_pool_huge_page() is modeled on and > performs roughly the inverse of alloc_fresh_huge_page(). > Replaces dequeue_huge_page() which now has no callers > and can be removed. > > Helper function hstate_next_to_free_node() uses new hstate > member next_to_free_nid to distribute "frees" across all > nodes with huge pages. > > I placed this patch first in the series because I think it > [or something similar] should be applied independent of the > rest of the series. > Agreed. Reading though, I can't see any problems with the patch and it does make the freeing symmetric with the allocation. For clarity though, would it be worth renaming hugetlb_next_nid to next_to_alloc_nid so that there is a clear relationship in the round-robin allocation and freeing of pages amoung online nodes? > Signed-off-by: Lee Schermerhorn > > include/linux/hugetlb.h | 1 > mm/hugetlb.c | 68 +++++++++++++++++++++++++++++++++--------------- > 2 files changed, 48 insertions(+), 21 deletions(-) > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:29.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 > @@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area( > /* Defines one hugetlb page size */ > struct hstate { > int hugetlb_next_nid; > + int next_to_free_nid; > unsigned int order; > unsigned long mask; > unsigned long max_huge_pages; > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:29.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst > h->free_huge_pages_node[nid]++; > } > > -static struct page *dequeue_huge_page(struct hstate *h) > -{ > - int nid; > - struct page *page = NULL; > - > - for (nid = 0; nid < MAX_NUMNODES; ++nid) { > - if (!list_empty(&h->hugepage_freelists[nid])) { > - page = list_entry(h->hugepage_freelists[nid].next, > - struct page, lru); > - list_del(&page->lru); > - h->free_huge_pages--; > - h->free_huge_pages_node[nid]--; > - break; > - } > - } > - return page; > -} > - > static struct page *dequeue_huge_page_vma(struct hstate *h, > struct vm_area_struct *vma, > unsigned long address, int avoid_reserve) > @@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct > return ret; > } > > +/* > + * helper for free_pool_huge_page() - find next node > + * from which to free a huge page > + */ > +static int hstate_next_to_free_node(struct hstate *h) > +{ > + int next_nid; > + next_nid = next_node(h->next_to_free_nid, node_online_map); > + if (next_nid == MAX_NUMNODES) > + next_nid = first_node(node_online_map); > + h->next_to_free_nid = next_nid; > + return next_nid; > +} > + > +/* > + * Free huge page from pool from next node to free. > + * Attempt to keep persistent huge pages more or less > + * balanced over allowed nodes. > + * Called with hugetlb_lock locked. > + */ > +static int free_pool_huge_page(struct hstate *h) > +{ > + int start_nid; > + int nid; > + int ret = 0; > + > + start_nid = h->next_to_free_nid; > + nid = h->next_to_free_nid; > + > + do { > + if (!list_empty(&h->hugepage_freelists[nid])) { > + struct page *page = > + list_entry(h->hugepage_freelists[nid].next, > + struct page, lru); > + list_del(&page->lru); > + h->free_huge_pages--; > + h->free_huge_pages_node[nid]--; > + update_and_free_page(h, page); > + ret = 1; > + } > + nid = hstate_next_to_free_node(h); > + } while (!ret && nid != start_nid); > + > + return ret; > +} > + > static struct page *alloc_buddy_huge_page(struct hstate *h, > struct vm_area_struct *vma, unsigned long address) > { > @@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages( > min_count = max(count, min_count); > try_to_free_low(h, min_count); > while (min_count < persistent_huge_pages(h)) { > - struct page *page = dequeue_huge_page(h); > - if (!page) > + if (!free_pool_huge_page(h)) > break; > - update_and_free_page(h, page); > } > while (count < persistent_huge_pages(h)) { > if (!adjust_pool_surplus(h, 1)) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 79F2D6B0082 for ; Wed, 17 Jun 2009 09:34:40 -0400 (EDT) Date: Wed, 17 Jun 2009 14:35:41 +0100 From: Mel Gorman Subject: Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Message-ID: <20090617133541.GH28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135253.25248.96346.sendpatchset@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090616135253.25248.96346.sendpatchset@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote: > [PATCH 2/5] add nodes_allowed members to hugepages hstate struct > > Against: 17may09 mmotm > > This patch adds a nodes_allowed nodemask_t pointer and a > __nodes_allowed nodemask_t to the hstate struct for constraining > fresh hugepage allocations. It then adds sysfs attributes and > boot command line options to set and [for sysfs attributes] query > the allowed nodes mask. > > A separate patch will hook up this nodes_allowed mask/pointer to > fresh huge page allocation and promoting of surplus pages to > persistent. Another patch will add a 'sysctl' hugepages_nodes_allowed > to /proc/sys/vm. > > Signed-off-by: Lee Schermerhorn > > include/linux/hugetlb.h | 2 + > mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 88 insertions(+) > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > @@ -193,6 +193,8 @@ struct hstate { > unsigned long resv_huge_pages; > unsigned long surplus_huge_pages; > unsigned long nr_overcommit_huge_pages; > + nodemask_t *nodes_allowed; > + nodemask_t __nodes_allowed; Can you add a comment as to why a nodemask pointer and a nodemask itself with very similar field names are both needed? > struct list_head hugepage_freelists[MAX_NUMNODES]; > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > unsigned int free_huge_pages_node[MAX_NUMNODES]; > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages); > static struct hstate * __initdata parsed_hstate; > static unsigned long __initdata default_hstate_max_huge_pages; > static unsigned long __initdata default_hstate_size; > +static struct hstate __initdata default_boot_hstate = { > + .nodes_allowed = &node_online_map, > +}; > > #define for_each_hstate(h) \ > for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++) > @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates( > struct hstate *h; > > for_each_hstate(h) { > + if (!h->nodes_allowed) > + h->nodes_allowed = &node_online_map; > + The reasoning for two nodes_allowed would appear to be to allow nodes_allowed to be some predefined mash or a hstate-private mask. As the mask has to exist anyway, can you not just copy it in rather than having a pointer and a mask? It would make the check below as to whether all nodes allowed or not a bit more expensive but it's not a big deal. > /* oversize hugepages were init'ed in early boot */ > if (h->order < MAX_ORDER) > hugetlb_hstate_alloc_pages(h); > @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s > } > HSTATE_ATTR(nr_overcommit_hugepages); > > +static ssize_t nodes_allowed_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + struct hstate *h = kobj_to_hstate(kobj); > + int len = 3; > + > + if (h->nodes_allowed == &node_online_map) > + strcpy(buf, "all"); > + else > + len = nodelist_scnprintf(buf, PAGE_SIZE, > + *h->nodes_allowed); > + > + if (len) > + buf[len++] = '\n'; > + buf doesn't get NULL terminated. > + return len; > +} Can print_nodes_state() be extended a little to print "all" and then shared with here? > + > +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf, > + bool lock) > +{ > + nodemask_t nodes_allowed; > + int ret = 1; > + bool all = !strncasecmp(buf, "all", 3); > + > + if (!all) > + ret = !nodelist_parse(buf, nodes_allowed); > + if (ret) { > + if (lock) > + spin_lock(&hugetlb_lock); > + ick. Convention for something like this is to have set_hstate_nodes_allowed that takes a spinlock and __set_hstate_nodes_allowed that is the lock-free version and call as appropriate. Can the same be done here? For that matter, does taking spinlocks from __setup() break that you avoid taking it in that case? > + if (all) { > + h->nodes_allowed = &node_online_map; > + } else { > + h->__nodes_allowed = nodes_allowed; > + h->nodes_allowed = &h->__nodes_allowed; > + } > + > + if (lock) > + spin_unlock(&hugetlb_lock); > + } > + return ret; > +} > + > +static ssize_t nodes_allowed_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t count) > +{ > + struct hstate *h = kobj_to_hstate(kobj); > + > + if (set_hstate_nodes_allowed(h, buf, 1)) > + return count; > + else > + return 0; > +} > +HSTATE_ATTR(nodes_allowed); > + > static ssize_t free_hugepages_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages); > static struct attribute *hstate_attrs[] = { > &nr_hugepages_attr.attr, > &nr_overcommit_hugepages_attr.attr, > + &nodes_allowed_attr.attr, > &free_hugepages_attr.attr, > &resv_hugepages_attr.attr, > &surplus_hugepages_attr.attr, > @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void) > if (default_hstate_max_huge_pages) > default_hstate.max_huge_pages = default_hstate_max_huge_pages; > > + if (default_boot_hstate.nodes_allowed != &node_online_map) { > + default_hstate.__nodes_allowed = > + default_boot_hstate.__nodes_allowed; > + default_hstate.nodes_allowed = > + &default_hstate.__nodes_allowed; > + } > + > hugetlb_init_hstates(); > > gather_bootmem_prealloc(); > @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > huge_page_size(h)/1024); > > + h->nodes_allowed = &node_online_map; > parsed_hstate = h; > } > > @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup( > } > __setup("default_hugepagesz=", hugetlb_default_setup); > > +static int __init hugetlb_nodes_allowed_setup(char *s) > +{ > + struct hstate *h = &default_boot_hstate; > + > + /* > + * max_hstate means we've parsed a hugepagesz= parameter, so > + * use the [most recently] parsed_hstate. Else use default. > + */ > + if (max_hstate) > + h = parsed_hstate; > + > + return set_hstate_nodes_allowed(h, s, 0); > +} > +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup); > + > static unsigned int cpuset_mems_nr(unsigned int *array) > { > int node; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 853DA6B0055 for ; Wed, 17 Jun 2009 09:38:03 -0400 (EDT) Date: Wed, 17 Jun 2009 14:39:11 +0100 From: Mel Gorman Subject: Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Message-ID: <20090617133911.GI28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135301.25248.91276.sendpatchset@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090616135301.25248.91276.sendpatchset@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote: > [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation > > Against: 17may09 mmotm > > Select only nodes from the per hstate nodes_allowed mask when > promoting surplus pages to persistent or when allocating fresh > huge pages to the pool. > > Note that alloc_buddy_huge_page() still uses task policy to allocate > surplus huge pages. This could be changed. > > Signed-off-by: Lee Schermerhorn > > mm/hugetlb.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag > static int hstate_next_node(struct hstate *h) > { > int next_nid; > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > + next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed); > if (next_nid == MAX_NUMNODES) > - next_nid = first_node(node_online_map); > + next_nid = first_node(*h->nodes_allowed); > h->hugetlb_next_nid = next_nid; > return next_nid; > } > @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct > int ret = 0; > > start_nid = h->hugetlb_next_nid; > + /* > + * we may have allocated with a different nodes_allowed previously > + */ > + if (!node_isset(start_nid, *h->nodes_allowed)) > + start_nid = hstate_next_node(h); > > do { > page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc > > /* > * Increment or decrement surplus_huge_pages. Keep node-specific counters > - * balanced by operating on them in a round-robin fashion. > + * balanced by operating on them in a round-robin fashion. Use nodes_allowed > + * mask when decreasing suplus pages as we're "promoting" them to persistent. s/suplus/surplus/ > + * Use node_online_map for increment surplus pages as we're demoting previously > + * persistent huge pages. > + * Called holding the hugetlb_lock. > * Returns 1 if an adjustment was made. > */ > static int adjust_pool_surplus(struct hstate *h, int delta) > { > + nodemask_t *nodemask = &node_online_map; > static int prev_nid; > int nid = prev_nid; > int ret = 0; > > VM_BUG_ON(delta != -1 && delta != 1); > + if (delta < 0) > + nodemask = h->nodes_allowed; > + Please spell out why nodes_allowed is only used when decreasing the surplus count. > do { > - nid = next_node(nid, node_online_map); > + nid = next_node(nid, *nodemask); > if (nid == MAX_NUMNODES) > - nid = first_node(node_online_map); > + nid = first_node(*nodemask); > > /* To shrink on this node, there must be a surplus page */ > if (delta < 0 && !h->surplus_huge_pages_node[nid]) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id C9C536B0062 for ; Wed, 17 Jun 2009 09:39:58 -0400 (EDT) Date: Wed, 17 Jun 2009 14:41:07 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed. Message-ID: <20090617134107.GJ28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135308.25248.57593.sendpatchset@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20090616135308.25248.57593.sendpatchset@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote: > [PATCH 4/5] add sysctl for default hstate nodes_allowed. > > Against: 17may09 mmotm > > This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed -- > to set/query the default hstate's nodes_allowed. I don't know > that this patch is required, given that we have the per hstate > controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls > for other recent hugepages controls, like nr_overcommit_hugepages, > so I've followed that convention. > Yeah, it's somewhat expected that what is in /proc/sys/vm is information on the default hugepage size. > Factor the formatting of the nodes_allowed mask out of nodes_allowed_show() > for use by both that function and the hugetlb_nodes_allowed_handler(). > > Signed-off-by: Lee Schermerhorn > > include/linux/hugetlb.h | 1 + > kernel/sysctl.c | 8 ++++++++ > mm/hugetlb.c | 43 ++++++++++++++++++++++++++++++++++++++----- > 3 files changed, 47 insertions(+), 5 deletions(-) > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:35.000000000 -0400 > @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm > int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); > int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); > void unmap_hugepage_range(struct vm_area_struct *, > Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c 2009-06-04 12:59:26.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c 2009-06-04 12:59:35.000000000 -0400 > @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = { > .extra1 = (void *)&hugetlb_zero, > .extra2 = (void *)&hugetlb_infinity, > }, > + { > + .ctl_name = CTL_UNNUMBERED, > + .procname = "hugepages_nodes_allowed", > + .data = NULL, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = &hugetlb_nodes_allowed_handler, > + }, > #endif > { > .ctl_name = VM_LOWMEM_RESERVE_RATIO, > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > =================================================================== > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:35.000000000 -0400 > @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s > } > HSTATE_ATTR(nr_overcommit_hugepages); > > -static ssize_t nodes_allowed_show(struct kobject *kobj, > - struct kobj_attribute *attr, char *buf) > +static int format_hstate_nodes_allowed(struct hstate *h, char *buf, > + size_t buflen) > { > - struct hstate *h = kobj_to_hstate(kobj); > int len = 3; > > if (h->nodes_allowed == &node_online_map) > strcpy(buf, "all"); > else > - len = nodelist_scnprintf(buf, PAGE_SIZE, > + len = nodelist_scnprintf(buf, buflen, > *h->nodes_allowed); > + return len; > + > +} This looks like unnecessary churn and could have been done in the earlier patch introducing nodes_allowed_show() > + > +static ssize_t nodes_allowed_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + struct hstate *h = kobj_to_hstate(kobj); > + int len = format_hstate_nodes_allowed(h, buf, PAGE_SIZE); > > - if (len) > + if (len && (len +1) < PAGE_SIZE) > buf[len++] = '\n'; > > return len; > @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct > return 0; > } > > +#define NODES_ALLOWED_MAX 64 > +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write, > + struct file *file, void __user *buffer, > + size_t *length, loff_t *ppos) > +{ > + struct hstate *h = &default_hstate; > + int ret = 0; > + > + if (write) { > + (void)set_hstate_nodes_allowed(h, buffer, 1); > + } else { > + char buf[NODES_ALLOWED_MAX]; > + struct ctl_table tbl = { > + .data = buf, > + .maxlen = NODES_ALLOWED_MAX, > + }; > + int len = format_hstate_nodes_allowed(h, buf, sizeof(buf)); > + > + if (len) > + ret = proc_dostring(&tbl, write, file, buffer, > + length, ppos); > + } > + return ret; > +} > + > #endif /* CONFIG_SYSCTL */ > > void hugetlb_report_meminfo(struct seq_file *m) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id E3FCA6B0055 for ; Wed, 17 Jun 2009 13:15:31 -0400 (EDT) Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed From: Lee Schermerhorn In-Reply-To: <20090617130216.GF28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> Content-Type: text/plain Date: Wed, 17 Jun 2009 13:15:54 -0400 Message-Id: <1245258954.6235.58.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote: > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote: > > Because of assymmetries in some NUMA platforms, and "interesting" > > topologies emerging in the "scale up x86" world, we have need for > > better control over the placement of "fresh huge pages". A while > > back Nish Aravamundan floated a series of patches to add per node > > controls for allocating pages to the hugepage pool and removing > > them. Nish apparently moved on to other tasks before those patches > > were accepted. I have kept a copy of Nish's patches and have > > intended to rebase and test them and resubmit. > > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge > > in the huge pages area, I asked his opinion of per node controls > > for huge pages and he suggested another approach: using the mempolicy > > of the task that changes nr_hugepages to constrain the fresh huge > > page allocations. I considered this approach but it seemed to me > > to be a misuse of mempolicy for populating the huge pages free > > pool. > > Why would it be a misuse? Fundamentally, the huge page pools are being > filled by the current process when nr_hugepages is being used. Or are > you concerned about the specification of hugepages on the kernel command > line? Well, "misuse" might be too strong [and I already softened it from "abuse" :)]--more like "not a good fit, IMO". I agree that it could be made to work, but it just didn't "feel" right. I suppose that we could use alloc_pages_current() with the necessary gfp flags to specify the "thisnode"/"exact_node" semantics [no fallback], making sure we didn't OOM kill the task if a node couldn't satisfy the huge page allocation, ... I think it would require major restructuring of set_max_huge_pages(), ... Maybe that would be the right thing to do and maybe we'll end up there, but I was looking for a simpler approach. And, yes, the kernel command line was a consideration. Because of that, I considered specifying a boot time "nodes_allowed" mask and constructing a "nodes_allowed" mask at run time from the current task's policy. But, it seemed more straightforward to my twisted way of thinking to make the nodes_allowed mask a bona fide hstate attribute. > > > Interleave policy doesn't have same "this node" semantics > > that we want > > By "this node" semantics, do you mean allocating from one specific node? Yes. > In that case, why would specifying a nodemask of just one node not be > sufficient? Maybe along with GFP_THISNODE. In general, we want to interleave the persistent huge pages over multiple nodes, but we don't want the allocations to succeed by falling back to another node. Then we end up with unbalanced allocations. This is how fresh huge page allocation worked before it was changed to use "THISNODE". > > > and bind policy would require constructing a custom > > node mask for node as well as addressing OOM, which we don't want > > during fresh huge page allocation. > > Would the required mask not already be setup when the process set the > policy? OOM is not a major concern, it doesn't trigger for failed > hugepage allocations. Perhaps not with the current implementation. However, a certain, currently shipping, enterprise distro added a "quick and dirty" [well, "dirty", anyway :)] implementation of "thisnode" behavior to elimiate the unwanted fallback behavior and resulting huge pages imbalance by constructing an ad hoc "MPOL_BIND" policy for allocating huge pages. We found that this DOES result in oom kill of the task increasing nr_hugepages if it encountered a node w/ no available huge pages. If one used "echo" [usually a shell builtin] to increase nr_hugepages, it killed off your shell. sysctl was somewhat better--only killed sysctl. The good news was that oom kill just sets a thread info flag, so the nr_hugepages handler continued to allocate pages around the nodes and further oom kills were effective no-ops. When the allocations was done, the victim task would die. End users might not consider this acceptable behavior. I wasn't sure we'd avoid this situation if we dropped back to using task mempolicy via, e.g., alloc_pages_current(). So, I thought I'd run this proposal [nodes_allowed] by you. > > > One could derive a node mask > > of allowed nodes for huge pages from the mempolicy of the task > > that is modifying nr_hugepages and use that for fresh huge pages > > with GFP_THISNODE. However, if we're not going to use mempolicy > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with > > yet another on-stack pseudo-vma :(]--I thought it cleaner to > > define a "nodes allowed" nodemask for populating the [persistent] > > huge pages free pool. > > > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy > you need? Interesting you should mention this. Somewhat off topic: I have a "cleanup" and reorg of shared policy and vma policy that separates policy lookup from allocation, and adds an "alloc_page_pol()" function. This is part of my series to generalize shared policy and extend it to shared, mmap()ed regular files. That aspect of the series [shared policy on shared mmaped files] got a lot of push back without any consideration of the technical details of the patches themselves. Besides having need/requests for this capability, the resulting cleanup, removal of all on-stack pseudo-vmas, ..., the series actually seems to perform [slightly] better on the testing I've done. I keep this series up to date and hope to repost again sometime with benchmark results. > > > This patch series introduces a [per hugepage size] "sysctl", > > hugepages_nodes_allowed, that specifies a nodemask to constrain > > the allocation of persistent, fresh huge pages. The nodemask > > may be specified by a sysctl, a sysfs huge pages attribute and > > on the kernel boot command line. > > > > The series includes a patch to free hugepages from the pool in a > > "round robin" fashion, interleaved across all on-line nodes to > > balance the hugepage pool across nodes. Nish had a patch to do > > this, too. > > > > Together, these changes don't provide the fine grain of control > > that per node attributes would. > > I'm failing to understand at the moment why mem policies set by numactl > would not do the job for allocation at least. Freeing is a different problem. They could be made to work. I actually started coding up a patch to extract a "nodes allowed" mask from the policy for use with alloc_fresh_huge_page[_node]() so that I could maintain the overall structure and use alloc_page_exact_node() with nodes in the allowed mask. And, as you say, with some investigation, we may find that we can use alloc_pages_current() with appropriate flags to achieve the exact node semantics on each node in the policy. > > > Specifically, there is no easy > > way to reduce the persistent huge page count for a specific node. > > I think the degree of control provided by these patches is the > > minimal necessary and sufficient for managing the persistent the > > huge page pool. However, with a bit more reorganization, we > > could implement per node controls if others would find that > > useful. > > > > For more info, see the patch descriptions and the updated kernel > > hugepages documentation. > > > More in response to your comments on the individual patches. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 142E16B005C for ; Wed, 17 Jun 2009 13:16:33 -0400 (EDT) Subject: Re: [PATCH 1/5] Free huge pages round robin to balance across nodes From: Lee Schermerhorn In-Reply-To: <20090617131833.GG28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135236.25248.93692.sendpatchset@lts-notebook> <20090617131833.GG28529@csn.ul.ie> Content-Type: text/plain Date: Wed, 17 Jun 2009 13:16:58 -0400 Message-Id: <1245259018.6235.59.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-06-17 at 14:18 +0100, Mel Gorman wrote: > On Tue, Jun 16, 2009 at 09:52:36AM -0400, Lee Schermerhorn wrote: > > [PATCH 1/5] Free huge pages round robin to balance across nodes > > > > Against: 17may09 mmotm > > > > Currently, altho' increasing nr_hugepages will [attempt to] > > distribute the new huge pages across all nodes in the system, > > reducing nr_hugepages will free or surplus all free pages > > from nodes in node id order. This patch frees huges pages > > from nodes in round robin fashion in an attempt to keep > > [persistent] hugepage allocates balanced across the nodes. > > > > New function free_pool_huge_page() is modeled on and > > performs roughly the inverse of alloc_fresh_huge_page(). > > Replaces dequeue_huge_page() which now has no callers > > and can be removed. > > > > Helper function hstate_next_to_free_node() uses new hstate > > member next_to_free_nid to distribute "frees" across all > > nodes with huge pages. > > > > I placed this patch first in the series because I think it > > [or something similar] should be applied independent of the > > rest of the series. > > > > Agreed. Reading though, I can't see any problems with the patch and it > does make the freeing symmetric with the allocation. > > For clarity though, would it be worth renaming hugetlb_next_nid to > next_to_alloc_nid so that there is a clear relationship in the > round-robin allocation and freeing of pages amoung online nodes? OK. I'll do that in the next version of the series. > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/hugetlb.h | 1 > > mm/hugetlb.c | 68 +++++++++++++++++++++++++++++++++--------------- > > 2 files changed, 48 insertions(+), 21 deletions(-) > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:29.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 > > @@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area( > > /* Defines one hugetlb page size */ > > struct hstate { > > int hugetlb_next_nid; > > + int next_to_free_nid; > > unsigned int order; > > unsigned long mask; > > unsigned long max_huge_pages; > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:29.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 > > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst > > h->free_huge_pages_node[nid]++; > > } > > > > -static struct page *dequeue_huge_page(struct hstate *h) > > -{ > > - int nid; > > - struct page *page = NULL; > > - > > - for (nid = 0; nid < MAX_NUMNODES; ++nid) { > > - if (!list_empty(&h->hugepage_freelists[nid])) { > > - page = list_entry(h->hugepage_freelists[nid].next, > > - struct page, lru); > > - list_del(&page->lru); > > - h->free_huge_pages--; > > - h->free_huge_pages_node[nid]--; > > - break; > > - } > > - } > > - return page; > > -} > > - > > static struct page *dequeue_huge_page_vma(struct hstate *h, > > struct vm_area_struct *vma, > > unsigned long address, int avoid_reserve) > > @@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct > > return ret; > > } > > > > +/* > > + * helper for free_pool_huge_page() - find next node > > + * from which to free a huge page > > + */ > > +static int hstate_next_to_free_node(struct hstate *h) > > +{ > > + int next_nid; > > + next_nid = next_node(h->next_to_free_nid, node_online_map); > > + if (next_nid == MAX_NUMNODES) > > + next_nid = first_node(node_online_map); > > + h->next_to_free_nid = next_nid; > > + return next_nid; > > +} > > + > > +/* > > + * Free huge page from pool from next node to free. > > + * Attempt to keep persistent huge pages more or less > > + * balanced over allowed nodes. > > + * Called with hugetlb_lock locked. > > + */ > > +static int free_pool_huge_page(struct hstate *h) > > +{ > > + int start_nid; > > + int nid; > > + int ret = 0; > > + > > + start_nid = h->next_to_free_nid; > > + nid = h->next_to_free_nid; > > + > > + do { > > + if (!list_empty(&h->hugepage_freelists[nid])) { > > + struct page *page = > > + list_entry(h->hugepage_freelists[nid].next, > > + struct page, lru); > > + list_del(&page->lru); > > + h->free_huge_pages--; > > + h->free_huge_pages_node[nid]--; > > + update_and_free_page(h, page); > > + ret = 1; > > + } > > + nid = hstate_next_to_free_node(h); > > + } while (!ret && nid != start_nid); > > + > > + return ret; > > +} > > + > > static struct page *alloc_buddy_huge_page(struct hstate *h, > > struct vm_area_struct *vma, unsigned long address) > > { > > @@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages( > > min_count = max(count, min_count); > > try_to_free_low(h, min_count); > > while (min_count < persistent_huge_pages(h)) { > > - struct page *page = dequeue_huge_page(h); > > - if (!page) > > + if (!free_pool_huge_page(h)) > > break; > > - update_and_free_page(h, page); > > } > > while (count < persistent_huge_pages(h)) { > > if (!adjust_pool_surplus(h, 1)) > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id AD0FD6B006A for ; Wed, 17 Jun 2009 13:37:12 -0400 (EDT) Subject: Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct From: Lee Schermerhorn In-Reply-To: <20090617133541.GH28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135253.25248.96346.sendpatchset@lts-notebook> <20090617133541.GH28529@csn.ul.ie> Content-Type: text/plain Date: Wed, 17 Jun 2009 13:38:16 -0400 Message-Id: <1245260296.6235.81.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-06-17 at 14:35 +0100, Mel Gorman wrote: > On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote: > > [PATCH 2/5] add nodes_allowed members to hugepages hstate struct > > > > Against: 17may09 mmotm > > > > This patch adds a nodes_allowed nodemask_t pointer and a > > __nodes_allowed nodemask_t to the hstate struct for constraining > > fresh hugepage allocations. It then adds sysfs attributes and > > boot command line options to set and [for sysfs attributes] query > > the allowed nodes mask. > > > > A separate patch will hook up this nodes_allowed mask/pointer to > > fresh huge page allocation and promoting of surplus pages to > > persistent. Another patch will add a 'sysctl' hugepages_nodes_allowed > > to /proc/sys/vm. > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/hugetlb.h | 2 + > > mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 88 insertions(+) > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > > @@ -193,6 +193,8 @@ struct hstate { > > unsigned long resv_huge_pages; > > unsigned long surplus_huge_pages; > > unsigned long nr_overcommit_huge_pages; > > + nodemask_t *nodes_allowed; > > + nodemask_t __nodes_allowed; > > Can you add a comment as to why a nodemask pointer and a nodemask itself > with very similar field names are both needed? Hmmm, thought I did that. Guess I just thought about doing it :(. It's sort of tied up with why I want to use "all" rather than a literal node mask to specify allocation [attempts] across all nodes. The default and current behavior is to use the node_online_mask. My understanding is that this tracks node hot add/remove, and I wanted to preserve that behavior by pointing at node_online_mask directly, in the default and nodes_allowed=all cases. > > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > > @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages); > > static struct hstate * __initdata parsed_hstate; > > static unsigned long __initdata default_hstate_max_huge_pages; > > static unsigned long __initdata default_hstate_size; > > +static struct hstate __initdata default_boot_hstate = { > > + .nodes_allowed = &node_online_map, > > +}; > > > > #define for_each_hstate(h) \ > > for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++) > > @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates( > > struct hstate *h; > > > > for_each_hstate(h) { > > + if (!h->nodes_allowed) > > + h->nodes_allowed = &node_online_map; > > + > > The reasoning for two nodes_allowed would appear to be to allow nodes_allowed > to be some predefined mash or a hstate-private mask. As the mask has to exist > anyway, can you not just copy it in rather than having a pointer and a mask? > > It would make the check below as to whether all nodes allowed or not a bit > more expensive but it's not a big deal. but, it wouldn't track node hot add/remove. I suppose we could add handlers to fix up the hstates, but that seemed unnecessary work. My thinking was that allocation of persistent hugepages [as opposed to overcommitted/surplus h.p.] is an administrative action and, thus, not a fast path. Also, I didn't envision [my myopia, perhaps] a large number of hstates, so I didn't think the extra pointer per hstate was a big burden. > > > /* oversize hugepages were init'ed in early boot */ > > if (h->order < MAX_ORDER) > > hugetlb_hstate_alloc_pages(h); > > @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s > > } > > HSTATE_ATTR(nr_overcommit_hugepages); > > > > +static ssize_t nodes_allowed_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf) > > +{ > > + struct hstate *h = kobj_to_hstate(kobj); > > + int len = 3; > > + > > + if (h->nodes_allowed == &node_online_map) > > + strcpy(buf, "all"); > > + else > > + len = nodelist_scnprintf(buf, PAGE_SIZE, > > + *h->nodes_allowed); > > + > > + if (len) > > + buf[len++] = '\n'; > > + > > buf doesn't get NULL terminated. Ah, OK. Not sure what I was using as an example here. Maybe I thunk that up all on my own... > > > + return len; > > +} > > Can print_nodes_state() be extended a little to print "all" and then shared > with here? Well, I don't know that we'd want to see "all" for the sysfs "online" nodes attribute. Some might not find this too informative. Of course, even for huge pages nodes allowed, it's not all that informative, but IMO fits the usage. One can always look at the sysfs online nodes attribute to see what "all" means for nodes_allowed. Couldn't do that if we changed print_nodes_state(). I guess we could add a flag to indicate whether we wanted "all" or the actual nodemask. And, we have to make print_nodes_state() visible. Do you think that would be worth while? > > > + > > +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf, > > + bool lock) > > +{ > > + nodemask_t nodes_allowed; > > + int ret = 1; > > + bool all = !strncasecmp(buf, "all", 3); > > + > > + if (!all) > > + ret = !nodelist_parse(buf, nodes_allowed); > > + if (ret) { > > + if (lock) > > + spin_lock(&hugetlb_lock); > > + > > ick. > > Convention for something like this is to have set_hstate_nodes_allowed > that takes a spinlock and __set_hstate_nodes_allowed that is the > lock-free version and call as appropriate. Can the same be done here? Sorry. Yeah, I can do that. > > For that matter, does taking spinlocks from __setup() break that you > avoid taking it in that case? I wasn't sure, but noticed that we weren't locking in the boot path for, e.g., [nr_]hugepages, so I avoided it here. I suppose I can just test whether or not it breaks and, if not, just lock. > > > + if (all) { > > + h->nodes_allowed = &node_online_map; > > + } else { > > + h->__nodes_allowed = nodes_allowed; > > + h->nodes_allowed = &h->__nodes_allowed; > > + } > > + > > + if (lock) > > + spin_unlock(&hugetlb_lock); > > + } > > + return ret; > > +} > > + > > +static ssize_t nodes_allowed_store(struct kobject *kobj, > > + struct kobj_attribute *attr, const char *buf, size_t count) > > +{ > > + struct hstate *h = kobj_to_hstate(kobj); > > + > > + if (set_hstate_nodes_allowed(h, buf, 1)) > > + return count; > > + else > > + return 0; > > +} > > +HSTATE_ATTR(nodes_allowed); > > + > > static ssize_t free_hugepages_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *buf) > > { > > @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages); > > static struct attribute *hstate_attrs[] = { > > &nr_hugepages_attr.attr, > > &nr_overcommit_hugepages_attr.attr, > > + &nodes_allowed_attr.attr, > > &free_hugepages_attr.attr, > > &resv_hugepages_attr.attr, > > &surplus_hugepages_attr.attr, > > @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void) > > if (default_hstate_max_huge_pages) > > default_hstate.max_huge_pages = default_hstate_max_huge_pages; > > > > + if (default_boot_hstate.nodes_allowed != &node_online_map) { > > + default_hstate.__nodes_allowed = > > + default_boot_hstate.__nodes_allowed; > > + default_hstate.nodes_allowed = > > + &default_hstate.__nodes_allowed; > > + } > > + > > hugetlb_init_hstates(); > > > > gather_bootmem_prealloc(); > > @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned > > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > > huge_page_size(h)/1024); > > > > + h->nodes_allowed = &node_online_map; > > parsed_hstate = h; > > } > > > > @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup( > > } > > __setup("default_hugepagesz=", hugetlb_default_setup); > > > > +static int __init hugetlb_nodes_allowed_setup(char *s) > > +{ > > + struct hstate *h = &default_boot_hstate; > > + > > + /* > > + * max_hstate means we've parsed a hugepagesz= parameter, so > > + * use the [most recently] parsed_hstate. Else use default. > > + */ > > + if (max_hstate) > > + h = parsed_hstate; > > + > > + return set_hstate_nodes_allowed(h, s, 0); > > +} > > +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup); > > + > > static unsigned int cpuset_mems_nr(unsigned int *array) > > { > > int node; > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 2077C6B0062 for ; Wed, 17 Jun 2009 13:45:42 -0400 (EDT) Subject: Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation From: Lee Schermerhorn In-Reply-To: <20090617133911.GI28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135301.25248.91276.sendpatchset@lts-notebook> <20090617133911.GI28529@csn.ul.ie> Content-Type: text/plain Date: Wed, 17 Jun 2009 13:47:03 -0400 Message-Id: <1245260823.6235.91.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-06-17 at 14:39 +0100, Mel Gorman wrote: > On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote: > > [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation > > > > Against: 17may09 mmotm > > > > Select only nodes from the per hstate nodes_allowed mask when > > promoting surplus pages to persistent or when allocating fresh > > huge pages to the pool. > > > > Note that alloc_buddy_huge_page() still uses task policy to allocate > > surplus huge pages. This could be changed. > > > > Signed-off-by: Lee Schermerhorn > > > > mm/hugetlb.c | 23 ++++++++++++++++++----- > > 1 file changed, 18 insertions(+), 5 deletions(-) > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > > @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag > > static int hstate_next_node(struct hstate *h) > > { > > int next_nid; > > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > > + next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed); > > if (next_nid == MAX_NUMNODES) > > - next_nid = first_node(node_online_map); > > + next_nid = first_node(*h->nodes_allowed); > > h->hugetlb_next_nid = next_nid; > > return next_nid; > > } > > @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct > > int ret = 0; > > > > start_nid = h->hugetlb_next_nid; > > + /* > > + * we may have allocated with a different nodes_allowed previously > > + */ > > + if (!node_isset(start_nid, *h->nodes_allowed)) > > + start_nid = hstate_next_node(h); > > > > do { > > page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > > @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc > > > > /* > > * Increment or decrement surplus_huge_pages. Keep node-specific counters > > - * balanced by operating on them in a round-robin fashion. > > + * balanced by operating on them in a round-robin fashion. Use nodes_allowed > > + * mask when decreasing suplus pages as we're "promoting" them to persistent. > > s/suplus/surplus/ ACK > > > + * Use node_online_map for increment surplus pages as we're demoting previously > > + * persistent huge pages. > > + * Called holding the hugetlb_lock. > > * Returns 1 if an adjustment was made. > > */ > > static int adjust_pool_surplus(struct hstate *h, int delta) > > { > > + nodemask_t *nodemask = &node_online_map; > > static int prev_nid; > > int nid = prev_nid; > > int ret = 0; > > > > VM_BUG_ON(delta != -1 && delta != 1); > > + if (delta < 0) > > + nodemask = h->nodes_allowed; > > + > > Please spell out why nodes_allowed is only used when decreasing the surplus > count. I thought my addition to the comment block did that. My thinking: surplus pages are, by definition, in use, so the only time we decrease them via adjust_pool_surplus() is when we're increasing nr_hugepages. New "persistent" huge pages are masked by nodes_allowed, so promoting surplus pages to persistent should also be so masked. Conversely, since this series uses the node_online_mask for freeing available persistent huge pages, I mention that we use it for increasing the surplus page count. Should I add more to the comment? > > > do { > > - nid = next_node(nid, node_online_map); > > + nid = next_node(nid, *nodemask); > > if (nid == MAX_NUMNODES) > > - nid = first_node(node_online_map); > > + nid = first_node(*nodemask); > > > > /* To shrink on this node, there must be a surplus page */ > > if (delta < 0 && !h->surplus_huge_pages_node[nid]) > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 5D9B86B005C for ; Wed, 17 Jun 2009 13:50:54 -0400 (EDT) Subject: Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed. From: Lee Schermerhorn In-Reply-To: <20090617134107.GJ28529@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135308.25248.57593.sendpatchset@lts-notebook> <20090617134107.GJ28529@csn.ul.ie> Content-Type: text/plain Date: Wed, 17 Jun 2009 13:52:02 -0400 Message-Id: <1245261122.6235.96.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 2009-06-17 at 14:41 +0100, Mel Gorman wrote: > On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote: > > [PATCH 4/5] add sysctl for default hstate nodes_allowed. > > > > Against: 17may09 mmotm > > > > This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed -- > > to set/query the default hstate's nodes_allowed. I don't know > > that this patch is required, given that we have the per hstate > > controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls > > for other recent hugepages controls, like nr_overcommit_hugepages, > > so I've followed that convention. > > > > Yeah, it's somewhat expected that what is in /proc/sys/vm is information > on the default hugepage size. > > > Factor the formatting of the nodes_allowed mask out of nodes_allowed_show() > > for use by both that function and the hugetlb_nodes_allowed_handler(). > > > > Signed-off-by: Lee Schermerhorn > > > > include/linux/hugetlb.h | 1 + > > kernel/sysctl.c | 8 ++++++++ > > mm/hugetlb.c | 43 ++++++++++++++++++++++++++++++++++++++----- > > 3 files changed, 47 insertions(+), 5 deletions(-) > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:35.000000000 -0400 > > @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm > > int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); > > int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); > > void unmap_hugepage_range(struct vm_area_struct *, > > Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c 2009-06-04 12:59:26.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c 2009-06-04 12:59:35.000000000 -0400 > > @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = { > > .extra1 = (void *)&hugetlb_zero, > > .extra2 = (void *)&hugetlb_infinity, > > }, > > + { > > + .ctl_name = CTL_UNNUMBERED, > > + .procname = "hugepages_nodes_allowed", > > + .data = NULL, > > + .maxlen = sizeof(unsigned long), > > + .mode = 0644, > > + .proc_handler = &hugetlb_nodes_allowed_handler, > > + }, > > #endif > > { > > .ctl_name = VM_LOWMEM_RESERVE_RATIO, > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:35.000000000 -0400 > > @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s > > } > > HSTATE_ATTR(nr_overcommit_hugepages); > > > > -static ssize_t nodes_allowed_show(struct kobject *kobj, > > - struct kobj_attribute *attr, char *buf) > > +static int format_hstate_nodes_allowed(struct hstate *h, char *buf, > > + size_t buflen) > > { > > - struct hstate *h = kobj_to_hstate(kobj); > > int len = 3; > > > > if (h->nodes_allowed == &node_online_map) > > strcpy(buf, "all"); > > else > > - len = nodelist_scnprintf(buf, PAGE_SIZE, > > + len = nodelist_scnprintf(buf, buflen, > > *h->nodes_allowed); > > + return len; > > + > > +} > > This looks like unnecessary churn and could have been done in the earlier > patch introducing nodes_allowed_show() Yes. I couldn't see creating the separate function if we weren't going to have the sysctl also formatting the nodes_allowed. If all agree that we want the sysctl, I can fold this patch in with patch 2/5. > > > + > > +static ssize_t nodes_allowed_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf) > > +{ > > + struct hstate *h = kobj_to_hstate(kobj); > > + int len = format_hstate_nodes_allowed(h, buf, PAGE_SIZE); > > > > - if (len) > > + if (len && (len +1) < PAGE_SIZE) > > buf[len++] = '\n'; > > > > return len; > > @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct > > return 0; > > } > > > > +#define NODES_ALLOWED_MAX 64 > > +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write, > > + struct file *file, void __user *buffer, > > + size_t *length, loff_t *ppos) > > +{ > > + struct hstate *h = &default_hstate; > > + int ret = 0; > > + > > + if (write) { > > + (void)set_hstate_nodes_allowed(h, buffer, 1); > > + } else { > > + char buf[NODES_ALLOWED_MAX]; > > + struct ctl_table tbl = { > > + .data = buf, > > + .maxlen = NODES_ALLOWED_MAX, > > + }; > > + int len = format_hstate_nodes_allowed(h, buf, sizeof(buf)); > > + > > + if (len) > > + ret = proc_dostring(&tbl, write, file, buffer, > > + length, ppos); > > + } > > + return ret; > > +} > > + > > #endif /* CONFIG_SYSCTL */ > > > > void hugetlb_report_meminfo(struct seq_file *m) > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id A1A216B005A for ; Thu, 18 Jun 2009 05:16:52 -0400 (EDT) Date: Thu, 18 Jun 2009 10:17:13 +0100 From: Mel Gorman Subject: Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Message-ID: <20090618091713.GA14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135253.25248.96346.sendpatchset@lts-notebook> <20090617133541.GH28529@csn.ul.ie> <1245260296.6235.81.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1245260296.6235.81.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Jun 17, 2009 at 01:38:16PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 14:35 +0100, Mel Gorman wrote: > > On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote: > > > [PATCH 2/5] add nodes_allowed members to hugepages hstate struct > > > > > > Against: 17may09 mmotm > > > > > > This patch adds a nodes_allowed nodemask_t pointer and a > > > __nodes_allowed nodemask_t to the hstate struct for constraining > > > fresh hugepage allocations. It then adds sysfs attributes and > > > boot command line options to set and [for sysfs attributes] query > > > the allowed nodes mask. > > > > > > A separate patch will hook up this nodes_allowed mask/pointer to > > > fresh huge page allocation and promoting of surplus pages to > > > persistent. Another patch will add a 'sysctl' hugepages_nodes_allowed > > > to /proc/sys/vm. > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > include/linux/hugetlb.h | 2 + > > > mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 88 insertions(+) > > > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:31.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > > > @@ -193,6 +193,8 @@ struct hstate { > > > unsigned long resv_huge_pages; > > > unsigned long surplus_huge_pages; > > > unsigned long nr_overcommit_huge_pages; > > > + nodemask_t *nodes_allowed; > > > + nodemask_t __nodes_allowed; > > > > Can you add a comment as to why a nodemask pointer and a nodemask itself > > with very similar field names are both needed? > > Hmmm, thought I did that. Guess I just thought about doing it :(. > > It's sort of tied up with why I want to use "all" rather than a literal > node mask to specify allocation [attempts] across all nodes. The > default and current behavior is to use the node_online_mask. My > understanding is that this tracks node hot add/remove, and I wanted to > preserve that behavior by pointing at node_online_mask directly, in the > default and nodes_allowed=all cases. > Good point, I had not considered node hot add/remove. Any chance you could kmalloc() the nodes_allowed() when the mask is set? It just feels remarkably untidy to have two nodes_allowed like this, particularly as the __nodes_allowed will not be used in the majority of cases. > > > > > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:31.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > > > @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages); > > > static struct hstate * __initdata parsed_hstate; > > > static unsigned long __initdata default_hstate_max_huge_pages; > > > static unsigned long __initdata default_hstate_size; > > > +static struct hstate __initdata default_boot_hstate = { > > > + .nodes_allowed = &node_online_map, > > > +}; > > > > > > #define for_each_hstate(h) \ > > > for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++) > > > @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates( > > > struct hstate *h; > > > > > > for_each_hstate(h) { > > > + if (!h->nodes_allowed) > > > + h->nodes_allowed = &node_online_map; > > > + > > > > The reasoning for two nodes_allowed would appear to be to allow nodes_allowed > > to be some predefined mash or a hstate-private mask. As the mask has to exist > > anyway, can you not just copy it in rather than having a pointer and a mask? > > > > It would make the check below as to whether all nodes allowed or not a bit > > more expensive but it's not a big deal. > > but, it wouldn't track node hot add/remove. I suppose we could add > handlers to fix up the hstates, but that seemed unnecessary work. > Right. > My thinking was that allocation of persistent hugepages [as opposed to > overcommitted/surplus h.p.] is an administrative action and, thus, not a > fast path. Also, I didn't envision [my myopia, perhaps] a large number > of hstates, so I didn't think the extra pointer per hstate was a big > burden. > It's not a big memory burden, it just seems untidy and there are a lot of untidy anomolies in hugetlbfs as it is. While I'm far from innocent, I hate to see it getting even odder looking :) > > > > > /* oversize hugepages were init'ed in early boot */ > > > if (h->order < MAX_ORDER) > > > hugetlb_hstate_alloc_pages(h); > > > @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s > > > } > > > HSTATE_ATTR(nr_overcommit_hugepages); > > > > > > +static ssize_t nodes_allowed_show(struct kobject *kobj, > > > + struct kobj_attribute *attr, char *buf) > > > +{ > > > + struct hstate *h = kobj_to_hstate(kobj); > > > + int len = 3; > > > + > > > + if (h->nodes_allowed == &node_online_map) > > > + strcpy(buf, "all"); > > > + else > > > + len = nodelist_scnprintf(buf, PAGE_SIZE, > > > + *h->nodes_allowed); > > > + > > > + if (len) > > > + buf[len++] = '\n'; > > > + > > > > buf doesn't get NULL terminated. > > Ah, OK. Not sure what I was using as an example here. Maybe I thunk > that up all on my own... > > > > > > + return len; > > > +} > > > > Can print_nodes_state() be extended a little to print "all" and then shared > > with here? > > Well, I don't know that we'd want to see "all" for the sysfs "online" > nodes attribute. Some might not find this too informative. Of course, > even for huge pages nodes allowed, it's not all that informative, but > IMO fits the usage. One can always look at the sysfs online nodes > attribute to see what "all" means for nodes_allowed. Couldn't do that > if we changed print_nodes_state(). I guess we could add a flag to > indicate whether we wanted "all" or the actual nodemask. And, we have > to make print_nodes_state() visible. Do you think that would be worth > while? > I see your point. Maybe, maybe not. The suggestion was due to the bugs in the new version than anything else. I figured it would be easier to get one version right than have multiple slightly different versions. How attached are you to having "all" printed? There is an arguement for having nodemasks reported via sysfs all looking the same > > > > > + > > > +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf, > > > + bool lock) > > > +{ > > > + nodemask_t nodes_allowed; > > > + int ret = 1; > > > + bool all = !strncasecmp(buf, "all", 3); > > > + > > > + if (!all) > > > + ret = !nodelist_parse(buf, nodes_allowed); > > > + if (ret) { > > > + if (lock) > > > + spin_lock(&hugetlb_lock); > > > + > > > > ick. > > > > Convention for something like this is to have set_hstate_nodes_allowed > > that takes a spinlock and __set_hstate_nodes_allowed that is the > > lock-free version and call as appropriate. Can the same be done here? > > Sorry. Yeah, I can do that. > > > > > For that matter, does taking spinlocks from __setup() break that you > > avoid taking it in that case? > > I wasn't sure, but noticed that we weren't locking in the boot path for, > e.g., [nr_]hugepages, so I avoided it here. I suppose I can just test > whether or not it breaks and, if not, just lock. > A lack of locking in the bootpath is more likely due to laziness than with any problems taking the locks. > > > > > + if (all) { > > > + h->nodes_allowed = &node_online_map; > > > + } else { > > > + h->__nodes_allowed = nodes_allowed; > > > + h->nodes_allowed = &h->__nodes_allowed; > > > + } > > > + > > > + if (lock) > > > + spin_unlock(&hugetlb_lock); > > > + } > > > + return ret; > > > +} > > > + > > > +static ssize_t nodes_allowed_store(struct kobject *kobj, > > > + struct kobj_attribute *attr, const char *buf, size_t count) > > > +{ > > > + struct hstate *h = kobj_to_hstate(kobj); > > > + > > > + if (set_hstate_nodes_allowed(h, buf, 1)) > > > + return count; > > > + else > > > + return 0; > > > +} > > > +HSTATE_ATTR(nodes_allowed); > > > + > > > static ssize_t free_hugepages_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > { > > > @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages); > > > static struct attribute *hstate_attrs[] = { > > > &nr_hugepages_attr.attr, > > > &nr_overcommit_hugepages_attr.attr, > > > + &nodes_allowed_attr.attr, > > > &free_hugepages_attr.attr, > > > &resv_hugepages_attr.attr, > > > &surplus_hugepages_attr.attr, > > > @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void) > > > if (default_hstate_max_huge_pages) > > > default_hstate.max_huge_pages = default_hstate_max_huge_pages; > > > > > > + if (default_boot_hstate.nodes_allowed != &node_online_map) { > > > + default_hstate.__nodes_allowed = > > > + default_boot_hstate.__nodes_allowed; > > > + default_hstate.nodes_allowed = > > > + &default_hstate.__nodes_allowed; > > > + } > > > + > > > hugetlb_init_hstates(); > > > > > > gather_bootmem_prealloc(); > > > @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned > > > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > > > huge_page_size(h)/1024); > > > > > > + h->nodes_allowed = &node_online_map; > > > parsed_hstate = h; > > > } > > > > > > @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup( > > > } > > > __setup("default_hugepagesz=", hugetlb_default_setup); > > > > > > +static int __init hugetlb_nodes_allowed_setup(char *s) > > > +{ > > > + struct hstate *h = &default_boot_hstate; > > > + > > > + /* > > > + * max_hstate means we've parsed a hugepagesz= parameter, so > > > + * use the [most recently] parsed_hstate. Else use default. > > > + */ > > > + if (max_hstate) > > > + h = parsed_hstate; > > > + > > > + return set_hstate_nodes_allowed(h, s, 0); > > > +} > > > +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup); > > > + > > > static unsigned int cpuset_mems_nr(unsigned int *array) > > > { > > > int node; > > > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id A68496B005D for ; Thu, 18 Jun 2009 05:17:57 -0400 (EDT) Date: Thu, 18 Jun 2009 10:18:26 +0100 From: Mel Gorman Subject: Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Message-ID: <20090618091826.GB14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135301.25248.91276.sendpatchset@lts-notebook> <20090617133911.GI28529@csn.ul.ie> <1245260823.6235.91.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1245260823.6235.91.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Jun 17, 2009 at 01:47:03PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 14:39 +0100, Mel Gorman wrote: > > On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote: > > > [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation > > > > > > Against: 17may09 mmotm > > > > > > Select only nodes from the per hstate nodes_allowed mask when > > > promoting surplus pages to persistent or when allocating fresh > > > huge pages to the pool. > > > > > > Note that alloc_buddy_huge_page() still uses task policy to allocate > > > surplus huge pages. This could be changed. > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > mm/hugetlb.c | 23 ++++++++++++++++++----- > > > 1 file changed, 18 insertions(+), 5 deletions(-) > > > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:32.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > > > @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag > > > static int hstate_next_node(struct hstate *h) > > > { > > > int next_nid; > > > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > > > + next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed); > > > if (next_nid == MAX_NUMNODES) > > > - next_nid = first_node(node_online_map); > > > + next_nid = first_node(*h->nodes_allowed); > > > h->hugetlb_next_nid = next_nid; > > > return next_nid; > > > } > > > @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct > > > int ret = 0; > > > > > > start_nid = h->hugetlb_next_nid; > > > + /* > > > + * we may have allocated with a different nodes_allowed previously > > > + */ > > > + if (!node_isset(start_nid, *h->nodes_allowed)) > > > + start_nid = hstate_next_node(h); > > > > > > do { > > > page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > > > @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc > > > > > > /* > > > * Increment or decrement surplus_huge_pages. Keep node-specific counters > > > - * balanced by operating on them in a round-robin fashion. > > > + * balanced by operating on them in a round-robin fashion. Use nodes_allowed > > > + * mask when decreasing suplus pages as we're "promoting" them to persistent. > > > > s/suplus/surplus/ > > ACK > > > > > > + * Use node_online_map for increment surplus pages as we're demoting previously > > > + * persistent huge pages. > > > + * Called holding the hugetlb_lock. > > > * Returns 1 if an adjustment was made. > > > */ > > > static int adjust_pool_surplus(struct hstate *h, int delta) > > > { > > > + nodemask_t *nodemask = &node_online_map; > > > static int prev_nid; > > > int nid = prev_nid; > > > int ret = 0; > > > > > > VM_BUG_ON(delta != -1 && delta != 1); > > > + if (delta < 0) > > > + nodemask = h->nodes_allowed; > > > + > > > > Please spell out why nodes_allowed is only used when decreasing the surplus > > count. > > I thought my addition to the comment block did that. > > My thinking: surplus pages are, by definition, in use, so the only time > we decrease them via adjust_pool_surplus() is when we're increasing > nr_hugepages. New "persistent" huge pages are masked by nodes_allowed, > so promoting surplus pages to persistent should also be so masked. > This in addition to the existing comment is more than adequate. > Conversely, since this series uses the node_online_mask for freeing > available persistent huge pages, I mention that we use it for increasing > the surplus page count. > > Should I add more to the comment? > Do please. > > > > > do { > > > - nid = next_node(nid, node_online_map); > > > + nid = next_node(nid, *nodemask); > > > if (nid == MAX_NUMNODES) > > > - nid = first_node(node_online_map); > > > + nid = first_node(*nodemask); > > > > > > /* To shrink on this node, there must be a surplus page */ > > > if (delta < 0 && !h->surplus_huge_pages_node[nid]) > > > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 0030D6B006A for ; Thu, 18 Jun 2009 05:18:55 -0400 (EDT) Date: Thu, 18 Jun 2009 10:19:22 +0100 From: Mel Gorman Subject: Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed. Message-ID: <20090618091921.GC14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135308.25248.57593.sendpatchset@lts-notebook> <20090617134107.GJ28529@csn.ul.ie> <1245261122.6235.96.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1245261122.6235.96.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Jun 17, 2009 at 01:52:02PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 14:41 +0100, Mel Gorman wrote: > > On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote: > > > [PATCH 4/5] add sysctl for default hstate nodes_allowed. > > > > > > Against: 17may09 mmotm > > > > > > This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed -- > > > to set/query the default hstate's nodes_allowed. I don't know > > > that this patch is required, given that we have the per hstate > > > controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls > > > for other recent hugepages controls, like nr_overcommit_hugepages, > > > so I've followed that convention. > > > > > > > Yeah, it's somewhat expected that what is in /proc/sys/vm is information > > on the default hugepage size. > > > > > Factor the formatting of the nodes_allowed mask out of nodes_allowed_show() > > > for use by both that function and the hugetlb_nodes_allowed_handler(). > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > include/linux/hugetlb.h | 1 + > > > kernel/sysctl.c | 8 ++++++++ > > > mm/hugetlb.c | 43 ++++++++++++++++++++++++++++++++++++++----- > > > 3 files changed, 47 insertions(+), 5 deletions(-) > > > > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h 2009-06-04 12:59:32.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h 2009-06-04 12:59:35.000000000 -0400 > > > @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm > > > int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > > int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > > int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > > +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); > > > int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); > > > int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); > > > void unmap_hugepage_range(struct vm_area_struct *, > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c 2009-06-04 12:59:26.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c 2009-06-04 12:59:35.000000000 -0400 > > > @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = { > > > .extra1 = (void *)&hugetlb_zero, > > > .extra2 = (void *)&hugetlb_infinity, > > > }, > > > + { > > > + .ctl_name = CTL_UNNUMBERED, > > > + .procname = "hugepages_nodes_allowed", > > > + .data = NULL, > > > + .maxlen = sizeof(unsigned long), > > > + .mode = 0644, > > > + .proc_handler = &hugetlb_nodes_allowed_handler, > > > + }, > > > #endif > > > { > > > .ctl_name = VM_LOWMEM_RESERVE_RATIO, > > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c 2009-06-04 12:59:33.000000000 -0400 > > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c 2009-06-04 12:59:35.000000000 -0400 > > > @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s > > > } > > > HSTATE_ATTR(nr_overcommit_hugepages); > > > > > > -static ssize_t nodes_allowed_show(struct kobject *kobj, > > > - struct kobj_attribute *attr, char *buf) > > > +static int format_hstate_nodes_allowed(struct hstate *h, char *buf, > > > + size_t buflen) > > > { > > > - struct hstate *h = kobj_to_hstate(kobj); > > > int len = 3; > > > > > > if (h->nodes_allowed == &node_online_map) > > > strcpy(buf, "all"); > > > else > > > - len = nodelist_scnprintf(buf, PAGE_SIZE, > > > + len = nodelist_scnprintf(buf, buflen, > > > *h->nodes_allowed); > > > + return len; > > > + > > > +} > > > > This looks like unnecessary churn and could have been done in the earlier > > patch introducing nodes_allowed_show() > > Yes. I couldn't see creating the separate function if we weren't going > to have the sysctl also formatting the nodes_allowed. If all agree that > we want the sysctl, I can fold this patch in with patch 2/5. > I think we want the sysctl to be consistent with nr_hugepages and the other hugepages parameters. > > > > > + > > > +static ssize_t nodes_allowed_show(struct kobject *kobj, > > > + struct kobj_attribute *attr, char *buf) > > > +{ > > > + struct hstate *h = kobj_to_hstate(kobj); > > > + int len = format_hstate_nodes_allowed(h, buf, PAGE_SIZE); > > > > > > - if (len) > > > + if (len && (len +1) < PAGE_SIZE) > > > buf[len++] = '\n'; > > > > > > return len; > > > @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct > > > return 0; > > > } > > > > > > +#define NODES_ALLOWED_MAX 64 > > > +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write, > > > + struct file *file, void __user *buffer, > > > + size_t *length, loff_t *ppos) > > > +{ > > > + struct hstate *h = &default_hstate; > > > + int ret = 0; > > > + > > > + if (write) { > > > + (void)set_hstate_nodes_allowed(h, buffer, 1); > > > + } else { > > > + char buf[NODES_ALLOWED_MAX]; > > > + struct ctl_table tbl = { > > > + .data = buf, > > > + .maxlen = NODES_ALLOWED_MAX, > > > + }; > > > + int len = format_hstate_nodes_allowed(h, buf, sizeof(buf)); > > > + > > > + if (len) > > > + ret = proc_dostring(&tbl, write, file, buffer, > > > + length, ppos); > > > + } > > > + return ret; > > > +} > > > + > > > #endif /* CONFIG_SYSCTL */ > > > > > > void hugetlb_report_meminfo(struct seq_file *m) > > > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 380186B004F for ; Thu, 18 Jun 2009 05:32:06 -0400 (EDT) Date: Thu, 18 Jun 2009 10:33:02 +0100 From: Mel Gorman Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed Message-ID: <20090618093301.GD14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1245258954.6235.58.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote: > > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote: > > > Because of assymmetries in some NUMA platforms, and "interesting" > > > topologies emerging in the "scale up x86" world, we have need for > > > better control over the placement of "fresh huge pages". A while > > > back Nish Aravamundan floated a series of patches to add per node > > > controls for allocating pages to the hugepage pool and removing > > > them. Nish apparently moved on to other tasks before those patches > > > were accepted. I have kept a copy of Nish's patches and have > > > intended to rebase and test them and resubmit. > > > > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge > > > in the huge pages area, I asked his opinion of per node controls > > > for huge pages and he suggested another approach: using the mempolicy > > > of the task that changes nr_hugepages to constrain the fresh huge > > > page allocations. I considered this approach but it seemed to me > > > to be a misuse of mempolicy for populating the huge pages free > > > pool. > > > > Why would it be a misuse? Fundamentally, the huge page pools are being > > filled by the current process when nr_hugepages is being used. Or are > > you concerned about the specification of hugepages on the kernel command > > line? > > Well, "misuse" might be too strong [and I already softened it from > "abuse" :)]--more like "not a good fit, IMO". I agree that it could be > made to work, but it just didn't "feel" right. I suppose that we could > use alloc_pages_current() with the necessary gfp flags to specify the > "thisnode"/"exact_node" semantics [no fallback], making sure we didn't > OOM kill the task if a node couldn't satisfy the huge page > allocation, ... I think it would require major restructuring of > set_max_huge_pages(), ... Maybe that would be the right thing to do and > maybe we'll end up there, but I was looking for a simpler approach. > Ultimately I think it's the right way of doing things. hugepages in the past had a large number of "special" considerations in comparison to the core VM. We've removed a fair few of the special cases in the last year and I'd rather avoid introducing more of them. If I was as an administrator, I would prefer being able to do allocate hugepages only on two nodes with # numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128 instead of # echo some_magic > /proc/sys/vm/hugepage_nodes_allowed # hugeadm --pool-pages-min=2M:128 hugeadm would have to be taught about the new sysctl of course if it's created but it still feels more awkward than it should be and numactl should already be familar. > And, yes, the kernel command line was a consideration. Because of that, > I considered specifying a boot time "nodes_allowed" mask and > constructing a "nodes_allowed" mask at run time from the current task's > policy. But, it seemed more straightforward to my twisted way of > thinking to make the nodes_allowed mask a bona fide hstate attribute. > I see scope for allowing the nodemask to be specified at boot-time all right, heck it might even be required! However, I am somewhat attached to the idea of static hugepage allocation obeying memory policies. > > > > > Interleave policy doesn't have same "this node" semantics > > > that we want > > > > By "this node" semantics, do you mean allocating from one specific node? > > Yes. > > > In that case, why would specifying a nodemask of just one node not be > > sufficient? > > Maybe along with GFP_THISNODE. But a nodemask of just one node might as well be GFP_THISNODE, right? > In general, we want to interleave the > persistent huge pages over multiple nodes, but we don't want the > allocations to succeed by falling back to another node. Then we end up > with unbalanced allocations. This is how fresh huge page allocation > worked before it was changed to use "THISNODE". > > > > > > > and bind policy would require constructing a custom > > > node mask for node as well as addressing OOM, which we don't want > > > during fresh huge page allocation. > > > > Would the required mask not already be setup when the process set the > > policy? OOM is not a major concern, it doesn't trigger for failed > > hugepage allocations. > > Perhaps not with the current implementation. However, a certain, > currently shipping, enterprise distro added a "quick and dirty" [well, > "dirty", anyway :)] implementation of "thisnode" behavior to elimiate > the unwanted fallback behavior and resulting huge pages imbalance by > constructing an ad hoc "MPOL_BIND" policy for allocating huge pages. We > found that this DOES result in oom kill of the task increasing > nr_hugepages if it encountered a node w/ no available huge pages. > Fun. > If one used "echo" [usually a shell builtin] to increase nr_hugepages, > it killed off your shell. sysctl was somewhat better--only killed > sysctl. The good news was that oom kill just sets a thread info flag, > so the nr_hugepages handler continued to allocate pages around the nodes > and further oom kills were effective no-ops. When the allocations was > done, the victim task would die. End users might not consider this > acceptable behavior. > No. My current expectation is that we've handled the vast majority of cases where processes could get unexpectedly killed just because they were looking funny at hugepages. > I wasn't sure we'd avoid this situation if we dropped back to using task > mempolicy via, e.g., alloc_pages_current(). So, I thought I'd run this > proposal [nodes_allowed] by you. > I'm not wholesale against it. Minimally, I see scope for specifying the nodemask at boot-time but I'm less sold on the sysctl at the moment because I'm not convinced that applying memory policies when sizing the hugepage pool cannot be made work to solve your problem. > > > > > One could derive a node mask > > > of allowed nodes for huge pages from the mempolicy of the task > > > that is modifying nr_hugepages and use that for fresh huge pages > > > with GFP_THISNODE. However, if we're not going to use mempolicy > > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with > > > yet another on-stack pseudo-vma :(]--I thought it cleaner to > > > define a "nodes allowed" nodemask for populating the [persistent] > > > huge pages free pool. > > > > > > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy > > you need? > > Interesting you should mention this. Somewhat off topic: I have a > "cleanup" and reorg of shared policy and vma policy that separates > policy lookup from allocation, and adds an "alloc_page_pol()" function. > This is part of my series to generalize shared policy and extend it to > shared, mmap()ed regular files. That aspect of the series [shared > policy on shared mmaped files] got a lot of push back without any > consideration of the technical details of the patches themselves. Those arguements feel vaguely familiar. I think I read the thread although the specifics of the objections escape me. Think there was something about non-determinism if two processes shared a mapping with different policies and no idea which should take precedence. > Besides having need/requests for this capability, the resulting cleanup, > removal of all on-stack pseudo-vmas, ..., the series actually seems to > perform [slightly] better on the testing I've done. > > I keep this series up to date and hope to repost again sometime with > benchmark results. > Sounds like a good idea. If the cleanup yielded an alloc_page_pol() function, it might make the patch for hugetlbfs more straight-forward. > > > > > > This patch series introduces a [per hugepage size] "sysctl", > > > hugepages_nodes_allowed, that specifies a nodemask to constrain > > > the allocation of persistent, fresh huge pages. The nodemask > > > may be specified by a sysctl, a sysfs huge pages attribute and > > > on the kernel boot command line. > > > > > > The series includes a patch to free hugepages from the pool in a > > > "round robin" fashion, interleaved across all on-line nodes to > > > balance the hugepage pool across nodes. Nish had a patch to do > > > this, too. > > > > > > Together, these changes don't provide the fine grain of control > > > that per node attributes would. > > > > I'm failing to understand at the moment why mem policies set by numactl > > would not do the job for allocation at least. Freeing is a different problem. > > They could be made to work. I actually started coding up a patch to > extract a "nodes allowed" mask from the policy for use with > alloc_fresh_huge_page[_node]() so that I could maintain the overall > structure and use alloc_page_exact_node() with nodes in the allowed > mask. And, as you say, with some investigation, we may find that we can > use alloc_pages_current() with appropriate flags to achieve the exact > node semantics on each node in the policy. > I'd like to see it investigated more please. > > > > > Specifically, there is no easy > > > way to reduce the persistent huge page count for a specific node. > > > I think the degree of control provided by these patches is the > > > minimal necessary and sufficient for managing the persistent the > > > huge page pool. However, with a bit more reorganization, we > > > could implement per node controls if others would find that > > > useful. > > > > > > For more info, see the patch descriptions and the updated kernel > > > hugepages documentation. > > > > > > > More in response to your comments on the individual patches. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id A43956B005A for ; Thu, 18 Jun 2009 10:45:15 -0400 (EDT) Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed From: Lee Schermerhorn In-Reply-To: <20090618093301.GD14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <20090618093301.GD14903@csn.ul.ie> Content-Type: text/plain Date: Thu, 18 Jun 2009 10:46:34 -0400 Message-Id: <1245336394.1025.65.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Thu, 2009-06-18 at 10:33 +0100, Mel Gorman wrote: > On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote: > > On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote: > > > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote: > > > > Because of assymmetries in some NUMA platforms, and "interesting" > > > > topologies emerging in the "scale up x86" world, we have need for > > > > better control over the placement of "fresh huge pages". A while > > > > back Nish Aravamundan floated a series of patches to add per node > > > > controls for allocating pages to the hugepage pool and removing > > > > them. Nish apparently moved on to other tasks before those patches > > > > were accepted. I have kept a copy of Nish's patches and have > > > > intended to rebase and test them and resubmit. > > > > > > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge > > > > in the huge pages area, I asked his opinion of per node controls > > > > for huge pages and he suggested another approach: using the mempolicy > > > > of the task that changes nr_hugepages to constrain the fresh huge > > > > page allocations. I considered this approach but it seemed to me > > > > to be a misuse of mempolicy for populating the huge pages free > > > > pool. > > > > > > Why would it be a misuse? Fundamentally, the huge page pools are being > > > filled by the current process when nr_hugepages is being used. Or are > > > you concerned about the specification of hugepages on the kernel command > > > line? > > > > Well, "misuse" might be too strong [and I already softened it from > > "abuse" :)]--more like "not a good fit, IMO". I agree that it could be > > made to work, but it just didn't "feel" right. I suppose that we could > > use alloc_pages_current() with the necessary gfp flags to specify the > > "thisnode"/"exact_node" semantics [no fallback], making sure we didn't > > OOM kill the task if a node couldn't satisfy the huge page > > allocation, ... I think it would require major restructuring of > > set_max_huge_pages(), ... Maybe that would be the right thing to do and > > maybe we'll end up there, but I was looking for a simpler approach. > > > > Ultimately I think it's the right way of doing things. hugepages in the > past had a large number of "special" considerations in comparison to the > core VM. We've removed a fair few of the special cases in the last year > and I'd rather avoid introducing more of them. > > If I was as an administrator, I would prefer being able to do allocate > hugepages only on two nodes with > > # numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128 Of course, this means that hugeadm itself and any library pages it pulls in will be interleaved across those nodes. I don't know that that's an issue. hugeadm or any command used to resize the pool will need to run somewhere. > > instead of > > # echo some_magic > /proc/sys/vm/hugepage_nodes_allowed > # hugeadm --pool-pages-min=2M:128 > > hugeadm would have to be taught about the new sysctl of course if it's created > but it still feels more awkward than it should be and numactl should already > be familar. > > > And, yes, the kernel command line was a consideration. Because of that, > > I considered specifying a boot time "nodes_allowed" mask and > > constructing a "nodes_allowed" mask at run time from the current task's > > policy. But, it seemed more straightforward to my twisted way of > > thinking to make the nodes_allowed mask a bona fide hstate attribute. > > > > I see scope for allowing the nodemask to be specified at boot-time all right, > heck it might even be required! Yeah, the hugepages doc does recommend that one allocate the pool at boot time or in an init script when the allocations still have a good chance of succeeding. > However, I am somewhat attached to the idea > of static hugepage allocation obeying memory policies. Really? I couldn't tell ;). > > > > > > > > Interleave policy doesn't have same "this node" semantics > > > > that we want > > > > > > By "this node" semantics, do you mean allocating from one specific node? > > > > Yes. > > > > > In that case, why would specifying a nodemask of just one node not be > > > sufficient? > > > > Maybe along with GFP_THISNODE. > > But a nodemask of just one node might as well be GFP_THISNODE, right? No. w/o GFP_THISNODE, the allocation will use the generic zonelist and will fallback to another node if the target node doesn't have any available huge pages. Looks just like any other higher order page allocation. And w/o 'THISNODE, I think we'll OOM kill the task increasing max hugepages if the allocation fails--need to test this. We may want/accept this behavior for satisfying tasks' page faults, but not for allocating static huge pages into the pool--IMO, anyway. So, my approach to using task mempolicy would be to construct a "nodes_allowed" mask from the policy and then use the same alloc_pages_exact_node() with the same flags we currently use. This is what I mean about it being a "misfit". What I really want is a node mask over which I'll distribute the huge pages. The policy mode [bind, pref, interleave, local] is irrelevant other than in constructing this mask. That's what led me to the nodes_allowed attribute/sysctl. > > > In general, we want to interleave the > > persistent huge pages over multiple nodes, but we don't want the > > allocations to succeed by falling back to another node. Then we end up > > with unbalanced allocations. This is how fresh huge page allocation > > worked before it was changed to use "THISNODE". > > > > > > > > > > > and bind policy would require constructing a custom > > > > node mask for node as well as addressing OOM, which we don't want > > > > during fresh huge page allocation. > > > > > > Would the required mask not already be setup when the process set the > > > policy? OOM is not a major concern, it doesn't trigger for failed > > > hugepage allocations. > > > > Perhaps not with the current implementation. However, a certain, > > currently shipping, enterprise distro added a "quick and dirty" [well, > > "dirty", anyway :)] implementation of "thisnode" behavior to elimiate > > the unwanted fallback behavior and resulting huge pages imbalance by > > constructing an ad hoc "MPOL_BIND" policy for allocating huge pages. We > > found that this DOES result in oom kill of the task increasing > > nr_hugepages if it encountered a node w/ no available huge pages. > > > > Fun. > > > If one used "echo" [usually a shell builtin] to increase nr_hugepages, > > it killed off your shell. sysctl was somewhat better--only killed > > sysctl. The good news was that oom kill just sets a thread info flag, > > so the nr_hugepages handler continued to allocate pages around the nodes > > and further oom kills were effective no-ops. When the allocations was > > done, the victim task would die. End users might not consider this > > acceptable behavior. > > > > No. My current expectation is that we've handled the vast majority of > cases where processes could get unexpectedly killed just because they > were looking funny at hugepages. Yes, doesn't seem to happen with current mainline kernel with or without this series. > > > I wasn't sure we'd avoid this situation if we dropped back to using task > > mempolicy via, e.g., alloc_pages_current(). So, I thought I'd run this > > proposal [nodes_allowed] by you. > > > > I'm not wholesale against it. Minimally, I see scope for specifying the > nodemask at boot-time but I'm less sold on the sysctl at the moment because > I'm not convinced that applying memory policies when sizing the hugepage pool > cannot be made work to solve your problem. Oh, I'm sure they can be made to work. Even for the boot command line, one could specify a policy and we have the mechanism to parse that. However, it still feels to me like the wrong tool for the job. Populating the pool is, IMO, quite different from handling page faults. I do worry about the proverbial "junior admin", charged with increasing the huge page pool, forgetting to use numactl and the correct policy as determined by, say, the system architect. I suppose one could provide wrappers around hugeadm or whatever to always use some predefined I think we have a couple of models at play here: 1) a more static, architected system layout, where I want to place the static huge pages on a specific set of nodes [perhaps at boot time] where the will be used for some long-running "enterprise" application[s] to which the system is more or less dedicated. Alternatively, one may just want to avoid [ever] having static huge pages allocated on some set of nodes. 2) a more dynamic model using libhugetlbfs [which which I have ZERO experience :(] in which we are using overcommitted/surplus huge pages in addition to a fairly uniformly distributed pool of static huge pages, counting on defragmentation and lumpy reclaim to satisfy requests for huge pages in excess of the static pool. Perhaps you see things differently. However, if you agree, I'm thinking we might want to describe these models in the hugepage kernel doc. Even if you don't agree, I think it would be a good idea to mention the use cases for huge pages in that doc. > > > > > > > > One could derive a node mask > > > > of allowed nodes for huge pages from the mempolicy of the task > > > > that is modifying nr_hugepages and use that for fresh huge pages > > > > with GFP_THISNODE. However, if we're not going to use mempolicy > > > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with > > > > yet another on-stack pseudo-vma :(]--I thought it cleaner to > > > > define a "nodes allowed" nodemask for populating the [persistent] > > > > huge pages free pool. > > > > > > > > > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy > > > you need? > > > > Interesting you should mention this. Somewhat off topic: I have a > > "cleanup" and reorg of shared policy and vma policy that separates > > policy lookup from allocation, and adds an "alloc_page_pol()" function. > > This is part of my series to generalize shared policy and extend it to > > shared, mmap()ed regular files. That aspect of the series [shared > > policy on shared mmaped files] got a lot of push back without any > > consideration of the technical details of the patches themselves. > > Those arguements feel vaguely familiar. I think I read the thread although > the specifics of the objections escape me. Think there was something about > non-determinism if two processes shared a mapping with different policies > and no idea which should take precedence. Actually, that last sentence describes the situation I'm trying to avoid. But, that's another conversation... > > Besides having need/requests for this capability, the resulting cleanup, > > removal of all on-stack pseudo-vmas, ..., the series actually seems to > > perform [slightly] better on the testing I've done. > > > > I keep this series up to date and hope to repost again sometime with > > benchmark results. > > > > Sounds like a good idea. If the cleanup yielded an alloc_page_pol() > function, it might make the patch for hugetlbfs more straight-forward. > > > > > > > > > > This patch series introduces a [per hugepage size] "sysctl", > > > > hugepages_nodes_allowed, that specifies a nodemask to constrain > > > > the allocation of persistent, fresh huge pages. The nodemask > > > > may be specified by a sysctl, a sysfs huge pages attribute and > > > > on the kernel boot command line. > > > > > > > > The series includes a patch to free hugepages from the pool in a > > > > "round robin" fashion, interleaved across all on-line nodes to > > > > balance the hugepage pool across nodes. Nish had a patch to do > > > > this, too. > > > > > > > > Together, these changes don't provide the fine grain of control > > > > that per node attributes would. > > > > > > I'm failing to understand at the moment why mem policies set by numactl > > > would not do the job for allocation at least. Freeing is a different problem. > > > > They could be made to work. I actually started coding up a patch to > > extract a "nodes allowed" mask from the policy for use with > > alloc_fresh_huge_page[_node]() so that I could maintain the overall > > structure and use alloc_page_exact_node() with nodes in the allowed > > mask. And, as you say, with some investigation, we may find that we can > > use alloc_pages_current() with appropriate flags to achieve the exact > > node semantics on each node in the policy. > > > > I'd like to see it investigated more please. OK. That will take a bit longer. :) I do think we'll need to reorganize set_max_huge_pages to "drive" the nodes allowed [from the policy] from the top loop, rather than bury it down in alloc_fresh_huge_page(). Might even end up eliminating that function and call the per node version directly. If we drive from the top, we can use the policy mask for freeing static huge pages, as well, altho' we'd have to decide the sense of the mask: does is specify the nodes from which to free or the nodes where we want static huge pages to remain. I had planned anyway to go back and look at this, even with the nodes_allowed attribute/sysctl so that, when decreasing nr_hugepages, we free unused huge pages and demote to surplus in-use pages on the non-allowed nodes before those on the allowed nodes. So, I think I'll clean up this series, based on your feed back, so we'll have something to compare with the mempolicy version and then look into the alternate implementation. It'll be a while [probably] before I can spend the time to do the latter. Thanks for your efforts to review these. Lee > > > > > > > > Specifically, there is no easy > > > > way to reduce the persistent huge page count for a specific node. > > > > I think the degree of control provided by these patches is the > > > > minimal necessary and sufficient for managing the persistent the > > > > huge page pool. However, with a bit more reorganization, we > > > > could implement per node controls if others would find that > > > > useful. > > > > > > > > For more info, see the patch descriptions and the updated kernel > > > > hugepages documentation. > > > > > > > > > > > More in response to your comments on the individual patches. > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 3A5BB6B005A for ; Thu, 18 Jun 2009 10:58:52 -0400 (EDT) Date: Thu, 18 Jun 2009 16:00:33 +0100 From: Mel Gorman Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed Message-ID: <20090618150033.GE14903@csn.ul.ie> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <20090618093301.GD14903@csn.ul.ie> <1245336394.1025.65.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1245336394.1025.65.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Thu, Jun 18, 2009 at 10:46:34AM -0400, Lee Schermerhorn wrote: > On Thu, 2009-06-18 at 10:33 +0100, Mel Gorman wrote: > > On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote: > > > On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote: > > > > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote: > > > > > Because of assymmetries in some NUMA platforms, and "interesting" > > > > > topologies emerging in the "scale up x86" world, we have need for > > > > > better control over the placement of "fresh huge pages". A while > > > > > back Nish Aravamundan floated a series of patches to add per node > > > > > controls for allocating pages to the hugepage pool and removing > > > > > them. Nish apparently moved on to other tasks before those patches > > > > > were accepted. I have kept a copy of Nish's patches and have > > > > > intended to rebase and test them and resubmit. > > > > > > > > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge > > > > > in the huge pages area, I asked his opinion of per node controls > > > > > for huge pages and he suggested another approach: using the mempolicy > > > > > of the task that changes nr_hugepages to constrain the fresh huge > > > > > page allocations. I considered this approach but it seemed to me > > > > > to be a misuse of mempolicy for populating the huge pages free > > > > > pool. > > > > > > > > Why would it be a misuse? Fundamentally, the huge page pools are being > > > > filled by the current process when nr_hugepages is being used. Or are > > > > you concerned about the specification of hugepages on the kernel command > > > > line? > > > > > > Well, "misuse" might be too strong [and I already softened it from > > > "abuse" :)]--more like "not a good fit, IMO". I agree that it could be > > > made to work, but it just didn't "feel" right. I suppose that we could > > > use alloc_pages_current() with the necessary gfp flags to specify the > > > "thisnode"/"exact_node" semantics [no fallback], making sure we didn't > > > OOM kill the task if a node couldn't satisfy the huge page > > > allocation, ... I think it would require major restructuring of > > > set_max_huge_pages(), ... Maybe that would be the right thing to do and > > > maybe we'll end up there, but I was looking for a simpler approach. > > > > > > > Ultimately I think it's the right way of doing things. hugepages in the > > past had a large number of "special" considerations in comparison to the > > core VM. We've removed a fair few of the special cases in the last year > > and I'd rather avoid introducing more of them. > > > > If I was as an administrator, I would prefer being able to do allocate > > hugepages only on two nodes with > > > > # numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128 > > Of course, this means that hugeadm itself and any library pages it pulls > in will be interleaved across those nodes. I don't know that that's an > issue. hugeadm or any command used to resize the pool will need to run > somewhere. > Pretty much. The amount of memory you are talking about for the hugeadm binary is not worth worrying about. I guess there is a very small corner case whereby a full node couldn't be allocated because the binary text could not be loaded but that's pretty specific. > > > > instead of > > > > # echo some_magic > /proc/sys/vm/hugepage_nodes_allowed > > # hugeadm --pool-pages-min=2M:128 > > > > hugeadm would have to be taught about the new sysctl of course if it's created > > but it still feels more awkward than it should be and numactl should already > > be familar. > > > > > And, yes, the kernel command line was a consideration. Because of that, > > > I considered specifying a boot time "nodes_allowed" mask and > > > constructing a "nodes_allowed" mask at run time from the current task's > > > policy. But, it seemed more straightforward to my twisted way of > > > thinking to make the nodes_allowed mask a bona fide hstate attribute. > > > > > > > I see scope for allowing the nodemask to be specified at boot-time all right, > > heck it might even be required! > > Yeah, the hugepages doc does recommend that one allocate the pool at > boot time or in an init script when the allocations still have a good > chance of succeeding. > Yep, while this has improved a lot, it still doesn't hurt. > > However, I am somewhat attached to the idea > > of static hugepage allocation obeying memory policies. > > Really? I couldn't tell ;). > It must be my innate subtlety > > > > > > > > > > > Interleave policy doesn't have same "this node" semantics > > > > > that we want > > > > > > > > By "this node" semantics, do you mean allocating from one specific node? > > > > > > Yes. > > > > > > > In that case, why would specifying a nodemask of just one node not be > > > > sufficient? > > > > > > Maybe along with GFP_THISNODE. > > > > But a nodemask of just one node might as well be GFP_THISNODE, right? > > No. w/o GFP_THISNODE, the allocation will use the generic zonelist and > will fallback to another node if the target node doesn't have any > available huge pages. Looks just like any other higher order page > allocation. And w/o 'THISNODE, I think we'll OOM kill the task > increasing max hugepages if the allocation fails--need to test this. Test this, because OOM should not be triggering for high order allocations like this. It was too easy to trigger OOM before by just writing large numbers to nr_hugepages. > We > may want/accept this behavior for satisfying tasks' page faults, but not > for allocating static huge pages into the pool--IMO, anyway. > > So, my approach to using task mempolicy would be to construct a > "nodes_allowed" mask from the policy and then use the same > alloc_pages_exact_node() with the same flags we currently use. This is > what I mean about it being a "misfit". What I really want is a node > mask over which I'll distribute the huge pages. The policy mode [bind, > pref, interleave, local] is irrelevant other than in constructing this > mask. That's what led me to the nodes_allowed attribute/sysctl. > but it does sound like alloc_pages_pol() which takes a policy and nodemask might also do the job. > > > > > In general, we want to interleave the > > > persistent huge pages over multiple nodes, but we don't want the > > > allocations to succeed by falling back to another node. Then we end up > > > with unbalanced allocations. This is how fresh huge page allocation > > > worked before it was changed to use "THISNODE". > > > > > > > > > > > > > > > and bind policy would require constructing a custom > > > > > node mask for node as well as addressing OOM, which we don't want > > > > > during fresh huge page allocation. > > > > > > > > Would the required mask not already be setup when the process set the > > > > policy? OOM is not a major concern, it doesn't trigger for failed > > > > hugepage allocations. > > > > > > Perhaps not with the current implementation. However, a certain, > > > currently shipping, enterprise distro added a "quick and dirty" [well, > > > "dirty", anyway :)] implementation of "thisnode" behavior to elimiate > > > the unwanted fallback behavior and resulting huge pages imbalance by > > > constructing an ad hoc "MPOL_BIND" policy for allocating huge pages. We > > > found that this DOES result in oom kill of the task increasing > > > nr_hugepages if it encountered a node w/ no available huge pages. > > > > > > > Fun. > > > > > If one used "echo" [usually a shell builtin] to increase nr_hugepages, > > > it killed off your shell. sysctl was somewhat better--only killed > > > sysctl. The good news was that oom kill just sets a thread info flag, > > > so the nr_hugepages handler continued to allocate pages around the nodes > > > and further oom kills were effective no-ops. When the allocations was > > > done, the victim task would die. End users might not consider this > > > acceptable behavior. > > > > > > > No. My current expectation is that we've handled the vast majority of > > cases where processes could get unexpectedly killed just because they > > were looking funny at hugepages. > > Yes, doesn't seem to happen with current mainline kernel with or without > this series. > Good, that's what I was expecting. > > > > > I wasn't sure we'd avoid this situation if we dropped back to using task > > > mempolicy via, e.g., alloc_pages_current(). So, I thought I'd run this > > > proposal [nodes_allowed] by you. > > > > > > > I'm not wholesale against it. Minimally, I see scope for specifying the > > nodemask at boot-time but I'm less sold on the sysctl at the moment because > > I'm not convinced that applying memory policies when sizing the hugepage pool > > cannot be made work to solve your problem. > > Oh, I'm sure they can be made to work. Even for the boot command line, > one could specify a policy and we have the mechanism to parse that. > However, it still feels to me like the wrong tool for the job. > Populating the pool is, IMO, quite different from handling page faults. > Yeah, what I'm suggesting is a bit of a hack, but it's in line with how NUMA is controlled in other situations. > I do worry about the proverbial "junior admin", charged with increasing > the huge page pool, forgetting to use numactl and the correct policy as > determined by, say, the system architect. I suppose one could provide > wrappers around hugeadm or whatever to always use some predefined > Yeah. However, I would point out that if he's getting numactl wrong, there is no guarantee that he would get the sysctl right. > > I think we have a couple of models at play here: > > 1) a more static, architected system layout, where I want to place the > static huge pages on a specific set of nodes [perhaps at boot time] > where the will be used for some long-running "enterprise" application[s] > to which the system is more or less dedicated. Alternatively, one may > just want to avoid [ever] having static huge pages allocated on some set > of nodes. > > 2) a more dynamic model using libhugetlbfs [which which I have ZERO > experience :(] in which we are using overcommitted/surplus huge pages > in addition to a fairly uniformly distributed pool of static huge pages, > counting on defragmentation and lumpy reclaim to satisfy requests for > huge pages in excess of the static pool. > The more dynamic model should use the normal NUMA policies and what exists already. > Perhaps you see things differently. However, if you agree, I'm thinking > we might want to describe these models in the hugepage kernel doc. Even > if you don't agree, I think it would be a good idea to mention the use > cases for huge pages in that doc. > It's been on my TODO list to describe this for I don't know how long :( . I'll take this as another nudge that I really should get around to it. > > > > > > > > > > > One could derive a node mask > > > > > of allowed nodes for huge pages from the mempolicy of the task > > > > > that is modifying nr_hugepages and use that for fresh huge pages > > > > > with GFP_THISNODE. However, if we're not going to use mempolicy > > > > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with > > > > > yet another on-stack pseudo-vma :(]--I thought it cleaner to > > > > > define a "nodes allowed" nodemask for populating the [persistent] > > > > > huge pages free pool. > > > > > > > > > > > > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy > > > > you need? > > > > > > Interesting you should mention this. Somewhat off topic: I have a > > > "cleanup" and reorg of shared policy and vma policy that separates > > > policy lookup from allocation, and adds an "alloc_page_pol()" function. > > > This is part of my series to generalize shared policy and extend it to > > > shared, mmap()ed regular files. That aspect of the series [shared > > > policy on shared mmaped files] got a lot of push back without any > > > consideration of the technical details of the patches themselves. > > > > Those arguements feel vaguely familiar. I think I read the thread although > > the specifics of the objections escape me. Think there was something about > > non-determinism if two processes shared a mapping with different policies > > and no idea which should take precedence. > > Actually, that last sentence describes the situation I'm trying to > avoid. But, that's another conversation... > With shared policies for regular files sure, but I'm not sure it directly applies to the problem we are trying to solve here - be more specific what nodes are used to fill the static hugepage pool. > > > Besides having need/requests for this capability, the resulting cleanup, > > > removal of all on-stack pseudo-vmas, ..., the series actually seems to > > > perform [slightly] better on the testing I've done. > > > > > > I keep this series up to date and hope to repost again sometime with > > > benchmark results. > > > > > > > Sounds like a good idea. If the cleanup yielded an alloc_page_pol() > > function, it might make the patch for hugetlbfs more straight-forward. > > > > > > > > > > > > > > This patch series introduces a [per hugepage size] "sysctl", > > > > > hugepages_nodes_allowed, that specifies a nodemask to constrain > > > > > the allocation of persistent, fresh huge pages. The nodemask > > > > > may be specified by a sysctl, a sysfs huge pages attribute and > > > > > on the kernel boot command line. > > > > > > > > > > The series includes a patch to free hugepages from the pool in a > > > > > "round robin" fashion, interleaved across all on-line nodes to > > > > > balance the hugepage pool across nodes. Nish had a patch to do > > > > > this, too. > > > > > > > > > > Together, these changes don't provide the fine grain of control > > > > > that per node attributes would. > > > > > > > > I'm failing to understand at the moment why mem policies set by numactl > > > > would not do the job for allocation at least. Freeing is a different problem. > > > > > > They could be made to work. I actually started coding up a patch to > > > extract a "nodes allowed" mask from the policy for use with > > > alloc_fresh_huge_page[_node]() so that I could maintain the overall > > > structure and use alloc_page_exact_node() with nodes in the allowed > > > mask. And, as you say, with some investigation, we may find that we can > > > use alloc_pages_current() with appropriate flags to achieve the exact > > > node semantics on each node in the policy. > > > > > > > I'd like to see it investigated more please. > > OK. That will take a bit longer. :) I do think we'll need to > reorganize set_max_huge_pages to "drive" the nodes allowed [from the > policy] from the top loop, rather than bury it down in > alloc_fresh_huge_page(). Might even end up eliminating that function > and call the per node version directly. Sounds reasonable. > If we drive from the top, we > can use the policy mask for freeing static huge pages, as well, altho' > we'd have to decide the sense of the mask: does is specify the nodes > from which to free or the nodes where we want static huge pages to > remain. > I'd interpret it to mean the nodes from which we are freeing from - i.e. the nodes_allowed specifies the nodes we are operating on whether we are allocating or freeing. > I had planned anyway to go back and look at this, even with the > nodes_allowed attribute/sysctl so that, when decreasing nr_hugepages, we > free unused huge pages and demote to surplus in-use pages on the > non-allowed nodes before those on the allowed nodes. > > So, I think I'll clean up this series, based on your feed back, so we'll > have something to compare with the mempolicy version and then look into > the alternate implementation. It'll be a while [probably] before I can > spend the time to do the latter. > Sounds good. > Thanks for your efforts to review these. > Thanks for investigating this. > Lee > > > > > > > > > > > > Specifically, there is no easy > > > > > way to reduce the persistent huge page count for a specific node. > > > > > I think the degree of control provided by these patches is the > > > > > minimal necessary and sufficient for managing the persistent the > > > > > huge page pool. However, with a bit more reorganization, we > > > > > could implement per node controls if others would find that > > > > > useful. > > > > > > > > > > For more info, see the patch descriptions and the updated kernel > > > > > hugepages documentation. > > > > > > > > > > > > > > > More in response to your comments on the individual patches. > > > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 56C406B004F for ; Thu, 18 Jun 2009 14:48:26 -0400 (EDT) Received: from spaceape11.eur.corp.google.com (spaceape11.eur.corp.google.com [172.28.16.145]) by smtp-out.google.com with ESMTP id n5IIntEW028580 for ; Thu, 18 Jun 2009 19:49:55 +0100 Received: from pzk30 (pzk30.prod.google.com [10.243.19.158]) by spaceape11.eur.corp.google.com with ESMTP id n5IInqce014028 for ; Thu, 18 Jun 2009 11:49:53 -0700 Received: by pzk30 with SMTP id 30so1144414pzk.18 for ; Thu, 18 Jun 2009 11:49:51 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20090616135315.25248.7893.sendpatchset@lts-notebook> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135315.25248.7893.sendpatchset@lts-notebook> Date: Thu, 18 Jun 2009 11:49:51 -0700 Message-ID: <9ec263480906181149t1aac592o57ce517bdd749cf5@mail.gmail.com> Subject: Re: [PATCH 5/5] Update huge pages kernel documentation From: David Rientjes Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Tue, Jun 16, 2009 at 6:53 AM, Lee Schermerhorn wrote: > @@ -67,26 +65,76 @@ use either the mmap system call or share > =C2=A0the huge pages. =C2=A0It is required that the system administrator = preallocate > =C2=A0enough memory for huge page purposes. > > -Use the following command to dynamically allocate/deallocate hugepages: > +The administrator can preallocate huge pages on the kernel boot command = line by > +specifying the "hugepages=3DN" parameter, where 'N' =3D the number of hu= ge pages > +requested. =C2=A0This is the most reliable method for preallocating huge= pages as > +memory has not yet become fragmented. > + > +Some platforms support multiple huge page sizes. =C2=A0To preallocate hu= ge pages > +of a specific size, one must preceed the huge pages boot command paramet= ers > +with a huge page size selection parameter "hugepagesz=3D". =C2=A0<= size> must > +be specified in bytes with optional scale suffix [kKmMgG]. =C2=A0The def= ault huge > +page size may be selected with the "default_hugepagesz=3D" boot pa= rameter. > + > +/proc/sys/vm/nr_hugepages indicates the current number of configured [de= fault > +size] hugetlb pages in the kernel. =C2=A0Super user can dynamically requ= est more > +(or free some pre-configured) hugepages. > + > +Use the following command to dynamically allocate/deallocate default siz= ed > +hugepages: > > =C2=A0 =C2=A0 =C2=A0 =C2=A0echo 20 > /proc/sys/vm/nr_hugepages > > -This command will try to configure 20 hugepages in the system. =C2=A0The= success > -or failure of allocation depends on the amount of physically contiguous > -memory that is preset in system at this time. =C2=A0System administrator= s may want > -to put this command in one of the local rc init files. =C2=A0This will e= nable the > -kernel to request huge pages early in the boot process (when the possibi= lity > -of getting physical contiguous pages is still very high). In either > -case, administrators will want to verify the number of hugepages actuall= y > -allocated by checking the sysctl or meminfo. > +This command will try to configure 20 default sized hugepages in the sys= tem. > +On a NUMA platform, the kernel will attempt to distribute the hugepage p= ool > +over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed nod= e mask. > +hugepages_nodes_allowed defaults to all on-line nodes. > + > +To control the nodes on which huge pages are preallocated, the administr= ator > +may set the hugepages_nodes_allowed for the default huge page size using= : > + > + =C2=A0 =C2=A0 =C2=A0 echo >/proc/sys/vm/hugepages_nodes_allo= wed > + This probably also needs an update to Documentation/ABI/testing/sysfs-kernel-mm-hugepages for the non-default hstate nodes_allowed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 787C86B004F for ; Thu, 18 Jun 2009 15:06:56 -0400 (EDT) Subject: Re: [PATCH 5/5] Update huge pages kernel documentation From: Lee Schermerhorn In-Reply-To: <9ec263480906181149t1aac592o57ce517bdd749cf5@mail.gmail.com> References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135315.25248.7893.sendpatchset@lts-notebook> <9ec263480906181149t1aac592o57ce517bdd749cf5@mail.gmail.com> Content-Type: text/plain Date: Thu, 18 Jun 2009 15:06:58 -0400 Message-Id: <1245352018.1025.86.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Mel Gorman , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Thu, 2009-06-18 at 11:49 -0700, David Rientjes wrote: > On Tue, Jun 16, 2009 at 6:53 AM, Lee > Schermerhorn wrote: > > @@ -67,26 +65,76 @@ use either the mmap system call or share > > the huge pages. It is required that the system administrator preallocate > > enough memory for huge page purposes. > > > > -Use the following command to dynamically allocate/deallocate hugepages: > > +The administrator can preallocate huge pages on the kernel boot command line by > > +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages > > +requested. This is the most reliable method for preallocating huge pages as > > +memory has not yet become fragmented. > > + > > +Some platforms support multiple huge page sizes. To preallocate huge pages > > +of a specific size, one must preceed the huge pages boot command parameters > > +with a huge page size selection parameter "hugepagesz=". must > > +be specified in bytes with optional scale suffix [kKmMgG]. The default huge > > +page size may be selected with the "default_hugepagesz=" boot parameter. > > + > > +/proc/sys/vm/nr_hugepages indicates the current number of configured [default > > +size] hugetlb pages in the kernel. Super user can dynamically request more > > +(or free some pre-configured) hugepages. > > + > > +Use the following command to dynamically allocate/deallocate default sized > > +hugepages: > > > > echo 20 > /proc/sys/vm/nr_hugepages > > > > -This command will try to configure 20 hugepages in the system. The success > > -or failure of allocation depends on the amount of physically contiguous > > -memory that is preset in system at this time. System administrators may want > > -to put this command in one of the local rc init files. This will enable the > > -kernel to request huge pages early in the boot process (when the possibility > > -of getting physical contiguous pages is still very high). In either > > -case, administrators will want to verify the number of hugepages actually > > -allocated by checking the sysctl or meminfo. > > +This command will try to configure 20 default sized hugepages in the system. > > +On a NUMA platform, the kernel will attempt to distribute the hugepage pool > > +over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed node mask. > > +hugepages_nodes_allowed defaults to all on-line nodes. > > + > > +To control the nodes on which huge pages are preallocated, the administrator > > +may set the hugepages_nodes_allowed for the default huge page size using: > > + > > + echo >/proc/sys/vm/hugepages_nodes_allowed > > + > > This probably also needs an update to > Documentation/ABI/testing/sysfs-kernel-mm-hugepages for the > non-default hstate nodes_allowed. Thanks, David. I'll take a look and address that in the next respin of the series. If you've been following the exchange with Mel, you'll know that the approach may change quite a bit. However it ends up, I'll update the abi testing doc or yell for help. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 6146C6B005A for ; Thu, 18 Jun 2009 15:08:15 -0400 (EDT) Received: from zps76.corp.google.com (zps76.corp.google.com [172.25.146.76]) by smtp-out.google.com with ESMTP id n5IJ8OGU029072 for ; Thu, 18 Jun 2009 20:08:25 +0100 Received: from wf-out-1314.google.com (wfg24.prod.google.com [10.142.7.24]) by zps76.corp.google.com with ESMTP id n5IJ8LQB004615 for ; Thu, 18 Jun 2009 12:08:22 -0700 Received: by wf-out-1314.google.com with SMTP id 24so455749wfg.20 for ; Thu, 18 Jun 2009 12:08:21 -0700 (PDT) Date: Thu, 18 Jun 2009 12:08:18 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed In-Reply-To: <1245258954.6235.58.camel@lts-notebook> Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, akpm@linux-foundation.org, Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, 17 Jun 2009, Lee Schermerhorn wrote: > > > Specifically, there is no easy > > > way to reduce the persistent huge page count for a specific node. > > > I think the degree of control provided by these patches is the > > > minimal necessary and sufficient for managing the persistent the > > > huge page pool. However, with a bit more reorganization, we > > > could implement per node controls if others would find that > > > useful. > > > Thanks for looking at this, it's going to be very useful. >>From a cpusets perspective, control over the hugepage allocations on various nodes is essential when the number of nodes on the system is not trivially small. While cpusets are generally used to assign applications to a set of nodes to which they have affinity, they are also hierarchial so that within a cpuset, nodes can be further divided as a means of resource management. With your extensions, this could potentially include hugepage management in addition to strict memory isolation. Manipulating hugepages via a nodemask seems less ideal than, as you mentioned, per-node hugepage controls, probably via /sys/kernel/system/node/node*/nr_hugepages. This type of interface provides all the functionality that this patchset does, including hugepage allocation and freeing, but with more power to explicitly allocate and free on targeted nodes. /proc/sys/vm/nr_hugepages would remain to round-robin the allocation (and freeing, with your patch 1/5 which I ack'd). Such an interface would also automatically deal with all memory hotplug/remove issues without storing or keeping a nodemask updated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 016B76B006A for ; Thu, 18 Jun 2009 15:08:34 -0400 (EDT) Received: from zps75.corp.google.com (zps75.corp.google.com [172.25.146.75]) by smtp-out.google.com with ESMTP id n5IJ8kil007791 for ; Thu, 18 Jun 2009 20:08:47 +0100 Received: from pxi2 (pxi2.prod.google.com [10.243.27.2]) by zps75.corp.google.com with ESMTP id n5IJ8h0N028444 for ; Thu, 18 Jun 2009 12:08:44 -0700 Received: by pxi2 with SMTP id 2so1156180pxi.6 for ; Thu, 18 Jun 2009 12:08:43 -0700 (PDT) Date: Thu, 18 Jun 2009 12:08:42 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 1/5] Free huge pages round robin to balance across nodes In-Reply-To: <1245259018.6235.59.camel@lts-notebook> Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090616135236.25248.93692.sendpatchset@lts-notebook> <20090617131833.GG28529@csn.ul.ie> <1245259018.6235.59.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: Acked-by: David Rientjes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 3E2616B005A for ; Wed, 24 Jun 2009 03:11:09 -0400 (EDT) Received: from spaceape7.eur.corp.google.com (spaceape7.eur.corp.google.com [172.28.16.141]) by smtp-out.google.com with ESMTP id n5O7BMLE011384 for ; Wed, 24 Jun 2009 08:11:22 +0100 Received: from wa-out-1112.google.com (wagm34.prod.google.com [10.114.214.34]) by spaceape7.eur.corp.google.com with ESMTP id n5O7BJFw018808 for ; Wed, 24 Jun 2009 00:11:20 -0700 Received: by wa-out-1112.google.com with SMTP id m34so113679wag.3 for ; Wed, 24 Jun 2009 00:11:19 -0700 (PDT) Date: Wed, 24 Jun 2009 00:11:17 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed In-Reply-To: Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Thu, 18 Jun 2009, David Rientjes wrote: > Manipulating hugepages via a nodemask seems less ideal than, as you > mentioned, per-node hugepage controls, probably via > /sys/kernel/system/node/node*/nr_hugepages. This type of interface > provides all the functionality that this patchset does, including hugepage > allocation and freeing, but with more power to explicitly allocate and > free on targeted nodes. /proc/sys/vm/nr_hugepages would remain to > round-robin the allocation (and freeing, with your patch 1/5 which I > ack'd). > > Such an interface would also automatically deal with all memory > hotplug/remove issues without storing or keeping a nodemask updated. > Expanding this proposal out a little bit, we'd want all the power of the /sys/kernel/mm/hugepages tunables for each node. The best way of doing that is probably to keep the current /sys/kernel/mm/hugepages directory as is (already published Documentation/ABI/testing/sysfs-kernel-mm-hugepages) for the system-wide hugepage state and then add individual `hugepages-kB' directories to each /sys/devices/system/node/node* to target allocations and freeing for the per-node hugepage state. Otherwise, we lack node targeted support for multiple hugepagesz= users. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 7CF9A6B004F for ; Wed, 24 Jun 2009 07:24:44 -0400 (EDT) Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed From: Lee Schermerhorn In-Reply-To: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> Content-Type: text/plain Date: Wed, 24 Jun 2009 07:25:24 -0400 Message-Id: <1245842724.6439.19.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Wed, 2009-06-24 at 00:11 -0700, David Rientjes wrote: > On Thu, 18 Jun 2009, David Rientjes wrote: > > > Manipulating hugepages via a nodemask seems less ideal than, as you > > mentioned, per-node hugepage controls, probably via > > /sys/kernel/system/node/node*/nr_hugepages. This type of interface > > provides all the functionality that this patchset does, including hugepage > > allocation and freeing, but with more power to explicitly allocate and > > free on targeted nodes. /proc/sys/vm/nr_hugepages would remain to > > round-robin the allocation (and freeing, with your patch 1/5 which I > > ack'd). > > > > Such an interface would also automatically deal with all memory > > hotplug/remove issues without storing or keeping a nodemask updated. > > > > Expanding this proposal out a little bit, we'd want all the power of the > /sys/kernel/mm/hugepages tunables for each node. The best way of doing > that is probably to keep the current /sys/kernel/mm/hugepages directory as > is (already published Documentation/ABI/testing/sysfs-kernel-mm-hugepages) > for the system-wide hugepage state and then add individual > `hugepages-kB' directories to each /sys/devices/system/node/node* to > target allocations and freeing for the per-node hugepage state. > Otherwise, we lack node targeted support for multiple hugepagesz= users. David: Nish mentioned this to me a while back when I asked about his patches. That's one of my reasons for seeing if the simpler [IMO] nodes_allowed would be sufficient. I'm currently updating the nodes_allowed series per Mel's cleanup suggestions. I'll then prototype Mel's preferred method of using the task's mempolicy. I still have reservations about this: static huge page allocation is currently not constrained by policy nor cpusets, and I can't tell whether the task's mempolicy was set explicitly to contstrain the huge pages or just inherited from the parent shell. Next I'll also dust off Nish's old per node hugetlb control patches and see what it task to update them for the multiple sizes. It will look pretty much as you suggest. Do you have any suggestions for a boot command line syntax to specify per node huge page counts at boot time [assuming we still want this]? Currently, for default huge page size, distributed across nodes, we have: hugepages= I was thinking something like: hugepages=(node:count,...) using the '(' as a flag for per node counts, w/o needing to prescan for ':' Thoughts? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 174126B004D for ; Wed, 24 Jun 2009 18:26:08 -0400 (EDT) Received: from zps36.corp.google.com (zps36.corp.google.com [172.25.146.36]) by smtp-out.google.com with ESMTP id n5OMQrm2031242 for ; Wed, 24 Jun 2009 23:26:54 +0100 Received: from pzk41 (pzk41.prod.google.com [10.243.19.169]) by zps36.corp.google.com with ESMTP id n5OMQoPQ032649 for ; Wed, 24 Jun 2009 15:26:51 -0700 Received: by pzk41 with SMTP id 41so773941pzk.19 for ; Wed, 24 Jun 2009 15:26:50 -0700 (PDT) Date: Wed, 24 Jun 2009 15:26:46 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed In-Reply-To: <1245842724.6439.19.camel@lts-notebook> Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <1245842724.6439.19.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Wed, 24 Jun 2009, Lee Schermerhorn wrote: > David: > > Nish mentioned this to me a while back when I asked about his patches. > That's one of my reasons for seeing if the simpler [IMO] nodes_allowed > would be sufficient. I'm currently updating the nodes_allowed series > per Mel's cleanup suggestions. The /proc/sys/vm/hugepages_nodes_allowed support is troublesome because it's global and can race with other writers, such as for tasks in two disjoint cpusets attempting to allocate hugepages concurrently. > I'll then prototype Mel's preferred > method of using the task's mempolicy. This proposal eliminates the aforementioned race, but now has the opposite problem: if a single task is allocating hugepages for multiple cpusets, it must setup the correct mempolicies to allocate (or free) for the new cpuset mems. > I still have reservations about > this: static huge page allocation is currently not constrained by > policy nor cpusets, and I can't tell whether the task's mempolicy was > set explicitly to contstrain the huge pages or just inherited from the > parent shell. > Agreed. I do think that we used to constrain hugepage allocations by cpuset in the past, though, when the allocation was simply done via alloc_pages_node(). We'd fallback to using nodes outside of cpuset_current_mems_allowed when the nodes were too full or fragmented. > Next I'll also dust off Nish's old per node hugetlb control patches and > see what it task to update them for the multiple sizes. It will look > pretty much as you suggest. Do you have any suggestions for a boot > command line syntax to specify per node huge page counts at boot time > [assuming we still want this]? Currently, for default huge page size, > distributed across nodes, we have: > > hugepages= > > I was thinking something like: > > hugepages=(node:count,...) > > using the '(' as a flag for per node counts, w/o needing to prescan for > ':' > The hugepages=(node:count,...) option would still need to be interleaved with hugepagesz= for the various sizes. This could become pretty cryptic: hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G \ hugepages=(2:10,3:10) and I assume we'd use `count' of 99999 for nodes of unknown sizes where we simply want to allocate as many hugepages as possible. We'd still need to support hugepages=N for large NUMA machines so we don't have to specify the same number of hugepages per node for a true interleave, which would require an extremely large command line. And then the behavior of hugepagesz=1G hugepages=(0:10,1:20) hugepages=30 needs to be defined. In that case, does hugepages=30 override the previous settings if this system only has dual nodes? If so, for SGI's 1K node systems it's going to be difficult to specify many nodes with 10 hugepages and a few with 20. So perhaps hugepages=(node:count,...) should increment or decrement the hugepages= value, if specified? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 6F9096B004F for ; Wed, 24 Jun 2009 22:14:11 -0400 (EDT) Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed From: Lee Schermerhorn In-Reply-To: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <1245842724.6439.19.camel@lts-notebook> Content-Type: text/plain Date: Wed, 24 Jun 2009 22:14:20 -0400 Message-Id: <1245896060.6439.159.camel@lts-notebook> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Wed, 2009-06-24 at 15:26 -0700, David Rientjes wrote: > On Wed, 24 Jun 2009, Lee Schermerhorn wrote: > > > David: > > > > Nish mentioned this to me a while back when I asked about his patches. > > That's one of my reasons for seeing if the simpler [IMO] nodes_allowed > > would be sufficient. I'm currently updating the nodes_allowed series > > per Mel's cleanup suggestions. > > The /proc/sys/vm/hugepages_nodes_allowed support is troublesome because > it's global and can race with other writers, such as for tasks in two > disjoint cpusets attempting to allocate hugepages concurrently. Agreed. If one has multiple administrators or privileged tasks trying to modify the huge page pool simultaneously, this is problematic. For "single administrator" system, it might be workable. And it seemed a fairly minimal change at a time where I didn't see much interest in this area in the community. > > > I'll then prototype Mel's preferred > > method of using the task's mempolicy. > > This proposal eliminates the aforementioned race, but now has the opposite > problem: if a single task is allocating hugepages for multiple cpusets, it > must setup the correct mempolicies to allocate (or free) for the new > cpuset mems. Agreed. To support this proposal, we'll need to construct a "nodes allowed" mask from the policy [and I don't think we want default policy to mean "local" in this case--big change in behavior!] and pass that to the allocation functions. Racing allocators can then use different masks. > > > I still have reservations about > > this: static huge page allocation is currently not constrained by > > policy nor cpusets, and I can't tell whether the task's mempolicy was > > set explicitly to contstrain the huge pages or just inherited from the > > parent shell. > > > > Agreed. I do think that we used to constrain hugepage allocations by > cpuset in the past, though, when the allocation was simply done via > alloc_pages_node(). We'd fallback to using nodes outside of > cpuset_current_mems_allowed when the nodes were too full or fragmented. I do recall a discussion about huge page allocation being constrained by cpusets or not. I don't recall whether they were and this was changed, or they weren't [as is currently the case] and someone [Nish?] proposed to change that. I'd need to search for that exchange. Would having cpusets constrain huge page pool allocation meet your needs? > > > Next I'll also dust off Nish's old per node hugetlb control patches and > > see what it task to update them for the multiple sizes. It will look > > pretty much as you suggest. Do you have any suggestions for a boot > > command line syntax to specify per node huge page counts at boot time > > [assuming we still want this]? Currently, for default huge page size, > > distributed across nodes, we have: > > > > hugepages= > > > > I was thinking something like: > > > > hugepages=(node:count,...) > > > > using the '(' as a flag for per node counts, w/o needing to prescan for > > ':' > > > > The hugepages=(node:count,...) option would still need to be interleaved > with hugepagesz= for the various sizes. Well, yes. I'd assumed that. > This could become pretty cryptic: > > hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G \ > hugepages=(2:10,3:10) > > and I assume we'd use `count' of 99999 for nodes of unknown sizes where we > simply want to allocate as many hugepages as possible. If one needed that capability--"allocate as many as possible"--then, yes, I guess any ridiculously large count would do the trick. > > We'd still need to support hugepages=N for large NUMA machines so we don't > have to specify the same number of hugepages per node for a true > interleave, which would require an extremely large command line. And then > the behavior of > > hugepagesz=1G hugepages=(0:10,1:20) hugepages=30 > > needs to be defined. In that case, does hugepages=30 override the > previous settings if this system only has dual nodes? If so, for SGI's 1K > node systems it's going to be difficult to specify many nodes with 10 > hugepages and a few with 20. So perhaps hugepages=(node:count,...) should > increment or decrement the hugepages= value, if specified? Mel mentioned that we probably don't need boot command line hugepage allocation all that much with lumpy reclaim, etc. I can see his point. If we can't allocate all the hugepages we need from an early init script or similar, we probably don't have enough memory anyway. For compatibility, I supposed we need to retain the hugepages= parameter. And, we've added the hugepagesz parameter, so we need to retain that. But, maybe we should initially limit per node allocations to sysfs node attributes post boot? ------------- Related question: do you think we need per node overcommit limits? I'm having difficulty understanding what the semantics of the global limit would be with per node limits--i.e., how would one distribute the global limit across nodes [for backwards compatibility]. With nr_hugepages, today we just do a best effort to distribute the requested number of pages over the on-line nodes. If we fail to allocate that many, we don't remember the initial request, just how many we actually allocated where ever they landed. But, I don't see how that works with limits. I suppose we could arrange that if you don't specify a per node limit, the global limit applies when attempting to allocate a surplus page on a given node. If you do [so specify], then the respective node limit applies, whether or not the sum of per node surplus pages exceeds the global limit. Thoughts? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 48B786B005A for ; Thu, 25 Jun 2009 15:21:42 -0400 (EDT) Received: from wpaz21.hot.corp.google.com (wpaz21.hot.corp.google.com [172.24.198.85]) by smtp-out.google.com with ESMTP id n5PJMGVU002445 for ; Thu, 25 Jun 2009 20:22:17 +0100 Received: from pxi9 (pxi9.prod.google.com [10.243.27.9]) by wpaz21.hot.corp.google.com with ESMTP id n5PJMDsw030866 for ; Thu, 25 Jun 2009 12:22:14 -0700 Received: by pxi9 with SMTP id 9so311631pxi.1 for ; Thu, 25 Jun 2009 12:22:13 -0700 (PDT) Date: Thu, 25 Jun 2009 12:22:12 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed In-Reply-To: <1245896060.6439.159.camel@lts-notebook> Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <1245842724.6439.19.camel@lts-notebook> <1245896060.6439.159.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Wed, 24 Jun 2009, Lee Schermerhorn wrote: > Would having cpusets constrain huge page pool allocation meet your > needs? > It would, but it seems like an unnecessary inconvenience. It would require the admin task to join a cpuset to allocate hugepages for an application while allocating them. It also is more difficult to expand a cpuset to include a new node that has a specific threshold of hugepages available if writing to /sys/devices/system/node/node*/hugepages-kB/nr_hugepages is used to preallocate hugepages to determine which node has the least fragmentation to support such an allocation in the first place (and then freeing them if they cannot be allocated). It also doesn't support users who don't have CONFIG_CPUSETS, but do mbind their memory to a subset of nodes that need hugepages while others do not. > > This could become pretty cryptic: > > > > hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G \ > > hugepages=(2:10,3:10) > > > > and I assume we'd use `count' of 99999 for nodes of unknown sizes where we > > simply want to allocate as many hugepages as possible. > > If one needed that capability--"allocate as many as possible"--then, > yes, I guess any ridiculously large count would do the trick. > I remember on older kernels that large hugepages= values would cause the system not to boot because subsequent kernel allocations would fail because it's oom. We need to do hugepages= allocation as early as possible to avoid additional fragmentation later, but not enough to completely oom the kernel. The only way to prevent that is with a maximum hugepage watermark (which would no longer be global but per zone with the hugepages=(node:count,...) support to allow at least a certain threshold of memory to be free to the kernel for boot. > > We'd still need to support hugepages=N for large NUMA machines so we don't > > have to specify the same number of hugepages per node for a true > > interleave, which would require an extremely large command line. And then > > the behavior of > > > > hugepagesz=1G hugepages=(0:10,1:20) hugepages=30 > > > > needs to be defined. In that case, does hugepages=30 override the > > previous settings if this system only has dual nodes? If so, for SGI's 1K > > node systems it's going to be difficult to specify many nodes with 10 > > hugepages and a few with 20. So perhaps hugepages=(node:count,...) should > > increment or decrement the hugepages= value, if specified? > > Mel mentioned that we probably don't need boot command line hugepage > allocation all that much with lumpy reclaim, etc. I can see his point. Understood, especially with the complexity of specifying them on the command line in the first place :) The only concern I have is for users who want hugepages preallocated on a specific set of nodes at boot and are required to use much higher hugepages= values to allocate on all system nodes and then just free the pages on the nodes they aren't interested in. > If we can't allocate all the hugepages we need from an early init script > or similar, we probably don't have enough memory anyway. For > compatibility, I supposed we need to retain the hugepages= parameter. > And, we've added the hugepagesz parameter, so we need to retain that. > But, maybe we should initially limit per node allocations to sysfs node > attributes post boot? > Agreed, it solves the early boot oom failures as well. > ------------- > Related question: do you think we need per node overcommit limits? I do, because applications constrained to an exclusive cpuset will only be able to allocate from its set of allowable nodes anyway, so the global overcommit limits aren't in effect. There needs to be a mechanism to allow such allocations to take place for such constrained tasks. The only reason I've proposed these hugepage tunables to be attributes of the system's nodes and not attributes of the individual cpusets is because non-exclusive cpusets may share nodes among siblings and parents will always share nodes with children, so the tunables could become inconsistent with one another. For all other purposes, they really are a characteristic of the cpuset's job, however. > I'm > having difficulty understanding what the semantics of the global limit > would be with per node limits--i.e., how would one distribute the global > limit across nodes [for backwards compatibility]. With nr_hugepages, > today we just do a best effort to distribute the requested number of > pages over the on-line nodes. If we fail to allocate that many, we > don't remember the initial request, just how many we actually allocated > where ever they landed. But, I don't see how that works with limits. I > suppose we could arrange that if you don't specify a per node limit, the > global limit applies when attempting to allocate a surplus page on a > given node. If you do [so specify], then the respective node limit > applies, whether or not the sum of per node surplus pages exceeds the > global limit. > Hmm, yes, that's a concern. I think the easiest way to do it would be to respect both the global and node surplus limits when the node limit is 0, and respect the node surplus limit when it is positive. The setting of either the global limit or the node limit does not change the other. This deals with the cpuset case where applications constrained to a cpuset may only allocate from their own nodes. This would allow the global limit to exceed the sum of the node limits and for hugepage allocation to fail, but that would have required a node limit to be set on each online node. In such a situation, the global limit has, in effect, been obsoleted and its value no longer matters. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org