* [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator. 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 2:03 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator. @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC 2/4] Have __free_pages_memory() free in larger chunks. 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 2:03 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased to PTRS_PER_PMD. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..3b512ca 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); + end_aligned = end & ~((1UL << order) - 1); if (end_aligned <= start_aligned) { for (i = start; i < end; i++) @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) for (i = start; i < start_aligned; i++) __free_pages_bootmem(pfn_to_page(i), 0); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) + for (i = start_aligned; i < end_aligned; i += 1 << order) __free_pages_bootmem(pfn_to_page(i), order); for (i = end_aligned; i < end; i++) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC 2/4] Have __free_pages_memory() free in larger chunks. @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased to PTRS_PER_PMD. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..3b512ca 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); + end_aligned = end & ~((1UL << order) - 1); if (end_aligned <= start_aligned) { for (i = start; i < end; i++) @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) for (i = start; i < start_aligned; i++) __free_pages_bootmem(pfn_to_page(i), 0); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) + for (i = start_aligned; i < end_aligned; i += 1 << order) __free_pages_bootmem(pfn_to_page(i), order); for (i = end_aligned; i < end; i++) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 7:45 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 7:45 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman After sleeping on this, why can't we change __free_pages_bootmem to not take an order, but rather nr_pages? If we did that, then __free_pages_memory could just calculate nr_pages and call __free_pages_bootmem one time. I don't see why any of the callers of __free_pages_bootmem would not easily support that change. Maybe I will work that up as part of a -v2 and see if it boots/runs. At the very least, I think we could change to: static void __init __free_pages_memory(unsigned long start, unsigned long end) { int order; while (start < end) { order = ffs(start); while (start + (1UL << order) > end) order--; __free_pages_bootmem(start, order); start += (1UL << order); } } Robin On Thu, Jul 11, 2013 at 09:03:53PM -0500, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) > -- > 1.8.2.1 ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. @ 2013-07-12 7:45 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 7:45 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman After sleeping on this, why can't we change __free_pages_bootmem to not take an order, but rather nr_pages? If we did that, then __free_pages_memory could just calculate nr_pages and call __free_pages_bootmem one time. I don't see why any of the callers of __free_pages_bootmem would not easily support that change. Maybe I will work that up as part of a -v2 and see if it boots/runs. At the very least, I think we could change to: static void __init __free_pages_memory(unsigned long start, unsigned long end) { int order; while (start < end) { order = ffs(start); while (start + (1UL << order) > end) order--; __free_pages_bootmem(start, order); start += (1UL << order); } } Robin On Thu, Jul 11, 2013 at 09:03:53PM -0500, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) > -- > 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. 2013-07-12 7:45 ` Robin Holt @ 2013-07-13 3:08 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 3:08 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 12:45 AM, Robin Holt <holt@sgi.com> wrote: > At the very least, I think we could change to: > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > int order; > > while (start < end) { > order = ffs(start); > > while (start + (1UL << order) > end) > order--; > > __free_pages_bootmem(start, order); > > start += (1UL << order); > } > } should work, but need to make sure order < MAX_ORDER. Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. @ 2013-07-13 3:08 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 3:08 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 12:45 AM, Robin Holt <holt@sgi.com> wrote: > At the very least, I think we could change to: > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > int order; > > while (start < end) { > order = ffs(start); > > while (start + (1UL << order) > end) > order--; > > __free_pages_bootmem(start, order); > > start += (1UL << order); > } > } should work, but need to make sure order < MAX_ORDER. Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. 2013-07-12 2:03 ` Robin Holt @ 2013-07-16 13:02 ` Sam Ben -1 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-16 13:02 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Hi Robin, On 07/12/2013 10:03 AM, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to I fail to understand this. Why the original page number is BITS_PER_LONG? > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. @ 2013-07-16 13:02 ` Sam Ben 0 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-16 13:02 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Hi Robin, On 07/12/2013 10:03 AM, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to I fail to understand this. Why the original page number is BITS_PER_LONG? > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. 2013-07-16 13:02 ` Sam Ben @ 2013-07-23 15:32 ` Johannes Weiner -1 siblings, 0 replies; 153+ messages in thread From: Johannes Weiner @ 2013-07-23 15:32 UTC (permalink / raw) To: Sam Ben Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Tue, Jul 16, 2013 at 09:02:53PM +0800, Sam Ben wrote: > Hi Robin, > On 07/12/2013 10:03 AM, Robin Holt wrote: > >Currently, when free_all_bootmem() calls __free_pages_memory(), the > >number of contiguous pages that __free_pages_memory() passes to the > >buddy allocator is limited to BITS_PER_LONG. In order to be able to > > I fail to understand this. Why the original page number is BITS_PER_LONG? The mm/bootmem.c implementation uses a bitmap to keep track of free/reserved pages. It walks that bitmap in BITS_PER_LONG steps because it is the biggest chunk that is still trivial and cheap to check if all pages are free in it (chunk == ~0UL). nobootmem.c was written based on the bootmem.c interface, so it was probably adapted to keep things similar between the two, short of a pressing reason not to. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. @ 2013-07-23 15:32 ` Johannes Weiner 0 siblings, 0 replies; 153+ messages in thread From: Johannes Weiner @ 2013-07-23 15:32 UTC (permalink / raw) To: Sam Ben Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Tue, Jul 16, 2013 at 09:02:53PM +0800, Sam Ben wrote: > Hi Robin, > On 07/12/2013 10:03 AM, Robin Holt wrote: > >Currently, when free_all_bootmem() calls __free_pages_memory(), the > >number of contiguous pages that __free_pages_memory() passes to the > >buddy allocator is limited to BITS_PER_LONG. In order to be able to > > I fail to understand this. Why the original page number is BITS_PER_LONG? The mm/bootmem.c implementation uses a bitmap to keep track of free/reserved pages. It walks that bitmap in BITS_PER_LONG steps because it is the biggest chunk that is still trivial and cheap to check if all pages are free in it (chunk == ~0UL). nobootmem.c was written based on the bootmem.c interface, so it was probably adapted to keep things similar between the two, short of a pressing reason not to. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC 3/4] Seperate page initialization into a separate function. 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 2:03 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 45 insertions(+), 32 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c3edb62..635b131 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) +{ + unsigned long pfn = page_to_pfn(page); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + if (reserved) { + SetPageReserved(page); + } else { + ClearPageReserved(page); + set_page_count(page, 0); + } + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; } page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(page, zone, nid, 1); } } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC 3/4] Seperate page initialization into a separate function. @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 45 insertions(+), 32 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c3edb62..635b131 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) +{ + unsigned long pfn = page_to_pfn(page); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + if (reserved) { + SetPageReserved(page); + } else { + ClearPageReserved(page); + set_page_count(page, 0); + } + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; } page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(page, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC 3/4] Seperate page initialization into a separate function. 2013-07-12 2:03 ` Robin Holt @ 2013-07-13 3:06 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 3:06 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > Currently, memmap_init_zone() has all the smarts for initializing a > single page. When we convert to initializing pages in a 2MiB chunk, > we will need to do this equivalent work from two separate places > so we are breaking out a helper function. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > 2 files changed, 45 insertions(+), 32 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index c280a02..be8a539 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > BUG_ON(or_mask != add_mask); > } > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > unsigned long nid, unsigned long pfn) > { > BUG_ON(page_to_nid(page) != nid); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c3edb62..635b131 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > +{ > + unsigned long pfn = page_to_pfn(page); > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > + > + set_page_links(page, zone, nid, pfn); > + mminit_verify_page_links(page, zone, nid, pfn); > + init_page_count(page); > + page_mapcount_reset(page); > + page_nid_reset_last(page); > + if (reserved) { > + SetPageReserved(page); > + } else { > + ClearPageReserved(page); > + set_page_count(page, 0); > + } > + /* > + * Mark the block movable so that blocks are reserved for > + * movable at startup. This will force kernel allocations > + * to reserve their blocks rather than leaking throughout > + * the address space during boot when many long-lived > + * kernel allocations are made. Later some blocks near > + * the start are marked MIGRATE_RESERVE by > + * setup_zone_migrate_reserve() > + * > + * bitmap is created for zone's valid pfn range. but memmap > + * can be created for invalid pages (for alignment) > + * check here not to call set_pageblock_migratetype() against > + * pfn out of zone. > + */ > + if ((z->zone_start_pfn <= pfn) > + && (pfn < zone_end_pfn(z)) > + && !(pfn & (pageblock_nr_pages - 1))) > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + > + INIT_LIST_HEAD(&page->lru); > +#ifdef WANT_PAGE_VIRTUAL > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > + if (!is_highmem_idx(zone)) > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > +#endif > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > continue; > } > page = pfn_to_page(pfn); > - set_page_links(page, zone, nid, pfn); > - mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > - page_mapcount_reset(page); > - page_nid_reset_last(page); > - SetPageReserved(page); > - /* > - * Mark the block movable so that blocks are reserved for > - * movable at startup. This will force kernel allocations > - * to reserve their blocks rather than leaking throughout > - * the address space during boot when many long-lived > - * kernel allocations are made. Later some blocks near > - * the start are marked MIGRATE_RESERVE by > - * setup_zone_migrate_reserve() > - * > - * bitmap is created for zone's valid pfn range. but memmap > - * can be created for invalid pages (for alignment) > - * check here not to call set_pageblock_migratetype() against > - * pfn out of zone. > - */ > - if ((z->zone_start_pfn <= pfn) > - && (pfn < zone_end_pfn(z)) > - && !(pfn & (pageblock_nr_pages - 1))) > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > - > - INIT_LIST_HEAD(&page->lru); > -#ifdef WANT_PAGE_VIRTUAL > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > - if (!is_highmem_idx(zone)) > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > -#endif > + __init_single_page(page, zone, nid, 1); Can you move page = pfn_to_page(pfn) into __init_single_page and pass pfn directly? Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 3/4] Seperate page initialization into a separate function. @ 2013-07-13 3:06 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 3:06 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > Currently, memmap_init_zone() has all the smarts for initializing a > single page. When we convert to initializing pages in a 2MiB chunk, > we will need to do this equivalent work from two separate places > so we are breaking out a helper function. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > 2 files changed, 45 insertions(+), 32 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index c280a02..be8a539 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > BUG_ON(or_mask != add_mask); > } > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > unsigned long nid, unsigned long pfn) > { > BUG_ON(page_to_nid(page) != nid); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c3edb62..635b131 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > +{ > + unsigned long pfn = page_to_pfn(page); > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > + > + set_page_links(page, zone, nid, pfn); > + mminit_verify_page_links(page, zone, nid, pfn); > + init_page_count(page); > + page_mapcount_reset(page); > + page_nid_reset_last(page); > + if (reserved) { > + SetPageReserved(page); > + } else { > + ClearPageReserved(page); > + set_page_count(page, 0); > + } > + /* > + * Mark the block movable so that blocks are reserved for > + * movable at startup. This will force kernel allocations > + * to reserve their blocks rather than leaking throughout > + * the address space during boot when many long-lived > + * kernel allocations are made. Later some blocks near > + * the start are marked MIGRATE_RESERVE by > + * setup_zone_migrate_reserve() > + * > + * bitmap is created for zone's valid pfn range. but memmap > + * can be created for invalid pages (for alignment) > + * check here not to call set_pageblock_migratetype() against > + * pfn out of zone. > + */ > + if ((z->zone_start_pfn <= pfn) > + && (pfn < zone_end_pfn(z)) > + && !(pfn & (pageblock_nr_pages - 1))) > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + > + INIT_LIST_HEAD(&page->lru); > +#ifdef WANT_PAGE_VIRTUAL > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > + if (!is_highmem_idx(zone)) > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > +#endif > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > continue; > } > page = pfn_to_page(pfn); > - set_page_links(page, zone, nid, pfn); > - mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > - page_mapcount_reset(page); > - page_nid_reset_last(page); > - SetPageReserved(page); > - /* > - * Mark the block movable so that blocks are reserved for > - * movable at startup. This will force kernel allocations > - * to reserve their blocks rather than leaking throughout > - * the address space during boot when many long-lived > - * kernel allocations are made. Later some blocks near > - * the start are marked MIGRATE_RESERVE by > - * setup_zone_migrate_reserve() > - * > - * bitmap is created for zone's valid pfn range. but memmap > - * can be created for invalid pages (for alignment) > - * check here not to call set_pageblock_migratetype() against > - * pfn out of zone. > - */ > - if ((z->zone_start_pfn <= pfn) > - && (pfn < zone_end_pfn(z)) > - && !(pfn & (pageblock_nr_pages - 1))) > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > - > - INIT_LIST_HEAD(&page->lru); > -#ifdef WANT_PAGE_VIRTUAL > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > - if (!is_highmem_idx(zone)) > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > -#endif > + __init_single_page(page, zone, nid, 1); Can you move page = pfn_to_page(pfn) into __init_single_page and pass pfn directly? Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 3/4] Seperate page initialization into a separate function. 2013-07-13 3:06 ` Yinghai Lu @ 2013-07-15 3:19 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 3:19 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 08:06:52PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > Currently, memmap_init_zone() has all the smarts for initializing a > > single page. When we convert to initializing pages in a 2MiB chunk, > > we will need to do this equivalent work from two separate places > > so we are breaking out a helper function. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > mm/mm_init.c | 2 +- > > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > > 2 files changed, 45 insertions(+), 32 deletions(-) > > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index c280a02..be8a539 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > > BUG_ON(or_mask != add_mask); > > } > > > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > > unsigned long nid, unsigned long pfn) > > { > > BUG_ON(page_to_nid(page) != nid); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c3edb62..635b131 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > > spin_unlock(&zone->lock); > > } > > > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > + > > + set_page_links(page, zone, nid, pfn); > > + mminit_verify_page_links(page, zone, nid, pfn); > > + init_page_count(page); > > + page_mapcount_reset(page); > > + page_nid_reset_last(page); > > + if (reserved) { > > + SetPageReserved(page); > > + } else { > > + ClearPageReserved(page); > > + set_page_count(page, 0); > > + } > > + /* > > + * Mark the block movable so that blocks are reserved for > > + * movable at startup. This will force kernel allocations > > + * to reserve their blocks rather than leaking throughout > > + * the address space during boot when many long-lived > > + * kernel allocations are made. Later some blocks near > > + * the start are marked MIGRATE_RESERVE by > > + * setup_zone_migrate_reserve() > > + * > > + * bitmap is created for zone's valid pfn range. but memmap > > + * can be created for invalid pages (for alignment) > > + * check here not to call set_pageblock_migratetype() against > > + * pfn out of zone. > > + */ > > + if ((z->zone_start_pfn <= pfn) > > + && (pfn < zone_end_pfn(z)) > > + && !(pfn & (pageblock_nr_pages - 1))) > > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > + > > + INIT_LIST_HEAD(&page->lru); > > +#ifdef WANT_PAGE_VIRTUAL > > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > + if (!is_highmem_idx(zone)) > > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > > +#endif > > +} > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > continue; > > } > > page = pfn_to_page(pfn); > > - set_page_links(page, zone, nid, pfn); > > - mminit_verify_page_links(page, zone, nid, pfn); > > - init_page_count(page); > > - page_mapcount_reset(page); > > - page_nid_reset_last(page); > > - SetPageReserved(page); > > - /* > > - * Mark the block movable so that blocks are reserved for > > - * movable at startup. This will force kernel allocations > > - * to reserve their blocks rather than leaking throughout > > - * the address space during boot when many long-lived > > - * kernel allocations are made. Later some blocks near > > - * the start are marked MIGRATE_RESERVE by > > - * setup_zone_migrate_reserve() > > - * > > - * bitmap is created for zone's valid pfn range. but memmap > > - * can be created for invalid pages (for alignment) > > - * check here not to call set_pageblock_migratetype() against > > - * pfn out of zone. > > - */ > > - if ((z->zone_start_pfn <= pfn) > > - && (pfn < zone_end_pfn(z)) > > - && !(pfn & (pageblock_nr_pages - 1))) > > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > - > > - INIT_LIST_HEAD(&page->lru); > > -#ifdef WANT_PAGE_VIRTUAL > > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > - if (!is_highmem_idx(zone)) > > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > > -#endif > > + __init_single_page(page, zone, nid, 1); > > Can you > move page = pfn_to_page(pfn) into __init_single_page > and pass pfn directly? Sure, but then I don't care for the name so much, but I can think on that some too. I think the feedback I was most hoping to receive was pertaining to a means for removing the PG_uninitialized2Mib flag entirely. If I could get rid of that and have some page-local way of knowing if it has not been initialized, I think this patch set would be much better. Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 3/4] Seperate page initialization into a separate function. @ 2013-07-15 3:19 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 3:19 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 08:06:52PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > Currently, memmap_init_zone() has all the smarts for initializing a > > single page. When we convert to initializing pages in a 2MiB chunk, > > we will need to do this equivalent work from two separate places > > so we are breaking out a helper function. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > mm/mm_init.c | 2 +- > > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > > 2 files changed, 45 insertions(+), 32 deletions(-) > > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index c280a02..be8a539 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > > BUG_ON(or_mask != add_mask); > > } > > > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > > unsigned long nid, unsigned long pfn) > > { > > BUG_ON(page_to_nid(page) != nid); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c3edb62..635b131 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > > spin_unlock(&zone->lock); > > } > > > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > + > > + set_page_links(page, zone, nid, pfn); > > + mminit_verify_page_links(page, zone, nid, pfn); > > + init_page_count(page); > > + page_mapcount_reset(page); > > + page_nid_reset_last(page); > > + if (reserved) { > > + SetPageReserved(page); > > + } else { > > + ClearPageReserved(page); > > + set_page_count(page, 0); > > + } > > + /* > > + * Mark the block movable so that blocks are reserved for > > + * movable at startup. This will force kernel allocations > > + * to reserve their blocks rather than leaking throughout > > + * the address space during boot when many long-lived > > + * kernel allocations are made. Later some blocks near > > + * the start are marked MIGRATE_RESERVE by > > + * setup_zone_migrate_reserve() > > + * > > + * bitmap is created for zone's valid pfn range. but memmap > > + * can be created for invalid pages (for alignment) > > + * check here not to call set_pageblock_migratetype() against > > + * pfn out of zone. > > + */ > > + if ((z->zone_start_pfn <= pfn) > > + && (pfn < zone_end_pfn(z)) > > + && !(pfn & (pageblock_nr_pages - 1))) > > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > + > > + INIT_LIST_HEAD(&page->lru); > > +#ifdef WANT_PAGE_VIRTUAL > > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > + if (!is_highmem_idx(zone)) > > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > > +#endif > > +} > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > continue; > > } > > page = pfn_to_page(pfn); > > - set_page_links(page, zone, nid, pfn); > > - mminit_verify_page_links(page, zone, nid, pfn); > > - init_page_count(page); > > - page_mapcount_reset(page); > > - page_nid_reset_last(page); > > - SetPageReserved(page); > > - /* > > - * Mark the block movable so that blocks are reserved for > > - * movable at startup. This will force kernel allocations > > - * to reserve their blocks rather than leaking throughout > > - * the address space during boot when many long-lived > > - * kernel allocations are made. Later some blocks near > > - * the start are marked MIGRATE_RESERVE by > > - * setup_zone_migrate_reserve() > > - * > > - * bitmap is created for zone's valid pfn range. but memmap > > - * can be created for invalid pages (for alignment) > > - * check here not to call set_pageblock_migratetype() against > > - * pfn out of zone. > > - */ > > - if ((z->zone_start_pfn <= pfn) > > - && (pfn < zone_end_pfn(z)) > > - && !(pfn & (pageblock_nr_pages - 1))) > > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > - > > - INIT_LIST_HEAD(&page->lru); > > -#ifdef WANT_PAGE_VIRTUAL > > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > - if (!is_highmem_idx(zone)) > > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > > -#endif > > + __init_single_page(page, zone, nid, 1); > > Can you > move page = pfn_to_page(pfn) into __init_single_page > and pass pfn directly? Sure, but then I don't care for the name so much, but I can think on that some too. I think the feedback I was most hoping to receive was pertaining to a means for removing the PG_uninitialized2Mib flag entirely. If I could get rid of that and have some page-local way of knowing if it has not been initialized, I think this patch set would be much better. Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC 4/4] Sparse initialization of struct page array. 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 2:03 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 11 +++++ include/linux/page-flags.h | 5 +- mm/nobootmem.c | 5 ++ mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 132 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..3de08b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) __free_page(page); } +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); + +static inline void __reserve_bootmem_page(struct page *page) +{ + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; + phys_addr_t end = start + PAGE_SIZE; + + __reserve_bootmem_region(start, end); +} + static inline void free_reserved_page(struct page *page) { + __reserve_bootmem_page(page); __free_reserved_page(page); adjust_managed_page_count(page, 1); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..79e8eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 3b512ca..e3a386d 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + __reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) { struct pglist_data *pgdat; + memblock_dump_all(); + for_each_online_pgdat(pgdat) reset_node_lowmem_managed_pages(pgdat); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 635b131..fe51eb9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int reserved = PageReserved(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2Mib(basepage); + + for( pfn++; pfn < end_pfn; pfn++ ) + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); +} + +void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if(PageUninitialized2Mib(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + __reserve_bootmem_page(page); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2Mib(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2Mib(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2Mib(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2Mib(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) } } +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) + { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + /* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + page = pfn_to_page(pfn); __init_single_page(page, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2Mib(page); + + pfn += pfns; } } @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-12 2:03 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-12 2:03 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nate Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 11 +++++ include/linux/page-flags.h | 5 +- mm/nobootmem.c | 5 ++ mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 132 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..3de08b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) __free_page(page); } +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); + +static inline void __reserve_bootmem_page(struct page *page) +{ + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; + phys_addr_t end = start + PAGE_SIZE; + + __reserve_bootmem_region(start, end); +} + static inline void free_reserved_page(struct page *page) { + __reserve_bootmem_page(page); __free_reserved_page(page); adjust_managed_page_count(page, 1); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..79e8eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 3b512ca..e3a386d 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + __reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) { struct pglist_data *pgdat; + memblock_dump_all(); + for_each_online_pgdat(pgdat) reset_node_lowmem_managed_pages(pgdat); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 635b131..fe51eb9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int reserved = PageReserved(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2Mib(basepage); + + for( pfn++; pfn < end_pfn; pfn++ ) + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); +} + +void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if(PageUninitialized2Mib(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + __reserve_bootmem_page(page); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2Mib(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2Mib(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2Mib(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2Mib(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) } } +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) + { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + /* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + page = pfn_to_page(pfn); __init_single_page(page, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2Mib(page); + + pfn += pfns; } } @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-12 2:03 ` Robin Holt @ 2013-07-13 4:19 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 4:19 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > include/linux/mm.h | 11 +++++ > include/linux/page-flags.h | 5 +- > mm/nobootmem.c | 5 ++ > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 132 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..3de08b5 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} > + > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > PG_private, /* If pagecache, has fs-private data */ > PG_private_2, /* If pagecache, has fs aux data */ > PG_writeback, /* Page is under writeback */ > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > __PAGEFLAG(SlobFree, slob_free) > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > + > /* > * Private page markings that may be used by the filesystem that owns the page > * for its own purposes. > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > #define PAGE_FLAGS_CHECK_AT_FREE \ > (1 << PG_lru | 1 << PG_locked | \ > 1 << PG_private | 1 << PG_private_2 | \ > - 1 << PG_writeback | 1 << PG_reserved | \ > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > __PG_COMPOUND_LOCK) > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 3b512ca..e3a386d 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + __reserve_bootmem_region(start, end); > + How about holes that is not in memblock.reserved? before this patch: free_area_init_node/free_area_init_core/memmap_init_zone will mark all page in node range to Reserved in struct page, at first. but those holes is not mapped via kernel low mapping. so it should be ok not touch "struct page" for them. Now you only mark reserved for memblock.reserved at first, and later mark {memblock.memory} - { memblock.reserved} to be available. And that is ok. but should split that change to another patch and add some comment and change log for the change. in case there is some user like UEFI etc do weird thing. > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > { > struct pglist_data *pgdat; > > + memblock_dump_all(); > + Not needed. > for_each_online_pgdat(pgdat) > reset_node_lowmem_managed_pages(pgdat); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 635b131..fe51eb9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} > + > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > +{ > + unsigned long start_pfn = PFN_DOWN(start); > + unsigned long end_pfn = PFN_UP(end); > + > + ensure_pages_are_initialized(start_pfn, end_pfn); > +} that name is confusing, actually it is setting to struct page to Reserved only. maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > + > +static inline void ensure_page_is_initialized(struct page *page) > +{ > + __reserve_bootmem_page(page); > +} how about use __reserve_page_bootmem directly and add comment in callers site ? > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > if (PageAnon(page)) > page->mapping = NULL; > for (i = 0; i < (1 << order); i++) > - bad += free_pages_check(page + i); > + if (PageUninitialized2Mib(page + i)) > + i += PTRS_PER_PMD - 1; > + else > + bad += free_pages_check(page + i); > if (bad) > return false; > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > unsigned int loop; > > prefetchw(page); > - for (loop = 0; loop < nr_pages; loop++) { > + for (loop = 0; loop < nr_pages; ) { > struct page *p = &page[loop]; > > if (loop + 1 < nr_pages) > prefetchw(p + 1); > + > + if ((PageUninitialized2Mib(p)) && > + ((loop + PTRS_PER_PMD) > nr_pages)) > + ensure_page_is_initialized(p); > + > __ClearPageReserved(p); > set_page_count(p, 0); > + if (PageUninitialized2Mib(p)) > + loop += PTRS_PER_PMD; > + else > + loop += 1; > } > > page_zone(page)->managed_pages += 1 << order; > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > area--; > high--; > size >>= 1; > + ensure_page_is_initialized(page); > VM_BUG_ON(bad_range(zone, &page[size])); > > #ifdef CONFIG_DEBUG_PAGEALLOC > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > } > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > unsigned long order; > int pages_moved = 0; > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > #ifndef CONFIG_HOLES_IN_ZONE > /* > * page_zone is not safe to call in this context when > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > return 1; > + > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > + pfn += PTRS_PER_PMD; > } > return 0; > } > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > } > } > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) why not use static ? it seems there is not outside user. > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} > + > /* > * Initially all pages are reserved - free ones are freed > * up by free_all_bootmem() once the early boot process is > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > highest_memmap_pfn = end_pfn - 1; > > z = &NODE_DATA(nid)->node_zones[zone]; > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; ) { > /* > * There can be holes in boot-time mem_map[]s > * handed to this function. They do not > * exist on hotplugged memory. > */ > + int pfns = 1; > if (context == MEMMAP_EARLY) { > - if (!early_pfn_valid(pfn)) > + if (!early_pfn_valid(pfn)) { > + pfn++; > continue; > - if (!early_pfn_in_nid(pfn, nid)) > + } > + if (!early_pfn_in_nid(pfn, nid)) { > + pfn++; > continue; > + } > + > + pfns = pfn_range_init_avail(pfn, end_pfn, > + PTRS_PER_PMD, nid); > } maybe could update memmap_init_zone() only iterate {memblock.memory} - {memblock.reserved}, so you do need to check avail .... as memmap_init_zone do not need to handle holes to mark reserve for them. > + > page = pfn_to_page(pfn); > __init_single_page(page, zone, nid, 1); > + > + if (pfns > 1) > + SetPageUninitialized2Mib(page); > + > + pfn += pfns; > } > } > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, PG_uninitialized_2m ? > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-13 4:19 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 4:19 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > include/linux/mm.h | 11 +++++ > include/linux/page-flags.h | 5 +- > mm/nobootmem.c | 5 ++ > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 132 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..3de08b5 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} > + > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > PG_private, /* If pagecache, has fs-private data */ > PG_private_2, /* If pagecache, has fs aux data */ > PG_writeback, /* Page is under writeback */ > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > __PAGEFLAG(SlobFree, slob_free) > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > + > /* > * Private page markings that may be used by the filesystem that owns the page > * for its own purposes. > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > #define PAGE_FLAGS_CHECK_AT_FREE \ > (1 << PG_lru | 1 << PG_locked | \ > 1 << PG_private | 1 << PG_private_2 | \ > - 1 << PG_writeback | 1 << PG_reserved | \ > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > __PG_COMPOUND_LOCK) > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 3b512ca..e3a386d 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + __reserve_bootmem_region(start, end); > + How about holes that is not in memblock.reserved? before this patch: free_area_init_node/free_area_init_core/memmap_init_zone will mark all page in node range to Reserved in struct page, at first. but those holes is not mapped via kernel low mapping. so it should be ok not touch "struct page" for them. Now you only mark reserved for memblock.reserved at first, and later mark {memblock.memory} - { memblock.reserved} to be available. And that is ok. but should split that change to another patch and add some comment and change log for the change. in case there is some user like UEFI etc do weird thing. > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > { > struct pglist_data *pgdat; > > + memblock_dump_all(); > + Not needed. > for_each_online_pgdat(pgdat) > reset_node_lowmem_managed_pages(pgdat); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 635b131..fe51eb9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} > + > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > +{ > + unsigned long start_pfn = PFN_DOWN(start); > + unsigned long end_pfn = PFN_UP(end); > + > + ensure_pages_are_initialized(start_pfn, end_pfn); > +} that name is confusing, actually it is setting to struct page to Reserved only. maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > + > +static inline void ensure_page_is_initialized(struct page *page) > +{ > + __reserve_bootmem_page(page); > +} how about use __reserve_page_bootmem directly and add comment in callers site ? > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > if (PageAnon(page)) > page->mapping = NULL; > for (i = 0; i < (1 << order); i++) > - bad += free_pages_check(page + i); > + if (PageUninitialized2Mib(page + i)) > + i += PTRS_PER_PMD - 1; > + else > + bad += free_pages_check(page + i); > if (bad) > return false; > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > unsigned int loop; > > prefetchw(page); > - for (loop = 0; loop < nr_pages; loop++) { > + for (loop = 0; loop < nr_pages; ) { > struct page *p = &page[loop]; > > if (loop + 1 < nr_pages) > prefetchw(p + 1); > + > + if ((PageUninitialized2Mib(p)) && > + ((loop + PTRS_PER_PMD) > nr_pages)) > + ensure_page_is_initialized(p); > + > __ClearPageReserved(p); > set_page_count(p, 0); > + if (PageUninitialized2Mib(p)) > + loop += PTRS_PER_PMD; > + else > + loop += 1; > } > > page_zone(page)->managed_pages += 1 << order; > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > area--; > high--; > size >>= 1; > + ensure_page_is_initialized(page); > VM_BUG_ON(bad_range(zone, &page[size])); > > #ifdef CONFIG_DEBUG_PAGEALLOC > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > } > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > unsigned long order; > int pages_moved = 0; > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > #ifndef CONFIG_HOLES_IN_ZONE > /* > * page_zone is not safe to call in this context when > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > return 1; > + > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > + pfn += PTRS_PER_PMD; > } > return 0; > } > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > } > } > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) why not use static ? it seems there is not outside user. > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} > + > /* > * Initially all pages are reserved - free ones are freed > * up by free_all_bootmem() once the early boot process is > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > highest_memmap_pfn = end_pfn - 1; > > z = &NODE_DATA(nid)->node_zones[zone]; > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; ) { > /* > * There can be holes in boot-time mem_map[]s > * handed to this function. They do not > * exist on hotplugged memory. > */ > + int pfns = 1; > if (context == MEMMAP_EARLY) { > - if (!early_pfn_valid(pfn)) > + if (!early_pfn_valid(pfn)) { > + pfn++; > continue; > - if (!early_pfn_in_nid(pfn, nid)) > + } > + if (!early_pfn_in_nid(pfn, nid)) { > + pfn++; > continue; > + } > + > + pfns = pfn_range_init_avail(pfn, end_pfn, > + PTRS_PER_PMD, nid); > } maybe could update memmap_init_zone() only iterate {memblock.memory} - {memblock.reserved}, so you do need to check avail .... as memmap_init_zone do not need to handle holes to mark reserve for them. > + > page = pfn_to_page(pfn); > __init_single_page(page, zone, nid, 1); > + > + if (pfns > 1) > + SetPageUninitialized2Mib(page); > + > + pfn += pfns; > } > } > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, PG_uninitialized_2m ? > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 4:19 ` Yinghai Lu @ 2013-07-13 4:39 ` H. Peter Anvin -1 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-13 4:39 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/12/2013 09:19 PM, Yinghai Lu wrote: >> PG_reserved, >> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >> PG_private, /* If pagecache, has fs-private data */ The comment here is WTF... -hpa ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-13 4:39 ` H. Peter Anvin 0 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-13 4:39 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/12/2013 09:19 PM, Yinghai Lu wrote: >> PG_reserved, >> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >> PG_private, /* If pagecache, has fs-private data */ The comment here is WTF... -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 4:39 ` H. Peter Anvin @ 2013-07-13 5:31 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 5:31 UTC (permalink / raw) To: H. Peter Anvin Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>> PG_reserved, >>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>> PG_private, /* If pagecache, has fs-private data */ > > The comment here is WTF... ntz: Nate Zimmer? rmh: Robin Holt? ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-13 5:31 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-13 5:31 UTC (permalink / raw) To: H. Peter Anvin Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>> PG_reserved, >>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>> PG_private, /* If pagecache, has fs-private data */ > > The comment here is WTF... ntz: Nate Zimmer? rmh: Robin Holt? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 5:31 ` Yinghai Lu @ 2013-07-13 5:38 ` H. Peter Anvin -1 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-13 5:38 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/12/2013 10:31 PM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> >> The comment here is WTF... > > ntz: Nate Zimmer? > rmh: Robin Holt? > This kind of conversation doesn't really belong in a code comment, though. -hpa ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-13 5:38 ` H. Peter Anvin 0 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-13 5:38 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/12/2013 10:31 PM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> >> The comment here is WTF... > > ntz: Nate Zimmer? > rmh: Robin Holt? > This kind of conversation doesn't really belong in a code comment, though. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 5:31 ` Yinghai Lu @ 2013-07-15 14:08 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-07-15 14:08 UTC (permalink / raw) To: Yinghai Lu Cc: H. Peter Anvin, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/13/2013 12:31 AM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> The comment here is WTF... > ntz: Nate Zimmer? > rmh: Robin Holt? Yea that comment was supposed to be removed. Sorry about that. Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 14:08 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-07-15 14:08 UTC (permalink / raw) To: Yinghai Lu Cc: H. Peter Anvin, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/13/2013 12:31 AM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> The comment here is WTF... > ntz: Nate Zimmer? > rmh: Robin Holt? Yea that comment was supposed to be removed. Sorry about that. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 4:19 ` Yinghai Lu @ 2013-07-15 17:45 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-07-15 17:45 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai I hadn't actually been very happy with having a PG_uninitialized2mib flag. It implies if we want to jump to 1Gb pages we would need a second flag, PG_uninitialized1gb, for that. I was thinking of changing it to PG_uninitialized and setting page->private to the correct order. Thoughts? Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 17:45 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-07-15 17:45 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai I hadn't actually been very happy with having a PG_uninitialized2mib flag. It implies if we want to jump to 1Gb pages we would need a second flag, PG_uninitialized1gb, for that. I was thinking of changing it to PG_uninitialized and setting page->private to the correct order. Thoughts? Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-15 17:45 ` Nathan Zimmer @ 2013-07-15 17:54 ` H. Peter Anvin -1 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-15 17:54 UTC (permalink / raw) To: Nathan Zimmer Cc: Yinghai Lu, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > It implies if we want to jump to 1Gb pages we would need a second flag, > PG_uninitialized1gb, for that. I was thinking of changing it to > PG_uninitialized and setting page->private to the correct order. > Thoughts? > Seems straightforward. The bigger issue is the amount of overhead we cause by having to check upstack for the initialization status of the superpages. I'm concerned, obviously, about lingering overhead that is "forever". That being said, in the absolutely worst case we could have a counter to the number of uninitialized pages which when it hits zero we do a static switch and switch out the initialization code (would have to be undone on memory hotplug, of course.) -hpa ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 17:54 ` H. Peter Anvin 0 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-15 17:54 UTC (permalink / raw) To: Nathan Zimmer Cc: Yinghai Lu, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > It implies if we want to jump to 1Gb pages we would need a second flag, > PG_uninitialized1gb, for that. I was thinking of changing it to > PG_uninitialized and setting page->private to the correct order. > Thoughts? > Seems straightforward. The bigger issue is the amount of overhead we cause by having to check upstack for the initialization status of the superpages. I'm concerned, obviously, about lingering overhead that is "forever". That being said, in the absolutely worst case we could have a counter to the number of uninitialized pages which when it hits zero we do a static switch and switch out the initialization code (would have to be undone on memory hotplug, of course.) -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-15 17:54 ` H. Peter Anvin @ 2013-07-15 18:26 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 18:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Nathan Zimmer, Yinghai Lu, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Mon, Jul 15, 2013 at 10:54:38AM -0700, H. Peter Anvin wrote: > On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > > It implies if we want to jump to 1Gb pages we would need a second flag, > > PG_uninitialized1gb, for that. I was thinking of changing it to > > PG_uninitialized and setting page->private to the correct order. > > Thoughts? > > > > Seems straightforward. The bigger issue is the amount of overhead we > cause by having to check upstack for the initialization status of the > superpages. > > I'm concerned, obviously, about lingering overhead that is "forever". > That being said, in the absolutely worst case we could have a counter to > the number of uninitialized pages which when it hits zero we do a static > switch and switch out the initialization code (would have to be undone > on memory hotplug, of course.) Is there a fairly cheap way to determine definitively that the struct page is not initialized? I think this patch set can change fairly drastically if we have that. I think I will start working up those changes and code a heavy-handed check until I hear of an alternative way to cheaply check. Thanks, Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 18:26 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 18:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Nathan Zimmer, Yinghai Lu, Robin Holt, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Mon, Jul 15, 2013 at 10:54:38AM -0700, H. Peter Anvin wrote: > On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > > It implies if we want to jump to 1Gb pages we would need a second flag, > > PG_uninitialized1gb, for that. I was thinking of changing it to > > PG_uninitialized and setting page->private to the correct order. > > Thoughts? > > > > Seems straightforward. The bigger issue is the amount of overhead we > cause by having to check upstack for the initialization status of the > superpages. > > I'm concerned, obviously, about lingering overhead that is "forever". > That being said, in the absolutely worst case we could have a counter to > the number of uninitialized pages which when it hits zero we do a static > switch and switch out the initialization code (would have to be undone > on memory hotplug, of course.) Is there a fairly cheap way to determine definitively that the struct page is not initialized? I think this patch set can change fairly drastically if we have that. I think I will start working up those changes and code a heavy-handed check until I hear of an alternative way to cheaply check. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-15 18:26 ` Robin Holt @ 2013-07-15 18:29 ` H. Peter Anvin -1 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-15 18:29 UTC (permalink / raw) To: Robin Holt Cc: Nathan Zimmer, Yinghai Lu, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/15/2013 11:26 AM, Robin Holt wrote: > Is there a fairly cheap way to determine definitively that the struct > page is not initialized? By definition I would assume no. The only way I can think of would be to unmap the memory associated with the struct page in the TLB and initialize the struct pages at trap time. > I think this patch set can change fairly drastically if we have that. > I think I will start working up those changes and code a heavy-handed > check until I hear of an alternative way to cheaply check. -hpa ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 18:29 ` H. Peter Anvin 0 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-07-15 18:29 UTC (permalink / raw) To: Robin Holt Cc: Nathan Zimmer, Yinghai Lu, Ingo Molnar, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On 07/15/2013 11:26 AM, Robin Holt wrote: > Is there a fairly cheap way to determine definitively that the struct > page is not initialized? By definition I would assume no. The only way I can think of would be to unmap the memory associated with the struct page in the TLB and initialize the struct pages at trap time. > I think this patch set can change fairly drastically if we have that. > I think I will start working up those changes and code a heavy-handed > check until I hear of an alternative way to cheaply check. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-15 18:29 ` H. Peter Anvin @ 2013-07-23 8:32 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:32 UTC (permalink / raw) To: H. Peter Anvin Cc: Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman * H. Peter Anvin <hpa@zytor.com> wrote: > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > Is there a fairly cheap way to determine definitively that the struct > > page is not initialized? > > By definition I would assume no. The only way I can think of would be > to unmap the memory associated with the struct page in the TLB and > initialize the struct pages at trap time. But ... the only fastpath impact I can see of delayed initialization right now is this piece of logic in prep_new_page(): @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; That is where I think it can be made zero overhead in the already-initialized case, because page-flags are already used in check_new_page(): static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every struct page on allocation. We can micro-optimize that low overhead to zero-overhead, by integrating the PageUninitialized2Mib() check into check_new_page(). This can be done by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { if (PageUninitialized2Mib(p)) expand_page_initialization(page); ... } if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; this will result in making it essentially zero-overhead, the expand_page_initialization() logic is now in a slowpath. Am I missing anything here? Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-23 8:32 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:32 UTC (permalink / raw) To: H. Peter Anvin Cc: Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman * H. Peter Anvin <hpa@zytor.com> wrote: > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > Is there a fairly cheap way to determine definitively that the struct > > page is not initialized? > > By definition I would assume no. The only way I can think of would be > to unmap the memory associated with the struct page in the TLB and > initialize the struct pages at trap time. But ... the only fastpath impact I can see of delayed initialization right now is this piece of logic in prep_new_page(): @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; That is where I think it can be made zero overhead in the already-initialized case, because page-flags are already used in check_new_page(): static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every struct page on allocation. We can micro-optimize that low overhead to zero-overhead, by integrating the PageUninitialized2Mib() check into check_new_page(). This can be done by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { if (PageUninitialized2Mib(p)) expand_page_initialization(page); ... } if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; this will result in making it essentially zero-overhead, the expand_page_initialization() logic is now in a slowpath. Am I missing anything here? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-23 8:32 ` Ingo Molnar @ 2013-07-23 11:09 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:09 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > Is there a fairly cheap way to determine definitively that the struct > > > page is not initialized? > > > > By definition I would assume no. The only way I can think of would be > > to unmap the memory associated with the struct page in the TLB and > > initialize the struct pages at trap time. > > But ... the only fastpath impact I can see of delayed initialization right > now is this piece of logic in prep_new_page(): > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > > That is where I think it can be made zero overhead in the > already-initialized case, because page-flags are already used in > check_new_page(): The problem I see here is that the page flags we need to check for the uninitialized flag are in the "other" page for the page aligned at the 2MiB virtual address, not the page currently being referenced. Let me try a version of the patch where we set the PG_unintialized_2m flag on all pages, including the aligned pages and see what that does to performance. Robin > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > struct page on allocation. > > We can micro-optimize that low overhead to zero-overhead, by integrating > the PageUninitialized2Mib() check into check_new_page(). This can be done > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > if (PageUninitialized2Mib(p)) > expand_page_initialization(page); > ... > } > > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > > return 1; > > this will result in making it essentially zero-overhead, the > expand_page_initialization() logic is now in a slowpath. > > Am I missing anything here? > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-23 11:09 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:09 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > Is there a fairly cheap way to determine definitively that the struct > > > page is not initialized? > > > > By definition I would assume no. The only way I can think of would be > > to unmap the memory associated with the struct page in the TLB and > > initialize the struct pages at trap time. > > But ... the only fastpath impact I can see of delayed initialization right > now is this piece of logic in prep_new_page(): > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > > That is where I think it can be made zero overhead in the > already-initialized case, because page-flags are already used in > check_new_page(): The problem I see here is that the page flags we need to check for the uninitialized flag are in the "other" page for the page aligned at the 2MiB virtual address, not the page currently being referenced. Let me try a version of the patch where we set the PG_unintialized_2m flag on all pages, including the aligned pages and see what that does to performance. Robin > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > struct page on allocation. > > We can micro-optimize that low overhead to zero-overhead, by integrating > the PageUninitialized2Mib() check into check_new_page(). This can be done > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > if (PageUninitialized2Mib(p)) > expand_page_initialization(page); > ... > } > > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > > return 1; > > this will result in making it essentially zero-overhead, the > expand_page_initialization() logic is now in a slowpath. > > Am I missing anything here? > > Thanks, > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-23 11:09 ` Robin Holt @ 2013-07-23 11:15 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:15 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman I think the other critical path which is affected is in expand(). There, we just call ensure_page_is_initialized() blindly which does the check against the other page. The below is a nearly zero addition. Sorry for the confusion. My morning coffee has not kicked in yet. Robin On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > page is not initialized? > > > > > > By definition I would assume no. The only way I can think of would be > > > to unmap the memory associated with the struct page in the TLB and > > > initialize the struct pages at trap time. > > > > But ... the only fastpath impact I can see of delayed initialization right > > now is this piece of logic in prep_new_page(): > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > > > That is where I think it can be made zero overhead in the > > already-initialized case, because page-flags are already used in > > check_new_page(): > > The problem I see here is that the page flags we need to check for the > uninitialized flag are in the "other" page for the page aligned at the > 2MiB virtual address, not the page currently being referenced. > > Let me try a version of the patch where we set the PG_unintialized_2m > flag on all pages, including the aligned pages and see what that does > to performance. > > Robin > > > > > static inline int check_new_page(struct page *page) > > { > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > return 1; > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > struct page on allocation. > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > if (PageUninitialized2Mib(p)) > > expand_page_initialization(page); > > ... > > } > > > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > > > return 1; > > > > this will result in making it essentially zero-overhead, the > > expand_page_initialization() logic is now in a slowpath. > > > > Am I missing anything here? > > > > Thanks, > > > > Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-23 11:15 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:15 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman I think the other critical path which is affected is in expand(). There, we just call ensure_page_is_initialized() blindly which does the check against the other page. The below is a nearly zero addition. Sorry for the confusion. My morning coffee has not kicked in yet. Robin On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > page is not initialized? > > > > > > By definition I would assume no. The only way I can think of would be > > > to unmap the memory associated with the struct page in the TLB and > > > initialize the struct pages at trap time. > > > > But ... the only fastpath impact I can see of delayed initialization right > > now is this piece of logic in prep_new_page(): > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > > > That is where I think it can be made zero overhead in the > > already-initialized case, because page-flags are already used in > > check_new_page(): > > The problem I see here is that the page flags we need to check for the > uninitialized flag are in the "other" page for the page aligned at the > 2MiB virtual address, not the page currently being referenced. > > Let me try a version of the patch where we set the PG_unintialized_2m > flag on all pages, including the aligned pages and see what that does > to performance. > > Robin > > > > > static inline int check_new_page(struct page *page) > > { > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > return 1; > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > struct page on allocation. > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > if (PageUninitialized2Mib(p)) > > expand_page_initialization(page); > > ... > > } > > > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > > > return 1; > > > > this will result in making it essentially zero-overhead, the > > expand_page_initialization() logic is now in a slowpath. > > > > Am I missing anything here? > > > > Thanks, > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-23 11:15 ` Robin Holt @ 2013-07-23 11:41 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:41 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > I think the other critical path which is affected is in expand(). > There, we just call ensure_page_is_initialized() blindly which does > the check against the other page. The below is a nearly zero addition. > Sorry for the confusion. My morning coffee has not kicked in yet. I don't have access to the 16TiB system until Thursday unless the other testing on it fails early. I did boot a 2TiB system with the a change which set the Unitialized_2m flag on all pages in that 2MiB range during memmap_init_zone. That makes the expand check test against the referenced page instead of having to go back to the 2MiB page. It appears to have added less than a second to the 2TiB boot so I hope it has equally little impact to the 16TiB boot. I will clean up this patch some more and resend the currently untested set later today. Thanks, Robin > > Robin > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > page is not initialized? > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > to unmap the memory associated with the struct page in the TLB and > > > > initialize the struct pages at trap time. > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > now is this piece of logic in prep_new_page(): > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > for (i = 0; i < (1 << order); i++) { > > > struct page *p = page + i; > > > + > > > + if (PageUninitialized2Mib(p)) > > > + expand_page_initialization(page); > > > + > > > if (unlikely(check_new_page(p))) > > > return 1; > > > > > > That is where I think it can be made zero overhead in the > > > already-initialized case, because page-flags are already used in > > > check_new_page(): > > > > The problem I see here is that the page flags we need to check for the > > uninitialized flag are in the "other" page for the page aligned at the > > 2MiB virtual address, not the page currently being referenced. > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > flag on all pages, including the aligned pages and see what that does > > to performance. > > > > Robin > > > > > > > > static inline int check_new_page(struct page *page) > > > { > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > return 1; > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > struct page on allocation. > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > if (PageUninitialized2Mib(p)) > > > expand_page_initialization(page); > > > ... > > > } > > > > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > > > > return 1; > > > > > > this will result in making it essentially zero-overhead, the > > > expand_page_initialization() logic is now in a slowpath. > > > > > > Am I missing anything here? > > > > > > Thanks, > > > > > > Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-23 11:41 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:41 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > I think the other critical path which is affected is in expand(). > There, we just call ensure_page_is_initialized() blindly which does > the check against the other page. The below is a nearly zero addition. > Sorry for the confusion. My morning coffee has not kicked in yet. I don't have access to the 16TiB system until Thursday unless the other testing on it fails early. I did boot a 2TiB system with the a change which set the Unitialized_2m flag on all pages in that 2MiB range during memmap_init_zone. That makes the expand check test against the referenced page instead of having to go back to the 2MiB page. It appears to have added less than a second to the 2TiB boot so I hope it has equally little impact to the 16TiB boot. I will clean up this patch some more and resend the currently untested set later today. Thanks, Robin > > Robin > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > page is not initialized? > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > to unmap the memory associated with the struct page in the TLB and > > > > initialize the struct pages at trap time. > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > now is this piece of logic in prep_new_page(): > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > for (i = 0; i < (1 << order); i++) { > > > struct page *p = page + i; > > > + > > > + if (PageUninitialized2Mib(p)) > > > + expand_page_initialization(page); > > > + > > > if (unlikely(check_new_page(p))) > > > return 1; > > > > > > That is where I think it can be made zero overhead in the > > > already-initialized case, because page-flags are already used in > > > check_new_page(): > > > > The problem I see here is that the page flags we need to check for the > > uninitialized flag are in the "other" page for the page aligned at the > > 2MiB virtual address, not the page currently being referenced. > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > flag on all pages, including the aligned pages and see what that does > > to performance. > > > > Robin > > > > > > > > static inline int check_new_page(struct page *page) > > > { > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > return 1; > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > struct page on allocation. > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > if (PageUninitialized2Mib(p)) > > > expand_page_initialization(page); > > > ... > > > } > > > > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > > > > return 1; > > > > > > this will result in making it essentially zero-overhead, the > > > expand_page_initialization() logic is now in a slowpath. > > > > > > Am I missing anything here? > > > > > > Thanks, > > > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-23 11:41 ` Robin Holt @ 2013-07-23 11:50 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:50 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 06:41:50AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > > I think the other critical path which is affected is in expand(). > > There, we just call ensure_page_is_initialized() blindly which does > > the check against the other page. The below is a nearly zero addition. > > Sorry for the confusion. My morning coffee has not kicked in yet. > > I don't have access to the 16TiB system until Thursday unless the other > testing on it fails early. I did boot a 2TiB system with the a change > which set the Unitialized_2m flag on all pages in that 2MiB range > during memmap_init_zone. That makes the expand check test against > the referenced page instead of having to go back to the 2MiB page. > It appears to have added less than a second to the 2TiB boot so I hope > it has equally little impact to the 16TiB boot. I was wrong. One of the two logs I looked at was the wrong one. Setting that Unitialized2m flag on all pages added 30 seconds to the 2TiB boot's memmap_init_zone(). Please disregard. That brings me back to the belief we need a better solution for the expand() path. Robin > > I will clean up this patch some more and resend the currently untested > set later today. > > Thanks, > Robin > > > > > Robin > > > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > > page is not initialized? > > > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > > to unmap the memory associated with the struct page in the TLB and > > > > > initialize the struct pages at trap time. > > > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > > now is this piece of logic in prep_new_page(): > > > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > > > for (i = 0; i < (1 << order); i++) { > > > > struct page *p = page + i; > > > > + > > > > + if (PageUninitialized2Mib(p)) > > > > + expand_page_initialization(page); > > > > + > > > > if (unlikely(check_new_page(p))) > > > > return 1; > > > > > > > > That is where I think it can be made zero overhead in the > > > > already-initialized case, because page-flags are already used in > > > > check_new_page(): > > > > > > The problem I see here is that the page flags we need to check for the > > > uninitialized flag are in the "other" page for the page aligned at the > > > 2MiB virtual address, not the page currently being referenced. > > > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > > flag on all pages, including the aligned pages and see what that does > > > to performance. > > > > > > Robin > > > > > > > > > > > static inline int check_new_page(struct page *page) > > > > { > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > return 1; > > > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > > struct page on allocation. > > > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > > if (PageUninitialized2Mib(p)) > > > > expand_page_initialization(page); > > > > ... > > > > } > > > > > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > > > > > return 1; > > > > > > > > this will result in making it essentially zero-overhead, the > > > > expand_page_initialization() logic is now in a slowpath. > > > > > > > > Am I missing anything here? > > > > > > > > Thanks, > > > > > > > > Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-23 11:50 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-23 11:50 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Robin Holt, Nathan Zimmer, Yinghai Lu, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Tue, Jul 23, 2013 at 06:41:50AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > > I think the other critical path which is affected is in expand(). > > There, we just call ensure_page_is_initialized() blindly which does > > the check against the other page. The below is a nearly zero addition. > > Sorry for the confusion. My morning coffee has not kicked in yet. > > I don't have access to the 16TiB system until Thursday unless the other > testing on it fails early. I did boot a 2TiB system with the a change > which set the Unitialized_2m flag on all pages in that 2MiB range > during memmap_init_zone. That makes the expand check test against > the referenced page instead of having to go back to the 2MiB page. > It appears to have added less than a second to the 2TiB boot so I hope > it has equally little impact to the 16TiB boot. I was wrong. One of the two logs I looked at was the wrong one. Setting that Unitialized2m flag on all pages added 30 seconds to the 2TiB boot's memmap_init_zone(). Please disregard. That brings me back to the belief we need a better solution for the expand() path. Robin > > I will clean up this patch some more and resend the currently untested > set later today. > > Thanks, > Robin > > > > > Robin > > > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > > page is not initialized? > > > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > > to unmap the memory associated with the struct page in the TLB and > > > > > initialize the struct pages at trap time. > > > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > > now is this piece of logic in prep_new_page(): > > > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > > > for (i = 0; i < (1 << order); i++) { > > > > struct page *p = page + i; > > > > + > > > > + if (PageUninitialized2Mib(p)) > > > > + expand_page_initialization(page); > > > > + > > > > if (unlikely(check_new_page(p))) > > > > return 1; > > > > > > > > That is where I think it can be made zero overhead in the > > > > already-initialized case, because page-flags are already used in > > > > check_new_page(): > > > > > > The problem I see here is that the page flags we need to check for the > > > uninitialized flag are in the "other" page for the page aligned at the > > > 2MiB virtual address, not the page currently being referenced. > > > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > > flag on all pages, including the aligned pages and see what that does > > > to performance. > > > > > > Robin > > > > > > > > > > > static inline int check_new_page(struct page *page) > > > > { > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > return 1; > > > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > > struct page on allocation. > > > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > > if (PageUninitialized2Mib(p)) > > > > expand_page_initialization(page); > > > > ... > > > > } > > > > > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > > > > > return 1; > > > > > > > > this will result in making it essentially zero-overhead, the > > > > expand_page_initialization() logic is now in a slowpath. > > > > > > > > Am I missing anything here? > > > > > > > > Thanks, > > > > > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 4:19 ` Yinghai Lu @ 2013-07-16 10:26 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-16 10:26 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. I will split out a separate patch for this. > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. Left over debug in the rush to ask our question. > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? Done. > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? I really dislike that. The inline function makes the need for a comment unnecessary in my opinion and leaves the implementation localized to the one-line function. Those wanting to understand why can quickly see that the intended functionality is accomplished by the other function. I would really like to leave this as-is. > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. Left over from early patch series where we were using this from mm/nobootmem.c. Fixed. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. Maybe I can change pfn_range_init_avail in such a way that the __reserve_pages_bootmem() work above is not needed. I will dig into that some more before the next patch submission. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? Done. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-16 10:26 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-16 10:26 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. I will split out a separate patch for this. > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. Left over debug in the rush to ask our question. > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? Done. > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? I really dislike that. The inline function makes the need for a comment unnecessary in my opinion and leaves the implementation localized to the one-line function. Those wanting to understand why can quickly see that the intended functionality is accomplished by the other function. I would really like to leave this as-is. > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. Left over from early patch series where we were using this from mm/nobootmem.c. Fixed. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. Maybe I can change pfn_range_init_avail in such a way that the __reserve_pages_bootmem() work above is not needed. I will dig into that some more before the next patch submission. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? Done. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-13 4:19 ` Yinghai Lu @ 2013-07-25 2:25 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-25 2:25 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. Nate and I talked this over today. Sorry for the delay, but it was the first time we were both free. Neither of us quite understands what you are asking for here. My interpretation is that you would like us to change the use of the PageReserved flag such that during boot, we do not set the flag at all from memmap_init_zone, and then only set it on pages which, at the time of free_all_bootmem, have been allocated/reserved in the memblock allocator. Is that correct? I will start to work that up on the assumption that is what you are asking for. Robin > > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. > > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? > > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-25 2:25 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-25 2:25 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <holt@sgi.com> wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt <holt@sgi.com> > > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > > To: "H. Peter Anvin" <hpa@zytor.com> > > To: Ingo Molnar <mingo@kernel.org> > > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > > Cc: Linux MM <linux-mm@kvack.org> > > Cc: Rob Landley <rob@landley.net> > > Cc: Mike Travis <travis@sgi.com> > > Cc: Daniel J Blueman <daniel@numascale-asia.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Greg KH <gregkh@linuxfoundation.org> > > Cc: Yinghai Lu <yinghai@kernel.org> > > Cc: Mel Gorman <mgorman@suse.de> > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. Nate and I talked this over today. Sorry for the delay, but it was the first time we were both free. Neither of us quite understands what you are asking for here. My interpretation is that you would like us to change the use of the PageReserved flag such that during boot, we do not set the flag at all from memmap_init_zone, and then only set it on pages which, at the time of free_all_bootmem, have been allocated/reserved in the memblock allocator. Is that correct? I will start to work that up on the assumption that is what you are asking for. Robin > > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. > > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? > > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-25 2:25 ` Robin Holt @ 2013-07-25 12:50 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-25 12:50 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: >> >> How about holes that is not in memblock.reserved? >> >> before this patch: >> free_area_init_node/free_area_init_core/memmap_init_zone >> will mark all page in node range to Reserved in struct page, at first. >> >> but those holes is not mapped via kernel low mapping. >> so it should be ok not touch "struct page" for them. >> >> Now you only mark reserved for memblock.reserved at first, and later >> mark {memblock.memory} - { memblock.reserved} to be available. >> And that is ok. >> >> but should split that change to another patch and add some comment >> and change log for the change. >> in case there is some user like UEFI etc do weird thing. > > Nate and I talked this over today. Sorry for the delay, but it was the > first time we were both free. Neither of us quite understands what you > are asking for here. My interpretation is that you would like us to > change the use of the PageReserved flag such that during boot, we do not > set the flag at all from memmap_init_zone, and then only set it on pages > which, at the time of free_all_bootmem, have been allocated/reserved in > the memblock allocator. Is that correct? I will start to work that up > on the assumption that is what you are asking for. Not exactly. your change should be right, but there is some subtle change about holes handling. before mem holes between memory ranges in memblock.memory, get struct page, and initialized with to Reserved in memmap_init_zone. Those holes is not in memory.reserved, with your patches, those hole's struct page will still have all 0. Please separate change about set page to reserved according to memory.reserved to another patch. Thanks Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-25 12:50 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-25 12:50 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: >> >> How about holes that is not in memblock.reserved? >> >> before this patch: >> free_area_init_node/free_area_init_core/memmap_init_zone >> will mark all page in node range to Reserved in struct page, at first. >> >> but those holes is not mapped via kernel low mapping. >> so it should be ok not touch "struct page" for them. >> >> Now you only mark reserved for memblock.reserved at first, and later >> mark {memblock.memory} - { memblock.reserved} to be available. >> And that is ok. >> >> but should split that change to another patch and add some comment >> and change log for the change. >> in case there is some user like UEFI etc do weird thing. > > Nate and I talked this over today. Sorry for the delay, but it was the > first time we were both free. Neither of us quite understands what you > are asking for here. My interpretation is that you would like us to > change the use of the PageReserved flag such that during boot, we do not > set the flag at all from memmap_init_zone, and then only set it on pages > which, at the time of free_all_bootmem, have been allocated/reserved in > the memblock allocator. Is that correct? I will start to work that up > on the assumption that is what you are asking for. Not exactly. your change should be right, but there is some subtle change about holes handling. before mem holes between memory ranges in memblock.memory, get struct page, and initialized with to Reserved in memmap_init_zone. Those holes is not in memory.reserved, with your patches, those hole's struct page will still have all 0. Please separate change about set page to reserved according to memory.reserved to another patch. Thanks Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-25 12:50 ` Yinghai Lu @ 2013-07-25 13:42 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-25 13:42 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: > On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: > >> > >> How about holes that is not in memblock.reserved? > >> > >> before this patch: > >> free_area_init_node/free_area_init_core/memmap_init_zone > >> will mark all page in node range to Reserved in struct page, at first. > >> > >> but those holes is not mapped via kernel low mapping. > >> so it should be ok not touch "struct page" for them. > >> > >> Now you only mark reserved for memblock.reserved at first, and later > >> mark {memblock.memory} - { memblock.reserved} to be available. > >> And that is ok. > >> > >> but should split that change to another patch and add some comment > >> and change log for the change. > >> in case there is some user like UEFI etc do weird thing. > > > > Nate and I talked this over today. Sorry for the delay, but it was the > > first time we were both free. Neither of us quite understands what you > > are asking for here. My interpretation is that you would like us to > > change the use of the PageReserved flag such that during boot, we do not > > set the flag at all from memmap_init_zone, and then only set it on pages > > which, at the time of free_all_bootmem, have been allocated/reserved in > > the memblock allocator. Is that correct? I will start to work that up > > on the assumption that is what you are asking for. > > Not exactly. > > your change should be right, but there is some subtle change about > holes handling. > > before mem holes between memory ranges in memblock.memory, get struct page, > and initialized with to Reserved in memmap_init_zone. > Those holes is not in memory.reserved, with your patches, those hole's > struct page > will still have all 0. > > Please separate change about set page to reserved according to memory.reserved > to another patch. Just want to make sure this is where you want me to go. Here is my currently untested patch. Is that what you were expecting to have done? One thing I don't like about this patch is it seems to slow down boot in my simulator environment. I think I am going to look at restructuring things a bit to see if I can eliminate that performance penalty. Otherwise, I think I am following your directions. Thanks, Robin Holt >From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 From: Robin Holt <holt@sgi.com> Date: Thu, 25 Jul 2013 04:25:15 -0500 Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved pages. --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 18 +++++++++++------- 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 048e166..3aa30b7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, } static void __init_single_page(unsigned long pfn, unsigned long zone, - int nid, int reserved) + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - if (reserved) { - SetPageReserved(page); - } else { - ClearPageReserved(page); - set_page_count(page, 0); - } + ClearPageReserved(page); + set_page_count(page, page_count); + /* * Mark the block movable so that blocks are reserved for * movable at startup. This will force kernel allocations @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-25 13:42 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-25 13:42 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: > On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: > >> > >> How about holes that is not in memblock.reserved? > >> > >> before this patch: > >> free_area_init_node/free_area_init_core/memmap_init_zone > >> will mark all page in node range to Reserved in struct page, at first. > >> > >> but those holes is not mapped via kernel low mapping. > >> so it should be ok not touch "struct page" for them. > >> > >> Now you only mark reserved for memblock.reserved at first, and later > >> mark {memblock.memory} - { memblock.reserved} to be available. > >> And that is ok. > >> > >> but should split that change to another patch and add some comment > >> and change log for the change. > >> in case there is some user like UEFI etc do weird thing. > > > > Nate and I talked this over today. Sorry for the delay, but it was the > > first time we were both free. Neither of us quite understands what you > > are asking for here. My interpretation is that you would like us to > > change the use of the PageReserved flag such that during boot, we do not > > set the flag at all from memmap_init_zone, and then only set it on pages > > which, at the time of free_all_bootmem, have been allocated/reserved in > > the memblock allocator. Is that correct? I will start to work that up > > on the assumption that is what you are asking for. > > Not exactly. > > your change should be right, but there is some subtle change about > holes handling. > > before mem holes between memory ranges in memblock.memory, get struct page, > and initialized with to Reserved in memmap_init_zone. > Those holes is not in memory.reserved, with your patches, those hole's > struct page > will still have all 0. > > Please separate change about set page to reserved according to memory.reserved > to another patch. Just want to make sure this is where you want me to go. Here is my currently untested patch. Is that what you were expecting to have done? One thing I don't like about this patch is it seems to slow down boot in my simulator environment. I think I am going to look at restructuring things a bit to see if I can eliminate that performance penalty. Otherwise, I think I am following your directions. Thanks, Robin Holt ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-25 13:42 ` Robin Holt @ 2013-07-25 13:52 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-25 13:52 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 25, 2013 at 6:42 AM, Robin Holt <holt@sgi.com> wrote: > On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: >> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: >> >> >> >> How about holes that is not in memblock.reserved? >> >> >> >> before this patch: >> >> free_area_init_node/free_area_init_core/memmap_init_zone >> >> will mark all page in node range to Reserved in struct page, at first. >> >> >> >> but those holes is not mapped via kernel low mapping. >> >> so it should be ok not touch "struct page" for them. >> >> >> >> Now you only mark reserved for memblock.reserved at first, and later >> >> mark {memblock.memory} - { memblock.reserved} to be available. >> >> And that is ok. >> >> >> >> but should split that change to another patch and add some comment >> >> and change log for the change. >> >> in case there is some user like UEFI etc do weird thing. >> > >> > Nate and I talked this over today. Sorry for the delay, but it was the >> > first time we were both free. Neither of us quite understands what you >> > are asking for here. My interpretation is that you would like us to >> > change the use of the PageReserved flag such that during boot, we do not >> > set the flag at all from memmap_init_zone, and then only set it on pages >> > which, at the time of free_all_bootmem, have been allocated/reserved in >> > the memblock allocator. Is that correct? I will start to work that up >> > on the assumption that is what you are asking for. >> >> Not exactly. >> >> your change should be right, but there is some subtle change about >> holes handling. >> >> before mem holes between memory ranges in memblock.memory, get struct page, >> and initialized with to Reserved in memmap_init_zone. >> Those holes is not in memory.reserved, with your patches, those hole's >> struct page >> will still have all 0. >> >> Please separate change about set page to reserved according to memory.reserved >> to another patch. > > > Just want to make sure this is where you want me to go. Here is my > currently untested patch. Is that what you were expecting to have done? > One thing I don't like about this patch is it seems to slow down boot in > my simulator environment. I think I am going to look at restructuring > things a bit to see if I can eliminate that performance penalty. > Otherwise, I think I am following your directions. > > From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 > From: Robin Holt <holt@sgi.com> > Date: Thu, 25 Jul 2013 04:25:15 -0500 > Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved > pages. yes. > > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 18 +++++++++++------- > 3 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 048e166..3aa30b7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > } > > static void __init_single_page(unsigned long pfn, unsigned long zone, > - int nid, int reserved) > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - if (reserved) { > - SetPageReserved(page); > - } else { > - ClearPageReserved(page); > - set_page_count(page, 0); > - } > + ClearPageReserved(page); > + set_page_count(page, page_count); > + > /* > * Mark the block movable so that blocks are reserved for > * movable at startup. This will force kernel allocations > @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > -- > 1.8.2.1 > ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-25 13:52 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-07-25 13:52 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Thu, Jul 25, 2013 at 6:42 AM, Robin Holt <holt@sgi.com> wrote: > On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: >> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <holt@sgi.com> wrote: >> >> >> >> How about holes that is not in memblock.reserved? >> >> >> >> before this patch: >> >> free_area_init_node/free_area_init_core/memmap_init_zone >> >> will mark all page in node range to Reserved in struct page, at first. >> >> >> >> but those holes is not mapped via kernel low mapping. >> >> so it should be ok not touch "struct page" for them. >> >> >> >> Now you only mark reserved for memblock.reserved at first, and later >> >> mark {memblock.memory} - { memblock.reserved} to be available. >> >> And that is ok. >> >> >> >> but should split that change to another patch and add some comment >> >> and change log for the change. >> >> in case there is some user like UEFI etc do weird thing. >> > >> > Nate and I talked this over today. Sorry for the delay, but it was the >> > first time we were both free. Neither of us quite understands what you >> > are asking for here. My interpretation is that you would like us to >> > change the use of the PageReserved flag such that during boot, we do not >> > set the flag at all from memmap_init_zone, and then only set it on pages >> > which, at the time of free_all_bootmem, have been allocated/reserved in >> > the memblock allocator. Is that correct? I will start to work that up >> > on the assumption that is what you are asking for. >> >> Not exactly. >> >> your change should be right, but there is some subtle change about >> holes handling. >> >> before mem holes between memory ranges in memblock.memory, get struct page, >> and initialized with to Reserved in memmap_init_zone. >> Those holes is not in memory.reserved, with your patches, those hole's >> struct page >> will still have all 0. >> >> Please separate change about set page to reserved according to memory.reserved >> to another patch. > > > Just want to make sure this is where you want me to go. Here is my > currently untested patch. Is that what you were expecting to have done? > One thing I don't like about this patch is it seems to slow down boot in > my simulator environment. I think I am going to look at restructuring > things a bit to see if I can eliminate that performance penalty. > Otherwise, I think I am following your directions. > > From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 > From: Robin Holt <holt@sgi.com> > Date: Thu, 25 Jul 2013 04:25:15 -0500 > Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved > pages. yes. > > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 18 +++++++++++------- > 3 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 048e166..3aa30b7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > } > > static void __init_single_page(unsigned long pfn, unsigned long zone, > - int nid, int reserved) > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - if (reserved) { > - SetPageReserved(page); > - } else { > - ClearPageReserved(page); > - set_page_count(page, 0); > - } > + ClearPageReserved(page); > + set_page_count(page, page_count); > + > /* > * Mark the block movable so that blocks are reserved for > * movable at startup. This will force kernel allocations > @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > -- > 1.8.2.1 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-12 2:03 ` Robin Holt @ 2013-07-15 21:30 ` Andrew Morton -1 siblings, 0 replies; 153+ messages in thread From: Andrew Morton @ 2013-07-15 21:30 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Greg KH, Yinghai Lu, Mel Gorman On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <holt@sgi.com> wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} It isn't obvious that this needed to be inlined? > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ "mib" creeps me out too. And it makes me think of SNMP, which I'd prefer not to think about. We've traditionally had fears of running out of page flags, but I've lost track of how close we are to that happening. IIRC the answer depends on whether you believe there is such a thing as a 32-bit NUMA system. Can this be avoided anyway? I suspect there's some idiotic combination of flags we could use to indicate the state. PG_reserved|PG_lru or something. "2MB" sounds terribly arch-specific. Shouldn't we make it more generic for when the hexagon64 port wants to use 4MB? That conversational code comment was already commented on, but it's still there? > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) I think this can be made static. I hope so, as it's a somewhat odd-sounding identifier for a global. > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) checkpatch them, please. > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} Some nice code comments for the above two functions would be helpful. > > ... > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} Document it, please. The return value semantics look odd, so don't forget to explain all that as well. > > ... > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, It would be better if the name which is visible in procfs matches the name in the kernel source code. > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-15 21:30 ` Andrew Morton 0 siblings, 0 replies; 153+ messages in thread From: Andrew Morton @ 2013-07-15 21:30 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Greg KH, Yinghai Lu, Mel Gorman On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <holt@sgi.com> wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} It isn't obvious that this needed to be inlined? > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ "mib" creeps me out too. And it makes me think of SNMP, which I'd prefer not to think about. We've traditionally had fears of running out of page flags, but I've lost track of how close we are to that happening. IIRC the answer depends on whether you believe there is such a thing as a 32-bit NUMA system. Can this be avoided anyway? I suspect there's some idiotic combination of flags we could use to indicate the state. PG_reserved|PG_lru or something. "2MB" sounds terribly arch-specific. Shouldn't we make it more generic for when the hexagon64 port wants to use 4MB? That conversational code comment was already commented on, but it's still there? > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) I think this can be made static. I hope so, as it's a somewhat odd-sounding identifier for a global. > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) checkpatch them, please. > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} Some nice code comments for the above two functions would be helpful. > > ... > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} Document it, please. The return value semantics look odd, so don't forget to explain all that as well. > > ... > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, It would be better if the name which is visible in procfs matches the name in the kernel source code. > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. 2013-07-15 21:30 ` Andrew Morton @ 2013-07-16 10:38 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-16 10:38 UTC (permalink / raw) To: Andrew Morton Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Greg KH, Yinghai Lu, Mel Gorman On Mon, Jul 15, 2013 at 02:30:37PM -0700, Andrew Morton wrote: > On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <holt@sgi.com> wrote: > > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > It isn't obvious that this needed to be inlined? It is being declared in a header file. All the other functions I came across in that header file are declared as inline (or __always_inline). It feels to me like this is right. Can I leave it as-is? > > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > "mib" creeps me out too. And it makes me think of SNMP, which I'd > prefer not to think about. > > We've traditionally had fears of running out of page flags, but I've > lost track of how close we are to that happening. IIRC the answer > depends on whether you believe there is such a thing as a 32-bit NUMA > system. > > Can this be avoided anyway? I suspect there's some idiotic combination > of flags we could use to indicate the state. PG_reserved|PG_lru or > something. > > "2MB" sounds terribly arch-specific. Shouldn't we make it more generic > for when the hexagon64 port wants to use 4MB? > > That conversational code comment was already commented on, but it's > still there? I am going to work on making it non-2m based over the course of this week, so expect the _2m (current name based on Yinghai's comments) to go away entirely. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > I think this can be made static. I hope so, as it's a somewhat > odd-sounding identifier for a global. Done. > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > checkpatch them, please. Will certainly do. > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > Some nice code comments for the above two functions would be helpful. Will do. > > > > ... > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > Document it, please. The return value semantics look odd, so don't > forget to explain all that as well. Will do. Will also work on the name to make it more clear what we are returning. > > > > ... > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > It would be better if the name which is visible in procfs matches the > name in the kernel source code. Done and will try to maintain the consistency. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 4/4] Sparse initialization of struct page array. @ 2013-07-16 10:38 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-16 10:38 UTC (permalink / raw) To: Andrew Morton Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Greg KH, Yinghai Lu, Mel Gorman On Mon, Jul 15, 2013 at 02:30:37PM -0700, Andrew Morton wrote: > On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <holt@sgi.com> wrote: > > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > It isn't obvious that this needed to be inlined? It is being declared in a header file. All the other functions I came across in that header file are declared as inline (or __always_inline). It feels to me like this is right. Can I leave it as-is? > > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > "mib" creeps me out too. And it makes me think of SNMP, which I'd > prefer not to think about. > > We've traditionally had fears of running out of page flags, but I've > lost track of how close we are to that happening. IIRC the answer > depends on whether you believe there is such a thing as a 32-bit NUMA > system. > > Can this be avoided anyway? I suspect there's some idiotic combination > of flags we could use to indicate the state. PG_reserved|PG_lru or > something. > > "2MB" sounds terribly arch-specific. Shouldn't we make it more generic > for when the hexagon64 port wants to use 4MB? > > That conversational code comment was already commented on, but it's > still there? I am going to work on making it non-2m based over the course of this week, so expect the _2m (current name based on Yinghai's comments) to go away entirely. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > I think this can be made static. I hope so, as it's a somewhat > odd-sounding identifier for a global. Done. > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > checkpatch them, please. Will certainly do. > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > Some nice code comments for the above two functions would be helpful. Will do. > > > > ... > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > Document it, please. The return value semantics look odd, so don't > forget to explain all that as well. Will do. Will also work on the name to make it more clear what we are returning. > > > > ... > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > It would be better if the name which is visible in procfs matches the > name in the kernel source code. Done and will try to maintain the consistency. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 2:03 ` Robin Holt @ 2013-07-12 8:27 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-12 8:27 UTC (permalink / raw) To: Robin Holt, Borislav Petkov, Robert Richter Cc: H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Robin Holt <holt@sgi.com> wrote: > [...] > > With this patch, we did boot a 16TiB machine. Without the patches, the > v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-12 8:27 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-12 8:27 UTC (permalink / raw) To: Robin Holt, Borislav Petkov, Robert Richter Cc: H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Robin Holt <holt@sgi.com> wrote: > [...] > > With this patch, we did boot a 16TiB machine. Without the patches, the > v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* boot tracing 2013-07-12 8:27 ` Ingo Molnar @ 2013-07-12 8:47 ` Borislav Petkov -1 siblings, 0 replies; 153+ messages in thread From: Borislav Petkov @ 2013-07-12 8:47 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > Robert Richter and Boris Petkov are working on 'persistent events' > support for perf, which will eventually allow boot time profiling - > I'm not sure if the patches and the tooling support is ready enough > yet for your purposes. Nope, not yet but we're getting there. > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB What does perf=boot mean? I assume boot tracing. If so, does it mean we want to enable *all* tracepoints and collect whatever hits us? What makes more sense to me is to hijack what the function tracer does - i.e. simply collect all function calls. > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. Right, what would those trace events be? > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. Ok. > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Yeah, that. > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Agreed. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 153+ messages in thread
* boot tracing @ 2013-07-12 8:47 ` Borislav Petkov 0 siblings, 0 replies; 153+ messages in thread From: Borislav Petkov @ 2013-07-12 8:47 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > Robert Richter and Boris Petkov are working on 'persistent events' > support for perf, which will eventually allow boot time profiling - > I'm not sure if the patches and the tooling support is ready enough > yet for your purposes. Nope, not yet but we're getting there. > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB What does perf=boot mean? I assume boot tracing. If so, does it mean we want to enable *all* tracepoints and collect whatever hits us? What makes more sense to me is to hijack what the function tracer does - i.e. simply collect all function calls. > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. Right, what would those trace events be? > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. Ok. > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Yeah, that. > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Agreed. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing 2013-07-12 8:47 ` Borislav Petkov @ 2013-07-12 8:53 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-12 8:53 UTC (permalink / raw) To: Borislav Petkov Cc: Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Borislav Petkov <bp@alien8.de> wrote: > On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > Robert Richter and Boris Petkov are working on 'persistent events' > > support for perf, which will eventually allow boot time profiling - > > I'm not sure if the patches and the tooling support is ready enough > > yet for your purposes. > > Nope, not yet but we're getting there. > > > Robert, Boris, the following workflow would be pretty intuitive: > > > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > What does perf=boot mean? I assume boot tracing. In this case it would mean boot profiling - i.e. a cycles hardware-PMU event collecting into a perf trace buffer as usual. Essentially a 'perf record -a' work-alike, just one that gets activated as early as practical, and which would allow the profiling of memory initialization. Now, one extra complication here is that to be able to profile buddy allocator this persistent event would have to work before the buddy allocator is active :-/ So this sort of profiling would have to use memblock_alloc(). Just wanted to highlight this usecase, we might eventually want to support it. [ Note that this is different from boot tracing of one or more trace events - but it's a conceptually pretty close cousin. ] Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing @ 2013-07-12 8:53 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-12 8:53 UTC (permalink / raw) To: Borislav Petkov Cc: Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Borislav Petkov <bp@alien8.de> wrote: > On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > Robert Richter and Boris Petkov are working on 'persistent events' > > support for perf, which will eventually allow boot time profiling - > > I'm not sure if the patches and the tooling support is ready enough > > yet for your purposes. > > Nope, not yet but we're getting there. > > > Robert, Boris, the following workflow would be pretty intuitive: > > > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > What does perf=boot mean? I assume boot tracing. In this case it would mean boot profiling - i.e. a cycles hardware-PMU event collecting into a perf trace buffer as usual. Essentially a 'perf record -a' work-alike, just one that gets activated as early as practical, and which would allow the profiling of memory initialization. Now, one extra complication here is that to be able to profile buddy allocator this persistent event would have to work before the buddy allocator is active :-/ So this sort of profiling would have to use memblock_alloc(). Just wanted to highlight this usecase, we might eventually want to support it. [ Note that this is different from boot tracing of one or more trace events - but it's a conceptually pretty close cousin. ] Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing 2013-07-12 8:53 ` Ingo Molnar @ 2013-07-15 1:38 ` Sam Ben -1 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-15 1:38 UTC (permalink / raw) To: Ingo Molnar Cc: Borislav Petkov, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On 07/12/2013 04:53 PM, Ingo Molnar wrote: > * Borislav Petkov <bp@alien8.de> wrote: > >> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: >>> Robert Richter and Boris Petkov are working on 'persistent events' >>> support for perf, which will eventually allow boot time profiling - >>> I'm not sure if the patches and the tooling support is ready enough >>> yet for your purposes. >> Nope, not yet but we're getting there. >> >>> Robert, Boris, the following workflow would be pretty intuitive: >>> >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB >> What does perf=boot mean? I assume boot tracing. > In this case it would mean boot profiling - i.e. a cycles hardware-PMU > event collecting into a perf trace buffer as usual. > > Essentially a 'perf record -a' work-alike, just one that gets activated as > early as practical, and which would allow the profiling of memory > initialization. > > Now, one extra complication here is that to be able to profile buddy > allocator this persistent event would have to work before the buddy > allocator is active :-/ So this sort of profiling would have to use > memblock_alloc(). Could perf=boot be used to sample the performance of memblock subsystem? I think the perf subsystem is too late to be initialized and monitor this. > > Just wanted to highlight this usecase, we might eventually want to support > it. > > [ Note that this is different from boot tracing of one or more trace > events - but it's a conceptually pretty close cousin. ] > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing @ 2013-07-15 1:38 ` Sam Ben 0 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-15 1:38 UTC (permalink / raw) To: Ingo Molnar Cc: Borislav Petkov, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On 07/12/2013 04:53 PM, Ingo Molnar wrote: > * Borislav Petkov <bp@alien8.de> wrote: > >> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: >>> Robert Richter and Boris Petkov are working on 'persistent events' >>> support for perf, which will eventually allow boot time profiling - >>> I'm not sure if the patches and the tooling support is ready enough >>> yet for your purposes. >> Nope, not yet but we're getting there. >> >>> Robert, Boris, the following workflow would be pretty intuitive: >>> >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB >> What does perf=boot mean? I assume boot tracing. > In this case it would mean boot profiling - i.e. a cycles hardware-PMU > event collecting into a perf trace buffer as usual. > > Essentially a 'perf record -a' work-alike, just one that gets activated as > early as practical, and which would allow the profiling of memory > initialization. > > Now, one extra complication here is that to be able to profile buddy > allocator this persistent event would have to work before the buddy > allocator is active :-/ So this sort of profiling would have to use > memblock_alloc(). Could perf=boot be used to sample the performance of memblock subsystem? I think the perf subsystem is too late to be initialized and monitor this. > > Just wanted to highlight this usecase, we might eventually want to support > it. > > [ Note that this is different from boot tracing of one or more trace > events - but it's a conceptually pretty close cousin. ] > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing 2013-07-15 1:38 ` Sam Ben @ 2013-07-23 8:18 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:18 UTC (permalink / raw) To: Sam Ben Cc: Borislav Petkov, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Sam Ben <sam.bennn@gmail.com> wrote: > On 07/12/2013 04:53 PM, Ingo Molnar wrote: > >* Borislav Petkov <bp@alien8.de> wrote: > > > >>On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > >>>Robert Richter and Boris Petkov are working on 'persistent events' > >>>support for perf, which will eventually allow boot time profiling - > >>>I'm not sure if the patches and the tooling support is ready enough > >>>yet for your purposes. > >>Nope, not yet but we're getting there. > >> > >>>Robert, Boris, the following workflow would be pretty intuitive: > >>> > >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > >>What does perf=boot mean? I assume boot tracing. > >In this case it would mean boot profiling - i.e. a cycles hardware-PMU > >event collecting into a perf trace buffer as usual. > > > >Essentially a 'perf record -a' work-alike, just one that gets activated as > >early as practical, and which would allow the profiling of memory > >initialization. > > > >Now, one extra complication here is that to be able to profile buddy > >allocator this persistent event would have to work before the buddy > >allocator is active :-/ So this sort of profiling would have to use > >memblock_alloc(). > > Could perf=boot be used to sample the performance of memblock subsystem? > I think the perf subsystem is too late to be initialized and monitor > this. Yes, that would be a useful facility as well, for things with many events were printk is not necessarily practical. Any tracepoint could be utilized. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: boot tracing @ 2013-07-23 8:18 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:18 UTC (permalink / raw) To: Sam Ben Cc: Borislav Petkov, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Sam Ben <sam.bennn@gmail.com> wrote: > On 07/12/2013 04:53 PM, Ingo Molnar wrote: > >* Borislav Petkov <bp@alien8.de> wrote: > > > >>On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > >>>Robert Richter and Boris Petkov are working on 'persistent events' > >>>support for perf, which will eventually allow boot time profiling - > >>>I'm not sure if the patches and the tooling support is ready enough > >>>yet for your purposes. > >>Nope, not yet but we're getting there. > >> > >>>Robert, Boris, the following workflow would be pretty intuitive: > >>> > >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > >>What does perf=boot mean? I assume boot tracing. > >In this case it would mean boot profiling - i.e. a cycles hardware-PMU > >event collecting into a perf trace buffer as usual. > > > >Essentially a 'perf record -a' work-alike, just one that gets activated as > >early as practical, and which would allow the profiling of memory > >initialization. > > > >Now, one extra complication here is that to be able to profile buddy > >allocator this persistent event would have to work before the buddy > >allocator is active :-/ So this sort of profiling would have to use > >memblock_alloc(). > > Could perf=boot be used to sample the performance of memblock subsystem? > I think the perf subsystem is too late to be initialized and monitor > this. Yes, that would be a useful facility as well, for things with many events were printk is not necessarily practical. Any tracepoint could be utilized. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 8:27 ` Ingo Molnar @ 2013-07-12 9:19 ` Robert Richter -1 siblings, 0 replies; 153+ messages in thread From: Robert Richter @ 2013-07-12 9:19 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On 12.07.13 10:27:56, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent/<tracepoint>/ sleep 1 > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. See the perf-record command above... > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] > > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-12 9:19 ` Robert Richter 0 siblings, 0 replies; 153+ messages in thread From: Robert Richter @ 2013-07-12 9:19 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On 12.07.13 10:27:56, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent/<tracepoint>/ sleep 1 > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. See the perf-record command above... > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] > > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 8:27 ` Ingo Molnar @ 2013-07-15 15:16 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 15:16 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-15 15:16 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 15:16 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 8:27 ` Ingo Molnar @ 2013-07-16 8:55 ` Joonsoo Kim -1 siblings, 0 replies; 153+ messages in thread From: Joonsoo Kim @ 2013-07-16 8:55 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. > > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-16 8:55 ` Joonsoo Kim 0 siblings, 0 replies; 153+ messages in thread From: Joonsoo Kim @ 2013-07-16 8:55 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Borislav Petkov, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt <holt@sgi.com> wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. > > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-16 8:55 ` Joonsoo Kim @ 2013-07-16 9:08 ` Borislav Petkov -1 siblings, 0 replies; 153+ messages in thread From: Borislav Petkov @ 2013-07-16 9:08 UTC (permalink / raw) To: Joonsoo Kim Cc: Ingo Molnar, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > How about executing a perf in usermodehelper and collecting output > in tmpfs? Using this approach, we can start a perf after rootfs > initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-16 9:08 ` Borislav Petkov 0 siblings, 0 replies; 153+ messages in thread From: Borislav Petkov @ 2013-07-16 9:08 UTC (permalink / raw) To: Joonsoo Kim Cc: Ingo Molnar, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > How about executing a perf in usermodehelper and collecting output > in tmpfs? Using this approach, we can start a perf after rootfs > initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-16 9:08 ` Borislav Petkov @ 2013-07-23 8:20 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:20 UTC (permalink / raw) To: Borislav Petkov Cc: Joonsoo Kim, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Borislav Petkov <bp@alien8.de> wrote: > On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > > How about executing a perf in usermodehelper and collecting output > > in tmpfs? Using this approach, we can start a perf after rootfs > > initialization, > > What for if we can start logging to buffers much earlier? *Reading* > from those buffers can be done much later, at our own leisure with full > userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about "when does userspace start up the earliest" details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-23 8:20 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-07-23 8:20 UTC (permalink / raw) To: Borislav Petkov Cc: Joonsoo Kim, Robin Holt, Robert Richter, H. Peter Anvin, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman, Peter Zijlstra * Borislav Petkov <bp@alien8.de> wrote: > On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > > How about executing a perf in usermodehelper and collecting output > > in tmpfs? Using this approach, we can start a perf after rootfs > > initialization, > > What for if we can start logging to buffers much earlier? *Reading* > from those buffers can be done much later, at our own leisure with full > userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about "when does userspace start up the earliest" details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 2:03 ` Robin Holt @ 2013-07-15 15:00 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 15:00 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-15 15:00 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-15 15:00 UTC (permalink / raw) To: H. Peter Anvin, Ingo Molnar Cc: Robin Holt, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 2:03 ` Robin Holt @ 2013-07-17 5:17 ` Sam Ben -1 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-17 5:17 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On 07/12/2013 10:03 AM, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. > > The heart of the problem as I see it comes from expand(). We nearly > always see a first reference to a struct page which is in the middle > of the 2MiB region. Due to that access, the unlikely() check that was > originally proposed really ends up referencing a different page entirely. > We actually did not introduce an unlikely and refactor the patches to > make that unlikely inside a static inline function. Also, given the > strong warning at the head of expand(), we did not feel experienced > enough to refactor it to make things always reference the 2MiB page > first. > > With this patch, we did boot a 16TiB machine. Without the patches, > the v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? > > I am on vacation for the next three days so I am sorry in advance for > my infrequent or non-existant responses. > > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-17 5:17 ` Sam Ben 0 siblings, 0 replies; 153+ messages in thread From: Sam Ben @ 2013-07-17 5:17 UTC (permalink / raw) To: Robin Holt Cc: H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On 07/12/2013 10:03 AM, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. > > The heart of the problem as I see it comes from expand(). We nearly > always see a first reference to a struct page which is in the middle > of the 2MiB region. Due to that access, the unlikely() check that was > originally proposed really ends up referencing a different page entirely. > We actually did not introduce an unlikely and refactor the patches to > make that unlikely inside a static inline function. Also, given the > strong warning at the head of expand(), we did not feel experienced > enough to refactor it to make things always reference the 2MiB page > first. > > With this patch, we did boot a 16TiB machine. Without the patches, > the v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? > > I am on vacation for the next three days so I am sorry in advance for > my infrequent or non-existant responses. > > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-17 5:17 ` Sam Ben @ 2013-07-17 9:30 ` Robin Holt -1 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-17 9:30 UTC (permalink / raw) To: Sam Ben Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > On 07/12/2013 10:03 AM, Robin Holt wrote: > >We have been working on this since we returned from shutdown and have > >something to discuss now. We restricted ourselves to 2MiB initialization > >to keep the patch set a little smaller and more clear. > > > >First, I think I want to propose getting rid of the page flag. If I knew > >of a concrete way to determine that the page has not been initialized, > >this patch series would look different. If there is no definitive > >way to determine that the struct page has been initialized aside from > >checking the entire page struct is zero, then I think I would suggest > >we change the page flag to indicate the page has been initialized. > > > >The heart of the problem as I see it comes from expand(). We nearly > >always see a first reference to a struct page which is in the middle > >of the 2MiB region. Due to that access, the unlikely() check that was > >originally proposed really ends up referencing a different page entirely. > >We actually did not introduce an unlikely and refactor the patches to > >make that unlikely inside a static inline function. Also, given the > >strong warning at the head of expand(), we did not feel experienced > >enough to refactor it to make things always reference the 2MiB page > >first. > > > >With this patch, we did boot a 16TiB machine. Without the patches, > >the v3.10 kernel with the same configuration took 407 seconds for > >free_all_bootmem. With the patches and operating on 2MiB pages instead > >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >for how the 1GiB chunk size will perform. > > How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin > > > > >I am on vacation for the next three days so I am sorry in advance for > >my infrequent or non-existant responses. > > > > > >Signed-off-by: Robin Holt <holt@sgi.com> > >Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > >To: "H. Peter Anvin" <hpa@zytor.com> > >To: Ingo Molnar <mingo@kernel.org> > >Cc: Linux Kernel <linux-kernel@vger.kernel.org> > >Cc: Linux MM <linux-mm@kvack.org> > >Cc: Rob Landley <rob@landley.net> > >Cc: Mike Travis <travis@sgi.com> > >Cc: Daniel J Blueman <daniel@numascale-asia.com> > >Cc: Andrew Morton <akpm@linux-foundation.org> > >Cc: Greg KH <gregkh@linuxfoundation.org> > >Cc: Yinghai Lu <yinghai@kernel.org> > >Cc: Mel Gorman <mgorman@suse.de> > >-- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-17 9:30 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-17 9:30 UTC (permalink / raw) To: Sam Ben Cc: Robin Holt, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Yinghai Lu, Mel Gorman On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > On 07/12/2013 10:03 AM, Robin Holt wrote: > >We have been working on this since we returned from shutdown and have > >something to discuss now. We restricted ourselves to 2MiB initialization > >to keep the patch set a little smaller and more clear. > > > >First, I think I want to propose getting rid of the page flag. If I knew > >of a concrete way to determine that the page has not been initialized, > >this patch series would look different. If there is no definitive > >way to determine that the struct page has been initialized aside from > >checking the entire page struct is zero, then I think I would suggest > >we change the page flag to indicate the page has been initialized. > > > >The heart of the problem as I see it comes from expand(). We nearly > >always see a first reference to a struct page which is in the middle > >of the 2MiB region. Due to that access, the unlikely() check that was > >originally proposed really ends up referencing a different page entirely. > >We actually did not introduce an unlikely and refactor the patches to > >make that unlikely inside a static inline function. Also, given the > >strong warning at the head of expand(), we did not feel experienced > >enough to refactor it to make things always reference the 2MiB page > >first. > > > >With this patch, we did boot a 16TiB machine. Without the patches, > >the v3.10 kernel with the same configuration took 407 seconds for > >free_all_bootmem. With the patches and operating on 2MiB pages instead > >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >for how the 1GiB chunk size will perform. > > How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin > > > > >I am on vacation for the next three days so I am sorry in advance for > >my infrequent or non-existant responses. > > > > > >Signed-off-by: Robin Holt <holt@sgi.com> > >Signed-off-by: Nate Zimmer <nzimmer@sgi.com> > >To: "H. Peter Anvin" <hpa@zytor.com> > >To: Ingo Molnar <mingo@kernel.org> > >Cc: Linux Kernel <linux-kernel@vger.kernel.org> > >Cc: Linux MM <linux-mm@kvack.org> > >Cc: Rob Landley <rob@landley.net> > >Cc: Mike Travis <travis@sgi.com> > >Cc: Daniel J Blueman <daniel@numascale-asia.com> > >Cc: Andrew Morton <akpm@linux-foundation.org> > >Cc: Greg KH <gregkh@linuxfoundation.org> > >Cc: Yinghai Lu <yinghai@kernel.org> > >Cc: Mel Gorman <mgorman@suse.de> > >-- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-17 9:30 ` Robin Holt (?) @ 2013-07-19 23:51 ` Yinghai Lu 2013-07-22 6:13 ` Robin Holt -1 siblings, 1 reply; 153+ messages in thread From: Yinghai Lu @ 2013-07-19 23:51 UTC (permalink / raw) To: Robin Holt Cc: Sam Ben, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 1102 bytes --] On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt <holt@sgi.com> wrote: > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: >> >With this patch, we did boot a 16TiB machine. Without the patches, >> >the v3.10 kernel with the same configuration took 407 seconds for >> >free_all_bootmem. With the patches and operating on 2MiB pages instead >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel >> >for how the 1GiB chunk size will perform. >> >> How to test how much time spend on free_all_bootmem? > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > then used a modified version of script which record the time in uSecs > at the beginning of each line of output. used two patches, found 3TiB system will take 100s before slub is ready. about three portions: 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those struct page area take about 30s. 2. memmap_init_zone: take about 25s 3. mem_init/free_all_bootmem about 30s. so still wonder why 16TiB will need hours. also your patches looks like only address 2 and 3. Yinghai [-- Attachment #2: printk_time_tsc_0.patch --] [-- Type: application/octet-stream, Size: 2624 bytes --] Subject: [PATCH] printk_time: prepare stub for using other than cpu_clock -v2: refresh to v3.10 Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> --- include/linux/printk.h | 3 +++ init/main.c | 1 + kernel/printk.c | 17 +++++++++++++++-- 3 files changed, 19 insertions(+), 2 deletions(-) Index: linux-2.6/init/main.c =================================================================== --- linux-2.6.orig/init/main.c +++ linux-2.6/init/main.c @@ -601,6 +601,7 @@ asmlinkage void __init start_kernel(void late_time_init(); sched_clock_init(); calibrate_delay(); + set_printk_time_clock(local_clock); pidmap_init(); anon_vma_init(); #ifdef CONFIG_X86 Index: linux-2.6/kernel/printk.c =================================================================== --- linux-2.6.orig/kernel/printk.c +++ linux-2.6/kernel/printk.c @@ -217,6 +217,19 @@ struct log { static DEFINE_RAW_SPINLOCK(logbuf_lock); #ifdef CONFIG_PRINTK + +static u64 default_printk_time_clock(void) +{ + return 0; +} + +static u64 (*printk_time_clock)(void) = default_printk_time_clock; + +void set_printk_time_clock(u64 (*fn)(void)) +{ + printk_time_clock = fn; +} + DECLARE_WAIT_QUEUE_HEAD(log_wait); /* the next printk record to read by syslog(READ) or /proc/kmsg */ static u64 syslog_seq; @@ -354,7 +367,7 @@ static void log_store(int facility, int if (ts_nsec > 0) msg->ts_nsec = ts_nsec; else - msg->ts_nsec = local_clock(); + msg->ts_nsec = printk_time_clock(); memset(log_dict(msg) + dict_len, 0, pad_len); msg->len = sizeof(struct log) + text_len + dict_len + pad_len; @@ -1457,7 +1470,7 @@ static bool cont_add(int facility, int l cont.facility = facility; cont.level = level; cont.owner = current; - cont.ts_nsec = local_clock(); + cont.ts_nsec = printk_time_clock(); cont.flags = 0; cont.cons = 0; cont.flushed = false; Index: linux-2.6/include/linux/printk.h =================================================================== --- linux-2.6.orig/include/linux/printk.h +++ linux-2.6/include/linux/printk.h @@ -107,6 +107,7 @@ void early_printk(const char *s, ...) { #endif #ifdef CONFIG_PRINTK +extern void set_printk_time_clock(u64 (*fn)(void)); asmlinkage __printf(5, 0) int vprintk_emit(int facility, int level, const char *dict, size_t dictlen, @@ -150,6 +151,8 @@ void dump_stack_set_arch_desc(const char void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); #else +static inline void set_printk_time_clock(u64 (*fn)(void)) { } + static inline __printf(1, 0) int vprintk(const char *s, va_list args) { [-- Attachment #3: printk_time_tsc_1.patch --] [-- Type: application/octet-stream, Size: 1201 bytes --] Subject: [PATCH] x86: printk_time to use tsc before cpu_clock is ready so can get tsc value on printk need to apply after [PATCH] printk_time: prepare stub for using other than cpu_clock -v2: refresh on v3.10 Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/kernel/cpu/common.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) Index: linux-2.6/arch/x86/kernel/cpu/common.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/cpu/common.c +++ linux-2.6/arch/x86/kernel/cpu/common.c @@ -737,6 +737,25 @@ static void __init early_identify_cpu(st setup_force_cpu_cap(X86_FEATURE_ALWAYS); } +static u64 __initdata tsc_clock_base; + +static __init u64 tsc_clock_offset(void) +{ + unsigned long long t; + + rdtscll(t); + + return t - tsc_clock_base; +} + +static __init void early_printk_time_init(void) +{ + if (cpu_has_tsc) { + rdtscll(tsc_clock_base); + set_printk_time_clock(tsc_clock_offset); + } +} + void __init early_cpu_init(void) { const struct cpu_dev *const *cdev; @@ -761,6 +780,8 @@ void __init early_cpu_init(void) } early_identify_cpu(&boot_cpu_data); + + early_printk_time_init(); } /* ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-19 23:51 ` Yinghai Lu @ 2013-07-22 6:13 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-22 6:13 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Sam Ben, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: > On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt <holt@sgi.com> wrote: > > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > >> >With this patch, we did boot a 16TiB machine. Without the patches, > >> >the v3.10 kernel with the same configuration took 407 seconds for > >> >free_all_bootmem. With the patches and operating on 2MiB pages instead > >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >> >for how the 1GiB chunk size will perform. > >> > >> How to test how much time spend on free_all_bootmem? > > > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > > then used a modified version of script which record the time in uSecs > > at the beginning of each line of output. > > used two patches, found 3TiB system will take 100s before slub is ready. > > about three portions: > 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those > struct page area take about 30s. > 2. memmap_init_zone: take about 25s > 3. mem_init/free_all_bootmem about 30s. > > so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 > also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-07-22 6:13 ` Robin Holt 0 siblings, 0 replies; 153+ messages in thread From: Robin Holt @ 2013-07-22 6:13 UTC (permalink / raw) To: Yinghai Lu Cc: Robin Holt, Sam Ben, H. Peter Anvin, Ingo Molnar, Nate Zimmer, Linux Kernel, Linux MM, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg KH, Mel Gorman On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: > On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt <holt@sgi.com> wrote: > > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > >> >With this patch, we did boot a 16TiB machine. Without the patches, > >> >the v3.10 kernel with the same configuration took 407 seconds for > >> >free_all_bootmem. With the patches and operating on 2MiB pages instead > >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >> >for how the 1GiB chunk size will perform. > >> > >> How to test how much time spend on free_all_bootmem? > > > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > > then used a modified version of script which record the time in uSecs > > at the beginning of each line of output. > > used two patches, found 3TiB system will take 100s before slub is ready. > > about three portions: > 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those > struct page area take about 30s. > 2. memmap_init_zone: take about 25s > 3. mem_init/free_all_bootmem about 30s. > > so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 > also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-07-12 2:03 ` Robin Holt @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman We are still restricting ourselves ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. We are still struggling with the expand(). Nearly always the first reference to a struct page which is in the middle of the 2MiB region. We were unable to find a good solution. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. The only other fastpath impact left is the expansion in prep_new_page. With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 220 and 49 seconds respectively. This is a total savings of 1289 seconds (21 minutes). These times were aquired using a modified version of script which record the time in uSecs at the beginning of each line of output. The previous patch set was faster through free_all_bootmem but I wanted to include Yinghai suggestion. Hopefully I didn't miss the mark too much with that patch and yes I do still need to optimize it. I know there are some still rough parts but I wanted to check in with the full patch set. Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 194 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 225 insertions(+), 56 deletions(-) -- 1.8.2.1 ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman We are still restricting ourselves ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. We are still struggling with the expand(). Nearly always the first reference to a struct page which is in the middle of the 2MiB region. We were unable to find a good solution. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. The only other fastpath impact left is the expansion in prep_new_page. With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 220 and 49 seconds respectively. This is a total savings of 1289 seconds (21 minutes). These times were aquired using a modified version of script which record the time in uSecs at the beginning of each line of output. The previous patch set was faster through free_all_bootmem but I wanted to include Yinghai suggestion. Hopefully I didn't miss the mark too much with that patch and yes I do still need to optimize it. I know there are some still rough parts but I wanted to check in with the full patch set. Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 194 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 225 insertions(+), 56 deletions(-) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 2/5] Have __free_pages_memory() free in larger chunks. 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 2/5] Have __free_pages_memory() free in larger chunks. @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 3/5] Move page initialization into a separate function. 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 3/5] Move page initialization into a separate function. @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 4/5] Only set page reserved in the memblock region 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. I could restruture a bit to eliminate the perform hit. But I wanted to make sure I am on track first. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 16 ++++++++++++---- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..382223e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 4/5] Only set page reserved in the memblock region @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. I could restruture a bit to eliminate the perform hit. But I wanted to make sure I am on track first. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 16 ++++++++++++---- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..382223e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC v2 4/5] Only set page reserved in the memblock region 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-03 20:04 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-03 20:04 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, mingo, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote: > Currently we when we initialze each page struct is set as reserved upon > initialization. This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. > > I could restruture a bit to eliminate the perform hit. But I wanted to make > sure I am on track first. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 16 ++++++++++++---- > 3 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index df3ec13..382223e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > +static void __init_single_page(unsigned long pfn, unsigned long zone, > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > set_page_links(page, zone, nid, pfn); > mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - SetPageReserved(page); > + set_page_count(page, page_count); > + ClearPageReserved(page); > > /* > * Mark the block movable so that blocks are reserved for > @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > if (!early_pfn_in_nid(pfn, nid)) > continue; > } > - __init_single_page(pfn, zone, nid); > + __init_single_page(pfn, zone, nid, 1); > } > } > > -- > 1.8.2.1 > Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns. It should be: void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); } That also brings the timings back in line with the previous patch set. Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v2 4/5] Only set page reserved in the memblock region @ 2013-08-03 20:04 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-03 20:04 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, mingo, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote: > Currently we when we initialze each page struct is set as reserved upon > initialization. This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. > > I could restruture a bit to eliminate the perform hit. But I wanted to make > sure I am on track first. > > Signed-off-by: Robin Holt <holt@sgi.com> > Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> > To: "H. Peter Anvin" <hpa@zytor.com> > To: Ingo Molnar <mingo@kernel.org> > Cc: Linux Kernel <linux-kernel@vger.kernel.org> > Cc: Linux MM <linux-mm@kvack.org> > Cc: Rob Landley <rob@landley.net> > Cc: Mike Travis <travis@sgi.com> > Cc: Daniel J Blueman <daniel@numascale-asia.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 16 ++++++++++++---- > 3 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index df3ec13..382223e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > +static void __init_single_page(unsigned long pfn, unsigned long zone, > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > set_page_links(page, zone, nid, pfn); > mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - SetPageReserved(page); > + set_page_count(page, page_count); > + ClearPageReserved(page); > > /* > * Mark the block movable so that blocks are reserved for > @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > if (!early_pfn_in_nid(pfn, nid)) > continue; > } > - __init_single_page(pfn, zone, nid); > + __init_single_page(pfn, zone, nid, 1); > } > } > > -- > 1.8.2.1 > Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns. It should be: void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); } That also brings the timings back in line with the previous patch set. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v2 5/5] Sparse initialization of struct page array. 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-02 17:44 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 119 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 382223e..c2fd03a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,8 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start < end; start++) if (pfn_valid(start)) SetPageReserved(pfn_to_page(start)); @@ -755,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -799,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -907,6 +965,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2m(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -989,6 +1051,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3902,6 +3966,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3991,6 +4058,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4006,19 +4101,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6237,6 +6346,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v2 5/5] Sparse initialization of struct page array. @ 2013-08-02 17:44 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-02 17:44 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 119 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 382223e..c2fd03a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,8 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start < end; start++) if (pfn_valid(start)) SetPageReserved(pfn_to_page(start)); @@ -755,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -799,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -907,6 +965,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2m(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -989,6 +1051,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3902,6 +3966,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3991,6 +4058,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4006,19 +4101,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6237,6 +6346,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-05 9:58 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-05 9:58 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman * Nathan Zimmer <nzimmer@sgi.com> wrote: > We are still restricting ourselves ourselves to 2MiB initialization to > keep the patch set a little smaller and more clear. > > We are still struggling with the expand(). Nearly always the first > reference to a struct page which is in the middle of the 2MiB region. > We were unable to find a good solution. Also, given the strong warning > at the head of expand(), we did not feel experienced enough to refactor > it to make things always reference the 2MiB page first. The only other > fastpath impact left is the expansion in prep_new_page. I suppose it's about this chunk: @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); where ensure_page_is_initialized() does, in essence: + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } where aligned_start_pfn is 2MB rounded down. which looks like an expensive loop to execute for a single page: there are 512 pages in a 2MB range, so on average this iterates 256 times, for every single page of allocation. Right? I might be missing something, but why not just represent the initialization state in 2MB chunks: it is either fully uninitialized, or fully initialized. If any page in the 'middle' gets allocated, all page heads have to get initialized. That should make the fast path test fairly cheap, basically just PageUninitialized2m(page) has to be tested - and that will fail in the post-initialization fastpath. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-05 9:58 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-05 9:58 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman * Nathan Zimmer <nzimmer@sgi.com> wrote: > We are still restricting ourselves ourselves to 2MiB initialization to > keep the patch set a little smaller and more clear. > > We are still struggling with the expand(). Nearly always the first > reference to a struct page which is in the middle of the 2MiB region. > We were unable to find a good solution. Also, given the strong warning > at the head of expand(), we did not feel experienced enough to refactor > it to make things always reference the 2MiB page first. The only other > fastpath impact left is the expansion in prep_new_page. I suppose it's about this chunk: @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); where ensure_page_is_initialized() does, in essence: + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } where aligned_start_pfn is 2MB rounded down. which looks like an expensive loop to execute for a single page: there are 512 pages in a 2MB range, so on average this iterates 256 times, for every single page of allocation. Right? I might be missing something, but why not just represent the initialization state in 2MB chunks: it is either fully uninitialized, or fully initialized. If any page in the 'middle' gets allocated, all page heads have to get initialized. That should make the fast path test fairly cheap, basically just PageUninitialized2m(page) has to be tested - and that will fail in the post-initialization fastpath. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-02 17:44 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman We are still restricting ourselves ourselves to 2MiB initialization. This was initially to keep the patch set a little smaller and more clear. However given how well it is currently performing I don't see a how much better it could be with to 2GiB chunks. As far as extra overhead. We incur an extra function call to ensure_page_is_initialized but that is only really expensive when we find uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. To get a better feel for this we ran two quick tests. The first was simply timing some memhogs. This showed no measurable difference so we made a more granular test. We spawned N threads, start a timer, each thread mallocs totalmem/N then each thread writes to its memory to induce page faults, stop the timer. In this case it each thread had just under 4GB of ram to fault in. This showed a measureable difference in the page faulting. The baseline took an average of 2.68 seconds, the new version took an average of 2.75 seconds. Which is .07s slower or 2.6%. Are there some other tests I should run? With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 13 and 39 seconds respectively. This is a total savings of 1506 seconds (25 minutes). These times were acquired using a modified version of script which record the time in uSecs at the beginning of each line of output. Overall I am fairly happy with the patch set at the moment. It improves boot times without noticeable runtime overhead. I am, as always, open for suggestions. v2: included the Yinghai's suggestion to not set the reserved bit until later. v3: Corrected my first attempt at moving the reserved bit. __expand_page_initialization should only be called by ensure_pages_are_initialized Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 198 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 229 insertions(+), 56 deletions(-) -- 1.8.2.1 ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman We are still restricting ourselves ourselves to 2MiB initialization. This was initially to keep the patch set a little smaller and more clear. However given how well it is currently performing I don't see a how much better it could be with to 2GiB chunks. As far as extra overhead. We incur an extra function call to ensure_page_is_initialized but that is only really expensive when we find uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. To get a better feel for this we ran two quick tests. The first was simply timing some memhogs. This showed no measurable difference so we made a more granular test. We spawned N threads, start a timer, each thread mallocs totalmem/N then each thread writes to its memory to induce page faults, stop the timer. In this case it each thread had just under 4GB of ram to fault in. This showed a measureable difference in the page faulting. The baseline took an average of 2.68 seconds, the new version took an average of 2.75 seconds. Which is .07s slower or 2.6%. Are there some other tests I should run? With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 13 and 39 seconds respectively. This is a total savings of 1506 seconds (25 minutes). These times were acquired using a modified version of script which record the time in uSecs at the beginning of each line of output. Overall I am fairly happy with the patch set at the moment. It improves boot times without noticeable runtime overhead. I am, as always, open for suggestions. v2: included the Yinghai's suggestion to not set the reserved bit until later. v3: Corrected my first attempt at moving the reserved bit. __expand_page_initialization should only be called by ensure_pages_are_initialized Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 198 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 229 insertions(+), 56 deletions(-) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 2/5] Have __free_pages_memory() free in larger chunks. 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 2/5] Have __free_pages_memory() free in larger chunks. @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 3/5] Move page initialization into a separate function. 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 3/5] Move page initialization into a separate function. @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 4/5] Only set page reserved in the memblock region 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 19 +++++++++++++++---- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..227bd39 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,16 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4021,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 4/5] Only set page reserved in the memblock region @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 19 +++++++++++++++---- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..227bd39 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,16 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4021,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 5/5] Sparse initialization of struct page array. 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-12 21:54 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 115 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 227bd39..6c35a58 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,11 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void __expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + __expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); @@ -758,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -802,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -863,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(&page[size]); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -908,8 +963,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) { int i; + ensure_pages_are_initialized(page_to_pfn(page), + page_to_pfn(page+(1<<order)-1)); for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + if (unlikely(check_new_page(p))) return 1; } @@ -992,6 +1050,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3905,6 +3965,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3994,6 +4057,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4009,19 +4100,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6240,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 153+ messages in thread
* [RFC v3 5/5] Sparse initialization of struct page array. @ 2013-08-12 21:54 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-12 21:54 UTC (permalink / raw) To: hpa, mingo Cc: linux-kernel, linux-mm, holt, nzimmer, rob, travis, daniel, akpm, gregkh, yinghai, mgorman From: Robin Holt <holt@sgi.com> During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> To: "H. Peter Anvin" <hpa@zytor.com> To: Ingo Molnar <mingo@kernel.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Cc: Linux MM <linux-mm@kvack.org> Cc: Rob Landley <rob@landley.net> Cc: Mike Travis <travis@sgi.com> Cc: Daniel J Blueman <daniel@numascale-asia.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 115 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 227bd39..6c35a58 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,11 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void __expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + __expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); @@ -758,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -802,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -863,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(&page[size]); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -908,8 +963,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) { int i; + ensure_pages_are_initialized(page_to_pfn(page), + page_to_pfn(page+(1<<order)-1)); for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + if (unlikely(check_new_page(p))) return 1; } @@ -992,6 +1050,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3905,6 +3965,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3994,6 +4057,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4009,19 +4100,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6240,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-13 10:58 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-13 10:58 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman, Linus Torvalds * Nathan Zimmer <nzimmer@sgi.com> wrote: > We are still restricting ourselves ourselves to 2MiB initialization. > This was initially to keep the patch set a little smaller and more > clear. However given how well it is currently performing I don't see a > how much better it could be with to 2GiB chunks. > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we > find uninitialized pages, otherwise it is a flag check once every > PTRS_PER_PMD. [...] Mind expanding on this in more detail? The main fastpath overhead we are really interested in is the 'memory is already fully ininialized and we reallocate a second time' case - i.e. the *second* (and subsequent), post-initialization allocation of any page range. Those allocations are the ones that matter most: they will occur again and again, for the lifetime of the booted up system. What extra overhead is there in that case? Only a flag check that is merged into an existing flag check (in free_pages_check()) and thus is essentially zero overhead? Or is it more involved - if yes, why? One would naively think that nothing but the flags check is needed in this case: if all 512 pages in an aligned 2MB block is fully initialized, and marked as initialized in all the 512 page heads, then no other runtime check will be needed in the future. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 10:58 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-13 10:58 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman, Linus Torvalds * Nathan Zimmer <nzimmer@sgi.com> wrote: > We are still restricting ourselves ourselves to 2MiB initialization. > This was initially to keep the patch set a little smaller and more > clear. However given how well it is currently performing I don't see a > how much better it could be with to 2GiB chunks. > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we > find uninitialized pages, otherwise it is a flag check once every > PTRS_PER_PMD. [...] Mind expanding on this in more detail? The main fastpath overhead we are really interested in is the 'memory is already fully ininialized and we reallocate a second time' case - i.e. the *second* (and subsequent), post-initialization allocation of any page range. Those allocations are the ones that matter most: they will occur again and again, for the lifetime of the booted up system. What extra overhead is there in that case? Only a flag check that is merged into an existing flag check (in free_pages_check()) and thus is essentially zero overhead? Or is it more involved - if yes, why? One would naively think that nothing but the flags check is needed in this case: if all 512 pages in an aligned 2MB block is fully initialized, and marked as initialized in all the 512 page heads, then no other runtime check will be needed in the future. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-13 17:09 ` Linus Torvalds -1 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 17:09 UTC (permalink / raw) To: Nathan Zimmer Cc: Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we find > uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. > To get a better feel for this we ran two quick tests. Sorry for coming into this late and for this last version of the patch, but I have to say that I'd *much* rather see this delayed initialization using another data structure than hooking into the basic page allocation ones.. I understand that you want to do delayed initialization on some TB+ memory machines, but what I don't understand is why it has to be done when the pages have already been added to the memory management free list. Could we not do this much simpler: make the early boot insert the first few gigs of memory (initialized) synchronously into the free lists, and then have a background thread that goes through the rest? That way the MM layer would never see the uninitialized pages. And I bet that *nobody* cares if you "only" have a few gigs of ram during the first few minutes of boot, and you mysteriously end up getting more and more memory for a while until all the RAM has been initialized. IOW, just don't call __free_pages_bootmem() on all the pages al at once. If we have to remove a few __init markers to be able to do some of it later, does anybody really care? I really really dislike this "let's check if memory is initialized at runtime" approach. Linus ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 17:09 ` Linus Torvalds 0 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 17:09 UTC (permalink / raw) To: Nathan Zimmer Cc: Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we find > uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. > To get a better feel for this we ran two quick tests. Sorry for coming into this late and for this last version of the patch, but I have to say that I'd *much* rather see this delayed initialization using another data structure than hooking into the basic page allocation ones.. I understand that you want to do delayed initialization on some TB+ memory machines, but what I don't understand is why it has to be done when the pages have already been added to the memory management free list. Could we not do this much simpler: make the early boot insert the first few gigs of memory (initialized) synchronously into the free lists, and then have a background thread that goes through the rest? That way the MM layer would never see the uninitialized pages. And I bet that *nobody* cares if you "only" have a few gigs of ram during the first few minutes of boot, and you mysteriously end up getting more and more memory for a while until all the RAM has been initialized. IOW, just don't call __free_pages_bootmem() on all the pages al at once. If we have to remove a few __init markers to be able to do some of it later, does anybody really care? I really really dislike this "let's check if memory is initialized at runtime" approach. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:09 ` Linus Torvalds @ 2013-08-13 17:23 ` H. Peter Anvin -1 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-08-13 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 08/13/2013 10:09 AM, Linus Torvalds wrote: > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > It does seem to be getting messy, doesn't it... The one potential serious concern if if that will end up mucking with NUMA affinity in a way that has lasting effects past boot. -hpa ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 17:23 ` H. Peter Anvin 0 siblings, 0 replies; 153+ messages in thread From: H. Peter Anvin @ 2013-08-13 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Mike Travis, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 08/13/2013 10:09 AM, Linus Torvalds wrote: > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > It does seem to be getting messy, doesn't it... The one potential serious concern if if that will end up mucking with NUMA affinity in a way that has lasting effects past boot. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:09 ` Linus Torvalds @ 2013-08-13 17:33 ` Mike Travis -1 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 17:33 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 10:09 AM, Linus Torvalds wrote: > On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: >> >> As far as extra overhead. We incur an extra function call to >> ensure_page_is_initialized but that is only really expensive when we find >> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. >> To get a better feel for this we ran two quick tests. > > Sorry for coming into this late and for this last version of the > patch, but I have to say that I'd *much* rather see this delayed > initialization using another data structure than hooking into the > basic page allocation ones.. > > I understand that you want to do delayed initialization on some TB+ > memory machines, but what I don't understand is why it has to be done > when the pages have already been added to the memory management free > list. > > Could we not do this much simpler: make the early boot insert the > first few gigs of memory (initialized) synchronously into the free > lists, and then have a background thread that goes through the rest? > > That way the MM layer would never see the uninitialized pages. > > And I bet that *nobody* cares if you "only" have a few gigs of ram > during the first few minutes of boot, and you mysteriously end up > getting more and more memory for a while until all the RAM has been > initialized. > > IOW, just don't call __free_pages_bootmem() on all the pages al at > once. If we have to remove a few __init markers to be able to do some > of it later, does anybody really care? > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > > Linus > Initially this patch set consisted of diverting a major portion of the memory to an "absent" list during e820 processing. A very late initcall was then used to dispatch a cpu per node to add that nodes's absent memory. By nature these ran in parallel so Nathan did the work to "parallelize" various global resource locks to become per node locks. This sped up insertion considerably. And by disabling the "auto-start" of the insertion process and using a manual start command, you could monitor the insertion process and find hot spots in the memory initialization code. Also small updates to the sys/devices/{memory,node} drivers to also display the amount of memory still "absent". -Mike ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 17:33 ` Mike Travis 0 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 17:33 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 10:09 AM, Linus Torvalds wrote: > On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: >> >> As far as extra overhead. We incur an extra function call to >> ensure_page_is_initialized but that is only really expensive when we find >> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. >> To get a better feel for this we ran two quick tests. > > Sorry for coming into this late and for this last version of the > patch, but I have to say that I'd *much* rather see this delayed > initialization using another data structure than hooking into the > basic page allocation ones.. > > I understand that you want to do delayed initialization on some TB+ > memory machines, but what I don't understand is why it has to be done > when the pages have already been added to the memory management free > list. > > Could we not do this much simpler: make the early boot insert the > first few gigs of memory (initialized) synchronously into the free > lists, and then have a background thread that goes through the rest? > > That way the MM layer would never see the uninitialized pages. > > And I bet that *nobody* cares if you "only" have a few gigs of ram > during the first few minutes of boot, and you mysteriously end up > getting more and more memory for a while until all the RAM has been > initialized. > > IOW, just don't call __free_pages_bootmem() on all the pages al at > once. If we have to remove a few __init markers to be able to do some > of it later, does anybody really care? > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > > Linus > Initially this patch set consisted of diverting a major portion of the memory to an "absent" list during e820 processing. A very late initcall was then used to dispatch a cpu per node to add that nodes's absent memory. By nature these ran in parallel so Nathan did the work to "parallelize" various global resource locks to become per node locks. This sped up insertion considerably. And by disabling the "auto-start" of the insertion process and using a manual start command, you could monitor the insertion process and find hot spots in the memory initialization code. Also small updates to the sys/devices/{memory,node} drivers to also display the amount of memory still "absent". -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:33 ` Mike Travis @ 2013-08-13 17:51 ` Linus Torvalds -1 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 17:51 UTC (permalink / raw) To: Mike Travis Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis <travis@sgi.com> wrote: > > Initially this patch set consisted of diverting a major portion of the > memory to an "absent" list during e820 processing. A very late initcall > was then used to dispatch a cpu per node to add that nodes's absent > memory. By nature these ran in parallel so Nathan did the work to > "parallelize" various global resource locks to become per node locks. So quite frankly, I'm not sure how worthwhile it even is to parallelize the thing. I realize that some environments may care about getting up to full memory population very quicky, but I think it would be very rare and specialized, and shouldn't necessarily be part of the initial patches. And it really doesn't have to be an initcall at all - at least not a synchronous one. A late initcall to get the process *started*, but the process itself could easily be done with a separate thread asynchronously, and let the machine boot up while that thread is going. And in fact, I'd argue that instead of trying to make it fast and parallelize things excessively, you might want to make the memory initialization *slow*, and make all the rest of the bootup have higher priority. At that point, who cares if it takes 400 seconds to get all memory initialized? In fact, who cares if it takes twice that? Let's assume that the rest of the boot takes 30s (which is pretty aggressive for some big server with terabytes of memory), even if the memory initialization was running in the background and only during idle time for probing, I'm sure you'd have a few hundred gigs of RAM initialized by the time you can log in. And if it then takes another ten minutes until you have the full 16TB initialized, and some things might be a tad slower early on, does anybody really care? The machine will be up and running with plenty of memory, even if it may not be *all* the memory yet. I realize that benchmarking cares, and yes, I also realize that some benchmarks actually want to reboot the machine between some runs just to get repeatability, but if you're benchmarking a 16TB machine I'm guessing any serious benchmark that actually uses that much memory is going to take many hours to a few days to run anyway? Having some way to wait until the memory is all done (which might even be just a silly shell script that does "ps" and waits for the kernel threads to all go away) isn't going to kill the benchmark - and the benchmark itself will then not have to worry about hittinf the "oops, I need to initialize 2GB of RAM now because I hit an uninitialized page". Ok, so I don't know all the issues, and in many ways I don't even really care. You could do it other ways, I don't think this is a big deal. The part I hate is the runtime hook into the core MM page allocation code, so I'm just throwing out any random thing that comes to my mind that could be used to avoid that part. Linus ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 17:51 ` Linus Torvalds 0 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 17:51 UTC (permalink / raw) To: Mike Travis Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis <travis@sgi.com> wrote: > > Initially this patch set consisted of diverting a major portion of the > memory to an "absent" list during e820 processing. A very late initcall > was then used to dispatch a cpu per node to add that nodes's absent > memory. By nature these ran in parallel so Nathan did the work to > "parallelize" various global resource locks to become per node locks. So quite frankly, I'm not sure how worthwhile it even is to parallelize the thing. I realize that some environments may care about getting up to full memory population very quicky, but I think it would be very rare and specialized, and shouldn't necessarily be part of the initial patches. And it really doesn't have to be an initcall at all - at least not a synchronous one. A late initcall to get the process *started*, but the process itself could easily be done with a separate thread asynchronously, and let the machine boot up while that thread is going. And in fact, I'd argue that instead of trying to make it fast and parallelize things excessively, you might want to make the memory initialization *slow*, and make all the rest of the bootup have higher priority. At that point, who cares if it takes 400 seconds to get all memory initialized? In fact, who cares if it takes twice that? Let's assume that the rest of the boot takes 30s (which is pretty aggressive for some big server with terabytes of memory), even if the memory initialization was running in the background and only during idle time for probing, I'm sure you'd have a few hundred gigs of RAM initialized by the time you can log in. And if it then takes another ten minutes until you have the full 16TB initialized, and some things might be a tad slower early on, does anybody really care? The machine will be up and running with plenty of memory, even if it may not be *all* the memory yet. I realize that benchmarking cares, and yes, I also realize that some benchmarks actually want to reboot the machine between some runs just to get repeatability, but if you're benchmarking a 16TB machine I'm guessing any serious benchmark that actually uses that much memory is going to take many hours to a few days to run anyway? Having some way to wait until the memory is all done (which might even be just a silly shell script that does "ps" and waits for the kernel threads to all go away) isn't going to kill the benchmark - and the benchmark itself will then not have to worry about hittinf the "oops, I need to initialize 2GB of RAM now because I hit an uninitialized page". Ok, so I don't know all the issues, and in many ways I don't even really care. You could do it other ways, I don't think this is a big deal. The part I hate is the runtime hook into the core MM page allocation code, so I'm just throwing out any random thing that comes to my mind that could be used to avoid that part. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:51 ` Linus Torvalds @ 2013-08-13 18:04 ` Mike Travis -1 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 18:04 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 10:51 AM, Linus Torvalds wrote: > by the time you can log in. And if it then takes another ten minutes > until you have the full 16TB initialized, and some things might be a > tad slower early on, does anybody really care? The machine will be up > and running with plenty of memory, even if it may not be *all* the > memory yet. Before the patches adding memory took ~45 mins for 16TB and almost 2 hours for 32TB. Adding it late sped up early boot but late insertion was still very slow, where the full 32TB was still not fully inserted after an hour. Doing it in parallel along with the memory hotplug lock per node, we got it down to the 10-15 minute range. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 18:04 ` Mike Travis 0 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 18:04 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 10:51 AM, Linus Torvalds wrote: > by the time you can log in. And if it then takes another ten minutes > until you have the full 16TB initialized, and some things might be a > tad slower early on, does anybody really care? The machine will be up > and running with plenty of memory, even if it may not be *all* the > memory yet. Before the patches adding memory took ~45 mins for 16TB and almost 2 hours for 32TB. Adding it late sped up early boot but late insertion was still very slow, where the full 32TB was still not fully inserted after an hour. Doing it in parallel along with the memory hotplug lock per node, we got it down to the 10-15 minute range. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 18:04 ` Mike Travis @ 2013-08-13 19:06 ` Mike Travis -1 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 19:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 11:04 AM, Mike Travis wrote: > > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. > FYI, the system at this time had 128 nodes each with 256GB of memory. About 252GB was inserted into the absent list from nodes 1 .. 126. Memory on nodes 0 and 128 was left fully present. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 19:06 ` Mike Travis 0 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 19:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 8/13/2013 11:04 AM, Mike Travis wrote: > > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. > FYI, the system at this time had 128 nodes each with 256GB of memory. About 252GB was inserted into the absent list from nodes 1 .. 126. Memory on nodes 0 and 128 was left fully present. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 19:06 ` Mike Travis @ 2013-08-13 20:24 ` Yinghai Lu -1 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-08-13 20:24 UTC (permalink / raw) To: Mike Travis Cc: Linus Torvalds, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Mel Gorman On Tue, Aug 13, 2013 at 12:06 PM, Mike Travis <travis@sgi.com> wrote: > > > On 8/13/2013 11:04 AM, Mike Travis wrote: >> >> >> On 8/13/2013 10:51 AM, Linus Torvalds wrote: >>> by the time you can log in. And if it then takes another ten minutes >>> until you have the full 16TB initialized, and some things might be a >>> tad slower early on, does anybody really care? The machine will be up >>> and running with plenty of memory, even if it may not be *all* the >>> memory yet. >> >> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours >> for 32TB. Adding it late sped up early boot but late insertion was still >> very slow, where the full 32TB was still not fully inserted after an hour. >> Doing it in parallel along with the memory hotplug lock per node, we got >> it down to the 10-15 minute range. >> > > FYI, the system at this time had 128 nodes each with 256GB of memory. > About 252GB was inserted into the absent list from nodes 1 .. 126. > Memory on nodes 0 and 128 was left fully present. Can we have one topic about those boot time issues in this year kernel summit? There will be more 32 sockets x86 systems and will have lots of memory, pci chain and cpu cores. current kernel/smp.c::smp_init(), we still have | /* FIXME: This should be done in userspace --RR */ | for_each_present_cpu(cpu) { | if (num_online_cpus() >= setup_max_cpus) | break; | if (!cpu_online(cpu)) | cpu_up(cpu); | } solution would be: 1. delay some memory, pci chain, or cpus cores. 2. or parallel initialize them during booting 3. or parallel add them after booting. Thanks Yinghai ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 20:24 ` Yinghai Lu 0 siblings, 0 replies; 153+ messages in thread From: Yinghai Lu @ 2013-08-13 20:24 UTC (permalink / raw) To: Mike Travis Cc: Linus Torvalds, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Mel Gorman On Tue, Aug 13, 2013 at 12:06 PM, Mike Travis <travis@sgi.com> wrote: > > > On 8/13/2013 11:04 AM, Mike Travis wrote: >> >> >> On 8/13/2013 10:51 AM, Linus Torvalds wrote: >>> by the time you can log in. And if it then takes another ten minutes >>> until you have the full 16TB initialized, and some things might be a >>> tad slower early on, does anybody really care? The machine will be up >>> and running with plenty of memory, even if it may not be *all* the >>> memory yet. >> >> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours >> for 32TB. Adding it late sped up early boot but late insertion was still >> very slow, where the full 32TB was still not fully inserted after an hour. >> Doing it in parallel along with the memory hotplug lock per node, we got >> it down to the 10-15 minute range. >> > > FYI, the system at this time had 128 nodes each with 256GB of memory. > About 252GB was inserted into the absent list from nodes 1 .. 126. > Memory on nodes 0 and 128 was left fully present. Can we have one topic about those boot time issues in this year kernel summit? There will be more 32 sockets x86 systems and will have lots of memory, pci chain and cpu cores. current kernel/smp.c::smp_init(), we still have | /* FIXME: This should be done in userspace --RR */ | for_each_present_cpu(cpu) { | if (num_online_cpus() >= setup_max_cpus) | break; | if (!cpu_online(cpu)) | cpu_up(cpu); | } solution would be: 1. delay some memory, pci chain, or cpus cores. 2. or parallel initialize them during booting 3. or parallel add them after booting. Thanks Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 20:24 ` Yinghai Lu @ 2013-08-13 20:37 ` Mike Travis -1 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 20:37 UTC (permalink / raw) To: Yinghai Lu Cc: Linus Torvalds, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Mel Gorman On 8/13/2013 1:24 PM, Yinghai Lu wrote: >> > FYI, the system at this time had 128 nodes each with 256GB of memory. >> > About 252GB was inserted into the absent list from nodes 1 .. 126. >> > Memory on nodes 0 and 128 was left fully present. Actually, I was corrected, it was 256 nodes with 128GB (8 * 16GB dimms - which are just now coming out.) So there were 254 concurrent initialization processes running. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 20:37 ` Mike Travis 0 siblings, 0 replies; 153+ messages in thread From: Mike Travis @ 2013-08-13 20:37 UTC (permalink / raw) To: Yinghai Lu Cc: Linus Torvalds, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Mel Gorman On 8/13/2013 1:24 PM, Yinghai Lu wrote: >> > FYI, the system at this time had 128 nodes each with 256GB of memory. >> > About 252GB was inserted into the absent list from nodes 1 .. 126. >> > Memory on nodes 0 and 128 was left fully present. Actually, I was corrected, it was 256 nodes with 128GB (8 * 16GB dimms - which are just now coming out.) So there were 254 concurrent initialization processes running. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 18:04 ` Mike Travis @ 2013-08-13 21:35 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-13 21:35 UTC (permalink / raw) To: Mike Travis Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 08/13/2013 01:04 PM, Mike Travis wrote: > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. Yes but to get it to the 10-15 minute range I had to change an number of system locks. The system_sleep, the memory_hotplug, zonelist_mutex and there was some general alteration to various wmark routines. Some of those fixes I don't know if they would stand up to proper scrutiny but were quick and dirty hacks to allow for progress. Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 21:35 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-13 21:35 UTC (permalink / raw) To: Mike Travis Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On 08/13/2013 01:04 PM, Mike Travis wrote: > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. Yes but to get it to the 10-15 minute range I had to change an number of system locks. The system_sleep, the memory_hotplug, zonelist_mutex and there was some general alteration to various wmark routines. Some of those fixes I don't know if they would stand up to proper scrutiny but were quick and dirty hacks to allow for progress. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:51 ` Linus Torvalds @ 2013-08-13 23:10 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-13 23:10 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Travis, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 10:51:37AM -0700, Linus Torvalds wrote: > I realize that benchmarking cares, and yes, I also realize that some > benchmarks actually want to reboot the machine between some runs just > to get repeatability, but if you're benchmarking a 16TB machine I'm > guessing any serious benchmark that actually uses that much memory is > going to take many hours to a few days to run anyway? Having some way > to wait until the memory is all done (which might even be just a silly > shell script that does "ps" and waits for the kernel threads to all go > away) isn't going to kill the benchmark - and the benchmark itself > will then not have to worry about hittinf the "oops, I need to > initialize 2GB of RAM now because I hit an uninitialized page". > I am not overly concerned with cost having to setup a page struct on first touch but what I need to avoid is adding more permanent cost to page faults on a system that is already "primed". > Ok, so I don't know all the issues, and in many ways I don't even > really care. You could do it other ways, I don't think this is a big > deal. The part I hate is the runtime hook into the core MM page > allocation code, so I'm just throwing out any random thing that comes > to my mind that could be used to avoid that part. > The only mm structure we are adding to is a new flag in page->flags. That didn't seem too much. I had hoped to restrict the core mm changes to check_new_page and free_pages_check but I haven't gotten there yet. Not putting on uninitialized pages on to the lru would work but then I would be concerned over any calculations based on totalpages. I might be too paranoid there but having that be incorrect until after a system is booted worries me. Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 23:10 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-13 23:10 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Travis, Nathan Zimmer, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 10:51:37AM -0700, Linus Torvalds wrote: > I realize that benchmarking cares, and yes, I also realize that some > benchmarks actually want to reboot the machine between some runs just > to get repeatability, but if you're benchmarking a 16TB machine I'm > guessing any serious benchmark that actually uses that much memory is > going to take many hours to a few days to run anyway? Having some way > to wait until the memory is all done (which might even be just a silly > shell script that does "ps" and waits for the kernel threads to all go > away) isn't going to kill the benchmark - and the benchmark itself > will then not have to worry about hittinf the "oops, I need to > initialize 2GB of RAM now because I hit an uninitialized page". > I am not overly concerned with cost having to setup a page struct on first touch but what I need to avoid is adding more permanent cost to page faults on a system that is already "primed". > Ok, so I don't know all the issues, and in many ways I don't even > really care. You could do it other ways, I don't think this is a big > deal. The part I hate is the runtime hook into the core MM page > allocation code, so I'm just throwing out any random thing that comes > to my mind that could be used to avoid that part. > The only mm structure we are adding to is a new flag in page->flags. That didn't seem too much. I had hoped to restrict the core mm changes to check_new_page and free_pages_check but I haven't gotten there yet. Not putting on uninitialized pages on to the lru would work but then I would be concerned over any calculations based on totalpages. I might be too paranoid there but having that be incorrect until after a system is booted worries me. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 23:10 ` Nathan Zimmer @ 2013-08-13 23:55 ` Linus Torvalds -1 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 23:55 UTC (permalink / raw) To: Nathan Zimmer Cc: Mike Travis, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > The only mm structure we are adding to is a new flag in page->flags. > That didn't seem too much. I don't agree. I see only downsides, and no upsides. Doing the same thing *without* the downsides seems straightforward, so I simply see no reason for any extra flags or tests at runtime. Linus ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-13 23:55 ` Linus Torvalds 0 siblings, 0 replies; 153+ messages in thread From: Linus Torvalds @ 2013-08-13 23:55 UTC (permalink / raw) To: Nathan Zimmer Cc: Mike Travis, Peter Anvin, Ingo Molnar, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > The only mm structure we are adding to is a new flag in page->flags. > That didn't seem too much. I don't agree. I see only downsides, and no upsides. Doing the same thing *without* the downsides seems straightforward, so I simply see no reason for any extra flags or tests at runtime. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 23:55 ` Linus Torvalds @ 2013-08-14 11:27 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-14 11:27 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Mike Travis, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > > > The only mm structure we are adding to is a new flag in page->flags. > > That didn't seem too much. > > I don't agree. > > I see only downsides, and no upsides. Doing the same thing *without* the > downsides seems straightforward, so I simply see no reason for any extra > flags or tests at runtime. The code as presented clearly looks more involved and neither simple nor zero-cost - I was hoping for a much more simple approach. I see three solutions: - Speed up the synchronous memory init code: live migrate to the node being set up via set_cpus_allowed(), to make sure the init is always fast and local. Pros: if it solves the problem then mem init is still synchronous, deterministic and essentially equivalent to what we do today - so relatively simple and well-tested, with no 'large machine' special path. Cons: it might not be enough and we might not have scheduling enabled on the affected nodes yet. - Speed up the synchronous memory init code by paralellizing the key, most expensive initialization portion of setting up the page head arrays to per node, via SMP function-calls. Pros: by far the fastest synchronous option. (It will also test the power budget and the mains fuses right during bootup.) Cons: more complex and depends on SMP cross-calls being available at mem init time. Not necessarily hotplug friendly. - Avoid the problem by punting to async mem init. Pros: it gets us to a minimal working system quickly and leaves the memory code relatively untouched. Disadvantages: makes memory state asynchronous and non-deterministic. Stats either fluctuate shortly after bootup or have to be faked. Thanks, Ingo ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-14 11:27 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-14 11:27 UTC (permalink / raw) To: Linus Torvalds Cc: Nathan Zimmer, Mike Travis, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <nzimmer@sgi.com> wrote: > > > > The only mm structure we are adding to is a new flag in page->flags. > > That didn't seem too much. > > I don't agree. > > I see only downsides, and no upsides. Doing the same thing *without* the > downsides seems straightforward, so I simply see no reason for any extra > flags or tests at runtime. The code as presented clearly looks more involved and neither simple nor zero-cost - I was hoping for a much more simple approach. I see three solutions: - Speed up the synchronous memory init code: live migrate to the node being set up via set_cpus_allowed(), to make sure the init is always fast and local. Pros: if it solves the problem then mem init is still synchronous, deterministic and essentially equivalent to what we do today - so relatively simple and well-tested, with no 'large machine' special path. Cons: it might not be enough and we might not have scheduling enabled on the affected nodes yet. - Speed up the synchronous memory init code by paralellizing the key, most expensive initialization portion of setting up the page head arrays to per node, via SMP function-calls. Pros: by far the fastest synchronous option. (It will also test the power budget and the mains fuses right during bootup.) Cons: more complex and depends on SMP cross-calls being available at mem init time. Not necessarily hotplug friendly. - Avoid the problem by punting to async mem init. Pros: it gets us to a minimal working system quickly and leaves the memory code relatively untouched. Disadvantages: makes memory state asynchronous and non-deterministic. Stats either fluctuate shortly after bootup or have to be faked. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-13 17:51 ` Linus Torvalds @ 2013-08-14 11:05 ` Ingo Molnar -1 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-14 11:05 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Travis, Nathan Zimmer, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > > Ok, so I don't know all the issues, and in many ways I don't even really > care. You could do it other ways, I don't think this is a big deal. The > part I hate is the runtime hook into the core MM page allocation code, > so I'm just throwing out any random thing that comes to my mind that > could be used to avoid that part. So, my hope was that it's possible to have a single, simple, zero-cost runtime check [zero cost for already initialized pages], because it can be merged into already existing page flag mask checks present here and executed for every freshly allocated page: static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; } return 0; } We already run this for every new page allocated and the initialization check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. I'd not do any of the ensure_page_is_initialized() or __expand_page_initialization() complications in this patch-set - each page head represents itself and gets iterated when check_new_page() is done. During regular bootup we'd initialize like before, except we don't set up the page heads but memset() them to zero. With each page head 32 bytes this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB that's 128 GB to clear - that ought to be possible to do rather quickly, perhaps with some smart SMP cross-call approach that makes sure that each memset is done in a node-local fashion. [*] Such an approach should IMO be far smaller and less invasive than the patches presented so far: it should be below 100 lines or so. I don't know why there's such a big difference between the theory I outlined and the invasive patch-set implemented so far in practice, perhaps I'm missing some complication. I was trying to probe that difference, before giving up on the idea and punting back to the async hotplug-ish approach which would obviously work well too. All in one, I think async init just hides the real problem - there's no way memory init should take this long. Thanks, Ingo [*] alternatively maybe the main performance problem is that node-local memory is set up on a remote (boot) node? In that case I'd try to optimize it by migrating the memory init code's current node by using set_cpus_allowed() to live migrate from node to node, tracking the node whose struct page array is being initialized. ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-14 11:05 ` Ingo Molnar 0 siblings, 0 replies; 153+ messages in thread From: Ingo Molnar @ 2013-08-14 11:05 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Travis, Nathan Zimmer, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman * Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > > Ok, so I don't know all the issues, and in many ways I don't even really > care. You could do it other ways, I don't think this is a big deal. The > part I hate is the runtime hook into the core MM page allocation code, > so I'm just throwing out any random thing that comes to my mind that > could be used to avoid that part. So, my hope was that it's possible to have a single, simple, zero-cost runtime check [zero cost for already initialized pages], because it can be merged into already existing page flag mask checks present here and executed for every freshly allocated page: static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; } return 0; } We already run this for every new page allocated and the initialization check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. I'd not do any of the ensure_page_is_initialized() or __expand_page_initialization() complications in this patch-set - each page head represents itself and gets iterated when check_new_page() is done. During regular bootup we'd initialize like before, except we don't set up the page heads but memset() them to zero. With each page head 32 bytes this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB that's 128 GB to clear - that ought to be possible to do rather quickly, perhaps with some smart SMP cross-call approach that makes sure that each memset is done in a node-local fashion. [*] Such an approach should IMO be far smaller and less invasive than the patches presented so far: it should be below 100 lines or so. I don't know why there's such a big difference between the theory I outlined and the invasive patch-set implemented so far in practice, perhaps I'm missing some complication. I was trying to probe that difference, before giving up on the idea and punting back to the async hotplug-ish approach which would obviously work well too. All in one, I think async init just hides the real problem - there's no way memory init should take this long. Thanks, Ingo [*] alternatively maybe the main performance problem is that node-local memory is set up on a remote (boot) node? In that case I'd try to optimize it by migrating the memory init code's current node by using set_cpus_allowed() to live migrate from node to node, tracking the node whose struct page array is being initialized. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-14 11:05 ` Ingo Molnar @ 2013-08-14 22:15 ` Nathan Zimmer -1 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-14 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mike Travis, Nathan Zimmer, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Wed, Aug 14, 2013 at 01:05:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > [...] > > > > Ok, so I don't know all the issues, and in many ways I don't even really > > care. You could do it other ways, I don't think this is a big deal. The > > part I hate is the runtime hook into the core MM page allocation code, > > so I'm just throwing out any random thing that comes to my mind that > > could be used to avoid that part. > > So, my hope was that it's possible to have a single, simple, zero-cost > runtime check [zero cost for already initialized pages], because it can be > merged into already existing page flag mask checks present here and > executed for every freshly allocated page: > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > } > return 0; > } > > We already run this for every new page allocated and the initialization > check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. > > I'd not do any of the ensure_page_is_initialized() or > __expand_page_initialization() complications in this patch-set - each page > head represents itself and gets iterated when check_new_page() is done. > > During regular bootup we'd initialize like before, except we don't set up > the page heads but memset() them to zero. With each page head 32 bytes > this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB > that's 128 GB to clear - that ought to be possible to do rather quickly, > perhaps with some smart SMP cross-call approach that makes sure that each > memset is done in a node-local fashion. [*] > > Such an approach should IMO be far smaller and less invasive than the > patches presented so far: it should be below 100 lines or so. > > I don't know why there's such a big difference between the theory I > outlined and the invasive patch-set implemented so far in practice, > perhaps I'm missing some complication. I was trying to probe that > difference, before giving up on the idea and punting back to the async > hotplug-ish approach which would obviously work well too. > The reason, which I failed to mention, is once we pull off a page the lru in either __rmqueue_fallback or __rmqueue_smallest the first thing we do with it is expand() or sometimes move_freepages(). These then trip over some BUG_ON and VM_BUG_ON. Those BUG_ONs are what keep causing me to delve into the ensure/expand foolishness. Nate ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-14 22:15 ` Nathan Zimmer 0 siblings, 0 replies; 153+ messages in thread From: Nathan Zimmer @ 2013-08-14 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mike Travis, Nathan Zimmer, Peter Anvin, Linux Kernel Mailing List, linux-mm, Robin Holt, Rob Landley, Daniel J Blueman, Andrew Morton, Greg Kroah-Hartman, Yinghai Lu, Mel Gorman On Wed, Aug 14, 2013 at 01:05:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > [...] > > > > Ok, so I don't know all the issues, and in many ways I don't even really > > care. You could do it other ways, I don't think this is a big deal. The > > part I hate is the runtime hook into the core MM page allocation code, > > so I'm just throwing out any random thing that comes to my mind that > > could be used to avoid that part. > > So, my hope was that it's possible to have a single, simple, zero-cost > runtime check [zero cost for already initialized pages], because it can be > merged into already existing page flag mask checks present here and > executed for every freshly allocated page: > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > } > return 0; > } > > We already run this for every new page allocated and the initialization > check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. > > I'd not do any of the ensure_page_is_initialized() or > __expand_page_initialization() complications in this patch-set - each page > head represents itself and gets iterated when check_new_page() is done. > > During regular bootup we'd initialize like before, except we don't set up > the page heads but memset() them to zero. With each page head 32 bytes > this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB > that's 128 GB to clear - that ought to be possible to do rather quickly, > perhaps with some smart SMP cross-call approach that makes sure that each > memset is done in a node-local fashion. [*] > > Such an approach should IMO be far smaller and less invasive than the > patches presented so far: it should be below 100 lines or so. > > I don't know why there's such a big difference between the theory I > outlined and the invasive patch-set implemented so far in practice, > perhaps I'm missing some complication. I was trying to probe that > difference, before giving up on the idea and punting back to the async > hotplug-ish approach which would obviously work well too. > The reason, which I failed to mention, is once we pull off a page the lru in either __rmqueue_fallback or __rmqueue_smallest the first thing we do with it is expand() or sometimes move_freepages(). These then trip over some BUG_ON and VM_BUG_ON. Those BUG_ONs are what keep causing me to delve into the ensure/expand foolishness. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator 2013-08-12 21:54 ` Nathan Zimmer @ 2013-08-16 16:36 ` Dave Hansen -1 siblings, 0 replies; 153+ messages in thread From: Dave Hansen @ 2013-08-16 16:36 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, mingo, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Hey Nathan, Could you post your boot timing patches? My machines are much smaller than yours, but I'm curious how things behave here as well. I did some very imprecise timings (strace -t on a telnet attached to the serial console). The 'struct page' initializations take about a minute of boot time for me to do 1TB across 8 NUMA nodes (this is a glueless QPI system[1]). My _quick_ calculations look like it's 2x as fast to initialize node0's memory vs. the other nodes, and boot time is increased by a second for about every 30G of memory we add. So even with nothing else fancy, we could get some serious improvements from just doing the initialization locally. [1] We call anything using pure QPI without any other circuitry for the NUMA interconnects to be "glueless" ^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator @ 2013-08-16 16:36 ` Dave Hansen 0 siblings, 0 replies; 153+ messages in thread From: Dave Hansen @ 2013-08-16 16:36 UTC (permalink / raw) To: Nathan Zimmer Cc: hpa, mingo, linux-kernel, linux-mm, holt, rob, travis, daniel, akpm, gregkh, yinghai, mgorman Hey Nathan, Could you post your boot timing patches? My machines are much smaller than yours, but I'm curious how things behave here as well. I did some very imprecise timings (strace -t on a telnet attached to the serial console). The 'struct page' initializations take about a minute of boot time for me to do 1TB across 8 NUMA nodes (this is a glueless QPI system[1]). My _quick_ calculations look like it's 2x as fast to initialize node0's memory vs. the other nodes, and boot time is increased by a second for about every 30G of memory we add. So even with nothing else fancy, we could get some serious improvements from just doing the initialization locally. [1] We call anything using pure QPI without any other circuitry for the NUMA interconnects to be "glueless" -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 153+ messages in thread
end of thread, other threads:[~2013-08-16 16:36 UTC | newest] Thread overview: 153+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-07-12 2:03 [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Robin Holt 2013-07-12 2:03 ` Robin Holt 2013-07-12 2:03 ` [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator Robin Holt 2013-07-12 2:03 ` Robin Holt 2013-07-12 2:03 ` [RFC 2/4] Have __free_pages_memory() free in larger chunks Robin Holt 2013-07-12 2:03 ` Robin Holt 2013-07-12 7:45 ` Robin Holt 2013-07-12 7:45 ` Robin Holt 2013-07-13 3:08 ` Yinghai Lu 2013-07-13 3:08 ` Yinghai Lu 2013-07-16 13:02 ` Sam Ben 2013-07-16 13:02 ` Sam Ben 2013-07-23 15:32 ` Johannes Weiner 2013-07-23 15:32 ` Johannes Weiner 2013-07-12 2:03 ` [RFC 3/4] Seperate page initialization into a separate function Robin Holt 2013-07-12 2:03 ` Robin Holt 2013-07-13 3:06 ` Yinghai Lu 2013-07-13 3:06 ` Yinghai Lu 2013-07-15 3:19 ` Robin Holt 2013-07-15 3:19 ` Robin Holt 2013-07-12 2:03 ` [RFC 4/4] Sparse initialization of struct page array Robin Holt 2013-07-12 2:03 ` Robin Holt 2013-07-13 4:19 ` Yinghai Lu 2013-07-13 4:19 ` Yinghai Lu 2013-07-13 4:39 ` H. Peter Anvin 2013-07-13 4:39 ` H. Peter Anvin 2013-07-13 5:31 ` Yinghai Lu 2013-07-13 5:31 ` Yinghai Lu 2013-07-13 5:38 ` H. Peter Anvin 2013-07-13 5:38 ` H. Peter Anvin 2013-07-15 14:08 ` Nathan Zimmer 2013-07-15 14:08 ` Nathan Zimmer 2013-07-15 17:45 ` Nathan Zimmer 2013-07-15 17:45 ` Nathan Zimmer 2013-07-15 17:54 ` H. Peter Anvin 2013-07-15 17:54 ` H. Peter Anvin 2013-07-15 18:26 ` Robin Holt 2013-07-15 18:26 ` Robin Holt 2013-07-15 18:29 ` H. Peter Anvin 2013-07-15 18:29 ` H. Peter Anvin 2013-07-23 8:32 ` Ingo Molnar 2013-07-23 8:32 ` Ingo Molnar 2013-07-23 11:09 ` Robin Holt 2013-07-23 11:09 ` Robin Holt 2013-07-23 11:15 ` Robin Holt 2013-07-23 11:15 ` Robin Holt 2013-07-23 11:41 ` Robin Holt 2013-07-23 11:41 ` Robin Holt 2013-07-23 11:50 ` Robin Holt 2013-07-23 11:50 ` Robin Holt 2013-07-16 10:26 ` Robin Holt 2013-07-16 10:26 ` Robin Holt 2013-07-25 2:25 ` Robin Holt 2013-07-25 2:25 ` Robin Holt 2013-07-25 12:50 ` Yinghai Lu 2013-07-25 12:50 ` Yinghai Lu 2013-07-25 13:42 ` Robin Holt 2013-07-25 13:42 ` Robin Holt 2013-07-25 13:52 ` Yinghai Lu 2013-07-25 13:52 ` Yinghai Lu 2013-07-15 21:30 ` Andrew Morton 2013-07-15 21:30 ` Andrew Morton 2013-07-16 10:38 ` Robin Holt 2013-07-16 10:38 ` Robin Holt 2013-07-12 8:27 ` [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Ingo Molnar 2013-07-12 8:27 ` Ingo Molnar 2013-07-12 8:47 ` boot tracing Borislav Petkov 2013-07-12 8:47 ` Borislav Petkov 2013-07-12 8:53 ` Ingo Molnar 2013-07-12 8:53 ` Ingo Molnar 2013-07-15 1:38 ` Sam Ben 2013-07-15 1:38 ` Sam Ben 2013-07-23 8:18 ` Ingo Molnar 2013-07-23 8:18 ` Ingo Molnar 2013-07-12 9:19 ` [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Robert Richter 2013-07-12 9:19 ` Robert Richter 2013-07-15 15:16 ` Robin Holt 2013-07-15 15:16 ` Robin Holt 2013-07-16 8:55 ` Joonsoo Kim 2013-07-16 8:55 ` Joonsoo Kim 2013-07-16 9:08 ` Borislav Petkov 2013-07-16 9:08 ` Borislav Petkov 2013-07-23 8:20 ` Ingo Molnar 2013-07-23 8:20 ` Ingo Molnar 2013-07-15 15:00 ` Robin Holt 2013-07-15 15:00 ` Robin Holt 2013-07-17 5:17 ` Sam Ben 2013-07-17 5:17 ` Sam Ben 2013-07-17 9:30 ` Robin Holt 2013-07-17 9:30 ` Robin Holt 2013-07-19 23:51 ` Yinghai Lu 2013-07-22 6:13 ` Robin Holt 2013-07-22 6:13 ` Robin Holt 2013-08-02 17:44 ` [RFC v2 0/5] " Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-02 17:44 ` [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-02 17:44 ` [RFC v2 2/5] Have __free_pages_memory() free in larger chunks Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-02 17:44 ` [RFC v2 3/5] Move page initialization into a separate function Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-02 17:44 ` [RFC v2 4/5] Only set page reserved in the memblock region Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-03 20:04 ` Nathan Zimmer 2013-08-03 20:04 ` Nathan Zimmer 2013-08-02 17:44 ` [RFC v2 5/5] Sparse initialization of struct page array Nathan Zimmer 2013-08-02 17:44 ` Nathan Zimmer 2013-08-05 9:58 ` [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Ingo Molnar 2013-08-05 9:58 ` Ingo Molnar 2013-08-12 21:54 ` [RFC v3 " Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-12 21:54 ` [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-12 21:54 ` [RFC v3 2/5] Have __free_pages_memory() free in larger chunks Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-12 21:54 ` [RFC v3 3/5] Move page initialization into a separate function Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-12 21:54 ` [RFC v3 4/5] Only set page reserved in the memblock region Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-12 21:54 ` [RFC v3 5/5] Sparse initialization of struct page array Nathan Zimmer 2013-08-12 21:54 ` Nathan Zimmer 2013-08-13 10:58 ` [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Ingo Molnar 2013-08-13 10:58 ` Ingo Molnar 2013-08-13 17:09 ` Linus Torvalds 2013-08-13 17:09 ` Linus Torvalds 2013-08-13 17:23 ` H. Peter Anvin 2013-08-13 17:23 ` H. Peter Anvin 2013-08-13 17:33 ` Mike Travis 2013-08-13 17:33 ` Mike Travis 2013-08-13 17:51 ` Linus Torvalds 2013-08-13 17:51 ` Linus Torvalds 2013-08-13 18:04 ` Mike Travis 2013-08-13 18:04 ` Mike Travis 2013-08-13 19:06 ` Mike Travis 2013-08-13 19:06 ` Mike Travis 2013-08-13 20:24 ` Yinghai Lu 2013-08-13 20:24 ` Yinghai Lu 2013-08-13 20:37 ` Mike Travis 2013-08-13 20:37 ` Mike Travis 2013-08-13 21:35 ` Nathan Zimmer 2013-08-13 21:35 ` Nathan Zimmer 2013-08-13 23:10 ` Nathan Zimmer 2013-08-13 23:10 ` Nathan Zimmer 2013-08-13 23:55 ` Linus Torvalds 2013-08-13 23:55 ` Linus Torvalds 2013-08-14 11:27 ` Ingo Molnar 2013-08-14 11:27 ` Ingo Molnar 2013-08-14 11:05 ` Ingo Molnar 2013-08-14 11:05 ` Ingo Molnar 2013-08-14 22:15 ` Nathan Zimmer 2013-08-14 22:15 ` Nathan Zimmer 2013-08-16 16:36 ` Dave Hansen 2013-08-16 16:36 ` Dave Hansen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.