* [PATCH v2 0/1] mm: buddy page accessed before initialized @ 2017-11-02 17:02 Pavel Tatashin 2017-11-02 17:02 ` [PATCH v2 1/1] " Pavel Tatashin 0 siblings, 1 reply; 5+ messages in thread From: Pavel Tatashin @ 2017-11-02 17:02 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel As discussed with Michal Hocko, I am sending a new version of the patch, where loops are split into two parts: initializing, and freeing. I also included compiler warning fixes from: mm-deferred_init_memmap-improvements-fix.patch So, this patch should replace two patches in mmots: mm-deferred_init_memmap-improvements-fix.patch and mm-deferred_init_memmap-improvements-fix-2.patch Again, I can send a new full version of mm-deferred_init_memmap-improvements.patch If that is better. Pavel Tatashin (1): mm: buddy page accessed before initialized mm/page_alloc.c | 66 +++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 23 deletions(-) -- 2.15.0 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/1] mm: buddy page accessed before initialized 2017-11-02 17:02 [PATCH v2 0/1] mm: buddy page accessed before initialized Pavel Tatashin @ 2017-11-02 17:02 ` Pavel Tatashin 2017-11-03 9:27 ` Michal Hocko 0 siblings, 1 reply; 5+ messages in thread From: Pavel Tatashin @ 2017-11-02 17:02 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel This problem is seen when machine is rebooted after kexec: A message like this is printed: ========================================================================== WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0 Modules linked in: CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90 Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 3016 node 1 initialised, 32444607 pages in 1679ms task: ffff880180e75a00 task.stack: ffffc9000cdb0000 RIP: 0010:__list_del_entry_valid+0x83/0xa0 RSP: 0000:ffffc9000cdb3d18 EFLAGS: 00010046 RAX: 0000000000000054 RBX: 0000000000000009 RCX: ffffffff81c5f3e8 RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000046 RBP: ffffc9000cdb3d18 R08: 00000000fffffffe R09: 0000000000000154 R10: 0000000000000005 R11: 0000000000000153 R12: 0000000001fcdc00 R13: 0000000001fcde00 R14: ffff88207ffded00 R15: ffffea007f370000 FS: 0000000000000000(0000) GS:ffff881fffac0000(0000) knlGS:0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407ec09001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: free_one_page+0x103/0x390 __free_pages_ok+0x1cf/0x2d0 __free_pages+0x19/0x30 __free_pages_boot_core+0xae/0xba deferred_free_range+0x60/0x94 deferred_init_memmap+0x324/0x372 kthread+0x109/0x140 ? __free_pages_bootmem+0x2e/0x2e ? kthread_park+0x60/0x60 ret_from_fork+0x25/0x30 list_del corruption. next->prev should be ffffea007f428020, but was ffffea007f1d8020 ========================================================================== The problem happens in this path: page_alloc_init_late deferred_init_memmap deferred_init_range __def_free deferred_free_range __free_pages_boot_core(page, order) __free_pages() __free_pages_ok() free_one_page() __free_one_page(page, pfn, zone, order, migratetype); deferred_init_range() initializes one page at a time by calling __init_single_page(), once it initializes pageblock_nr_pages pages, it calls deferred_free_range() to free the initialized pages to the buddy allocator. Eventually, we reach __free_one_page(), where we compute buddy page: buddy_pfn = __find_buddy_pfn(pfn, order); buddy = page + (buddy_pfn - pfn); buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages. Thefore, buddy page becomes a page one after the range that currently was initialized, and we access this page in this function. Also, later when we return back to deferred_init_range(), the buddy page is initialized again. So, in order to avoid this issue, we must initialize the buddy page prior to calling deferred_free_range(). Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> --- mm/page_alloc.c | 66 +++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 23 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 97687b38da05..201bf67ce042 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1449,9 +1449,9 @@ static inline void __init pgdat_init_report_one_done(void) * Helper for deferred_init_range, free the given range, reset the counters, and * return number of pages freed. */ -static inline unsigned long __def_free(unsigned long *nr_free, - unsigned long *free_base_pfn, - struct page **page) +static inline unsigned long __init __def_free(unsigned long *nr_free, + unsigned long *free_base_pfn, + struct page **page) { unsigned long nr = *nr_free; @@ -1463,8 +1463,9 @@ static inline unsigned long __def_free(unsigned long *nr_free, return nr; } -static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn, - unsigned long end_pfn) +static unsigned long __init deferred_init_range(int nid, int zid, + unsigned long start_pfn, + unsigned long end_pfn) { struct mminit_pfnnid_cache nid_init_state = { }; unsigned long nr_pgmask = pageblock_nr_pages - 1; @@ -1472,23 +1473,44 @@ static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn, unsigned long nr_pages = 0; unsigned long nr_free = 0; struct page *page = NULL; + unsigned long pfn; - for (; pfn < end_pfn; pfn++) { - /* - * First we check if pfn is valid on architectures where it is - * possible to have holes within pageblock_nr_pages. On systems - * where it is not possible, this function is optimized out. - * - * Then, we check if a current large page is valid by only - * checking the validity of the head pfn. - * - * meminit_pfn_in_nid is checked on systems where pfns can - * interleave within a node: a pfn is between start and end - * of a node, but does not belong to this memory node. - * - * Finally, we minimize pfn page lookups and scheduler checks by - * performing it only once every pageblock_nr_pages. - */ + /* + * First we check if pfn is valid on architectures where it is possible + * to have holes within pageblock_nr_pages. On systems where it is not + * possible, this function is optimized out. + * + * Then, we check if a current large page is valid by only checking the + * validity of the head pfn. + * + * meminit_pfn_in_nid is checked on systems where pfns can interleave + * within a node: a pfn is between start and end of a node, but does not + * belong to this memory node. + * + * Finally, we minimize pfn page lookups and scheduler checks by + * performing it only once every pageblock_nr_pages. + * + * We do it in two loops: first we initialize struct page, than free to + * buddy allocator, becuse while we are freeing pages we can access + * pages that are ahead (computing buddy page in __free_one_page()). + */ + for (pfn = start_pfn; pfn < end_pfn; pfn++) { + if (!pfn_valid_within(pfn)) + continue; + if ((pfn & nr_pgmask) || pfn_valid(pfn)) { + if (meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { + if (page && (pfn & nr_pgmask)) + page++; + else + page = pfn_to_page(pfn); + __init_single_page(page, pfn, zid, nid); + cond_resched(); + } + } + } + + page = NULL; + for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn)) { nr_pages += __def_free(&nr_free, &free_base_pfn, &page); } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) { @@ -1497,12 +1519,10 @@ static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn, nr_pages += __def_free(&nr_free, &free_base_pfn, &page); } else if (page && (pfn & nr_pgmask)) { page++; - __init_single_page(page, pfn, zid, nid); nr_free++; } else { nr_pages += __def_free(&nr_free, &free_base_pfn, &page); page = pfn_to_page(pfn); - __init_single_page(page, pfn, zid, nid); free_base_pfn = pfn; nr_free = 1; cond_resched(); -- 2.15.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] mm: buddy page accessed before initialized 2017-11-02 17:02 ` [PATCH v2 1/1] " Pavel Tatashin @ 2017-11-03 9:27 ` Michal Hocko 2017-11-03 13:47 ` Pavel Tatashin 0 siblings, 1 reply; 5+ messages in thread From: Michal Hocko @ 2017-11-03 9:27 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm, linux-kernel On Thu 02-11-17 13:02:21, Pavel Tatashin wrote: > This problem is seen when machine is rebooted after kexec: > A message like this is printed: > ========================================================================== > WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0 > Modules linked in: > CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90 > Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, > BIOS 3016 > node 1 initialised, 32444607 pages in 1679ms > task: ffff880180e75a00 task.stack: ffffc9000cdb0000 > RIP: 0010:__list_del_entry_valid+0x83/0xa0 > RSP: 0000:ffffc9000cdb3d18 EFLAGS: 00010046 > RAX: 0000000000000054 RBX: 0000000000000009 RCX: ffffffff81c5f3e8 > RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000046 > RBP: ffffc9000cdb3d18 R08: 00000000fffffffe R09: 0000000000000154 > R10: 0000000000000005 R11: 0000000000000153 R12: 0000000001fcdc00 > R13: 0000000001fcde00 R14: ffff88207ffded00 R15: ffffea007f370000 > FS: 0000000000000000(0000) GS:ffff881fffac0000(0000) knlGS:0 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000407ec09001 CR4: 00000000003606e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > free_one_page+0x103/0x390 > __free_pages_ok+0x1cf/0x2d0 > __free_pages+0x19/0x30 > __free_pages_boot_core+0xae/0xba > deferred_free_range+0x60/0x94 > deferred_init_memmap+0x324/0x372 > kthread+0x109/0x140 > ? __free_pages_bootmem+0x2e/0x2e > ? kthread_park+0x60/0x60 > ret_from_fork+0x25/0x30 > > list_del corruption. next->prev should be ffffea007f428020, but was > ffffea007f1d8020 > ========================================================================== > > The problem happens in this path: > > page_alloc_init_late > deferred_init_memmap > deferred_init_range > __def_free > deferred_free_range > __free_pages_boot_core(page, order) > __free_pages() > __free_pages_ok() > free_one_page() > __free_one_page(page, pfn, zone, order, migratetype); > > deferred_init_range() initializes one page at a time by calling > __init_single_page(), once it initializes pageblock_nr_pages pages, it > calls deferred_free_range() to free the initialized pages to the buddy > allocator. Eventually, we reach __free_one_page(), where we compute buddy > page: > buddy_pfn = __find_buddy_pfn(pfn, order); > buddy = page + (buddy_pfn - pfn); > > buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages. > Thefore, buddy page becomes a page one after the range that currently was > initialized, and we access this page in this function. Also, later when we > return back to deferred_init_range(), the buddy page is initialized again. > > So, in order to avoid this issue, we must initialize the buddy page prior > to calling deferred_free_range(). Have you measured any negative performance impact with this change? > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> The patch looks good to me otherwise. So if this doesn't introduce a noticeable overhead, which I whope it doesn't then feel free to add Acked-by: Michal Hocko <mhocko@suse.com> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] mm: buddy page accessed before initialized 2017-11-03 9:27 ` Michal Hocko @ 2017-11-03 13:47 ` Pavel Tatashin 2017-11-03 13:53 ` Michal Hocko 0 siblings, 1 reply; 5+ messages in thread From: Pavel Tatashin @ 2017-11-03 13:47 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, mgorman, Linux Memory Management List, linux-kernel Hi Michal, There is a small regression, on the largest x86 machine I have access to: Before: node 1 initialised, 32471632 pages in 901ms After: node 1 initialised, 32471632 pages in 1128ms One node contains 128G of memory (overal 1T in 8 nodes). This regression is going to be solved by this work: https://patchwork.kernel.org/patch/9920953/, other than that I do not know a better solution. The overall performance is still much better compared to before this project. Also, thinking about this problem some more, it is safer to split the initialization, and freeing parts into two functions: In deferred_init_memmap() 1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { 1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); 1576 epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa)); 1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn); 1578 } Replace with two loops: First loop, calls a function that initializes the given range, the 2nd loop calls a function that frees it. This way we won't get a potential problem where buddy page is computed from the next range that has not yet been initialized. And it is also going to be easier to multithread later: multi-thread the first loop, wait for it to finish, multi-thread the 2nd loop wait for it to finish. Pasha On Fri, Nov 3, 2017 at 5:27 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-11-17 13:02:21, Pavel Tatashin wrote: >> This problem is seen when machine is rebooted after kexec: >> A message like this is printed: >> ========================================================================== >> WARNING: CPU: 21 PID: 249 at linux/lib/list_debug.c:53__listd+0x83/0xa0 >> Modules linked in: >> CPU: 21 PID: 249 Comm: pgdatinit0 Not tainted 4.14.0-rc6_pt_deferred #90 >> Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, >> BIOS 3016 >> node 1 initialised, 32444607 pages in 1679ms >> task: ffff880180e75a00 task.stack: ffffc9000cdb0000 >> RIP: 0010:__list_del_entry_valid+0x83/0xa0 >> RSP: 0000:ffffc9000cdb3d18 EFLAGS: 00010046 >> RAX: 0000000000000054 RBX: 0000000000000009 RCX: ffffffff81c5f3e8 >> RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000046 >> RBP: ffffc9000cdb3d18 R08: 00000000fffffffe R09: 0000000000000154 >> R10: 0000000000000005 R11: 0000000000000153 R12: 0000000001fcdc00 >> R13: 0000000001fcde00 R14: ffff88207ffded00 R15: ffffea007f370000 >> FS: 0000000000000000(0000) GS:ffff881fffac0000(0000) knlGS:0 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000000000 CR3: 000000407ec09001 CR4: 00000000003606e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> free_one_page+0x103/0x390 >> __free_pages_ok+0x1cf/0x2d0 >> __free_pages+0x19/0x30 >> __free_pages_boot_core+0xae/0xba >> deferred_free_range+0x60/0x94 >> deferred_init_memmap+0x324/0x372 >> kthread+0x109/0x140 >> ? __free_pages_bootmem+0x2e/0x2e >> ? kthread_park+0x60/0x60 >> ret_from_fork+0x25/0x30 >> >> list_del corruption. next->prev should be ffffea007f428020, but was >> ffffea007f1d8020 >> ========================================================================== >> >> The problem happens in this path: >> >> page_alloc_init_late >> deferred_init_memmap >> deferred_init_range >> __def_free >> deferred_free_range >> __free_pages_boot_core(page, order) >> __free_pages() >> __free_pages_ok() >> free_one_page() >> __free_one_page(page, pfn, zone, order, migratetype); >> >> deferred_init_range() initializes one page at a time by calling >> __init_single_page(), once it initializes pageblock_nr_pages pages, it >> calls deferred_free_range() to free the initialized pages to the buddy >> allocator. Eventually, we reach __free_one_page(), where we compute buddy >> page: >> buddy_pfn = __find_buddy_pfn(pfn, order); >> buddy = page + (buddy_pfn - pfn); >> >> buddy_pfn is computed as pfn ^ (1 << order), or pfn + pageblock_nr_pages. >> Thefore, buddy page becomes a page one after the range that currently was >> initialized, and we access this page in this function. Also, later when we >> return back to deferred_init_range(), the buddy page is initialized again. >> >> So, in order to avoid this issue, we must initialize the buddy page prior >> to calling deferred_free_range(). > > Have you measured any negative performance impact with this change? > >> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> > > The patch looks good to me otherwise. So if this doesn't introduce a > noticeable overhead, which I whope it doesn't then feel free to add > Acked-by: Michal Hocko <mhocko@suse.com> > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] mm: buddy page accessed before initialized 2017-11-03 13:47 ` Pavel Tatashin @ 2017-11-03 13:53 ` Michal Hocko 0 siblings, 0 replies; 5+ messages in thread From: Michal Hocko @ 2017-11-03 13:53 UTC (permalink / raw) To: Pavel Tatashin Cc: Steve Sistare, Daniel Jordan, Andrew Morton, mgorman, Linux Memory Management List, linux-kernel On Fri 03-11-17 09:47:30, Pavel Tatashin wrote: > Hi Michal, > > There is a small regression, on the largest x86 machine I have access to: > Before: > node 1 initialised, 32471632 pages in 901ms > After: > node 1 initialised, 32471632 pages in 1128ms > > One node contains 128G of memory (overal 1T in 8 nodes). This > regression is going to be solved by this work: > https://patchwork.kernel.org/patch/9920953/, other than that I do not > know a better solution. The overall performance is still much better > compared to before this project. OK, I think that is completely acceptable for now. We can always optimize for a better result later. > Also, thinking about this problem some more, it is safer to split the > initialization, and freeing parts into two functions: > > In deferred_init_memmap() > 1574 for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) { > 1575 spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa)); > 1576 epfn = min_t(unsigned long, zone_end_pfn(zone), > PFN_DOWN(epa)); > 1577 nr_pages += deferred_init_range(nid, zid, spfn, epfn); > 1578 } > > Replace with two loops: > First loop, calls a function that initializes the given range, the 2nd > loop calls a function that frees it. This way we won't get a potential > problem where buddy page is computed from the next range that has not > yet been initialized. And it is also going to be easier to multithread > later: multi-thread the first loop, wait for it to finish, > multi-thread the 2nd loop wait for it to finish. OK, but let's do that as a separate patch. What you have here is good for now IMHO. My ack applies. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-11-03 13:53 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-02 17:02 [PATCH v2 0/1] mm: buddy page accessed before initialized Pavel Tatashin 2017-11-02 17:02 ` [PATCH v2 1/1] " Pavel Tatashin 2017-11-03 9:27 ` Michal Hocko 2017-11-03 13:47 ` Pavel Tatashin 2017-11-03 13:53 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).