On 20 May 2022, at 9:43, Zi Yan wrote: > On 20 May 2022, at 7:30, Qian Cai wrote: > >> On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote: >>> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound >>> page caused the infinite loop, because the page was not migrated and the code kept >>> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the >>> code the page cannot be migrated and the code will goto failed without retrying. It will be >>> great you can share what exactly has run after boot, so that I can reproduce locally to >>> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page. >> >> The reproducer is just to run the same script I shared with you previously >> multiple times instead. It is still quite reproducible here as it usually >> happens within a hour. >> >> $ for i in `seq 1 100`; do ./flip_mem.py; done Also, do you mind providing the page dump of the 512-page compound page? I would like to know what page caused the issue. Thanks. >> >>> Can you also try the patch below to see if it fixes the infinite loop? >>> >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c >>> index b3f074d1682e..abde1877bbcb 100644 >>> --- a/mm/page_isolation.c >>> +++ b/mm/page_isolation.c >>> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags, >>> order = 0; >>> outer_pfn = pfn; >>> while (!PageBuddy(pfn_to_page(outer_pfn))) { >>> - if (++order >= MAX_ORDER) { >>> - outer_pfn = pfn; >>> - break; >>> - } >>> + /* abort if the free page cannot be found */ >>> + if (++order >= MAX_ORDER) >>> + goto failed; >>> outer_pfn &= ~0UL << order; >>> } >>> pfn = outer_pfn; >>> >> >> Can you explain a bit how this patch is the right thing to do here? I am a >> little bit worry about shooting into the dark. Otherwise, I'll be running >> the off-by-one part over the weekend to see if that helps. > > The code kept retrying to migrate a 512-page compound page, so it seems to me > that __alloc_contig_migrate_range() did not migrate the page but returned > 0 every time, otherwise, if (ret) goto failed; would bail out of the loop > already. The original code above assumed a free page can always be found after > __alloc_contig_migrate_range(), so it will retry if no free page is found. > But that assumption is not true from your infinite loop result, the new > code quits retrying when no free page can be found. > > I will dig into it deeper to make sure it is the correct fix. I will > update you when I am done. > > Thanks. > > -- > Best Regards, > Yan, Zi -- Best Regards, Yan, Zi