On 1/15/20 1:07 PM, Mike Kravetz wrote: > Naoya, would appreciate any comments you have as this seems to be related > to your changes to add huge page migration support. > > On 1/14/20 5:07 PM, Mike Kravetz wrote: >> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >> - Is that leftover from the the days when huge page migration was not >> supported? >> - Is it just because huge page migration is more likely to fail than >> base page migration. > > According to man page, mbind was added to v2.6.7 kernel. git history does > not go back that far, so I first looked at v2.6.12. > > Definitions from include/linux/mempolicy.h > ------------------------------------------ > /* Policies */ > #define MPOL_DEFAULT 0 > #define MPOL_PREFERRED 1 > #define MPOL_BIND 2 > #define MPOL_INTERLEAVE 3 > > /* Flags for mbind */ > #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ > > Note the MPOL_MF_STRICT would simply verify current page placement. At > this time, hugetlb pages were not part of this verification. > > From v2.6.12 routine check_range() > ... > for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { > if (!vma->vm_next && vma->vm_end < end) > return ERR_PTR(-EFAULT); > if (prev && prev->vm_end < vma->vm_start) > return ERR_PTR(-EFAULT); > if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { > err = verify_pages(vma->vm_mm, > vma->vm_start, vma->vm_end, nodes); > if (err) { > first = ERR_PTR(err); > break; > } > } > prev = vma; > } > ... > > man page history does not go back this far. The first mbind man page has > some things that are interesting: > 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge > page mappings (even though the code does this). > 2) Support for huge page policy was added with 2.6.16 > 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to > move pages to the requested node with this flag. > > Next look at v2.6.16 kernel > ========================== > This kernel introduces the MPOL_MF_MOVE* flags > #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform > to mapping */ > #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ > > Once again, the up front check_range() routine will skip hugetlb vma's. > Note that check_range() will also isolate pages. So since hugetlb vma's > are skipped, no hugetlb pages will be isolated. Since no hugetlb pages > are isolated, none will be on the list to be migrate. Therefore, hugetlb > pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes > sense as huge page migration support was not added until the v3.12 kernel. > > Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, > the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." > was added. This was later changed to "MPOL_MF_STRICT is ignored on huge > page mappings." > > Next look at v3.12 kernel > ========================= > The v3.12 code looks similiar to today's code. In the verify/isolation > phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar > to queue_pages_hugetlb(). Neither will cause errors at this point in the > process. And, hugetlb pages with a mapcount == 1 will potentially be > isolated and added to the list of pages to be migrated. In addition, if > the subsequent call to migrate_pages() fails to migrate ANY page, we > return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) > that failed to migrate were hugetlb pages. > > It should also be noted that no mbind man page updates were made WRT > hugetlb pages after migration support was added. > > Summary > ======= > It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page > mappings." is left over from the original mbind implementation. When > the huge page migration support was added, I can not be sure if ignoring > MPOL_MF_STRICT for huge pages during the verify/isolation phase was > intentional. It seems like it was as the return value from > isolate_huge_page() is ignored. Thanks a lot for digging into the history. > > What should we do? > ================== > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > seen as conflicting with man page has existed since v3.12 and I am > not aware of any complaints. > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > after a failure of migrate_pages(). We could simply traverse the list > of pages that were not migrated looking for any non-hugetlb page. I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. Other than that traversing page list to look for a certain type page doesn't sound scalable to me. > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > and modify code accordingly. > > My suggestion would be for 1 or 2. Thoughts? By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.