From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87784C32771 for ; Wed, 15 Jan 2020 21:31:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E0F2E2064C for ; Wed, 15 Jan 2020 21:31:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E0F2E2064C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4C9548E0008; Wed, 15 Jan 2020 16:31:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 47ABB8E0003; Wed, 15 Jan 2020 16:31:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3907B8E0008; Wed, 15 Jan 2020 16:31:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 22BAB8E0003 for ; Wed, 15 Jan 2020 16:31:07 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id CFF38247F for ; Wed, 15 Jan 2020 21:31:06 +0000 (UTC) X-FDA: 76381164132.14.books52_7de62d82fd153 X-HE-Tag: books52_7de62d82fd153 X-Filterd-Recvd-Size: 14272 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Jan 2020 21:31:04 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04397;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0TnpyQgN_1579123858; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TnpyQgN_1579123858) by smtp.aliyun-inc.com(127.0.0.1); Thu, 16 Jan 2020 05:31:01 +0800 Subject: Re: [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone To: Mike Kravetz , Li Xinhai , "linux-mm@kvack.org" Cc: akpm , mhocko , n-horiguchi References: <1578993378-10860-1-git-send-email-lixinhai.lxh@gmail.com> <1578993378-10860-2-git-send-email-lixinhai.lxh@gmail.com> <2020011422092314671410@gmail.com> From: Yang Shi Message-ID: <9fabdc16-fc31-7e06-e306-af38850de771@linux.alibaba.com> Date: Wed, 15 Jan 2020 13:30:58 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------748F6D3D444F72AB61F6CE07" Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a multi-part message in MIME format. --------------748F6D3D444F72AB61F6CE07 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 1/15/20 1:07 PM, Mike Kravetz wrote: > Naoya, would appreciate any comments you have as this seems to be related > to your changes to add huge page migration support. > > On 1/14/20 5:07 PM, Mike Kravetz wrote: >> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >> - Is that leftover from the the days when huge page migration was not >> supported? >> - Is it just because huge page migration is more likely to fail than >> base page migration. > > According to man page, mbind was added to v2.6.7 kernel. git history does > not go back that far, so I first looked at v2.6.12. > > Definitions from include/linux/mempolicy.h > ------------------------------------------ > /* Policies */ > #define MPOL_DEFAULT 0 > #define MPOL_PREFERRED 1 > #define MPOL_BIND 2 > #define MPOL_INTERLEAVE 3 > > /* Flags for mbind */ > #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ > > Note the MPOL_MF_STRICT would simply verify current page placement. At > this time, hugetlb pages were not part of this verification. > > From v2.6.12 routine check_range() > ... > for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { > if (!vma->vm_next && vma->vm_end < end) > return ERR_PTR(-EFAULT); > if (prev && prev->vm_end < vma->vm_start) > return ERR_PTR(-EFAULT); > if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { > err = verify_pages(vma->vm_mm, > vma->vm_start, vma->vm_end, nodes); > if (err) { > first = ERR_PTR(err); > break; > } > } > prev = vma; > } > ... > > man page history does not go back this far. The first mbind man page has > some things that are interesting: > 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge > page mappings (even though the code does this). > 2) Support for huge page policy was added with 2.6.16 > 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to > move pages to the requested node with this flag. > > Next look at v2.6.16 kernel > ========================== > This kernel introduces the MPOL_MF_MOVE* flags > #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform > to mapping */ > #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ > > Once again, the up front check_range() routine will skip hugetlb vma's. > Note that check_range() will also isolate pages. So since hugetlb vma's > are skipped, no hugetlb pages will be isolated. Since no hugetlb pages > are isolated, none will be on the list to be migrate. Therefore, hugetlb > pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes > sense as huge page migration support was not added until the v3.12 kernel. > > Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, > the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." > was added. This was later changed to "MPOL_MF_STRICT is ignored on huge > page mappings." > > Next look at v3.12 kernel > ========================= > The v3.12 code looks similiar to today's code. In the verify/isolation > phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar > to queue_pages_hugetlb(). Neither will cause errors at this point in the > process. And, hugetlb pages with a mapcount == 1 will potentially be > isolated and added to the list of pages to be migrated. In addition, if > the subsequent call to migrate_pages() fails to migrate ANY page, we > return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) > that failed to migrate were hugetlb pages. > > It should also be noted that no mbind man page updates were made WRT > hugetlb pages after migration support was added. > > Summary > ======= > It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page > mappings." is left over from the original mbind implementation. When > the huge page migration support was added, I can not be sure if ignoring > MPOL_MF_STRICT for huge pages during the verify/isolation phase was > intentional. It seems like it was as the return value from > isolate_huge_page() is ignored. Thanks a lot for digging into the history. > > What should we do? > ================== > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > seen as conflicting with man page has existed since v3.12 and I am > not aware of any complaints. > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > after a failure of migrate_pages(). We could simply traverse the list > of pages that were not migrated looking for any non-hugetlb page. I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. Other than that traversing page list to look for a certain type page doesn't sound scalable to me. > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > and modify code accordingly. > > My suggestion would be for 1 or 2. Thoughts? By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change. --------------748F6D3D444F72AB61F6CE07 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit



On 1/15/20 1:07 PM, Mike Kravetz wrote:
Naoya, would appreciate any comments you have as this seems to be related
to your changes to add huge page migration support.

On 1/14/20 5:07 PM, Mike Kravetz wrote:
1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
   - Is that leftover from the the days when huge page migration was not
     supported?
   - Is it just because huge page migration is more likely to fail than
     base page migration.

According to man page, mbind was added to v2.6.7 kernel.  git history does
not go back that far, so I first looked at v2.6.12.

Definitions from include/linux/mempolicy.h
------------------------------------------
/* Policies */
#define MPOL_DEFAULT    0
#define MPOL_PREFERRED  1
#define MPOL_BIND       2
#define MPOL_INTERLEAVE 3

/* Flags for mbind */
#define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */

Note the MPOL_MF_STRICT would simply verify current page placement.  At
this time, hugetlb pages were not part of this verification.

>From v2.6.12 routine check_range()
...
	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
		if (!vma->vm_next && vma->vm_end < end)
			return ERR_PTR(-EFAULT);
		if (prev && prev->vm_end < vma->vm_start)
			return ERR_PTR(-EFAULT);
		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
			err = verify_pages(vma->vm_mm,
					   vma->vm_start, vma->vm_end, nodes);
			if (err) {
				first = ERR_PTR(err);
				break;
			}
		}
		prev = vma;
	}
...

man page history does not go back this far.  The first mbind man page has
some things that are interesting:
1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge
   page mappings (even though the code does this).
2) Support for huge page policy was added with 2.6.16
3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to
   move pages to the requested node with this flag.

Next look at v2.6.16 kernel
==========================
This kernel introduces the MPOL_MF_MOVE* flags
#define MPOL_MF_MOVE    (1<<1)  /* Move pages owned by this process to conform
				   to mapping */
#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */

Once again, the up front check_range() routine will skip hugetlb vma's.
Note that check_range() will also isolate pages.  So since hugetlb vma's
are skipped, no hugetlb pages will be isolated.  Since no hugetlb pages
are isolated, none will be on the list to be migrate.  Therefore, hugetlb
pages are effectively ignored for the new MPOL_MF_MOVE* flags.  This makes
sense as huge page migration support was not added until the v3.12 kernel.

Note that at when MPOL_MF_MOVE* flags were added to the mbind man page,
the statement "MPOL_MF_STRICT is ignored on huge page mappings right now."
was added.  This was later changed to "MPOL_MF_STRICT is ignored on huge
page mappings."

Next look at v3.12 kernel
=========================
The v3.12 code looks similiar to today's code.  In the verify/isolation
phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar
to queue_pages_hugetlb().  Neither will cause errors at this point in the
process.  And, hugetlb pages with a mapcount == 1 will potentially be
isolated and added to the list of pages to be migrated.  In addition, if
the subsequent call to migrate_pages() fails to migrate ANY page, we
return -EIO if MPOL_MF_STRICT is set.  This is true even if the only page(s)
that failed to migrate were hugetlb pages.

It should also be noted that no mbind man page updates were made WRT
hugetlb pages after migration support was added.

Summary
=======
It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
mappings." is left over from the original mbind implementation.  When
the huge page migration support was added, I can not be sure if ignoring
MPOL_MF_STRICT for huge pages during the verify/isolation phase was
intentional.  It seems like it was as the return value from
isolate_huge_page() is ignored.

Thanks a lot for digging into the history.


What should we do?
==================
1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
   seen as conflicting with man page has existed since v3.12 and I am
   not aware of any complaints.
2) In addition to optimizations by Li Xinhai, modify code to truly ignore
   MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
   after a failure of migrate_pages().  We could simply traverse the list
   of pages that were not migrated looking for any non-hugetlb page.

I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly.

Other than that traversing page list to look for a certain type page doesn't sound scalable to me.

3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
   and modify code accordingly.

My suggestion would be for 1 or 2.  Thoughts?

By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.


    

--------------748F6D3D444F72AB61F6CE07--