Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Li Xinhai <lixinhai.lxh@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: akpm <akpm@linux-foundation.org>, mhocko <mhocko@suse.com>,
	"yang.shi" <yang.shi@linux.alibaba.com>,
	n-horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: Re: [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone
Date: Wed, 15 Jan 2020 13:07:17 -0800
Message-ID: <c102d153-5093-94c9-7e6d-65fd6881831d@oracle.com> (raw)
In-Reply-To: <a2f0b0be-e9f1-bef6-1fab-c9d1a556d57a@oracle.com>

Naoya, would appreciate any comments you have as this seems to be related
to your changes to add huge page migration support.

On 1/14/20 5:07 PM, Mike Kravetz wrote:
> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>    - Is that leftover from the the days when huge page migration was not
>      supported?
>    - Is it just because huge page migration is more likely to fail than
>      base page migration.


According to man page, mbind was added to v2.6.7 kernel.  git history does
not go back that far, so I first looked at v2.6.12.

Definitions from include/linux/mempolicy.h
------------------------------------------
/* Policies */
#define MPOL_DEFAULT    0
#define MPOL_PREFERRED  1
#define MPOL_BIND       2
#define MPOL_INTERLEAVE 3

/* Flags for mbind */
#define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */

Note the MPOL_MF_STRICT would simply verify current page placement.  At
this time, hugetlb pages were not part of this verification.

From v2.6.12 routine check_range()
...
	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
		if (!vma->vm_next && vma->vm_end < end)
			return ERR_PTR(-EFAULT);
		if (prev && prev->vm_end < vma->vm_start)
			return ERR_PTR(-EFAULT);
		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
			err = verify_pages(vma->vm_mm,
					   vma->vm_start, vma->vm_end, nodes);
			if (err) {
				first = ERR_PTR(err);
				break;
			}
		}
		prev = vma;
	}
...

man page history does not go back this far.  The first mbind man page has
some things that are interesting:
1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge
   page mappings (even though the code does this).
2) Support for huge page policy was added with 2.6.16
3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to
   move pages to the requested node with this flag.

Next look at v2.6.16 kernel
==========================
This kernel introduces the MPOL_MF_MOVE* flags
#define MPOL_MF_MOVE    (1<<1)  /* Move pages owned by this process to conform
				   to mapping */
#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */

Once again, the up front check_range() routine will skip hugetlb vma's.
Note that check_range() will also isolate pages.  So since hugetlb vma's
are skipped, no hugetlb pages will be isolated.  Since no hugetlb pages
are isolated, none will be on the list to be migrate.  Therefore, hugetlb
pages are effectively ignored for the new MPOL_MF_MOVE* flags.  This makes
sense as huge page migration support was not added until the v3.12 kernel.

Note that at when MPOL_MF_MOVE* flags were added to the mbind man page,
the statement "MPOL_MF_STRICT is ignored on huge page mappings right now."
was added.  This was later changed to "MPOL_MF_STRICT is ignored on huge
page mappings."

Next look at v3.12 kernel
=========================
The v3.12 code looks similiar to today's code.  In the verify/isolation
phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar
to queue_pages_hugetlb().  Neither will cause errors at this point in the
process.  And, hugetlb pages with a mapcount == 1 will potentially be
isolated and added to the list of pages to be migrated.  In addition, if
the subsequent call to migrate_pages() fails to migrate ANY page, we
return -EIO if MPOL_MF_STRICT is set.  This is true even if the only page(s)
that failed to migrate were hugetlb pages.

It should also be noted that no mbind man page updates were made WRT
hugetlb pages after migration support was added.

Summary
=======
It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
mappings." is left over from the original mbind implementation.  When
the huge page migration support was added, I can not be sure if ignoring
MPOL_MF_STRICT for huge pages during the verify/isolation phase was
intentional.  It seems like it was as the return value from
isolate_huge_page() is ignored.

What should we do?
==================
1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
   seen as conflicting with man page has existed since v3.12 and I am
   not aware of any complaints.
2) In addition to optimizations by Li Xinhai, modify code to truly ignore
   MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
   after a failure of migrate_pages().  We could simply traverse the list
   of pages that were not migrated looking for any non-hugetlb page.
3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
   and modify code accordingly.

My suggestion would be for 1 or 2.  Thoughts?

-- 
Mike Kravetz


  parent reply index

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-14  9:16 [PATCH 1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable Li Xinhai
2020-01-14  9:16 ` [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone Li Xinhai
2020-01-14 14:09   ` Li Xinhai
2020-01-14 18:27     ` Yang Shi
2020-01-15  1:07     ` Mike Kravetz
2020-01-15  1:24       ` Yang Shi
2020-01-15  4:28         ` Mike Kravetz
2020-01-15  5:23           ` Yang Shi
2020-01-15  7:36             ` Li Xinhai
2020-01-15 17:16               ` Yang Shi
2020-01-15 21:07       ` Mike Kravetz [this message]
2020-01-15 21:30         ` Yang Shi
2020-01-15 21:45           ` Mike Kravetz
2020-01-15 21:59             ` Yang Shi
2020-01-16  8:07               ` HORIGUCHI NAOYA(堀口 直也)
2020-01-16 15:32                 ` Li Xinhai
2020-01-16  7:59         ` Michal Hocko
2020-01-16 19:22           ` Mike Kravetz
2020-01-17  2:32             ` Yang Shi
2020-01-17  2:38             ` Li Xinhai
2020-01-17  7:57               ` Michal Hocko
2020-01-17 12:05                 ` Li Xinhai
2020-01-17 15:20                   ` Michal Hocko
2020-01-17 15:46                     ` Li Xinhai
2020-01-20 12:45                       ` Michal Hocko
2020-01-21 14:15                         ` Li Xinhai
2020-01-21 14:53                           ` Michal Hocko
2020-01-22 13:55                             ` Li Xinhai
2020-01-14 19:12 ` [PATCH 1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable Mike Kravetz
2020-01-15  1:25 ` Andrew Morton

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c102d153-5093-94c9-7e6d-65fd6881831d@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=lixinhai.lxh@gmail.com \
    --cc=mhocko@suse.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git