Re: [PATCH] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind

From: Vlastimil Babka <vbabka@suse.cz>
To: Yang Shi <yang.shi@linux.alibaba.com>, Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Eric Dumazet <edumazet@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	netdev@vger.kernel.org
Subject: Re: [PATCH] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
Date: Fri, 21 Jun 2019 13:33:47 +0200	[thread overview]
Message-ID: <9945a66f-4434-b2a6-63ac-3240ef1d52c9@suse.cz> (raw)
In-Reply-To: <d185f277-85ed-4dc1-8ff2-2984b54a0d64@linux.alibaba.com>

On 6/20/19 6:08 PM, Yang Shi wrote:
> 
> 
> On 6/20/19 12:18 AM, Vlastimil Babka wrote:
>> On 6/19/19 8:19 PM, Yang Shi wrote:
>>>>>> This is getting even more muddy TBH. Is there any reason that we
>>>>>> have to
>>>>>> handle this problem during the isolation phase rather the migration?
>>>>> I think it was already said that if pages can't be isolated, then
>>>>> migration phase won't process them, so they're just ignored.
>>>> Yes，exactly.
>>>>
>>>>> However I think the patch is wrong to abort immediately when
>>>>> encountering such page that cannot be isolated (AFAICS). IMHO it should
>>>>> still try to migrate everything it can, and only then return -EIO.
>>>> It is fine too. I don't see mbind semantics define how to handle such
>>>> case other than returning -EIO.
>> I think it does. There's:
>> If MPOL_MF_MOVE is specified in flags, then the kernel *will attempt to
>> move all the existing pages* ... If MPOL_MF_STRICT is also specified,
>> then the call fails with the error *EIO if some pages could not be moved*
>>
>> Aborting immediately would be against the attempt to move all.
>>
>>> By looking into the code, it looks not that easy as what I thought.
>>> do_mbind() would check the return value of queue_pages_range(), it just
>>> applies the policy and manipulates vmas as long as the return value is 0
>>> (success), then migrate pages on the list. We could put the movable
>>> pages on the list by not breaking immediately, but they will be ignored.
>>> If we migrate the pages regardless of the return value, it may break the
>>> policy since the policy will *not* be applied at all.
>> I think we just need to remember if there was at least one page that
>> failed isolation or migration, but keep working, and in the end return
>> EIO if there was such page(s). I don't think it breaks the policy. Once
>> pages are allocated in a mapping, changing the policy is a best effort
>> thing anyway.
> 
> The current behavior is:
> If queue_pages_range() return -EIO (vma is not migratable, ignore other 
> conditions since we just focus on page migration), the policy won't be 
> set and no page will be migrated.

Ah, I see. IIUC the current behavior is due to your recent commit
a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified") in order to fix commit 6f4576e3687b
("mempolicy: apply page table walker on queue_pages_range()"), which
caused -EIO to be not returned enough. But I think you went too far and
instead return -EIO too much. If I look at the code in parent commit of
6f4576e3687b, I can see in queue_pages_range():

if ((flags & MPOL_MF_STRICT) ||
        ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
        vma_migratable(vma))) {

        err = queue_pages_pgd_range(vma, start, endvma, nodes,
                                flags, private);
        if (err)
                break;
}

and in queue_pages_pte_range():

if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
        migrate_page_add(page, private, flags);
else
        break;

So originally, there was no returning of -EIO due to !vma_migratable() -
as long as MPOL_MF_STRICT and MPOL_MF_MOVE* was specified, the code
tried to queue for migration everything it could and didn't ever abort,
AFAICS. And I still think that's the best possible behavior.

> However, the problem here is the vma might look migratable, but some or 
> all the underlying pages are unmovable. So, my patch assumes the vma is 
> *not* migratable if at least one page is unmovable. I'm not sure if it 
> is possible to have both movable and unmovable pages for the same 
> mapping or not, I'm supposed the vma would be split much earlier.
> 
> If we don't abort immediately, then we record if there is unmovable 
> page, then we could do:
> #1. Still follows the current behavior (then why not abort immediately?)

See above how the current behavior differs from the original one.

> #2. Set mempolicy then migrate all the migratable pages. But, we may end 
> up with the pages on node A, but the policy says node B. Doesn't it 
> break the policy?

The policy can already be "broken" (violated is probably better word) by
migrate_pages() failing. If that happens, we don't rollback the migrated
pages and reset the policy back, right? I think the manpage is clear
that MPOL_MF_MOVE is a best-effort. Userspace will know that not
everything was successfully migrated (via -EIO), and can take whatever
steps it deems necessary - attempt rollback, determine which exact
page(s) are violating the policy, etc.

>>
>>>>
>