Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline

From: David Hildenbrand <david@redhat.com>
To: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
Date: Thu, 3 Sep 2020 20:31:04 +0200	[thread overview]
Message-ID: <d89510b1-a6a2-a874-7ffc-ba7a37d4212d@redhat.com> (raw)
In-Reply-To: <CA+CK2bBTfmhTWNRrxnVKi=iknqq-iZxNZSnwNA9C9tWAJzRxmw@mail.gmail.com>

On 03.09.20 20:23, Pavel Tatashin wrote:
> On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.09.20 08:38, Michal Hocko wrote:
>>> On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
>>>> On 9/2/20 5:13 PM, Michal Hocko wrote:
>>>>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>>>>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>>>>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thread#1 - continue
>>>>>>>>>          free_unref_page_commit
>>>>>>>>>            migratetype = get_pcppage_migratetype(page);
>>>>>>>>>               // get old migration type
>>>>>>>>>            list_add(&page->lru, &pcp->lists[migratetype]);
>>>>>>>>>               // add new page to already drained pcp list
>>>>>>>>>
>>>>>>>>> Thread#2
>>>>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
>>>>>>>>>
>>>>>>>>> The fix is to try to drain per-cpu lists again after
>>>>>>>>> check_pages_isolated_cb() fails.
>>>>>>>>
>>>>>>>> But this means that the page is not isolated and so it could be reused
>>>>>>>> for something else. No?
>>>>>>>
>>>>>>> The page is in a movable zone, has zero references, and the section is
>>>>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>>>>>>> set. The page should be offlinable, but it is lost in a pcp list as
>>>>>>> that list is never drained again after the first failure to migrate
>>>>>>> all pages in the range.
>>>>>>
>>>>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>>>>>> could allocate it from the pcplist before we do the extra drain. But then it
>>>>>> becomes "visible again" and the loop in __offline_pages() should catch it by
>>>>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>>>>>> already marked as isolated, so the page (freed by migration) won't end up on the
>>>>>> pcplist again.
>>>>>
>>>>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
>>>>> could be used for non migrateable objects. Or does anything prevent that
>>>>> from happening?
>>>>
>>>> In a movable zone, the allocation should not be used for non migrateable
>>>> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
>>>> regardless of this race (analogically for migrating away from CMA pageblocks).
>>>>
>>>>> We really do depend on isolation to not allow reuse when offlining.
>>>>
>>>> This is not really different than if the page on pcplist was allocated just a
>>>> moment before the offlining, thus isolation started. We ultimately rely on being
>>>> able to migrate any allocated pages away during the isolation. This "freeing to
>>>> pcplists" race doesn't fundamentally change anything in this regard. We just
>>>> have to guarantee that pages on pcplists will be eventually flushed, to make
>>>> forward progress, and there was a bug in this aspect.
>>>
>>> You are right. I managed to confuse myself yesterday. The race is
>>> impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
>>> the movable zone we are not losing the migrateability property.
>>>
>>> Pavel I think this will be a useful information to add to the changelog.
>>> We should also document this in the code to prevent from further
>>> confusion. I would suggest something like the following:
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index 242c03121d73..56d4892bceb8 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>   * pageblocks we may have modified and return -EBUSY to caller. This
>>>   * prevents two threads from simultaneously working on overlapping ranges.
>>>   *
>>> + * Please note that there is no strong synchronization with the page allocator
>>> + * either. Pages might be freed while their page blocks are marked ISOLATED.
>>> + * In some cases pages might still end up on pcp lists and that would allow
>>> + * for their allocation even when they are in fact isolated already. Depending on
>>> + * how strong of a guarantee the caller needs drain_all_pages might be needed
>>> + * (e.g. __offline_pages will need to call it after check for isolated range for
>>> + * a next retry).
>>> + *
>>
>> As expressed in reply to v2, I dislike this hack. There is strong
>> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
>> just plain ugly.
>>
>> Can't we temporarily disable PCP (while some pageblock in the zone is
>> isolated, which we know e.g., due to the counter), so no new pages get
>> put into PCP lists after draining, and re-enable after no pageblocks are
>> isolated again? We keep draining the PCP, so it doesn't seem to be of a
>> lot of use during that period, no? It's a performance hit already.
>>
>> Then, we would only need exactly one drain. And we would only have to
>> check on the free path whether PCP is temporarily disabled.
> 
> Hm, we could use a static branches to disable it, that would keep
> release code just as fast, but I am worried it will make code even
> uglier. Let's see what others in this thread think about this idea.

It would at least stop this "allocate from MIOGRATE_ISOLATE" behavior
and the "drain when you feel like it, and drain more frequently to work
around broken PCP code" handling.

Anyhow, I'll be offline until next Tuesday, cheers!

-- 
Thanks,

David / dhildenb