Re: + mm-introduce-reported-pages.patch added to -mm tree

From: David Hildenbrand <david@redhat.com>
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com,
	dan.j.williams@intel.com, dave.hansen@intel.com,
	konrad.wilk@oracle.com, lcapitulino@redhat.com,
	mgorman@techsingularity.net, mm-commits@vger.kernel.org,
	mst@redhat.com, osalvador@suse.de, pagupta@redhat.com,
	pbonzini@redhat.com, riel@surriel.com, vbabka@suse.cz,
	wei.w.wang@intel.com, willy@infradead.org,
	yang.zhang.wz@gmail.com, linux-mm@kvack.org
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
Date: Wed, 13 Nov 2019 00:10:21 +0100	[thread overview]
Message-ID: <bfcabcdd-2824-f092-d546-8a9ce4325225@redhat.com> (raw)
In-Reply-To: <c2aa52d9265b125d1325a0a78a28362e8c37ff01.camel@linux.intel.com>

>>>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated()
>>>> along with a lockless check if the page is free.
>>>
>>> Okay, that part I think I get. However doesn't all that logic more or less
>>> ignore the watermarks? It seems like you could cause an OOM if you don't
>>> have the necessary checks in place for that.
>>
>> Any approach that temporarily blocks some free pages from getting
>> allocated will essentially have this issue, no? I think one main design
>> point to minimize false OOMs was to limit the number of pages we report
>> at a time. Or what do you propose here in addition to that?
> 
> If you take a look at __isolate_free_page it was performing a check to see
> if pulling the page would place us below the minimum watermark for pages.
> Odds are you should probably look at somehow incorporating that into the
> solution before you pull the page. I have updated my approach to check for

Ah, now I see what you mean. Makes sense!

> the low watermark with the full capacity of MAX_ORDER - 1 pages before I
> start reporting, and then I am using __isolate_free_page which will check
> the minimum watermark to make sure I don't cross that.

Yeah, you probably want to check the watermark before doing any 
reporting - I assume.

> 
>>>> I think it should be something like this (ignoring different
>>>> migratetypes and such for now)
>>>>
>>>> 1. Test lockless if page is free: Not free? Done.
>>>
>>> So this should help to reduce the liklihood of races in the steps below.
>>> However it might also be useful if the code had some other check to see if
>>> it was done other than just making a pass through the bitmap.
>>
>> Yes.
>>
>>> One thing I had brought up with Nitesh was the idea of maybe doing some
>>> sort of RCU bitmap type approach. Basically while we hold the zone lock we
>>> could swap out the old bitmap for a new one. We could probably even keep a
>>> counter at the start of the structure so that we could track how many bits
>>> are actually set there. Then it becomes less likely of having a race where
>>> you free a page and set the bit and the hinting thread tests and clears
>>> the bit but doesn't see the freed page since it is not synchronized.
>>> Otherwise your notification setup and reporting thread may need a few smp
>>> barriers added where necessary.
>>
>> Yes, swapping out the bitmap via RCU is also be a way to make memory
>> hotplug work.
>>
>> I was also thinking about a different bitmap approach. Store for each
>> section a bitmap. Use a meta bitmap with a bit for each section that
>> contains pages to report. Sparse zones and memory hot(un)plug would not
>> be a real issue anymore.
> 
> I had thought about that too. The only problem is that the section has to
> be power of 2 sized and I don't know if we want to be increasing the size

... are there sections that are not a power of 2? x86_64: 128MB, s390x: 
256MB, ...

It does not really make sense to have sections that are not a power of 
two, thinking about page tables ... I would really be interested where 
something like that is possible.

> by 100% in the base case, although I guess there is an 8 byte pad on the
> structure if page extensions are enabled.
> 
>> One could go one step further and only have a bitmap with a bit for each
>> section. Only remember that some (large) page was not reported in that
>> section (e.g., after buddy merging). In the reporting thread, report all
>> free pages within that section. You could end up reporting the same page
>> a couple of times, but the question would be if this is relevant at all.
>> One would have to prototype and measure that.
>>
>> Long story short, I am not 100% a fan of the current "bitmap per zone"
>> approach but is is fairly simple to start with :)
> 
> Agreed. Although I worry that a bitmap per section may be even more
> complex.

Slightly, yes.

> 
>>>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
>>>
>>> Doesn't this have the side effect of draining all the percpu caches in
>>> order to make certain to flush the pages we isolated from there?
>>
>> While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
>> think isolation will. Where did you spot something like this in
>> mm/page_isolation.c?
> 
> On the end of set_migratetype_isolate(). The last thing it does is call
> drain_all_pages.

Ahh, missed that, thanks. Yeah, one could probably make the 
configurable, because for that use case, where we already expect a free 
page, we don't need that.

> 
>>>> or with an allocation). Done.
>>>> 3. test_pages_isolated()
>>>
>>> So I have reviewed the code and I don't see how this could conflict with
>>> other callers isolating the pages. If anything it seems like if another
>>> thread has already isolated the pages you would end up getting a false
>>> positive, reporting the pages, and pulling them back out of isolation.
>>
>> Isolated pages cannot be isolated. This is tracked via the migratetype.
> 
> Thanks. I see that now that you pointed it out up above.
> 
>>>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>>>
>>> I would hope it is rare. However for something like a max order page I
>>> could easily see a piece of it having been pulled out. I would think this
>>> case would be exceedingly expensive since you would have to put back any
>>> pages you had previous moved into isolation.
>>
>> I guess it is rare, there is a tiny slot between checking if the page is
>> free and isolating it. Would have to see that in action.
> 
> Yeah, probably depends on the number of cores in play as well since the
> liklihood of a collision is probably pretty low.
> 
>>>> 3b. yes? Report, then undo_isolate_page_range()
>>>>
>>>> If we would run into performance issues with the current page isolation
>>>> implementation (esp. locking), I think there are some nice
>>>> cleanups/reworks possible of which all current users could benefit
>>>> (especially accross pageblocks).
>>>
>>> To me this feels a lot like what you had for this solution near the start.
>>> Only now instead of placing the pages into an array you are tracking a
>>> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.
>>
>> Now we have a clean MM interface to do that :) And yes, which data
>> structure we're using becomes irrelevant.
>>
>>> This sounds far more complex to me then it probably needs to be since just
>>> holding the pages with the buddy type cleared should be enough to make
>>> them temporarily unusable for other threads, and even in your case you are
>>
>> If you have a page that is not PageBuddy() and not movable within
>> ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) ==
>> ZONE_MOVABLE). This can be triggered via memory offlining, when
>> isolating the page range.
>>
>> If your approach does exactly that (clear PageBuddy() on a
>> ZONE_MOVABLE), it would be a bug. The only safe way is to have the
>> pageblock(s) isolated.
> 
>  From what I can tell it looks like if the page is in ZONE_MOVABLE the
> buddy flag doesn't even matter since the only thing checked is
> PageReserved. There is a check early on in the main loop that will
> "continue" if zone_idx(zone) == ZONE_MOVABLE.

Very right, missed that :) reserved pages are a different story.

> 
> The refcount is 0 so that will cause us to "continue" and not be counted
> as an unmovable page. The downside is the scan cannot take advantage of
> the "PageBuddy" value to skip over us so it just has to skip over the
> section one page at a time.
> 
> The advantage here is that we can still offline a region that contains
> pages that are being reported. I would think that it would fail if the

Yes, you can isolate +offline, while the isolation approach would 
require to actually try again (right now manually).

> pages in the region are isolated since as you pointed out you get an EBUSY
> when you attempt to isolate a page that is already isolated and as such
> removal will fail won't it?

Right now, yes.

(we should rework that code either way to return -EAGAIN in that case 
and let memory offlining try again automatically. But we have to rework 
the -EAGAIN vs. -EBUSY handling in memory offlining code at one point 
either way, I discussed that partially with Michal recently. There is a 
lot of cleaning up to do.)

> 
>>> still having to use the scatterlist in order to hold the pages and track
>>> what you will need to undo the isolation later.
>>
>> I think it is very neat and not complex at all. Page isolation is a nice
>> feature we have in the kernel. :) It deserves some cleanups, though.
> 
> We can agree to disagree. At this point you are talking about adding bits
> for sections and pages, and in the meantime I am working with zones and
> pages. I believe finding free space in the section may be much more tricky
> than finding it in the zone or page has been. Now that I am rid of the
> list manipulators my approach may soon surpass the bitmap one in terms of
> being less intrusive/complex.. :-)

I am definitely interested to see that approach :) Good to see that the 
whole discussion in this big thread turned out to be productive.

-- 

Thanks,

David / dhildenb