Re: On guest free page hinting and OOM

From: David Hildenbrand <david@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Rik van Riel <riel@surriel.com>,
	dodgen@google.com, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	dhildenb@redhat.com, Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: On guest free page hinting and OOM
Date: Tue, 2 Apr 2019 09:42:21 +0200	[thread overview]
Message-ID: <6a612adf-e9c3-6aff-3285-2e2d02c8b80d@redhat.com> (raw)
In-Reply-To: <CAKgT0UcJuD-t+MqeS9geiGE1zsUiYUgZzeRrOJOJbOzn2C-KOw@mail.gmail.com>

On 01.04.19 22:56, Alexander Duyck wrote:
> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
>>>> The interesting thing is most probably: Will the hinting size usually be
>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
>>>> something in the range of 1GB. But this is an interesting question to
>>>> look into.
>>>>
>>>> Also, if the admin does not care about performance implications when
>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
>>>
>>> "close to OOM" is what I meant.
>>
>> Problem is, host admin is the one adding memory. Guest admin is
>> the one that knows about performance.
> 
> The thing we have to keep in mind with this is that we are not dealing
> with the same behavior as the balloon driver. We don't need to inflate
> a massive hint and hand that off. Instead we can focus on performing
> the hints on much smaller amounts and do it incrementally over time
> with the idea being as the system sits idle it frees up more and more
> of the inactive memory on the system.
> 
> With that said, I still don't like the idea of us even trying to
> target 1GB of RAM for hinting. I think it would be much better if we
> stuck to smaller sizes and kept things down to a single digit multiple
> of THP or higher order pages. Maybe something like 64MB of total
> memory out for hinting.

1GB was just a number I came up with. But please note, as VCPUs hint in
parallel, even though each request is only 64MB in size, things can sum up.

> 
> All we really would need to make it work would be to possibly look at
> seeing if we can combine PageType values. Specifically what I would be
> looking at is a transition that looks something like Buddy -> Offline
> -> (Buddy | Offline). We would have to hold the zone lock at each
> transition, but that shouldn't be too big of an issue. If we are okay
> with possibly combining the Offline and Buddy types we would have a
> way of tracking which pages have been hinted and which have not. Then
> we would just have to have a thread running in the background on the
> guest that is looking at the higher order pages and pulling 64MB at a
> time offline, and when the hinting is done put them back in the "Buddy
> | Offline" state.

That approach may have other issues to solve (1 thread vs. many VCPUs,
scanning all buddy pages over and over again) and other implications
that might be undesirable (hints performed even more delayed, additional
thread activity). I wouldn't call it the ultimate solution.

Your approach sounds very interesting to play with, however
at this point I would like to avoid throwing away Nitesh work once again
to follow some other approach that looks promising. If we keep going
like that, we'll spend another ~10 years working on free page hinting
without getting anything upstream. Especially if it involves more
core-MM changes. We've been there, we've done that. As long as the
guest-host interface is generic enough, we can play with such approaches
later in the guest. Important part is that the guest-host interface
allows for that.

> 
> I view this all as working not too dissimilar to how a standard Rx
> ring in a network device works. Only we would want to allocate from
> the pool of "Buddy" pages, flag the pages as "Offline", and then when
> the hint has been processed we would place them back in the "Buddy"
> list with the "Offline" value still set. The only real changes needed
> to the buddy allocator would be to add some logic for clearing/merging
> the "Offline" setting as necessary, and to provide an allocator that
> only works with non-"Offline" pages.

Sorry, I had to smile at the phrase "only" in combination with "provide
an allocator that only works with non-Offline pages" :) . I guess you
realize yourself that these are core-mm changes that might easily be
rejected upstream because "the virt guys try to teach core-MM yet
another special case". I agree that this is nice to play with,
eventually that approach could succeed and be accepted upstream. But I
consider this long term work.

-- 

Thanks,

David / dhildenb