Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

From: Nitesh Narayan Lal <nitesh@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com,
	wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com,
	dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com,
	aarcange@redhat.com, Alexander Duyck <alexander.duyck@gmail.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
Date: Tue, 19 Feb 2019 09:17:43 -0500	[thread overview]
Message-ID: <0d7ff493-71f3-8707-8400-b51f1ce1a2bd@redhat.com> (raw)
In-Reply-To: <1abac6db-5e1a-2889-9831-707c2b78b0f3@redhat.com>

[-- Attachment #1.1: Type: text/plain, Size: 4334 bytes --]

On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>> There are two main ways to avoid allocation:
>>>>> 1. do not add extra data on top of each chunk passed
>>>> If I am not wrong then this is close to what we have right now.
>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>> allocation for the request. Once you're asynchronous via a notification
>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>> Whether we should go with kthread or without it, I would like to do some
>> performance comparison before commenting on this.
>>>> One issue I see right now is that I am polling while host is freeing the
>>>> memory.
>>>> In the next version I could tie the logic which returns pages to the
>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>> (i.e.., it happens once we receive an response from the host)
>>> The question is, what happens when freeing pages and the array is not
>>> ready to be reused yet. In that case, you want to somehow continue
>>> freeing pages without busy waiting or eventually not reporting pages.
>> This is what happens right now.
>> Having kthread or not should not effect this behavior.
>> When the array is full the current approach simply skips collecting the
>> free pages.
> Well, it somehow does affect your implementation. If you have a kthread
> you always have to synchronize against the VCPU: "Is the pcpu array
> ready to be used again".
>
> Once you do it asynchronously from your VCPU without another thread
> being involved, such synchronization is not required. Simply prepare a
> request and send it off. Reuse the pcpu array instantly. At least that's
> the theory :)
>
> If you have a guest bulk freeing a lot of memory, I guess temporarily
> dropping free page hints could be counter-productive. It really depends
> on how fast the thread gets scheduled and how long the hinting process
> takes. Having another thread involved might add a lot to that latency to
> that formula. We'll have to measure, but my gut feeling is that once we
> do stuff asynchronously, there is no need for a thread anymore.
This is true.
>
>>> The callback should put the pages back to the buddy and free the request
>>> eventually to have a fully asynchronous mechanism.
>>>
>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>> I am not sure if this is an arbitrary number we came up with here. We
>>> should really play with different orders to find a hot spot. I wouldn't
>>> consider this high priority, though. Getting the whole concept right to
>>> be able to deal with any magic number we come up should be the ultimate
>>> goal. (stuff that only works with huge pages I consider not future
>>> proof, especially regarding fragmented guests which can happen easily)
>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>> a specific workload we don't get any memory back until we re-run the
>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>> This is why I think we may make this configurable from compile time and
>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
> Eventually pages will never get merged. Assume you have 1 page of a
> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
> wasted (depending on the arch). This stuff can sum up.

After the discussion, here are the changes on which I am planning to
work next:
1. Get rid of the kthread and dynamically allocate a per-cpu array to
hold the
isolated pages. As soon as the initial per-cpu array is completely
scanned, release it
so that we don't end up blocking anything.
2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
array size to 256
for now. As we are doing asynchronous reporting we should be fine with a
lower size array.
3. As soon as the host responds, release the pages back to the buddy
from the callback and free the request.

Benefits wrt current implementation:
1. We will not eat up performance due to kernel thread.
2. We will still be doing reporting asynchronously=> no blocking.
3. Hopefully, we will be able to free more memory.
-- 
Regards
Nitesh

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]