Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

From: Christian Borntraeger <borntraeger@de.ibm.com>
To: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
	kvm@vger.kernel.org, frankja@linux.ibm.com, thuth@redhat.com,
	pasic@linux.ibm.com, david@redhat.com,
	linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
Date: Tue, 18 May 2021 18:35:50 +0200	[thread overview]
Message-ID: <cb21db5f-f303-5b4b-6fab-5b28728e8ef2@de.ibm.com> (raw)
In-Reply-To: <20210518183442.6e078ea1@ibm-vm>

On 18.05.21 18:34, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:20:22 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 18.05.21 18:13, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:45:18 +0200
>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>    
>>>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>>>      
>>>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>>>> down, its memory was made unprotected, and then the protected VM
>>>>>>> itself was destroyed. Looping over the whole address space can
>>>>>>> take some time, considering the overhead of the various
>>>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>>>> would take a potentially long amount of time, depending on the
>>>>>>> amount of used memory.
>>>>>>>
>>>>>>> This patchseries implements a deferred destroy mechanism for
>>>>>>> protected guests. When a protected guest is destroyed, its
>>>>>>> memory is cleared in background, allowing the guest to restart
>>>>>>> or terminate significantly faster than before.
>>>>>>>
>>>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>>>> * it still has an address space associated (reboot case)
>>>>>>> * it does not have an address space anymore (shutdown case)
>>>>>>>
>>>>>>> For the reboot case, the reference count of the mm is increased,
>>>>>>> and then a background thread is started to clean up. Once the
>>>>>>> thread went through the whole address space, the protected VM is
>>>>>>> actually destroyed.
>>>>>>>
>>>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>>>> when the mm is torn down. Instead of just unmapping the pages
>>>>>>> when the address space is being torn down, they are also set
>>>>>>> aside. Later when KVM cleans up the VM, a thread is started to
>>>>>>> clean up the pages from the list.
>>>>>>
>>>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>>>
>>>>> yes
>>>>>       
>>>>>>>
>>>>>>> This means that the same address space can have memory belonging
>>>>>>> to more than one protected guest, although only one will be
>>>>>>> running, the others will in fact not even have any CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>       
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>>       
>>>>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
>>>>>> memory exhaustion?
>>>>>
>>>>> in case of reboot, not much; the pages were in use are still in
>>>>> use after the reboot, and they can be swapped.
>>>>>
>>>>> in case of a shutdown, yes, because the pages are really taken
>>>>> aside and cleared/destroyed in background. they cannot be
>>>>> swapped. they are freed immediately as they are processed, to try
>>>>> to mitigate memory exhaustion scenarios.
>>>>>
>>>>> in the end, this patchseries is a tradeoff between speed and
>>>>> memory consumption. the memory needs to be cleared up at some
>>>>> point, and that requires time.
>>>>>
>>>>> in cases where this might be an issue, I introduced a new KVM flag
>>>>> to disable lazy destroy (patch 10)
>>>>
>>>> Maybe we could piggy-back on the OOM-kill notifier and then fall
>>>> back to synchronous freeing for some pages?
>>>
>>> I'm not sure I follow
>>>
>>> once the pages have been set aside, it's too late
>>>
>>> while the pages are being set aside, every now and then some memory
>>> needs to be allocated. the allocation is atomic, not allowed to use
>>> emergency reserves, and can fail without warning. if the allocation
>>> fails, we clean up one page and continue, without setting aside
>>> anything (patch 9)
>>>
>>> so if the system is low on memory, the lazy destroy should not make
>>> the situation too much worse.
>>>
>>> the only issue here is starting a normal process in the host (maybe
>>> a non secure guest) that uses a lot of memory very quickly, right
>>> after a large secure guest has terminated.
>>
>> I think page cache page allocations do not need to be atomic.
>> In that case the kernel might stil l decide to trigger the oom
>> killer. We can let it notify ourselves free 256 pages synchronously
>> and avoid the oom kill. Have a look at the virtio-balloon
>> virtio_balloon_oom_notify
> 
> the issue is that once the pages have been set aside, it's too late.
> the OOM notifier would only be useful if we get notified of the OOM
> situation _while_ setting aside the pages.
> 
> unless you mean that the notifier should simply wait until the thread
> has done (some of) its work?

Exactly. Let the notifier wait until you have freed 256pages and return
256 to the oom notifier.