Re: [RFC] kvm: reverse call order of kvm_arch_destroy_vm() and kvm_destroy_devices()

From: Anthony Krowiak <akrowiak@linux.ibm.com>
To: Halil Pasic <pasic@linux.ibm.com>
Cc: linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, jjherne@linux.ibm.com,
	borntraeger@de.ibm.com, cohuck@redhat.com,
	mjrosato@linux.ibm.com, pbonzini@redhat.com,
	frankja@linux.ibm.com, imbrenda@linux.ibm.com, david@redhat.com,
	"scottwood@freescale.com agraf"@suse.de
Subject: Re: [RFC] kvm: reverse call order of kvm_arch_destroy_vm() and kvm_destroy_devices()
Date: Thu, 11 Aug 2022 10:39:50 -0400	[thread overview]
Message-ID: <62595a25-61a9-cc83-3941-6001fee512af@linux.ibm.com> (raw)
In-Reply-To: <20220801135310.62c34c63.pasic@linux.ibm.com>

On 8/1/22 7:53 AM, Halil Pasic wrote:
> On Wed, 27 Jul 2022 15:00:02 -0400
> Anthony Krowiak <akrowiak@linux.ibm.com> wrote:
>
>> Any Takers??????
>>
>> On 7/5/22 2:54 PM, Tony Krowiak wrote:
>>> There is a new requirement for s390 secure execution guests that the
>>> hypervisor ensures all AP queues are reset and disassociated from the
>>> KVM guest before the secure configuration is torn down. It is the
>>> responsibility of the vfio_ap device driver to handle this.
>>>
>>> Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
>>> the driver reset all AP queues passed through to a KVM guest when notified
>>> that the KVM pointer was being set to NULL. Subsequently, the AP queues
>>> are only reset when the fd for the mediated device used to pass the queues
>>> through to the guest is closed (the vfio_ap_mdev_close_device() callback).
>>> This is not a problem when userspace is well-behaved and uses the
>>> KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
>>> userspace for some reason does not close the mdev fd, a secure execution
>>> guest will tear down its configuration before the AP queues are
>>> reset because the teardown is done in the kvm_arch_destroy_vm function
>>> which is invoked prior to kvm_destroy_devices.
> As Matt has pointed out: we did not have the guarantee we need prior
> that commit. Please for the next version drop the digression about
> the old behavior.
>
>>> This patch proposes a simple solution; rather than introducing a new
>>> notifier into vfio or callback into KVM, what aoubt reversing the order
>>> in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
>>> some very limited testing (i.e., the automated regression tests for
>>> the vfio_ap device driver) this did not seem to cause any problems.
>>>
>>> The question remains, is there a good technical reason why the VM
>>> is destroyed before the devices it is using? This is not intuitive, so
>>> this is a request for comments on this proposed patch. The assumption
>>> here is that the medev fd will get closed when the devices are destroyed.
> I did some digging! The function and the corresponding mechanism was
> introduced by  07f0a7bdec5c ("kvm: destroy emulated devices on VM
> exit"). Before that patch we used to have ref-counting, and the refcound
> got decremented in kvmppc_mpic_disconnect_vcpu() which in turn was
> called by kvm_arch_vcpu_free(). So this was basically arch specific
> stuff. For power (the patch came form power) the refcount was decremented
> before calling kvmppc_core_vcpu_free(). So I conclude the old scheme
> would have worked for us.
>
> Since the patch does not state any technical reasons, my guess is, that
> the choice was made somewhat arbitrarily under the assumption, that
> there is no requirements or dependency with regards to the destruction
> of devices or with regards towards severing the connection between
> the devices and the VM. Under these assumptions the placement of
> the invocation of kvm_destroy_devices after kvm_arch_destroy_vm()
> did made sense, because if something that is destroyed in destroy_vm()
> did hold a live reference to the device, this reference will be cleaned
> up before kvm_destroy_devices() is invoked. So basically unless the
> devices hold references to each other, things look good. If the
> positions of  kvm_arch_destroy_vm() and kvm_destroy_devices() are
> changed, then we basically need to assume that nothing that is destroyed
> in kvm_arch_destoy_vm() may logically hold a live reference (remember
> the refcount is gone, but pointers may still exist) to a kvm device.
> Does that hold? @Antony, maybe you can answer this question for us...

I do not have an answer for this without doing a deep dive into the 
code. I am not very familiar with the VM lifecycle. My hope was that 
someone who knows this area would respond to this RFC. I am copying the 
Signed-off-by email addresses for the patch (07f0a7bdec) you mentioned 
above; maybe they can provide some insight as to for their choice in 
ordering of the kvm_arch_destroy_vm() and kvm_destroy_devices() functions.

> Otherwise I will continue the digging from here, eventually.
>
> Also I have concerns about the following comments:
>
> static void kvm_destroy_devices(struct kvm *kvm)
> {
>          struct kvm_device *dev, *tmp;
>                                                                                  
>          /*
>           * We do not need to take the kvm->lock here, because nobody else
>           * has a reference to the struct kvm at this point and therefore
>           * cannot access the devices list anyhow.
> [..]
>
> Would this till hold when the order is changed?
>
> struct kvm_device_ops {
> [..]
>          /*
>           * Destroy is responsible for freeing dev.
>           *
>           * Destroy may be called before or after destructors are called
>           * on emulated I/O regions, depending on whether a reference is
>           * held by a vcpu or other kvm component that gets destroyed
>           * after the emulated I/O.
>           */
>          void (*destroy)(struct kvm_device *dev);
>
> This seems to document the order of things as is.
>
> Btw I would like to understand more about the lifecycle of these
> emulated I/O regions....
>
> @Paolo: I believe this is ultimately your truff. I'm just digging
> through the code, and the history to try to help along with this. We
> definitely need a solution for our problem. We would very much appreciate
> having your opinion!
>
> Regards,
> Halil
>
>>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>>> ---
>>>    virt/kvm/kvm_main.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index a49df8988cd6..edaf2918be9b 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>>    #else
>>>    	kvm_flush_shadow_all(kvm);
>>>    #endif
>>> -	kvm_arch_destroy_vm(kvm);
>>>    	kvm_destroy_devices(kvm);
>>> +	kvm_arch_destroy_vm(kvm);
>>>    	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>>>    		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>>>    		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);