linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
@ 2020-09-18 17:02 Tony Krowiak
  2020-09-21  5:48 ` Christian Borntraeger
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Tony Krowiak @ 2020-09-18 17:02 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: pmorel, pasic, alex.williamson, cohuck, kwankhede, borntraeger,
	Tony Krowiak

Attempting to unregister Guest Interruption Subclass (GISC) when the
link between the matrix mdev and KVM has been removed results in the
following:

   "Kernel panic -not syncing: Fatal exception: panic_on_oops"

This patch fixes this bug by verifying the matrix mdev and KVM are still
linked prior to unregistering the GISC.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e0bde8518745..847a88642644 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
  */
 static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
 {
-	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)
-		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
-	if (q->saved_pfn && q->matrix_mdev)
-		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
-				 &q->saved_pfn, 1);
+	if (q->matrix_mdev) {
+		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
+			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
+						 q->saved_isc);
+		if (q->saved_pfn)
+			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
+					 &q->saved_pfn, 1);
+	}
+
 	q->saved_pfn = 0;
 	q->saved_isc = VFIO_AP_ISC_INVALID;
 }
-- 
2.21.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
@ 2020-09-21  5:48 ` Christian Borntraeger
  2020-09-21 11:56   ` Halil Pasic
  2020-09-21  8:24 ` Pierre Morel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Christian Borntraeger @ 2020-09-21  5:48 UTC (permalink / raw)
  To: Tony Krowiak, linux-s390, linux-kernel, kvm
  Cc: pmorel, pasic, alex.williamson, cohuck, kwankhede



On 18.09.20 19:02, Tony Krowiak wrote:
> Attempting to unregister Guest Interruption Subclass (GISC) when the
> link between the matrix mdev and KVM has been removed results in the
> following:
> 
>    "Kernel panic -not syncing: Fatal exception: panic_on_oops"

I think the full backtrace would be better in case someone runs into this
and needs to compare this patch to its oops. This also makes it easier to
understand the fix. 
> 
> This patch fixes this bug by verifying the matrix mdev and KVM are still
> linked prior to unregistering the GISC.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>

Do we need a Fixes tag and cc stable?


> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e0bde8518745..847a88642644 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>   */
>  static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>  {
> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)

So we already check for q->matrix_mdev here

> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> -	if (q->saved_pfn && q->matrix_mdev)

and here
> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> -				 &q->saved_pfn, 1);
> +	if (q->matrix_mdev) {
> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
                                                           ^^^^ and this is the only
		new check? Cant we just add this condition to the first if?

> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> +						 q->saved_isc);
> +		if (q->saved_pfn)
> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> +					 &q->saved_pfn, 1);
> +	}
> +
>  	q->saved_pfn = 0;
>  	q->saved_isc = VFIO_AP_ISC_INVALID;
>  }
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
  2020-09-21  5:48 ` Christian Borntraeger
@ 2020-09-21  8:24 ` Pierre Morel
  2020-09-21  9:23 ` Cornelia Huck
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Pierre Morel @ 2020-09-21  8:24 UTC (permalink / raw)
  To: Tony Krowiak, linux-s390, linux-kernel, kvm
  Cc: pasic, alex.williamson, cohuck, kwankhede, borntraeger



On 2020-09-18 19:02, Tony Krowiak wrote:
> Attempting to unregister Guest Interruption Subclass (GISC) when the
> link between the matrix mdev and KVM has been removed results in the
> following:
> 
>     "Kernel panic -not syncing: Fatal exception: panic_on_oops"
> 
> This patch fixes this bug by verifying the matrix mdev and KVM are still
> linked prior to unregistering the GISC.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e0bde8518745..847a88642644 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>    */
>   static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>   {
> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)

I think you could have make the patch smaller by adding the missing test 
for kvm here.


> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> -	if (q->saved_pfn && q->matrix_mdev)
> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> -				 &q->saved_pfn, 1);
> +	if (q->matrix_mdev) {
> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> +						 q->saved_isc);
> +		if (q->saved_pfn)
> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> +					 &q->saved_pfn, 1);
> +	}
> +
>   	q->saved_pfn = 0;
>   	q->saved_isc = VFIO_AP_ISC_INVALID;
>   }
> 

-- 
Pierre Morel
IBM Lab Boeblingen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
  2020-09-21  5:48 ` Christian Borntraeger
  2020-09-21  8:24 ` Pierre Morel
@ 2020-09-21  9:23 ` Cornelia Huck
  2020-09-21 15:45 ` Halil Pasic
  2020-10-21 15:46 ` Tony Krowiak
  4 siblings, 0 replies; 9+ messages in thread
From: Cornelia Huck @ 2020-09-21  9:23 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, kvm, pmorel, pasic, alex.williamson,
	kwankhede, borntraeger

On Fri, 18 Sep 2020 13:02:34 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> Attempting to unregister Guest Interruption Subclass (GISC) when the
> link between the matrix mdev and KVM has been removed results in the
> following:
> 
>    "Kernel panic -not syncing: Fatal exception: panic_on_oops"

I'm wondering how we get there (why are we unregistering the gisc if
the mdev and kvm are not yet linked or are already unlinked?), so I
agree that the actual backchain would be helpful here.

> 
> This patch fixes this bug by verifying the matrix mdev and KVM are still
> linked prior to unregistering the GISC.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e0bde8518745..847a88642644 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>   */
>  static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>  {
> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)

If checking for ->kvm is the right thing to do, I agree that moving the
check here would be easier to read.

> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> -	if (q->saved_pfn && q->matrix_mdev)
> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> -				 &q->saved_pfn, 1);
> +	if (q->matrix_mdev) {
> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> +						 q->saved_isc);
> +		if (q->saved_pfn)
> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> +					 &q->saved_pfn, 1);
> +	}
> +
>  	q->saved_pfn = 0;
>  	q->saved_isc = VFIO_AP_ISC_INVALID;
>  }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-21  5:48 ` Christian Borntraeger
@ 2020-09-21 11:56   ` Halil Pasic
  0 siblings, 0 replies; 9+ messages in thread
From: Halil Pasic @ 2020-09-21 11:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Tony Krowiak, linux-s390, linux-kernel, kvm, pmorel,
	alex.williamson, cohuck, kwankhede

On Mon, 21 Sep 2020 07:48:58 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> 
> 
> On 18.09.20 19:02, Tony Krowiak wrote:
> > Attempting to unregister Guest Interruption Subclass (GISC) when the
> > link between the matrix mdev and KVM has been removed results in the
> > following:
> > 
> >    "Kernel panic -not syncing: Fatal exception: panic_on_oops"
> 
> I think the full backtrace would be better in case someone runs into this
> and needs to compare this patch to its oops. This also makes it easier to
> understand the fix. 
> > 
> > This patch fixes this bug by verifying the matrix mdev and KVM are still
> > linked prior to unregistering the GISC.
> > 
> > Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> 
> Do we need a Fixes tag and cc stable?
> 

I believe we do!

> 
> > ---
> >  drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> > index e0bde8518745..847a88642644 100644
> > --- a/drivers/s390/crypto/vfio_ap_ops.c
> > +++ b/drivers/s390/crypto/vfio_ap_ops.c
> > @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
> >   */
> >  static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
> >  {
> > -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)
> 
> So we already check for q->matrix_mdev here
> 
> > -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> > -	if (q->saved_pfn && q->matrix_mdev)
> 
> and here
> > -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> > -				 &q->saved_pfn, 1);
> > +	if (q->matrix_mdev) {
> > +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
>                                                            ^^^^ and this is the only
> 		new check? Cant we just add this condition to the first if?

You are technically right, but I'm not comfortable with my level of
understanding of this logic regardless of the coding style. Will ask
some questions directly.

Regards,
Halil

> 
> > +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> > +						 q->saved_isc);
> > +		if (q->saved_pfn)
> > +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> > +					 &q->saved_pfn, 1);
> > +	}
> > +
> >  	q->saved_pfn = 0;
> >  	q->saved_isc = VFIO_AP_ISC_INVALID;
> >  }
> > 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
                   ` (2 preceding siblings ...)
  2020-09-21  9:23 ` Cornelia Huck
@ 2020-09-21 15:45 ` Halil Pasic
  2020-09-25 22:29   ` Tony Krowiak
  2020-10-21 15:46 ` Tony Krowiak
  4 siblings, 1 reply; 9+ messages in thread
From: Halil Pasic @ 2020-09-21 15:45 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, kvm, pmorel, alex.williamson, cohuck,
	kwankhede, borntraeger

On Fri, 18 Sep 2020 13:02:34 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> Attempting to unregister Guest Interruption Subclass (GISC) when the
> link between the matrix mdev and KVM has been removed results in the
> following:
> 
>    "Kernel panic -not syncing: Fatal exception: panic_on_oops"
> 
> This patch fixes this bug by verifying the matrix mdev and KVM are still
> linked prior to unregistering the GISC.


I read from your commit message that this happens when the link between
the KVM and the matrix mdev was established and then got severed. 

I assume the interrupts were previously enabled, and were not been
disabled or cleaned up because q->saved_isc != VFIO_AP_ISC_INVALID.

That means the guest enabled  interrupts and then for whatever
reason got destroyed, and this happens on mdev cleanup.

Does it happen all the time or is it some sort of a race?

> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e0bde8518745..847a88642644 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>   */
>  static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>  {
> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)
> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> -	if (q->saved_pfn && q->matrix_mdev)
> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> -				 &q->saved_pfn, 1);
> +	if (q->matrix_mdev) {
> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> +						 q->saved_isc);

I don't quite understand the logic here. I suppose we need to ensure
that the struct kvm is 'alive' at least until kvm_s390_gisc_unregister()
is done. That is supposed be ensured by kvm_get_kvm() in
vfio_ap_mdev_set_kvm() and kvm_put_kvm() in vfio_ap_mdev_release().

If the critical section in vfio_ap_mdev_release() is done and
matrix_mdev->kvm was set to NULL there then I would expect that the
queues are already reset and q->saved_isc == VFIO_AP_ISC_INVALID. So
this should not blow up.

Now if this happens before the critical section in
vfio_ap_mdev_release() is done, I ask myself how are we going to do the
kvm_put_kvm()?

Another question. Do we hold the matrix_dev->lock here?

> +		if (q->saved_pfn)
> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> +					 &q->saved_pfn, 1);
> +	}
> +
>  	q->saved_pfn = 0;
>  	q->saved_isc = VFIO_AP_ISC_INVALID;
>  }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-21 15:45 ` Halil Pasic
@ 2020-09-25 22:29   ` Tony Krowiak
  2020-09-26  0:56     ` Halil Pasic
  0 siblings, 1 reply; 9+ messages in thread
From: Tony Krowiak @ 2020-09-25 22:29 UTC (permalink / raw)
  To: Halil Pasic
  Cc: linux-s390, linux-kernel, kvm, pmorel, alex.williamson, cohuck,
	kwankhede, borntraeger



On 9/21/20 11:45 AM, Halil Pasic wrote:
> On Fri, 18 Sep 2020 13:02:34 -0400
> Tony Krowiak <akrowiak@linux.ibm.com> wrote:
>
>> Attempting to unregister Guest Interruption Subclass (GISC) when the
>> link between the matrix mdev and KVM has been removed results in the
>> following:
>>
>>     "Kernel panic -not syncing: Fatal exception: panic_on_oops"
>>
>> This patch fixes this bug by verifying the matrix mdev and KVM are still
>> linked prior to unregistering the GISC.
>
> I read from your commit message that this happens when the link between
> the KVM and the matrix mdev was established and then got severed.
>
> I assume the interrupts were previously enabled, and were not been
> disabled or cleaned up because q->saved_isc != VFIO_AP_ISC_INVALID.
>
> That means the guest enabled  interrupts and then for whatever
> reason got destroyed, and this happens on mdev cleanup.
>
> Does it happen all the time or is it some sort of a race?

This is a race condition that happens when a guest is terminated and the 
mdev is
removed in rapid succession. I came across it with one of my hades test 
cases
on cleanup of the resources after the test case completes. There is a 
bug in the problem appears
the vfio_ap_mdev_release function because it tries to reset the APQNs 
after the bits are
cleared from the matrix_mdev.matrix, so the resets never happen.

Fixing that, however, does not resolve the issue, so I'm in the process 
of doing a bunch of
tracing to see the flow of the resets etc. during the lifecycle of the 
mdev during this
hades test. I should have a better answer next week.

>
>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>> ---
>>   drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
>> index e0bde8518745..847a88642644 100644
>> --- a/drivers/s390/crypto/vfio_ap_ops.c
>> +++ b/drivers/s390/crypto/vfio_ap_ops.c
>> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>>    */
>>   static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>>   {
>> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)
>> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
>> -	if (q->saved_pfn && q->matrix_mdev)
>> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
>> -				 &q->saved_pfn, 1);
>> +	if (q->matrix_mdev) {
>> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
>> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
>> +						 q->saved_isc);
> I don't quite understand the logic here. I suppose we need to ensure
> that the struct kvm is 'alive' at least until kvm_s390_gisc_unregister()
> is done. That is supposed be ensured by kvm_get_kvm() in
> vfio_ap_mdev_set_kvm() and kvm_put_kvm() in vfio_ap_mdev_release().
>
> If the critical section in vfio_ap_mdev_release() is done and
> matrix_mdev->kvm was set to NULL there then I would expect that the
> queues are already reset and q->saved_isc == VFIO_AP_ISC_INVALID. So
> this should not blow up.
>
> Now if this happens before the critical section in
> vfio_ap_mdev_release() is done, I ask myself how are we going to do the
> kvm_put_kvm()?
>
> Another question. Do we hold the matrix_dev->lock here?
>
>> +		if (q->saved_pfn)
>> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
>> +					 &q->saved_pfn, 1);
>> +	}
>> +
>>   	q->saved_pfn = 0;
>>   	q->saved_isc = VFIO_AP_ISC_INVALID;
>>   }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-25 22:29   ` Tony Krowiak
@ 2020-09-26  0:56     ` Halil Pasic
  0 siblings, 0 replies; 9+ messages in thread
From: Halil Pasic @ 2020-09-26  0:56 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, kvm, pmorel, alex.williamson, cohuck,
	kwankhede, borntraeger

On Fri, 25 Sep 2020 18:29:16 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> 
> 
> On 9/21/20 11:45 AM, Halil Pasic wrote:
> > On Fri, 18 Sep 2020 13:02:34 -0400
> > Tony Krowiak <akrowiak@linux.ibm.com> wrote:
> >
> >> Attempting to unregister Guest Interruption Subclass (GISC) when the
> >> link between the matrix mdev and KVM has been removed results in the
> >> following:
> >>
> >>     "Kernel panic -not syncing: Fatal exception: panic_on_oops"
> >>
> >> This patch fixes this bug by verifying the matrix mdev and KVM are still
> >> linked prior to unregistering the GISC.
> >
> > I read from your commit message that this happens when the link between
> > the KVM and the matrix mdev was established and then got severed.
> >
> > I assume the interrupts were previously enabled, and were not been
> > disabled or cleaned up because q->saved_isc != VFIO_AP_ISC_INVALID.
> >
> > That means the guest enabled  interrupts and then for whatever
> > reason got destroyed, and this happens on mdev cleanup.
> >
> > Does it happen all the time or is it some sort of a race?
> 
> This is a race condition that happens when a guest is terminated and the 
> mdev is
> removed in rapid succession. I came across it with one of my hades test 
> cases
> on cleanup of the resources after the test case completes. There is a 
> bug in the problem appears
> the vfio_ap_mdev_releasefunction because it tries to reset the APQNs 
> after the bits are
> cleared from the matrix_mdev.matrix, so the resets never happen.
> 

That sounds very strange. I couldn't find the place where we clear the
bits in matrix_mdev.matrix except for unassign. Currently the unassign
is supposed to be enabled only after we have no guest and we have
cleaned up the queues (which should restore VFIO_AP_ISC_INVALID). Does
your test do any unassign operations? (I'm not sure the we always do
like we are supposed to.)

Now if we did not clear the bits from matrix_mdev.matrix then this
could be an use after free scenario (where we interpret already
re-purposed memory as matrix_mdev.matrix).

> Fixing that, however, does not resolve the issue, so I'm in the process 
> of doing a bunch of
> tracing to see the flow of the resets etc. during the lifecycle of the 
> mdev during this
> hades test. I should have a better answer next week.
>

My take away is that we don't understand what exactly is going wrong, and
so this patch is at best a mitigation (not a real fix). Does that sound
about correct?

Regards,
Halil

[..]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS
  2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
                   ` (3 preceding siblings ...)
  2020-09-21 15:45 ` Halil Pasic
@ 2020-10-21 15:46 ` Tony Krowiak
  4 siblings, 0 replies; 9+ messages in thread
From: Tony Krowiak @ 2020-10-21 15:46 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: pmorel, pasic, alex.williamson, cohuck, kwankhede, borntraeger

In trying to recreate this problem in order to get a stack trace, I 
discovered that
it only occurs within a local repository that has several new patches 
applied,
so the problem is not part of the base code and will be fixed via this 
new set
of patches forthcoming.

On 9/18/20 1:02 PM, Tony Krowiak wrote:
> Attempting to unregister Guest Interruption Subclass (GISC) when the
> link between the matrix mdev and KVM has been removed results in the
> following:
>
>     "Kernel panic -not syncing: Fatal exception: panic_on_oops"
>
> This patch fixes this bug by verifying the matrix mdev and KVM are still
> linked prior to unregistering the GISC.
>
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e0bde8518745..847a88642644 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -119,11 +119,15 @@ static void vfio_ap_wait_for_irqclear(int apqn)
>    */
>   static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>   {
> -	if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev)
> -		kvm_s390_gisc_unregister(q->matrix_mdev->kvm, q->saved_isc);
> -	if (q->saved_pfn && q->matrix_mdev)
> -		vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> -				 &q->saved_pfn, 1);
> +	if (q->matrix_mdev) {
> +		if (q->saved_isc != VFIO_AP_ISC_INVALID && q->matrix_mdev->kvm)
> +			kvm_s390_gisc_unregister(q->matrix_mdev->kvm,
> +						 q->saved_isc);
> +		if (q->saved_pfn)
> +			vfio_unpin_pages(mdev_dev(q->matrix_mdev->mdev),
> +					 &q->saved_pfn, 1);
> +	}
> +
>   	q->saved_pfn = 0;
>   	q->saved_isc = VFIO_AP_ISC_INVALID;
>   }


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-10-21 15:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-18 17:02 [PATCH] s390/vfio-ap: fix unregister GISC when KVM is already gone results in OOPS Tony Krowiak
2020-09-21  5:48 ` Christian Borntraeger
2020-09-21 11:56   ` Halil Pasic
2020-09-21  8:24 ` Pierre Morel
2020-09-21  9:23 ` Cornelia Huck
2020-09-21 15:45 ` Halil Pasic
2020-09-25 22:29   ` Tony Krowiak
2020-09-26  0:56     ` Halil Pasic
2020-10-21 15:46 ` Tony Krowiak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).