[PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
@ 2019-08-19 17:48 Tony Krowiak
  2019-08-19 18:56 ` Cornelia Huck
  2019-08-30 16:02 ` Halil Pasic
  0 siblings, 2 replies; 10+ messages in thread
From: Tony Krowiak @ 2019-08-19 17:48 UTC (permalink / raw)
  To: linux-s390, linux-kernel
  Cc: freude, borntraeger, cohuck, pasic, frankja, jjherne, Tony Krowiak

When an AP queue is reset (zeroized), interrupts are disabled. The queue
reset function currently tries to disable interrupts unnecessarily. This patch
removes the unnecessary calls to disable interrupts after queue reset.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 0604b49a4d32..e3bcb430e214 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1114,18 +1114,19 @@ static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-static void vfio_ap_irq_disable_apqn(int apqn)
+static struct vfio_ap_queue *vfio_ap_find_qdev(int apqn)
 {
 	struct device *dev;
-	struct vfio_ap_queue *q;
+	struct vfio_ap_queue *q = NULL;
 
 	dev = driver_find_device(&matrix_dev->vfio_ap_drv->driver, NULL,
 				 &apqn, match_apqn);
 	if (dev) {
 		q = dev_get_drvdata(dev);
-		vfio_ap_irq_disable(q);
 		put_device(dev);
 	}
+
+	return q;
 }
 
 int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
@@ -1164,6 +1165,7 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
 	int rc = 0;
 	unsigned long apid, apqi;
 	struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev);
+	struct vfio_ap_queue *q;
 
 	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm,
 			     matrix_mdev->matrix.apm_max + 1) {
@@ -1177,7 +1179,18 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
 			 */
 			if (ret)
 				rc = ret;
-			vfio_ap_irq_disable_apqn(AP_MKQID(apid, apqi));
+
+			/*
+			 * Resetting a queue disables interrupts as a side
+			 * effect, so there is no need to disable interrupts
+			 * here. Note that an error on reset indicates the
+			 * queue is inaccessible, so an attempt to disable
+			 * interrupts would fail and is therefore unnecessary.
+			 * Just free up the resources used by IRQ processing.
+			 */
+			q = vfio_ap_find_qdev(AP_MKQID(apid, apqi));
+			if (q)
+				vfio_ap_free_aqic_resources(q);
 		}
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-08-19 17:48 [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts Tony Krowiak
@ 2019-08-19 18:56 ` Cornelia Huck
  2019-08-30 16:02 ` Halil Pasic
  1 sibling, 0 replies; 10+ messages in thread
From: Cornelia Huck @ 2019-08-19 18:56 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, freude, borntraeger, pasic, frankja, jjherne

On Mon, 19 Aug 2019 13:48:49 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> When an AP queue is reset (zeroized), interrupts are disabled. The queue
> reset function currently tries to disable interrupts unnecessarily. This patch
> removes the unnecessary calls to disable interrupts after queue reset.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-08-19 17:48 [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts Tony Krowiak
  2019-08-19 18:56 ` Cornelia Huck
@ 2019-08-30 16:02 ` Halil Pasic
  2019-09-03  7:37   ` Christian Borntraeger
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Halil Pasic @ 2019-08-30 16:02 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, freude, borntraeger, cohuck, pasic,
	frankja, jjherne

On Mon, 19 Aug 2019 13:48:49 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> When an AP queue is reset (zeroized), interrupts are disabled. The queue
> reset function currently tries to disable interrupts unnecessarily. This patch
> removes the unnecessary calls to disable interrupts after queue reset.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index 0604b49a4d32..e3bcb430e214 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1114,18 +1114,19 @@ static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> -static void vfio_ap_irq_disable_apqn(int apqn)
> +static struct vfio_ap_queue *vfio_ap_find_qdev(int apqn)
>  {
>  	struct device *dev;
> -	struct vfio_ap_queue *q;
> +	struct vfio_ap_queue *q = NULL;
>  
>  	dev = driver_find_device(&matrix_dev->vfio_ap_drv->driver, NULL,
>  				 &apqn, match_apqn);
>  	if (dev) {
>  		q = dev_get_drvdata(dev);
> -		vfio_ap_irq_disable(q);
>  		put_device(dev);
>  	}
> +
> +	return q;
>  }
>  
>  int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
> @@ -1164,6 +1165,7 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
>  	int rc = 0;
>  	unsigned long apid, apqi;
>  	struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev);
> +	struct vfio_ap_queue *q;
>  
>  	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm,
>  			     matrix_mdev->matrix.apm_max + 1) {
> @@ -1177,7 +1179,18 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
>  			 */
>  			if (ret)
>  				rc = ret;
> -			vfio_ap_irq_disable_apqn(AP_MKQID(apid, apqi));
> +
> +			/*
> +			 * Resetting a queue disables interrupts as a side
> +			 * effect, so there is no need to disable interrupts
> +			 * here. Note that an error on reset indicates the
> +			 * queue is inaccessible, so an attempt to disable
> +			 * interrupts would fail and is therefore unnecessary.
> +			 * Just free up the resources used by IRQ processing.
> +			 */

I have some concerns about this patch. One thing we must ensure is that
the machine does not poke freed memory (NIB, GISA). With the current
design that means we must ensure that there won't be any interruption
conditions indicated (and interrupts made pending) after this point.

I'm not entirely convinced this is ensured after your change. The
relevant bits of the  documentation are particularly hard to figure out.
I could use some clarifications for sure.

This hunk removes the wait implemented by vfio_ap_wait_for_irqclear()
along with the hopefully overkill AQIC.

Let me name some of the scenarios I'm concerned about, along with the
code that is currently supposed to handle these.


int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,              
                             unsigned int retry)                                
{                                                                               
        struct ap_queue_status status;                                          
        int retry2 = 2;                                                         
        int apqn = AP_MKQID(apid, apqi);                                        
                                                                                
        do {                                                                    
                status = ap_zapq(apqn);                                         
                switch (status.response_code) {                                 
                case AP_RESPONSE_NORMAL:                                        
                        while (!status.queue_empty && retry2--) {               
                                msleep(20);                                     
                                status = ap_tapq(apqn, NULL);                   
                        }                                                       
                        WARN_ON_ONCE(retry <= 0);                              
                        return 0;                                               
                case AP_RESPONSE_RESET_IN_PROGRESS:                             
                case AP_RESPONSE_BUSY:                                          
                        msleep(20);                                             
                        break;                                                  
                default:                                                        
                        /* things are really broken, give up */                 
                        return -EIO;                                            
                }                                                               
        } while (retry--);                                                      
                                                                                
        return -EBUSY; 

Scenario 1) 

ap_zapq() returns status.response_code == AP_RESPONSE_RESET_IN_PROGRESS,
because for example G2 did a ZAPQ before us.

The current logic retries ap_zapq() once after 20 msec. I have no idea
if that is sufficient under all circumstances. If we get a
AP_RESPONSE_RESET_IN_PROGRESS again, we return with -EBUSY and do nothing
about it if the whole process was triggered by vfio_ap_mdev_release().
Not even a warning.

Please notice that this was almost fine before IRQ support, because the
queue will get reset ASAP, and ...

Scenario 2)

It takes longer than 40 msec for the reset to complete. I don't know if
this is a possibility, but before your patch we used to wait of 1 sec.

I guess the we need the timeouts because do this with matrix_dev->lock
held. I'm not sure it's a good idea long term.

Scenario 3)

ap_zapq() returns status.response_code == AP_RESPONSE_DECONFIGURED. In
this case we would give up right away. Again so that we don't even know
we hit this case. There ain't much I can think about we could do here.

I hope we are guaranteed to  not get any bits set at this point, but I could
use some help.

                                  *

The good thing is I'm pretty sure that we are safe (i.e. no bits will be
poked by the hardware) after seeing the queue empty after we issued a
reset request.

So my concerns basically boil down to are the cumulative timeouts big enough
so we never time out (including are timeouts really the way to go)?

We should probably take any discussion about from which point on is it safe
to assume that NIB and GISA won't be poked by HW offline as the guys
without documentation can't contribute.


Two issues not directly related to your patch. I formulated those as two
follow up patches on top of yours .please take a look at them.

-------------------------------8<------------------------------------------
From: Halil Pasic <pasic@linux.ibm.com>
Date: Fri, 30 Aug 2019 16:03:42 +0200
Subject: [PATCH 1/2] s390: vfio-ap: fix warning reset not completed

The intention seems to be to warn once when we don't wait enough for the
reset to complete. Let's use the right retry counter to accomplish that
semantic.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e3bcb43..dd07ebf 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1144,7 +1144,7 @@ int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
 				msleep(20);
 				status = ap_tapq(apqn, NULL);
 			}
-			WARN_ON_ONCE(retry <= 0);
+			WARN_ON_ONCE(retry2 <= 0);
 			return 0;
 		case AP_RESPONSE_RESET_IN_PROGRESS:
 		case AP_RESPONSE_BUSY:
-- 
2.5.5

-------------------------------8<----------------------------------------
From: Halil Pasic <pasic@linux.ibm.com>
Date: Fri, 30 Aug 2019 17:39:47 +0200
Subject: [PATCH 2/2] s390: vfio-ap: don't wait after AQIC interpretation

Waiting for the asynchronous part of AQIC to complete as a part
AQIC implementation is unnecessary and silly.

Let's get rid of vfio_ap_wait_for_irqclear().

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c | 50 ++-------------------------------------
 1 file changed, 2 insertions(+), 48 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index dd07ebf..8d098f0 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -68,47 +68,6 @@ static struct vfio_ap_queue *vfio_ap_get_queue(
 }
 
 /**
- * vfio_ap_wait_for_irqclear
- * @apqn: The AP Queue number
- *
- * Checks the IRQ bit for the status of this APQN using ap_tapq.
- * Returns if the ap_tapq function succeeded and the bit is clear.
- * Returns if ap_tapq function failed with invalid, deconfigured or
- * checkstopped AP.
- * Otherwise retries up to 5 times after waiting 20ms.
- *
- */
-static void vfio_ap_wait_for_irqclear(int apqn)
-{
-	struct ap_queue_status status;
-	int retry = 5;
-
-	do {
-		status = ap_tapq(apqn, NULL);
-		switch (status.response_code) {
-		case AP_RESPONSE_NORMAL:
-		case AP_RESPONSE_RESET_IN_PROGRESS:
-			if (!status.irq_enabled)
-				return;
-			/* Fall through */
-		case AP_RESPONSE_BUSY:
-			msleep(20);
-			break;
-		case AP_RESPONSE_Q_NOT_AVAIL:
-		case AP_RESPONSE_DECONFIGURED:
-		case AP_RESPONSE_CHECKSTOPPED:
-		default:
-			WARN_ONCE(1, "%s: tapq rc %02x: %04x\n", __func__,
-				  status.response_code, apqn);
-			return;
-		}
-	} while (--retry);
-
-	WARN_ONCE(1, "%s: tapq rc %02x: %04x could not clear IR bit\n",
-		  __func__, status.response_code, apqn);
-}
-
-/**
  * vfio_ap_free_aqic_resources
  * @q: The vfio_ap_queue
  *
@@ -133,14 +92,10 @@ static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
  * @q: The vfio_ap_queue
  *
  * Uses ap_aqic to disable the interruption and in case of success, reset
- * in progress or IRQ disable command already proceeded: calls
- * vfio_ap_wait_for_irqclear() to check for the IRQ bit to be clear
- * and calls vfio_ap_free_aqic_resources() to free the resources associated
+ * in progress or IRQ disable command already proceeded :calls
+ * vfio_ap_free_aqic_resources() to free the resources associated
  * with the AP interrupt handling.
  *
- * In the case the AP is busy, or a reset is in progress,
- * retries after 20ms, up to 5 times.
- *
  * Returns if ap_aqic function failed with invalid, deconfigured or
  * checkstopped AP.
  */
@@ -155,7 +110,6 @@ struct ap_queue_status vfio_ap_irq_disable(struct vfio_ap_queue *q)
 		switch (status.response_code) {
 		case AP_RESPONSE_OTHERWISE_CHANGED:
 		case AP_RESPONSE_NORMAL:
-			vfio_ap_wait_for_irqclear(q->apqn);
 			goto end_free;
 		case AP_RESPONSE_RESET_IN_PROGRESS:
 		case AP_RESPONSE_BUSY:
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-08-30 16:02 ` Halil Pasic
@ 2019-09-03  7:37   ` Christian Borntraeger
  2019-09-04  7:35   ` Christian Borntraeger
  2019-09-05 16:26   ` Tony Krowiak
  2 siblings, 0 replies; 10+ messages in thread
From: Christian Borntraeger @ 2019-09-03  7:37 UTC (permalink / raw)
  To: Halil Pasic, Tony Krowiak
  Cc: linux-s390, linux-kernel, freude, cohuck, pasic, frankja, jjherne



On 30.08.19 18:02, Halil Pasic wrote:

> From: Halil Pasic <pasic@linux.ibm.com>
> Date: Fri, 30 Aug 2019 16:03:42 +0200
> Subject: [PATCH 1/2] s390: vfio-ap: fix warning reset not completed
> 
> The intention seems to be to warn once when we don't wait enough for the
> reset to complete. Let's use the right retry counter to accomplish that
> semantic.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e3bcb43..dd07ebf 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1144,7 +1144,7 @@ int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
>  				msleep(20);
>  				status = ap_tapq(apqn, NULL);
>  			}
> -			WARN_ON_ONCE(retry <= 0);
> +			WARN_ON_ONCE(retry2 <= 0);
>  			return 0;
>  		case AP_RESPONSE_RESET_IN_PROGRESS:
>  		case AP_RESPONSE_BUSY:

I think this patch alone makes certainly sense. Can you send that separately?
Or even better remove the retry parameter of that function. All users seem
to always pass in 1 as retry.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-08-30 16:02 ` Halil Pasic
  2019-09-03  7:37   ` Christian Borntraeger
@ 2019-09-04  7:35   ` Christian Borntraeger
  2019-09-04 15:05     ` Tony Krowiak
  2019-09-05 10:24     ` Halil Pasic
  2019-09-05 16:26   ` Tony Krowiak
  2 siblings, 2 replies; 10+ messages in thread
From: Christian Borntraeger @ 2019-09-04  7:35 UTC (permalink / raw)
  To: Halil Pasic, Tony Krowiak
  Cc: linux-s390, linux-kernel, freude, cohuck, pasic, frankja, jjherne

Halil,

can you also send this patch as a separate mail. This also requires a much better
patch description about the why and it certainly should also have an agreement from
Anthony.

On 30.08.19 18:02, Halil Pasic wrote:
> From: Halil Pasic <pasic@linux.ibm.com>
> Date: Fri, 30 Aug 2019 17:39:47 +0200
> Subject: [PATCH 2/2] s390: vfio-ap: don't wait after AQIC interpretation
> 
> Waiting for the asynchronous part of AQIC to complete as a part
> AQIC implementation is unnecessary and silly.
> 
> Let's get rid of vfio_ap_wait_for_irqclear().
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
>  drivers/s390/crypto/vfio_ap_ops.c | 50 ++-------------------------------------
>  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index dd07ebf..8d098f0 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -68,47 +68,6 @@ static struct vfio_ap_queue *vfio_ap_get_queue(
>  }
>  
>  /**
> - * vfio_ap_wait_for_irqclear
> - * @apqn: The AP Queue number
> - *
> - * Checks the IRQ bit for the status of this APQN using ap_tapq.
> - * Returns if the ap_tapq function succeeded and the bit is clear.
> - * Returns if ap_tapq function failed with invalid, deconfigured or
> - * checkstopped AP.
> - * Otherwise retries up to 5 times after waiting 20ms.
> - *
> - */
> -static void vfio_ap_wait_for_irqclear(int apqn)
> -{
> -	struct ap_queue_status status;
> -	int retry = 5;
> -
> -	do {
> -		status = ap_tapq(apqn, NULL);
> -		switch (status.response_code) {
> -		case AP_RESPONSE_NORMAL:
> -		case AP_RESPONSE_RESET_IN_PROGRESS:
> -			if (!status.irq_enabled)
> -				return;
> -			/* Fall through */
> -		case AP_RESPONSE_BUSY:
> -			msleep(20);
> -			break;
> -		case AP_RESPONSE_Q_NOT_AVAIL:
> -		case AP_RESPONSE_DECONFIGURED:
> -		case AP_RESPONSE_CHECKSTOPPED:
> -		default:
> -			WARN_ONCE(1, "%s: tapq rc %02x: %04x\n", __func__,
> -				  status.response_code, apqn);
> -			return;
> -		}
> -	} while (--retry);
> -
> -	WARN_ONCE(1, "%s: tapq rc %02x: %04x could not clear IR bit\n",
> -		  __func__, status.response_code, apqn);
> -}
> -
> -/**
>   * vfio_ap_free_aqic_resources
>   * @q: The vfio_ap_queue
>   *
> @@ -133,14 +92,10 @@ static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>   * @q: The vfio_ap_queue
>   *
>   * Uses ap_aqic to disable the interruption and in case of success, reset
> - * in progress or IRQ disable command already proceeded: calls
> - * vfio_ap_wait_for_irqclear() to check for the IRQ bit to be clear
> - * and calls vfio_ap_free_aqic_resources() to free the resources associated
> + * in progress or IRQ disable command already proceeded :calls
> + * vfio_ap_free_aqic_resources() to free the resources associated
>   * with the AP interrupt handling.
>   *
> - * In the case the AP is busy, or a reset is in progress,
> - * retries after 20ms, up to 5 times.
> - *
>   * Returns if ap_aqic function failed with invalid, deconfigured or
>   * checkstopped AP.
>   */
> @@ -155,7 +110,6 @@ struct ap_queue_status vfio_ap_irq_disable(struct vfio_ap_queue *q)
>  		switch (status.response_code) {
>  		case AP_RESPONSE_OTHERWISE_CHANGED:
>  		case AP_RESPONSE_NORMAL:
> -			vfio_ap_wait_for_irqclear(q->apqn);
>  			goto end_free;
>  		case AP_RESPONSE_RESET_IN_PROGRESS:
>  		case AP_RESPONSE_BUSY:
> -- 2.5.5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-09-04  7:35   ` Christian Borntraeger
@ 2019-09-04 15:05     ` Tony Krowiak
  2019-09-05 11:03       ` Halil Pasic
  2019-09-05 10:24     ` Halil Pasic
  1 sibling, 1 reply; 10+ messages in thread
From: Tony Krowiak @ 2019-09-04 15:05 UTC (permalink / raw)
  To: Christian Borntraeger, Halil Pasic
  Cc: linux-s390, linux-kernel, freude, cohuck, pasic, frankja, jjherne

On 9/4/19 3:35 AM, Christian Borntraeger wrote:
> Halil,
> 
> can you also send this patch as a separate mail. This also requires a much better
> patch description about the why and it certainly should also have an agreement from
> Anthony.
> 
> On 30.08.19 18:02, Halil Pasic wrote:
>> From: Halil Pasic <pasic@linux.ibm.com>
>> Date: Fri, 30 Aug 2019 17:39:47 +0200
>> Subject: [PATCH 2/2] s390: vfio-ap: don't wait after AQIC interpretation
>>
>> Waiting for the asynchronous part of AQIC to complete as a part
>> AQIC implementation is unnecessary and silly.
>>
>> Let's get rid of vfio_ap_wait_for_irqclear().
>>
>> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>> ---
>>   drivers/s390/crypto/vfio_ap_ops.c | 50 ++-------------------------------------
>>   1 file changed, 2 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
>> index dd07ebf..8d098f0 100644
>> --- a/drivers/s390/crypto/vfio_ap_ops.c
>> +++ b/drivers/s390/crypto/vfio_ap_ops.c
>> @@ -68,47 +68,6 @@ static struct vfio_ap_queue *vfio_ap_get_queue(
>>   }
>>   
>>   /**
>> - * vfio_ap_wait_for_irqclear
>> - * @apqn: The AP Queue number
>> - *
>> - * Checks the IRQ bit for the status of this APQN using ap_tapq.
>> - * Returns if the ap_tapq function succeeded and the bit is clear.
>> - * Returns if ap_tapq function failed with invalid, deconfigured or
>> - * checkstopped AP.
>> - * Otherwise retries up to 5 times after waiting 20ms.
>> - *
>> - */
>> -static void vfio_ap_wait_for_irqclear(int apqn)
>> -{
>> -	struct ap_queue_status status;
>> -	int retry = 5;
>> -
>> -	do {
>> -		status = ap_tapq(apqn, NULL);
>> -		switch (status.response_code) {
>> -		case AP_RESPONSE_NORMAL:
>> -		case AP_RESPONSE_RESET_IN_PROGRESS:
>> -			if (!status.irq_enabled)
>> -				return;
>> -			/* Fall through */
>> -		case AP_RESPONSE_BUSY:
>> -			msleep(20);
>> -			break;
>> -		case AP_RESPONSE_Q_NOT_AVAIL:
>> -		case AP_RESPONSE_DECONFIGURED:
>> -		case AP_RESPONSE_CHECKSTOPPED:
>> -		default:
>> -			WARN_ONCE(1, "%s: tapq rc %02x: %04x\n", __func__,
>> -				  status.response_code, apqn);
>> -			return;
>> -		}
>> -	} while (--retry);
>> -
>> -	WARN_ONCE(1, "%s: tapq rc %02x: %04x could not clear IR bit\n",
>> -		  __func__, status.response_code, apqn);
>> -}
>> -
>> -/**
>>    * vfio_ap_free_aqic_resources
>>    * @q: The vfio_ap_queue
>>    *
>> @@ -133,14 +92,10 @@ static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
>>    * @q: The vfio_ap_queue
>>    *
>>    * Uses ap_aqic to disable the interruption and in case of success, reset
>> - * in progress or IRQ disable command already proceeded: calls
>> - * vfio_ap_wait_for_irqclear() to check for the IRQ bit to be clear
>> - * and calls vfio_ap_free_aqic_resources() to free the resources associated
>> + * in progress or IRQ disable command already proceeded :calls
>> + * vfio_ap_free_aqic_resources() to free the resources associated
>>    * with the AP interrupt handling.
>>    *
>> - * In the case the AP is busy, or a reset is in progress,
>> - * retries after 20ms, up to 5 times.
>> - *
>>    * Returns if ap_aqic function failed with invalid, deconfigured or
>>    * checkstopped AP.
>>    */
>> @@ -155,7 +110,6 @@ struct ap_queue_status vfio_ap_irq_disable(struct vfio_ap_queue *q)
>>   		switch (status.response_code) {
>>   		case AP_RESPONSE_OTHERWISE_CHANGED:
>>   		case AP_RESPONSE_NORMAL:
>> -			vfio_ap_wait_for_irqclear(q->apqn);

I am not sure why you consider the wait unnecessary and silly. Notice
the response code AP_RESPONSE_OTHERWISE_CHANGED above which means that
the AP queue is already disabled for interrupts or the enablement
process has not yet completed. Shouldn't we wait for the IRQ to clear
in this case? I do agree that there is no need to wait if the
response code is 0.

>>   			goto end_free;
>>   		case AP_RESPONSE_RESET_IN_PROGRESS:
>>   		case AP_RESPONSE_BUSY:
>> -- 2.5.5
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-09-04  7:35   ` Christian Borntraeger
  2019-09-04 15:05     ` Tony Krowiak
@ 2019-09-05 10:24     ` Halil Pasic
  1 sibling, 0 replies; 10+ messages in thread
From: Halil Pasic @ 2019-09-05 10:24 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Tony Krowiak, linux-s390, linux-kernel, freude, cohuck, pasic,
	frankja, jjherne

On Wed, 4 Sep 2019 09:35:47 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Halil,
> 
> can you also send this patch as a separate mail. This also requires a much better
> patch description about the why and it certainly should also have an agreement from
> Anthony.
> 


IMHO this patch only makes sense when the objective of Tony's patch is
accomplished. We do have to wait when we need to give up resources or do
stuff like subsystem reset. But I don't think we should introduce
artificial waits when interpreting AQIC: it should be up to the guest to
handle this stuff, as it is in LPAR.

Sure I can send a separate patch and state in the cover lette that it
builds on top of 'vfio-ap: remove unnecessary calls to disable queue
interrupts'.

And yes the commit message is sloppy. Was not meant to end up in the
changelog. Just a quick and dirty explanation.

> On 30.08.19 18:02, Halil Pasic wrote:
> > From: Halil Pasic <pasic@linux.ibm.com>
> > Date: Fri, 30 Aug 2019 17:39:47 +0200
> > Subject: [PATCH 2/2] s390: vfio-ap: don't wait after AQIC interpretation
> > 
> > Waiting for the asynchronous part of AQIC to complete as a part
> > AQIC implementation is unnecessary and silly.

s/implementation/interpretation/

Regards,
Halil

> > 
> > Let's get rid of vfio_ap_wait_for_irqclear().
> > 
> > Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> > ---
> >  drivers/s390/crypto/vfio_ap_ops.c | 50 ++-------------------------------------
> >  1 file changed, 2 insertions(+), 48 deletions(-)
> > 
> > diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> > index dd07ebf..8d098f0 100644
> > --- a/drivers/s390/crypto/vfio_ap_ops.c
> > +++ b/drivers/s390/crypto/vfio_ap_ops.c
> > @@ -68,47 +68,6 @@ static struct vfio_ap_queue *vfio_ap_get_queue(
> >  }
> >  
> >  /**
> > - * vfio_ap_wait_for_irqclear
> > - * @apqn: The AP Queue number
> > - *
> > - * Checks the IRQ bit for the status of this APQN using ap_tapq.
> > - * Returns if the ap_tapq function succeeded and the bit is clear.
> > - * Returns if ap_tapq function failed with invalid, deconfigured or
> > - * checkstopped AP.
> > - * Otherwise retries up to 5 times after waiting 20ms.
> > - *
> > - */
> > -static void vfio_ap_wait_for_irqclear(int apqn)
> > -{
> > -	struct ap_queue_status status;
> > -	int retry = 5;
> > -
> > -	do {
> > -		status = ap_tapq(apqn, NULL);
> > -		switch (status.response_code) {
> > -		case AP_RESPONSE_NORMAL:
> > -		case AP_RESPONSE_RESET_IN_PROGRESS:
> > -			if (!status.irq_enabled)
> > -				return;
> > -			/* Fall through */
> > -		case AP_RESPONSE_BUSY:
> > -			msleep(20);
> > -			break;
> > -		case AP_RESPONSE_Q_NOT_AVAIL:
> > -		case AP_RESPONSE_DECONFIGURED:
> > -		case AP_RESPONSE_CHECKSTOPPED:
> > -		default:
> > -			WARN_ONCE(1, "%s: tapq rc %02x: %04x\n", __func__,
> > -				  status.response_code, apqn);
> > -			return;
> > -		}
> > -	} while (--retry);
> > -
> > -	WARN_ONCE(1, "%s: tapq rc %02x: %04x could not clear IR bit\n",
> > -		  __func__, status.response_code, apqn);
> > -}
> > -
> > -/**
> >   * vfio_ap_free_aqic_resources
> >   * @q: The vfio_ap_queue
> >   *
> > @@ -133,14 +92,10 @@ static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
> >   * @q: The vfio_ap_queue
> >   *
> >   * Uses ap_aqic to disable the interruption and in case of success, reset
> > - * in progress or IRQ disable command already proceeded: calls
> > - * vfio_ap_wait_for_irqclear() to check for the IRQ bit to be clear
> > - * and calls vfio_ap_free_aqic_resources() to free the resources associated
> > + * in progress or IRQ disable command already proceeded :calls
> > + * vfio_ap_free_aqic_resources() to free the resources associated
> >   * with the AP interrupt handling.
> >   *
> > - * In the case the AP is busy, or a reset is in progress,
> > - * retries after 20ms, up to 5 times.
> > - *
> >   * Returns if ap_aqic function failed with invalid, deconfigured or
> >   * checkstopped AP.
> >   */
> > @@ -155,7 +110,6 @@ struct ap_queue_status vfio_ap_irq_disable(struct vfio_ap_queue *q)
> >  		switch (status.response_code) {
> >  		case AP_RESPONSE_OTHERWISE_CHANGED:
> >  		case AP_RESPONSE_NORMAL:
> > -			vfio_ap_wait_for_irqclear(q->apqn);
> >  			goto end_free;
> >  		case AP_RESPONSE_RESET_IN_PROGRESS:
> >  		case AP_RESPONSE_BUSY:
> > -- 2.5.5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-09-04 15:05     ` Tony Krowiak
@ 2019-09-05 11:03       ` Halil Pasic
  0 siblings, 0 replies; 10+ messages in thread
From: Halil Pasic @ 2019-09-05 11:03 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: Christian Borntraeger, linux-s390, linux-kernel, freude, cohuck,
	pasic, frankja, jjherne

On Wed, 4 Sep 2019 11:05:24 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> On 9/4/19 3:35 AM, Christian Borntraeger wrote:
> > Halil,
> > 
> > can you also send this patch as a separate mail. This also requires a much better
> > patch description about the why and it certainly should also have an agreement from
> > Anthony.
> > 
> > On 30.08.19 18:02, Halil Pasic wrote:
> >> From: Halil Pasic <pasic@linux.ibm.com>
> >> Date: Fri, 30 Aug 2019 17:39:47 +0200
> >> Subject: [PATCH 2/2] s390: vfio-ap: don't wait after AQIC interpretation
> >>
> >> Waiting for the asynchronous part of AQIC to complete as a part
> >> AQIC implementation is unnecessary and silly.
> >>
> >> Let's get rid of vfio_ap_wait_for_irqclear().
> >>
> >> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> >> ---
> >>   drivers/s390/crypto/vfio_ap_ops.c | 50 ++-------------------------------------
> >>   1 file changed, 2 insertions(+), 48 deletions(-)
> >>
> >> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> >> index dd07ebf..8d098f0 100644
> >> --- a/drivers/s390/crypto/vfio_ap_ops.c
> >> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> >> @@ -68,47 +68,6 @@ static struct vfio_ap_queue *vfio_ap_get_queue(
> >>   }
> >>   
> >>   /**
> >> - * vfio_ap_wait_for_irqclear
> >> - * @apqn: The AP Queue number
> >> - *
> >> - * Checks the IRQ bit for the status of this APQN using ap_tapq.
> >> - * Returns if the ap_tapq function succeeded and the bit is clear.
> >> - * Returns if ap_tapq function failed with invalid, deconfigured or
> >> - * checkstopped AP.
> >> - * Otherwise retries up to 5 times after waiting 20ms.
> >> - *
> >> - */
> >> -static void vfio_ap_wait_for_irqclear(int apqn)
> >> -{
> >> -	struct ap_queue_status status;
> >> -	int retry = 5;
> >> -
> >> -	do {
> >> -		status = ap_tapq(apqn, NULL);
> >> -		switch (status.response_code) {
> >> -		case AP_RESPONSE_NORMAL:
> >> -		case AP_RESPONSE_RESET_IN_PROGRESS:
> >> -			if (!status.irq_enabled)
> >> -				return;
> >> -			/* Fall through */
> >> -		case AP_RESPONSE_BUSY:
> >> -			msleep(20);
> >> -			break;
> >> -		case AP_RESPONSE_Q_NOT_AVAIL:
> >> -		case AP_RESPONSE_DECONFIGURED:
> >> -		case AP_RESPONSE_CHECKSTOPPED:
> >> -		default:
> >> -			WARN_ONCE(1, "%s: tapq rc %02x: %04x\n", __func__,
> >> -				  status.response_code, apqn);
> >> -			return;
> >> -		}
> >> -	} while (--retry);
> >> -
> >> -	WARN_ONCE(1, "%s: tapq rc %02x: %04x could not clear IR bit\n",
> >> -		  __func__, status.response_code, apqn);
> >> -}
> >> -
> >> -/**
> >>    * vfio_ap_free_aqic_resources
> >>    * @q: The vfio_ap_queue
> >>    *
> >> @@ -133,14 +92,10 @@ static void vfio_ap_free_aqic_resources(struct vfio_ap_queue *q)
> >>    * @q: The vfio_ap_queue
> >>    *
> >>    * Uses ap_aqic to disable the interruption and in case of success, reset
> >> - * in progress or IRQ disable command already proceeded: calls
> >> - * vfio_ap_wait_for_irqclear() to check for the IRQ bit to be clear
> >> - * and calls vfio_ap_free_aqic_resources() to free the resources associated
> >> + * in progress or IRQ disable command already proceeded :calls
> >> + * vfio_ap_free_aqic_resources() to free the resources associated
> >>    * with the AP interrupt handling.
> >>    *
> >> - * In the case the AP is busy, or a reset is in progress,
> >> - * retries after 20ms, up to 5 times.
> >> - *
> >>    * Returns if ap_aqic function failed with invalid, deconfigured or
> >>    * checkstopped AP.
> >>    */
> >> @@ -155,7 +110,6 @@ struct ap_queue_status vfio_ap_irq_disable(struct vfio_ap_queue *q)
> >>   		switch (status.response_code) {
> >>   		case AP_RESPONSE_OTHERWISE_CHANGED:
> >>   		case AP_RESPONSE_NORMAL:
> >> -			vfio_ap_wait_for_irqclear(q->apqn);
> 
> I am not sure why you consider the wait unnecessary and silly. 

Because the async function associated with AQIC is not supposed/required
to finish during the execution of AQIC. But yes, there is a problem with
this patch.

> Notice
> the response code AP_RESPONSE_OTHERWISE_CHANGED above which means that
> the AP queue is already disabled for interrupts or the enablement
> process has not yet completed.

IMHO we should finish the interpretation of AQIC with response code
AP_RESPONSE_OTHERWISE_CHANGED without any wait. It's up to the guest
to respond to this condition in whatever way it likes, and not up to
us to stall the vcpu.

> Shouldn't we wait for the IRQ to clear
> in this case? I do agree that there is no need to wait if the
> response code is 0.

And the problem with this patch of mine is that we may not call 
vfio_ap_free_aqic_resources(q) before the interrupts are really disabled. The
nib needs to remain pinned until the interrupts are really disabled for
the queue. Please notice that this is the case for response code 0 as
well.

So if we don't want to do error handling and retry and wait
for the guest, we would need to do the cleanup async -- or don't do
any cleanup on AQIC with disable.

Honestly I'm not sure any more what is the smallest evil. Opinions?

Regards,
Halil

> 
> >>   			goto end_free;
> >>   		case AP_RESPONSE_RESET_IN_PROGRESS:
> >>   		case AP_RESPONSE_BUSY:
> >> -- 2.5.5
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-08-30 16:02 ` Halil Pasic
  2019-09-03  7:37   ` Christian Borntraeger
  2019-09-04  7:35   ` Christian Borntraeger
@ 2019-09-05 16:26   ` Tony Krowiak
  2019-09-05 23:10     ` Halil Pasic
  2 siblings, 1 reply; 10+ messages in thread
From: Tony Krowiak @ 2019-09-05 16:26 UTC (permalink / raw)
  To: Halil Pasic
  Cc: linux-s390, linux-kernel, freude, borntraeger, cohuck, pasic,
	frankja, jjherne

On 8/30/19 12:02 PM, Halil Pasic wrote:
> On Mon, 19 Aug 2019 13:48:49 -0400
> Tony Krowiak <akrowiak@linux.ibm.com> wrote:
> 
>> When an AP queue is reset (zeroized), interrupts are disabled. The queue
>> reset function currently tries to disable interrupts unnecessarily. This patch
>> removes the unnecessary calls to disable interrupts after queue reset.
>>
>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>> ---
>>   drivers/s390/crypto/vfio_ap_ops.c | 21 +++++++++++++++++----
>>   1 file changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
>> index 0604b49a4d32..e3bcb430e214 100644
>> --- a/drivers/s390/crypto/vfio_ap_ops.c
>> +++ b/drivers/s390/crypto/vfio_ap_ops.c
>> @@ -1114,18 +1114,19 @@ static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
>>   	return NOTIFY_OK;
>>   }
>>   
>> -static void vfio_ap_irq_disable_apqn(int apqn)
>> +static struct vfio_ap_queue *vfio_ap_find_qdev(int apqn)
>>   {
>>   	struct device *dev;
>> -	struct vfio_ap_queue *q;
>> +	struct vfio_ap_queue *q = NULL;
>>   
>>   	dev = driver_find_device(&matrix_dev->vfio_ap_drv->driver, NULL,
>>   				 &apqn, match_apqn);
>>   	if (dev) {
>>   		q = dev_get_drvdata(dev);
>> -		vfio_ap_irq_disable(q);
>>   		put_device(dev);
>>   	}
>> +
>> +	return q;
>>   }
>>   
>>   int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
>> @@ -1164,6 +1165,7 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
>>   	int rc = 0;
>>   	unsigned long apid, apqi;
>>   	struct ap_matrix_mdev *matrix_mdev = mdev_get_drvdata(mdev);
>> +	struct vfio_ap_queue *q;
>>   
>>   	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm,
>>   			     matrix_mdev->matrix.apm_max + 1) {
>> @@ -1177,7 +1179,18 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
>>   			 */
>>   			if (ret)
>>   				rc = ret;
>> -			vfio_ap_irq_disable_apqn(AP_MKQID(apid, apqi));
>> +
>> +			/*
>> +			 * Resetting a queue disables interrupts as a side
>> +			 * effect, so there is no need to disable interrupts
>> +			 * here. Note that an error on reset indicates the
>> +			 * queue is inaccessible, so an attempt to disable
>> +			 * interrupts would fail and is therefore unnecessary.
>> +			 * Just free up the resources used by IRQ processing.
>> +			 */
> 
> I have some concerns about this patch. One thing we must ensure is that
> the machine does not poke freed memory (NIB, GISA). With the current
> design that means we must ensure that there won't be any interruption
> conditions indicated (and interrupts made pending) after this point.

If reset disables interrupts, why would we need to turn around and
disable interrupts after completion of the reset, unless reset disables
interrupts only for the duration of the reset processing? If interrupts
are disabled only for the duration of the reset, then I question whether
we need to be at all concerned about interrupts in the context of
resetting a queue.

> 
> I'm not entirely convinced this is ensured after your change. The
> relevant bits of the  documentation are particularly hard to figure out.
> I could use some clarifications for sure.
> 
> This hunk removes the wait implemented by vfio_ap_wait_for_irqclear()
> along with the hopefully overkill AQIC.
> 
> Let me name some of the scenarios I'm concerned about, along with the
> code that is currently supposed to handle these.
> 
> 
> int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
>                               unsigned int retry)
> {
>          struct ap_queue_status status;
>          int retry2 = 2;
>          int apqn = AP_MKQID(apid, apqi);
>                                                                                  
>          do {
>                  status = ap_zapq(apqn);
>                  switch (status.response_code) {
>                  case AP_RESPONSE_NORMAL:
>                          while (!status.queue_empty && retry2--) {
>                                  msleep(20);
>                                  status = ap_tapq(apqn, NULL);
>                          }
>                          WARN_ON_ONCE(retry <= 0);
>                          return 0;
>                  case AP_RESPONSE_RESET_IN_PROGRESS:
>                  case AP_RESPONSE_BUSY:
>                          msleep(20);
>                          break;
>                  default:
>                          /* things are really broken, give up */
>                          return -EIO;
>                  }
>          } while (retry--);
>                                                                                  
>          return -EBUSY;
> 
> Scenario 1)
> 
> ap_zapq() returns status.response_code == AP_RESPONSE_RESET_IN_PROGRESS,
> because for example G2 did a ZAPQ before us.
> 
> The current logic retries ap_zapq() once after 20 msec. I have no idea
> if that is sufficient under all circumstances. If we get a
> AP_RESPONSE_RESET_IN_PROGRESS again, we return with -EBUSY and do nothing
> about it if the whole process was triggered by vfio_ap_mdev_release().
> Not even a warning.

I'm not sure this is a major concern, If the response code is
AP_RESPONSE_RESET_IN_PROGRESS due to a ZAPQ issued by G2 before us,
then once that completes the queue will be reset which accomplishes
the goal anyway. Maybe what should be done in this case is to wait
for the queue to empty?

> 
> Please notice that this was almost fine before IRQ support, because the
> queue will get reset ASAP, and ...

I'm not sure what IRQ support has to do with how soon the queue is
reset.

> 
> Scenario 2)
> 
> It takes longer than 40 msec for the reset to complete. I don't know if
> this is a possibility, but before your patch we used to wait of 1 sec.

Where are you coming up with 1 sec? The only thing my patch did was
remove the disable IRQ. Even if you include the the wait for IRQ disable
to complete, I don't see 1 second of wait time. By my calculations:

5 x 20ms = 100ms  Max wait for ZAPQ response code 0
2 x 20ms =  40ms  Max wait time for qempty
5 x 20ms = 100ms  Max wait for AQIC response code 0
5 x 20ms = 100ms  Max wait time for IRQ clear
----------------
            340ms  Max total wait time

> 
> I guess the we need the timeouts because do this with matrix_dev->lock
> held. I'm not sure it's a good idea long term.

It looks like your concern here is making sure we wait a
sufficient amount of time for the reset to complete. That may be
a question for the firmware folks.

> 
> Scenario 3)
> 
> ap_zapq() returns status.response_code == AP_RESPONSE_DECONFIGURED. In
> this case we would give up right away. Again so that we don't even know
> we hit this case. There ain't much I can think about we could do here.

The only thing I can think of is to log an error. I am introducing
logging in the dynamic configuration series if that helps.

> 
> I hope we are guaranteed to  not get any bits set at this point, but I could
> use some help.

I don't know what your concern is here. If the card is not in the
configuration, then no activity can take place until it is reconfigured.

> 
>                                    *
> 
> The good thing is I'm pretty sure that we are safe (i.e. no bits will be
> poked by the hardware) after seeing the queue empty after we issued a
> reset request.

You expressed several concerns above, yet now you say you think we're
safe; I don't understand.

> 
> So my concerns basically boil down to are the cumulative timeouts big enough
> so we never time out (including are timeouts really the way to go)?
> 
> We should probably take any discussion about from which point on is it safe
> to assume that NIB and GISA won't be poked by HW offline as the guys
> without documentation can't contribute.

Just a couple of thoughts. The only reason the vfio_ap driver is
concerned with interrupts is because it is intercepting the PQAP(AQIC)
instruction. Shouldn't enabling and disabling interrupts, therefore, be 
solely under the control of the guest? In other words, maybe the vfio_ap
driver should only allocate/deallocate the interrupt controls (i.e., the
NIB and GISA) in response to PQAP(AQIC) interception. When the guest
shuts down, I assume that the PQAP(AQIC) will be executed to disable
interrupts. The driver can do a final cleanup if the NIB and GISA when
the mdev fd is released by the guest if they are still hanging around.

> 
> 
> Two issues not directly related to your patch. I formulated those as two
> follow up patches on top of yours .please take a look at them.
> 
> -------------------------------8<------------------------------------------
> From: Halil Pasic <pasic@linux.ibm.com>
> Date: Fri, 30 Aug 2019 16:03:42 +0200
> Subject: [PATCH 1/2] s390: vfio-ap: fix warning reset not completed
> 
> The intention seems to be to warn once when we don't wait enough for the
> reset to complete. Let's use the right retry counter to accomplish that
> semantic.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e3bcb43..dd07ebf 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1144,7 +1144,7 @@ int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
>   				msleep(20);
>   				status = ap_tapq(apqn, NULL);
>   			}
> -			WARN_ON_ONCE(retry <= 0);
> +			WARN_ON_ONCE(retry2 <= 0);
>   			return 0;
>   		case AP_RESPONSE_RESET_IN_PROGRESS:
>   		case AP_RESPONSE_BUSY:
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts
  2019-09-05 16:26   ` Tony Krowiak
@ 2019-09-05 23:10     ` Halil Pasic
  0 siblings, 0 replies; 10+ messages in thread
From: Halil Pasic @ 2019-09-05 23:10 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, freude, borntraeger, cohuck, pasic,
	frankja, jjherne

On Thu, 5 Sep 2019 12:26:15 -0400
Tony Krowiak <akrowiak@linux.ibm.com> wrote:

> On 8/30/19 12:02 PM, Halil Pasic wrote:
[..]
> >> @@ -1177,7 +1179,18 @@ static int vfio_ap_mdev_reset_queues(struct mdev_device *mdev)
> >>   			 */
> >>   			if (ret)
> >>   				rc = ret;
> >> -			vfio_ap_irq_disable_apqn(AP_MKQID(apid, apqi));
> >> +
> >> +			/*
> >> +			 * Resetting a queue disables interrupts as a side
> >> +			 * effect, so there is no need to disable interrupts
> >> +			 * here. Note that an error on reset indicates the
> >> +			 * queue is inaccessible, so an attempt to disable
> >> +			 * interrupts would fail and is therefore unnecessary.
> >> +			 * Just free up the resources used by IRQ processing.
> >> +			 */
> > 
> > I have some concerns about this patch. One thing we must ensure is that
> > the machine does not poke freed memory (NIB, GISA). With the current
> > design that means we must ensure that there won't be any interruption
> > conditions indicated (and interrupts made pending) after this point.
> 
> If reset disables interrupts, why would we need to turn around and
> disable interrupts after completion of the reset, unless reset disables
> interrupts only for the duration of the reset processing?

That a successful queue reset disables interrupts on the queue is beyond
any doubt. My concern is about the fact that a part of the reset itself
is done asynchronously with respect to the PQAP ZAPQ that asks for the
reset.

> If interrupts
> are disabled only for the duration of the reset, then I question whether
> we need to be at all concerned about interrupts in the context of
> resetting a queue.
> 

We need to clean up the NIB. We must avoid use after free like stuff, so
my concern is about the waits.

> > 
> > I'm not entirely convinced this is ensured after your change. The
> > relevant bits of the  documentation are particularly hard to figure out.
> > I could use some clarifications for sure.
> > 
> > This hunk removes the wait implemented by vfio_ap_wait_for_irqclear()
> > along with the hopefully overkill AQIC.
> > 
> > Let me name some of the scenarios I'm concerned about, along with the
> > code that is currently supposed to handle these.
> > 
> > 
> > int vfio_ap_mdev_reset_queue(unsigned int apid, unsigned int apqi,
> >                               unsigned int retry)
> > {
> >          struct ap_queue_status status;
> >          int retry2 = 2;
> >          int apqn = AP_MKQID(apid, apqi);
> >                                                                                  
> >          do {
> >                  status = ap_zapq(apqn);
> >                  switch (status.response_code) {
> >                  case AP_RESPONSE_NORMAL:
> >                          while (!status.queue_empty && retry2--) {
> >                                  msleep(20);
> >                                  status = ap_tapq(apqn, NULL);
> >                          }
> >                          WARN_ON_ONCE(retry <= 0);
> >                          return 0;
> >                  case AP_RESPONSE_RESET_IN_PROGRESS:
> >                  case AP_RESPONSE_BUSY:
> >                          msleep(20);
> >                          break;
> >                  default:
> >                          /* things are really broken, give up */
> >                          return -EIO;
> >                  }
> >          } while (retry--);
> >                                                                                  
> >          return -EBUSY;
> > 
> > Scenario 1)
> > 
> > ap_zapq() returns status.response_code == AP_RESPONSE_RESET_IN_PROGRESS,
> > because for example G2 did a ZAPQ before us.
> > 
> > The current logic retries ap_zapq() once after 20 msec. I have no idea
> > if that is sufficient under all circumstances. If we get a
> > AP_RESPONSE_RESET_IN_PROGRESS again, we return with -EBUSY and do nothing
> > about it if the whole process was triggered by vfio_ap_mdev_release().
> > Not even a warning.
> 
> I'm not sure this is a major concern, If the response code is
> AP_RESPONSE_RESET_IN_PROGRESS due to a ZAPQ issued by G2 before us,
> then once that completes the queue will be reset which accomplishes
> the goal anyway. Maybe what should be done in this case is to wait
> for the queue to empty?
> 

Yes I'm concerned about do we wait long enough because we need to avoid
HW poking memory that could be used for something else at that point
(NIB, GISA.IPM).

> > 
> > Please notice that this was almost fine before IRQ support, because the
> > queue will get reset ASAP, and ...
> 
> I'm not sure what IRQ support has to do with how soon the queue is
> reset.
> 

If there is no IRQs AP HW won't just poke guest memory like it does with
the NIB. And since the AP instructions are bound to get reset in progress
response code even there is no danger of queue state leaking because the
reset is not fully completed yet.

> > 
> > Scenario 2)
> > 
> > It takes longer than 40 msec for the reset to complete. I don't know if
> > this is a possibility, but before your patch we used to wait of 1 sec.
> 
> Where are you coming up with 1 sec?

My bad, it was supposed to be 0.1 sec == 5 X 20ms (wait after AQIC that
got kicked by this patch). What I want to say is this patch
substantially reduces the wait time.

> The only thing my patch did was
> remove the disable IRQ. Even if you include the the wait for IRQ disable
> to complete, I don't see 1 second of wait time. By my calculations:
> 
> 5 x 20ms = 100ms  Max wait for ZAPQ response code 0
> 2 x 20ms =  40ms  Max wait time for qempty
> 5 x 20ms = 100ms  Max wait for AQIC response code 0
> 5 x 20ms = 100ms  Max wait time for IRQ clear
> ----------------
>             340ms  Max total wait time
> 
> > 
> > I guess the we need the timeouts because do this with matrix_dev->lock
> > held. I'm not sure it's a good idea long term.
> 
> It looks like your concern here is making sure we wait a
> sufficient amount of time for the reset to complete. That may be
> a question for the firmware folks.
> 

Yes, my concern is making sure we don't clean up/give up stuff
prematurely. The current code seems to do the waits for this purpose.
I'm not sure what is actually necessary. Clarifying stuff with firmware
or somebody else that knows better than I do is probably a good idea.

> > 
> > Scenario 3)
> > 
> > ap_zapq() returns status.response_code == AP_RESPONSE_DECONFIGURED. In
> > this case we would give up right away. Again so that we don't even know
> > we hit this case. There ain't much I can think about we could do here.
> 
> The only thing I can think of is to log an error. I am introducing
> logging in the dynamic configuration series if that helps.
> 

I don't think it is an error.

> > 
> > I hope we are guaranteed to  not get any bits set at this point, but I could
> > use some help.
> 
> I don't know what your concern is here. If the card is not in the
> configuration, then no activity can take place until it is reconfigured.
> 

Yes, I also think that is the only way this make sense. But I usually
don't assume complex things happen atomically unless I have a guarantee
they do. I will explain what I mean by this offline.

> > 
> >                                    *
> > 
> > The good thing is I'm pretty sure that we are safe (i.e. no bits will be
> > poked by the hardware) after seeing the queue empty after we issued a
> > reset request.
> 
> You expressed several concerns above, yet now you say you think we're
> safe; I don't understand.

We are safe after we see the E bit set to one (queue empty) after the
reset. Above I describe the scenarios where we don't wait for that to
happen (in scenario 3 because it won't happen because the card is gone,
and in the other two because of the implemented timeouts).

> 
> > 
> > So my concerns basically boil down to are the cumulative timeouts big enough
> > so we never time out (including are timeouts really the way to go)?
> > 
> > We should probably take any discussion about from which point on is it safe
> > to assume that NIB and GISA won't be poked by HW offline as the guys
> > without documentation can't contribute.
> 
> Just a couple of thoughts. The only reason the vfio_ap driver is
> concerned with interrupts is because it is intercepting the PQAP(AQIC)
> instruction. Shouldn't enabling and disabling interrupts, therefore, be 
> solely under the control of the guest? In other words, maybe the vfio_ap
> driver should only allocate/deallocate the interrupt controls (i.e., the
> NIB and GISA) in response to PQAP(AQIC) interception.

GISA is a 1 per guest entity in the hypervisor. The NIB is a byte of the
guest memory.

> When the guest
> shuts down, I assume that the PQAP(AQIC) will be executed to disable
> interrupts. 

I would not assume that.

> The driver can do a final cleanup if the NIB and GISA when
> the mdev fd is released by the guest if they are still hanging around.
> 

You have to also consider stuff like subsystem resets. Yes the last line
of defense should be release(). This is what we do today AFAIU.

Regards,
Halil


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-05 23:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19 17:48 [PATCH v2] s390: vfio-ap: remove unnecessary calls to disable queue interrupts Tony Krowiak
2019-08-19 18:56 ` Cornelia Huck
2019-08-30 16:02 ` Halil Pasic
2019-09-03  7:37   ` Christian Borntraeger
2019-09-04  7:35   ` Christian Borntraeger
2019-09-04 15:05     ` Tony Krowiak
2019-09-05 11:03       ` Halil Pasic
2019-09-05 10:24     ` Halil Pasic
2019-09-05 16:26   ` Tony Krowiak
2019-09-05 23:10     ` Halil Pasic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).