linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] crypto: qat - resolve race condition during AER recovery
@ 2024-02-09 12:43 Damian Muszynski
  2024-02-17  1:15 ` Herbert Xu
  2024-05-08  8:39 ` [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak Herbert Xu
  0 siblings, 2 replies; 5+ messages in thread
From: Damian Muszynski @ 2024-02-09 12:43 UTC (permalink / raw)
  To: herbert
  Cc: linux-crypto, qat-linux, Damian Muszynski, stable, Giovanni Cabiddu

During the PCI AER system's error recovery process, the kernel driver
may encounter a race condition with freeing the reset_data structure's
memory. If the device restart will take more than 10 seconds the function
scheduling that restart will exit due to a timeout, and the reset_data
structure will be freed. However, this data structure is used for
completion notification after the restart is completed, which leads
to a UAF bug.

This results in a KFENCE bug notice.

  BUG: KFENCE: use-after-free read in adf_device_reset_worker+0x38/0xa0 [intel_qat]
  Use-after-free read at 0x00000000bc56fddf (in kfence-#142):
  adf_device_reset_worker+0x38/0xa0 [intel_qat]
  process_one_work+0x173/0x340

To resolve this race condition, the memory associated to the container
of the work_struct is freed on the worker if the timeout expired,
otherwise on the function that schedules the worker.
The timeout detection can be done by checking if the caller is
still waiting for completion or not by using completion_done() function.

Fixes: d8cba25d2c68 ("crypto: qat - Intel(R) QAT driver framework")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damian Muszynski <damian.muszynski@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 drivers/crypto/intel/qat/qat_common/adf_aer.c | 22 ++++++++++++++-----
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index 3597e7605a14..9da2278bd5b7 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -130,7 +130,8 @@ static void adf_device_reset_worker(struct work_struct *work)
 	if (adf_dev_restart(accel_dev)) {
 		/* The device hanged and we can't restart it so stop here */
 		dev_err(&GET_DEV(accel_dev), "Restart device failed\n");
-		if (reset_data->mode == ADF_DEV_RESET_ASYNC)
+		if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
+		    completion_done(&reset_data->compl))
 			kfree(reset_data);
 		WARN(1, "QAT: device restart failed. Device is unusable\n");
 		return;
@@ -146,11 +147,19 @@ static void adf_device_reset_worker(struct work_struct *work)
 	adf_dev_restarted_notify(accel_dev);
 	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
 
-	/* The dev is back alive. Notify the caller if in sync mode */
-	if (reset_data->mode == ADF_DEV_RESET_SYNC)
-		complete(&reset_data->compl);
-	else
+	/*
+	 * The dev is back alive. Notify the caller if in sync mode
+	 *
+	 * If device restart will take a more time than expected,
+	 * the schedule_reset() function can timeout and exit. This can be
+	 * detected by calling the completion_done() function. In this case
+	 * the reset_data structure needs to be freed here.
+	 */
+	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
+	    completion_done(&reset_data->compl))
 		kfree(reset_data);
+	else
+		complete(&reset_data->compl);
 }
 
 static int adf_dev_aer_schedule_reset(struct adf_accel_dev *accel_dev,
@@ -183,8 +192,9 @@ static int adf_dev_aer_schedule_reset(struct adf_accel_dev *accel_dev,
 			dev_err(&GET_DEV(accel_dev),
 				"Reset device timeout expired\n");
 			ret = -EFAULT;
+		} else {
+			kfree(reset_data);
 		}
-		kfree(reset_data);
 		return ret;
 	}
 	return 0;

base-commit: 86f2ff2d4ec09a7eea931a56fbed2105037ba2ee
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: qat - resolve race condition during AER recovery
  2024-02-09 12:43 [PATCH] crypto: qat - resolve race condition during AER recovery Damian Muszynski
@ 2024-02-17  1:15 ` Herbert Xu
  2024-05-08  8:39 ` [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak Herbert Xu
  1 sibling, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2024-02-17  1:15 UTC (permalink / raw)
  To: Damian Muszynski; +Cc: linux-crypto, qat-linux, stable, Giovanni Cabiddu

On Fri, Feb 09, 2024 at 01:43:42PM +0100, Damian Muszynski wrote:
> During the PCI AER system's error recovery process, the kernel driver
> may encounter a race condition with freeing the reset_data structure's
> memory. If the device restart will take more than 10 seconds the function
> scheduling that restart will exit due to a timeout, and the reset_data
> structure will be freed. However, this data structure is used for
> completion notification after the restart is completed, which leads
> to a UAF bug.
> 
> This results in a KFENCE bug notice.
> 
>   BUG: KFENCE: use-after-free read in adf_device_reset_worker+0x38/0xa0 [intel_qat]
>   Use-after-free read at 0x00000000bc56fddf (in kfence-#142):
>   adf_device_reset_worker+0x38/0xa0 [intel_qat]
>   process_one_work+0x173/0x340
> 
> To resolve this race condition, the memory associated to the container
> of the work_struct is freed on the worker if the timeout expired,
> otherwise on the function that schedules the worker.
> The timeout detection can be done by checking if the caller is
> still waiting for completion or not by using completion_done() function.
> 
> Fixes: d8cba25d2c68 ("crypto: qat - Intel(R) QAT driver framework")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Damian Muszynski <damian.muszynski@intel.com>
> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
> ---
>  drivers/crypto/intel/qat/qat_common/adf_aer.c | 22 ++++++++++++++-----
>  1 file changed, 16 insertions(+), 6 deletions(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak
  2024-02-09 12:43 [PATCH] crypto: qat - resolve race condition during AER recovery Damian Muszynski
  2024-02-17  1:15 ` Herbert Xu
@ 2024-05-08  8:39 ` Herbert Xu
  2024-05-08 10:23   ` Damian Muszynski
  2024-05-16 21:16   ` Cabiddu, Giovanni
  1 sibling, 2 replies; 5+ messages in thread
From: Herbert Xu @ 2024-05-08  8:39 UTC (permalink / raw)
  To: Damian Muszynski; +Cc: linux-crypto, qat-linux, Giovanni Cabiddu

On Fri, Feb 09, 2024 at 01:43:42PM +0100, Damian Muszynski wrote:
>
> @@ -146,11 +147,19 @@ static void adf_device_reset_worker(struct work_struct *work)
>  	adf_dev_restarted_notify(accel_dev);
>  	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
>  
> -	/* The dev is back alive. Notify the caller if in sync mode */
> -	if (reset_data->mode == ADF_DEV_RESET_SYNC)
> -		complete(&reset_data->compl);
> -	else
> +	/*
> +	 * The dev is back alive. Notify the caller if in sync mode
> +	 *
> +	 * If device restart will take a more time than expected,
> +	 * the schedule_reset() function can timeout and exit. This can be
> +	 * detected by calling the completion_done() function. In this case
> +	 * the reset_data structure needs to be freed here.
> +	 */
> +	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
> +	    completion_done(&reset_data->compl))
>  		kfree(reset_data);
> +	else
> +		complete(&reset_data->compl);

This doesn't work because until you call complete, completion_done
will always return false.  IOW we now have a memory leak instead of
a UAF.

---8<---
Using completion_done to determine whether the caller has gone
away only works after a complete call.  Furthermore it's still
possible that the caller has not yet called wait_for_completion,
resulting in another potential UAF.

Fix this by making the caller use cancel_work_sync and then freeing
the memory safely.

Fixes: 7d42e097607c ("crypto: qat - resolve race condition during AER recovery")
Cc: <stable@vger.kernel.org> #6.8+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
index 9da2278bd5b7..04260f61d042 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
@@ -130,8 +130,7 @@ static void adf_device_reset_worker(struct work_struct *work)
 	if (adf_dev_restart(accel_dev)) {
 		/* The device hanged and we can't restart it so stop here */
 		dev_err(&GET_DEV(accel_dev), "Restart device failed\n");
-		if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
-		    completion_done(&reset_data->compl))
+		if (reset_data->mode == ADF_DEV_RESET_ASYNC)
 			kfree(reset_data);
 		WARN(1, "QAT: device restart failed. Device is unusable\n");
 		return;
@@ -147,16 +146,8 @@ static void adf_device_reset_worker(struct work_struct *work)
 	adf_dev_restarted_notify(accel_dev);
 	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
 
-	/*
-	 * The dev is back alive. Notify the caller if in sync mode
-	 *
-	 * If device restart will take a more time than expected,
-	 * the schedule_reset() function can timeout and exit. This can be
-	 * detected by calling the completion_done() function. In this case
-	 * the reset_data structure needs to be freed here.
-	 */
-	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
-	    completion_done(&reset_data->compl))
+	/* The dev is back alive. Notify the caller if in sync mode */
+	if (reset_data->mode == ADF_DEV_RESET_ASYNC)
 		kfree(reset_data);
 	else
 		complete(&reset_data->compl);
@@ -191,10 +182,10 @@ static int adf_dev_aer_schedule_reset(struct adf_accel_dev *accel_dev,
 		if (!timeout) {
 			dev_err(&GET_DEV(accel_dev),
 				"Reset device timeout expired\n");
+			cancel_work_sync(&reset_data->reset_work);
 			ret = -EFAULT;
-		} else {
-			kfree(reset_data);
 		}
+		kfree(reset_data);
 		return ret;
 	}
 	return 0;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak
  2024-05-08  8:39 ` [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak Herbert Xu
@ 2024-05-08 10:23   ` Damian Muszynski
  2024-05-16 21:16   ` Cabiddu, Giovanni
  1 sibling, 0 replies; 5+ messages in thread
From: Damian Muszynski @ 2024-05-08 10:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto, qat-linux, Giovanni Cabiddu

Hi Herbert,

Thanks for your vigilance. I based this fix on the description of
completion_done() which can be misunderstood as can be seen.

--
Damian

On 2024-05-08 at 16:39:51 +0800, Herbert Xu wrote:
> On Fri, Feb 09, 2024 at 01:43:42PM +0100, Damian Muszynski wrote:
> >
> > @@ -146,11 +147,19 @@ static void adf_device_reset_worker(struct work_struct *work)
> >  	adf_dev_restarted_notify(accel_dev);
> >  	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
> >
> > -	/* The dev is back alive. Notify the caller if in sync mode */
> > -	if (reset_data->mode == ADF_DEV_RESET_SYNC)
> > -		complete(&reset_data->compl);
> > -	else
> > +	/*
> > +	 * The dev is back alive. Notify the caller if in sync mode
> > +	 *
> > +	 * If device restart will take a more time than expected,
> > +	 * the schedule_reset() function can timeout and exit. This can be
> > +	 * detected by calling the completion_done() function. In this case
> > +	 * the reset_data structure needs to be freed here.
> > +	 */
> > +	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
> > +	    completion_done(&reset_data->compl))
> >  		kfree(reset_data);
> > +	else
> > +		complete(&reset_data->compl);
>
> This doesn't work because until you call complete, completion_done
> will always return false.  IOW we now have a memory leak instead of
> a UAF.
>
> ---8<---
> Using completion_done to determine whether the caller has gone
> away only works after a complete call.  Furthermore it's still
> possible that the caller has not yet called wait_for_completion,
> resulting in another potential UAF.
>
> Fix this by making the caller use cancel_work_sync and then freeing
> the memory safely.
>
> Fixes: 7d42e097607c ("crypto: qat - resolve race condition during AER recovery")
> Cc: <stable@vger.kernel.org> #6.8+
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/drivers/crypto/intel/qat/qat_common/adf_aer.c b/drivers/crypto/intel/qat/qat_common/adf_aer.c
> index 9da2278bd5b7..04260f61d042 100644
> --- a/drivers/crypto/intel/qat/qat_common/adf_aer.c
> +++ b/drivers/crypto/intel/qat/qat_common/adf_aer.c
> @@ -130,8 +130,7 @@ static void adf_device_reset_worker(struct work_struct *work)
>  	if (adf_dev_restart(accel_dev)) {
>  		/* The device hanged and we can't restart it so stop here */
>  		dev_err(&GET_DEV(accel_dev), "Restart device failed\n");
> -		if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
> -		    completion_done(&reset_data->compl))
> +		if (reset_data->mode == ADF_DEV_RESET_ASYNC)
>  			kfree(reset_data);
>  		WARN(1, "QAT: device restart failed. Device is unusable\n");
>  		return;
> @@ -147,16 +146,8 @@ static void adf_device_reset_worker(struct work_struct *work)
>  	adf_dev_restarted_notify(accel_dev);
>  	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
>
> -	/*
> -	 * The dev is back alive. Notify the caller if in sync mode
> -	 *
> -	 * If device restart will take a more time than expected,
> -	 * the schedule_reset() function can timeout and exit. This can be
> -	 * detected by calling the completion_done() function. In this case
> -	 * the reset_data structure needs to be freed here.
> -	 */
> -	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
> -	    completion_done(&reset_data->compl))
> +	/* The dev is back alive. Notify the caller if in sync mode */
> +	if (reset_data->mode == ADF_DEV_RESET_ASYNC)
>  		kfree(reset_data);
>  	else
>  		complete(&reset_data->compl);
> @@ -191,10 +182,10 @@ static int adf_dev_aer_schedule_reset(struct adf_accel_dev *accel_dev,
>  		if (!timeout) {
>  			dev_err(&GET_DEV(accel_dev),
>  				"Reset device timeout expired\n");
> +			cancel_work_sync(&reset_data->reset_work);
>  			ret = -EFAULT;
> -		} else {
> -			kfree(reset_data);
>  		}
> +		kfree(reset_data);
>  		return ret;
>  	}
>  	return 0;
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak
  2024-05-08  8:39 ` [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak Herbert Xu
  2024-05-08 10:23   ` Damian Muszynski
@ 2024-05-16 21:16   ` Cabiddu, Giovanni
  1 sibling, 0 replies; 5+ messages in thread
From: Cabiddu, Giovanni @ 2024-05-16 21:16 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Damian Muszynski, linux-crypto, qat-linux

On Wed, May 08, 2024 at 04:39:51PM +0800, Herbert Xu wrote:
> On Fri, Feb 09, 2024 at 01:43:42PM +0100, Damian Muszynski wrote:
> >
> > @@ -146,11 +147,19 @@ static void adf_device_reset_worker(struct work_struct *work)
> >  	adf_dev_restarted_notify(accel_dev);
> >  	clear_bit(ADF_STATUS_RESTARTING, &accel_dev->status);
> >  
> > -	/* The dev is back alive. Notify the caller if in sync mode */
> > -	if (reset_data->mode == ADF_DEV_RESET_SYNC)
> > -		complete(&reset_data->compl);
> > -	else
> > +	/*
> > +	 * The dev is back alive. Notify the caller if in sync mode
> > +	 *
> > +	 * If device restart will take a more time than expected,
> > +	 * the schedule_reset() function can timeout and exit. This can be
> > +	 * detected by calling the completion_done() function. In this case
> > +	 * the reset_data structure needs to be freed here.
> > +	 */
> > +	if (reset_data->mode == ADF_DEV_RESET_ASYNC ||
> > +	    completion_done(&reset_data->compl))
> >  		kfree(reset_data);
> > +	else
> > +		complete(&reset_data->compl);
> 
> This doesn't work because until you call complete, completion_done
> will always return false.  IOW we now have a memory leak instead of
> a UAF.
> 
> ---8<---
> Using completion_done to determine whether the caller has gone
> away only works after a complete call.  Furthermore it's still
> possible that the caller has not yet called wait_for_completion,
> resulting in another potential UAF.
> 
> Fix this by making the caller use cancel_work_sync and then freeing
> the memory safely.
> 
> Fixes: 7d42e097607c ("crypto: qat - resolve race condition during AER recovery")
> Cc: <stable@vger.kernel.org> #6.8+
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>

This is also present in 6.6+ and 6.7+.

-- 
Giovanni

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-05-16 21:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-09 12:43 [PATCH] crypto: qat - resolve race condition during AER recovery Damian Muszynski
2024-02-17  1:15 ` Herbert Xu
2024-05-08  8:39 ` [PATCH] crypto: qat - Fix ADF_DEV_RESET_SYNC memory leak Herbert Xu
2024-05-08 10:23   ` Damian Muszynski
2024-05-16 21:16   ` Cabiddu, Giovanni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).