dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] accel/habanalabs: remove pdev check on idle check
@ 2023-06-12 12:07 Oded Gabbay
  2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
  To: dri-devel

Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 0d02f1f7b994..5e61761b8c11 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -424,7 +424,7 @@ static void hpriv_release(struct kref *ref)
 	/* Check the device idle status and reset if not idle.
 	 * Skip it if already in reset, or if device is going to be reset in any case.
 	 */
-	if (!hdev->reset_info.in_reset && !reset_device && hdev->pdev && !hdev->pldm)
+	if (!hdev->reset_info.in_reset && !reset_device && !hdev->pldm)
 		device_is_idle = hdev->asic_funcs->is_device_idle(hdev, idle_mask,
 							HL_BUSY_ENGINES_MASK_EXT_SIZE, NULL);
 	if (!device_is_idle) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed
  2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
@ 2023-06-12 12:07 ` Oded Gabbay
  2023-06-13  7:54   ` Ofir Bitton
  2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
  2023-06-13  7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton
  2 siblings, 1 reply; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
  To: dri-devel

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref)
 		/* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
 		int rc = hdev->asic_funcs->scrub_device_mem(hdev);
 
-		if (rc)
+		if (rc) {
 			dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+			hl_device_reset(hdev, HL_DRV_RESET_HARD);
+		}
 	}
 
 	/* Now we can mark the compute_ctx as not active. Even if a reset is running in a different
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error
  2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
  2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
@ 2023-06-12 12:07 ` Oded Gabbay
  2023-06-13  7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton
  2 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
  To: dri-devel; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

Add dump of an error reported from f/w during boot time.
This error indicates a failure with setting temperature threshold.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/firmware_if.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c
index 370508e98854..c7da69dbfa0a 100644
--- a/drivers/accel/habanalabs/common/firmware_if.c
+++ b/drivers/accel/habanalabs/common/firmware_if.c
@@ -724,6 +724,11 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val,
 		err_exists = true;
 	}
 
+	if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) {
+		dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n");
+		err_exists = true;
+	}
+
 	if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) {
 		/* Ignore this bit, don't prevent driver loading */
 		dev_dbg(hdev->dev, "device unusable status is set\n");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] accel/habanalabs: remove pdev check on idle check
  2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
  2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
  2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
@ 2023-06-13  7:54 ` Ofir Bitton
  2 siblings, 0 replies; 5+ messages in thread
From: Ofir Bitton @ 2023-06-13  7:54 UTC (permalink / raw)
  To: Oded Gabbay, dri-devel

[-- Attachment #1: Type: text/plain, Size: 1218 bytes --]

On 12/06/2023 15:07, Oded Gabbay wrote:

Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org><mailto:ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 0d02f1f7b994..5e61761b8c11 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -424,7 +424,7 @@ static void hpriv_release(struct kref *ref)
        /* Check the device idle status and reset if not idle.
         * Skip it if already in reset, or if device is going to be reset in any case.
         */
-       if (!hdev->reset_info.in_reset && !reset_device && hdev->pdev && !hdev->pldm)
+       if (!hdev->reset_info.in_reset && !reset_device && !hdev->pldm)
                device_is_idle = hdev->asic_funcs->is_device_idle(hdev, idle_mask,
                                                        HL_BUSY_ENGINES_MASK_EXT_SIZE, NULL);
        if (!device_is_idle) {


Reviewed-by: Ofir Bitton <obitton@habana.ai<mailto:obitton@habana.ai>>

[-- Attachment #2: Type: text/html, Size: 1581 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed
  2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
@ 2023-06-13  7:54   ` Ofir Bitton
  0 siblings, 0 replies; 5+ messages in thread
From: Ofir Bitton @ 2023-06-13  7:54 UTC (permalink / raw)
  To: Oded Gabbay, dri-devel

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

On 12/06/2023 15:07, Oded Gabbay wrote:

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org><mailto:ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref)
                /* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
                int rc = hdev->asic_funcs->scrub_device_mem(hdev);

-               if (rc)
+               if (rc) {
                        dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+                       hl_device_reset(hdev, HL_DRV_RESET_HARD);
+               }
        }

        /* Now we can mark the compute_ctx as not active. Even if a reset is running in a different


Reviewed-by: Ofir Bitton <obitton@habana.ai<mailto:obitton@habana.ai>>

[-- Attachment #2: Type: text/html, Size: 1552 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-06-13  7:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
2023-06-13  7:54   ` Ofir Bitton
2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
2023-06-13  7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).