* [PATCH 1/3] accel/habanalabs: remove pdev check on idle check
@ 2023-06-12 12:07 Oded Gabbay
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
To: dri-devel
Our simulator supports idle check so no need anymore to check if pdev
exists.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 0d02f1f7b994..5e61761b8c11 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -424,7 +424,7 @@ static void hpriv_release(struct kref *ref)
/* Check the device idle status and reset if not idle.
* Skip it if already in reset, or if device is going to be reset in any case.
*/
- if (!hdev->reset_info.in_reset && !reset_device && hdev->pdev && !hdev->pldm)
+ if (!hdev->reset_info.in_reset && !reset_device && !hdev->pldm)
device_is_idle = hdev->asic_funcs->is_device_idle(hdev, idle_mask,
HL_BUSY_ENGINES_MASK_EXT_SIZE, NULL);
if (!device_is_idle) {
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed
2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
@ 2023-06-12 12:07 ` Oded Gabbay
2023-06-13 7:54 ` Ofir Bitton
2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
2023-06-13 7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton
2 siblings, 1 reply; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
To: dri-devel
If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref)
/* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
int rc = hdev->asic_funcs->scrub_device_mem(hdev);
- if (rc)
+ if (rc) {
dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+ hl_device_reset(hdev, HL_DRV_RESET_HARD);
+ }
}
/* Now we can mark the compute_ctx as not active. Even if a reset is running in a different
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error
2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
@ 2023-06-12 12:07 ` Oded Gabbay
2023-06-13 7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton
2 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2023-06-12 12:07 UTC (permalink / raw)
To: dri-devel; +Cc: Ofir Bitton
From: Ofir Bitton <obitton@habana.ai>
Add dump of an error reported from f/w during boot time.
This error indicates a failure with setting temperature threshold.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/firmware_if.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c
index 370508e98854..c7da69dbfa0a 100644
--- a/drivers/accel/habanalabs/common/firmware_if.c
+++ b/drivers/accel/habanalabs/common/firmware_if.c
@@ -724,6 +724,11 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val,
err_exists = true;
}
+ if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) {
+ dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n");
+ err_exists = true;
+ }
+
if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) {
/* Ignore this bit, don't prevent driver loading */
dev_dbg(hdev->dev, "device unusable status is set\n");
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/3] accel/habanalabs: remove pdev check on idle check
2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
@ 2023-06-13 7:54 ` Ofir Bitton
2 siblings, 0 replies; 5+ messages in thread
From: Ofir Bitton @ 2023-06-13 7:54 UTC (permalink / raw)
To: Oded Gabbay, dri-devel
[-- Attachment #1: Type: text/plain, Size: 1218 bytes --]
On 12/06/2023 15:07, Oded Gabbay wrote:
Our simulator supports idle check so no need anymore to check if pdev
exists.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org><mailto:ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 0d02f1f7b994..5e61761b8c11 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -424,7 +424,7 @@ static void hpriv_release(struct kref *ref)
/* Check the device idle status and reset if not idle.
* Skip it if already in reset, or if device is going to be reset in any case.
*/
- if (!hdev->reset_info.in_reset && !reset_device && hdev->pdev && !hdev->pldm)
+ if (!hdev->reset_info.in_reset && !reset_device && !hdev->pldm)
device_is_idle = hdev->asic_funcs->is_device_idle(hdev, idle_mask,
HL_BUSY_ENGINES_MASK_EXT_SIZE, NULL);
if (!device_is_idle) {
Reviewed-by: Ofir Bitton <obitton@habana.ai<mailto:obitton@habana.ai>>
[-- Attachment #2: Type: text/html, Size: 1581 bytes --]
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
@ 2023-06-13 7:54 ` Ofir Bitton
0 siblings, 0 replies; 5+ messages in thread
From: Ofir Bitton @ 2023-06-13 7:54 UTC (permalink / raw)
To: Oded Gabbay, dri-devel
[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]
On 12/06/2023 15:07, Oded Gabbay wrote:
If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org><mailto:ogabbay@kernel.org>
---
drivers/accel/habanalabs/common/device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 5e61761b8c11..d7d9198b2103 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -454,8 +454,10 @@ static void hpriv_release(struct kref *ref)
/* Scrubbing is handled within hl_device_reset(), so here need to do it directly */
int rc = hdev->asic_funcs->scrub_device_mem(hdev);
- if (rc)
+ if (rc) {
dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+ hl_device_reset(hdev, HL_DRV_RESET_HARD);
+ }
}
/* Now we can mark the compute_ctx as not active. Even if a reset is running in a different
Reviewed-by: Ofir Bitton <obitton@habana.ai<mailto:obitton@habana.ai>>
[-- Attachment #2: Type: text/html, Size: 1552 bytes --]
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-06-13 7:54 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 12:07 [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Oded Gabbay
2023-06-12 12:07 ` [PATCH 2/3] accel/habanalabs: reset device if scrubbing failed Oded Gabbay
2023-06-13 7:54 ` Ofir Bitton
2023-06-12 12:07 ` [PATCH 3/3] accel/habanalabs: dump temperature threshold boot error Oded Gabbay
2023-06-13 7:54 ` [PATCH 1/3] accel/habanalabs: remove pdev check on idle check Ofir Bitton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).