dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain
@ 2023-11-15 16:39 Oded Gabbay
  2023-11-15 16:39 ` [PATCH 02/10] accel/habanalabs: add log when eq event is not received Oded Gabbay
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Tomer Tayar

From: Tomer Tayar <ttayar@habana.ai>

When a PCIe AXI drain event happens, it is possible that the driver
cannot access the device through PCIe, and therefore cannot send a
hard-reset request to FW.
Starting from FW version 1.13, FW will initiate a hard-reset in such
a case without waiting for a reset request from the driver.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/habanalabs.h | 8 ++++++++
 drivers/accel/habanalabs/gaudi2/gaudi2.c     | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index 1655c101c705..5c69a482b8de 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -3594,6 +3594,14 @@ static inline bool hl_is_fw_sw_ver_below(struct hl_device *hdev, u32 fw_sw_major
 	return false;
 }
 
+static inline bool hl_is_fw_sw_ver_equal_or_greater(struct hl_device *hdev, u32 fw_sw_major,
+							u32 fw_sw_minor)
+{
+	return (hdev->fw_sw_major_ver > fw_sw_major ||
+			(hdev->fw_sw_major_ver == fw_sw_major &&
+					hdev->fw_sw_minor_ver >= fw_sw_minor));
+}
+
 /*
  * Kernel module functions that can be accessed by entire module
  */
diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index 819660c684cf..b739078c2d87 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -10007,6 +10007,8 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent
 		error_count = gaudi2_handle_pcie_drain(hdev, &eq_entry->pcie_drain_ind_data);
 		reset_flags |= HL_DRV_RESET_FW_FATAL_ERR;
 		event_mask |= HL_NOTIFIER_EVENT_GENERAL_HW_ERR;
+		if (hl_is_fw_sw_ver_equal_or_greater(hdev, 1, 13))
+			is_critical = true;
 		break;
 
 	case GAUDI2_EVENT_PSOC59_RPM_ERROR_OR_DRAIN:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 02/10] accel/habanalabs: add log when eq event is not received
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 03/10] accel/habanalabs: add support for Gaudi2C device Oded Gabbay
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Farah Kassabri

From: Farah Kassabri <fkassabri@habana.ai>

Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive neither the eq error interrupt
or the eq heartbeat event, and will just initiate a reset without
indication in the dmesg about the reason.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 9711e8fc979d..d95a981b2906 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -1049,10 +1049,12 @@ static void hl_device_eq_heartbeat(struct hl_device *hdev)
 	if (!prop->cpucp_info.eq_health_check_supported)
 		return;
 
-	if (hdev->eq_heartbeat_received)
+	if (hdev->eq_heartbeat_received) {
 		hdev->eq_heartbeat_received = false;
-	else
+	} else {
+		dev_err(hdev->dev, "EQ heartbeat event was not received!\n");
 		hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask);
+	}
 }
 
 static void hl_device_heartbeat(struct work_struct *work)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 03/10] accel/habanalabs: add support for Gaudi2C device
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
  2023-11-15 16:39 ` [PATCH 02/10] accel/habanalabs: add log when eq event is not received Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 04/10] accel/habanalabs: fix EQ heartbeat mechanism Oded Gabbay
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel

Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c                 | 3 +++
 drivers/accel/habanalabs/common/habanalabs.h             | 2 ++
 drivers/accel/habanalabs/common/habanalabs_drv.c         | 3 +++
 drivers/accel/habanalabs/common/mmu/mmu.c                | 1 +
 drivers/accel/habanalabs/common/sysfs.c                  | 3 +++
 drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h | 1 +
 6 files changed, 13 insertions(+)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index d95a981b2906..d9447aeb3937 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -853,6 +853,9 @@ static int device_early_init(struct hl_device *hdev)
 		gaudi2_set_asic_funcs(hdev);
 		strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name));
 		break;
+	case ASIC_GAUDI2C:
+		gaudi2_set_asic_funcs(hdev);
+		strscpy(hdev->asic_name, "GAUDI2C", sizeof(hdev->asic_name));
 		break;
 	default:
 		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index 5c69a482b8de..7b0209e5bad6 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -1262,6 +1262,7 @@ struct hl_dec {
  * @ASIC_GAUDI_SEC: Gaudi secured device (HL-2000).
  * @ASIC_GAUDI2: Gaudi2 device.
  * @ASIC_GAUDI2B: Gaudi2B device.
+ * @ASIC_GAUDI2C: Gaudi2C device.
  */
 enum hl_asic_type {
 	ASIC_INVALID,
@@ -1270,6 +1271,7 @@ enum hl_asic_type {
 	ASIC_GAUDI_SEC,
 	ASIC_GAUDI2,
 	ASIC_GAUDI2B,
+	ASIC_GAUDI2C,
 };
 
 struct hl_cs_parser;
diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c
index 35ae0ff347f5..e542fd40e16c 100644
--- a/drivers/accel/habanalabs/common/habanalabs_drv.c
+++ b/drivers/accel/habanalabs/common/habanalabs_drv.c
@@ -141,6 +141,9 @@ static enum hl_asic_type get_asic_type(struct hl_device *hdev)
 		case REV_ID_B:
 			asic_type = ASIC_GAUDI2B;
 			break;
+		case REV_ID_C:
+			asic_type = ASIC_GAUDI2C;
+			break;
 		default:
 			break;
 		}
diff --git a/drivers/accel/habanalabs/common/mmu/mmu.c b/drivers/accel/habanalabs/common/mmu/mmu.c
index b2145716c605..b654302a68fc 100644
--- a/drivers/accel/habanalabs/common/mmu/mmu.c
+++ b/drivers/accel/habanalabs/common/mmu/mmu.c
@@ -596,6 +596,7 @@ int hl_mmu_if_set_funcs(struct hl_device *hdev)
 		break;
 	case ASIC_GAUDI2:
 	case ASIC_GAUDI2B:
+	case ASIC_GAUDI2C:
 		/* MMUs in Gaudi2 are always host resident */
 		hl_mmu_v2_hr_set_funcs(hdev, &hdev->mmu_func[MMU_HR_PGT]);
 		break;
diff --git a/drivers/accel/habanalabs/common/sysfs.c b/drivers/accel/habanalabs/common/sysfs.c
index 01f89f029355..278606373055 100644
--- a/drivers/accel/habanalabs/common/sysfs.c
+++ b/drivers/accel/habanalabs/common/sysfs.c
@@ -251,6 +251,9 @@ static ssize_t device_type_show(struct device *dev,
 	case ASIC_GAUDI2B:
 		str = "GAUDI2B";
 		break;
+	case ASIC_GAUDI2C:
+		str = "GAUDI2C";
+		break;
 	default:
 		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
 				hdev->asic_type);
diff --git a/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h b/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h
index f5d497dc9bdc..4f951cada077 100644
--- a/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h
+++ b/drivers/accel/habanalabs/include/hw_ip/pci/pci_general.h
@@ -25,6 +25,7 @@ enum hl_revision_id {
 	REV_ID_INVALID				= 0x00,
 	REV_ID_A				= 0x01,
 	REV_ID_B				= 0x02,
+	REV_ID_C				= 0x03
 };
 
 #endif /* INCLUDE_PCI_GENERAL_H_ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 04/10] accel/habanalabs: fix EQ heartbeat mechanism
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
  2023-11-15 16:39 ` [PATCH 02/10] accel/habanalabs: add log when eq event is not received Oded Gabbay
  2023-11-15 16:39 ` [PATCH 03/10] accel/habanalabs: add support for Gaudi2C device Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 05/10] accel/habanalabs/gaudi2: fix undef opcode reporting Oded Gabbay
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Farah Kassabri

From: Farah Kassabri <fkassabri@habana.ai>

Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index d9447aeb3937..6bf5f1d0d005 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -1044,20 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev)
 	return (vendor_id == PCI_VENDOR_ID_HABANALABS);
 }
 
-static void hl_device_eq_heartbeat(struct hl_device *hdev)
+static int hl_device_eq_heartbeat_check(struct hl_device *hdev)
 {
-	u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE;
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
 
 	if (!prop->cpucp_info.eq_health_check_supported)
-		return;
+		return 0;
 
 	if (hdev->eq_heartbeat_received) {
 		hdev->eq_heartbeat_received = false;
 	} else {
 		dev_err(hdev->dev, "EQ heartbeat event was not received!\n");
-		hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask);
+		return -EIO;
 	}
+
+	return 0;
 }
 
 static void hl_device_heartbeat(struct work_struct *work)
@@ -1074,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work)
 	/*
 	 * For EQ health check need to check if driver received the heartbeat eq event
 	 * in order to validate the eq is working.
+	 * Only if both the EQ is healthy and we managed to send the next heartbeat reschedule.
 	 */
-	hl_device_eq_heartbeat(hdev);
-
-	if (!hdev->asic_funcs->send_heartbeat(hdev))
+	if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev)))
 		goto reschedule;
 
 	if (hl_device_operational(hdev, NULL))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 05/10] accel/habanalabs/gaudi2: fix undef opcode reporting
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (2 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 04/10] accel/habanalabs: fix EQ heartbeat mechanism Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 06/10] accel/habanalabs: remove 'get temperature' debug print Oded Gabbay
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

currently the undefined opcode event bit in set only for lower cp and
only if 'write_enable' is true. It should be set anyway and for all
streams in order to report that event to userspace.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/gaudi2/gaudi2.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index b739078c2d87..5075f92d15cc 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -7929,21 +7929,19 @@ static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type
 				error_count++;
 			}
 
-		if (i == QMAN_STREAMS && error_count) {
-			/* check for undefined opcode */
-			if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK &&
-					hdev->captured_err_info.undef_opcode.write_enable) {
+		/* check for undefined opcode */
+		if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK) {
+			*event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE;
+			if (hdev->captured_err_info.undef_opcode.write_enable) {
 				memset(&hdev->captured_err_info.undef_opcode, 0,
 						sizeof(hdev->captured_err_info.undef_opcode));
-
-				hdev->captured_err_info.undef_opcode.write_enable = false;
 				hdev->captured_err_info.undef_opcode.timestamp = ktime_get();
 				hdev->captured_err_info.undef_opcode.engine_id =
 							gaudi2_queue_id_to_engine_id[qid_base];
-				*event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE;
 			}
 
-			handle_lower_qman_data_on_err(hdev, qman_base, *event_mask);
+			if (i == QMAN_STREAMS)
+				handle_lower_qman_data_on_err(hdev, qman_base, *event_mask);
 		}
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 06/10] accel/habanalabs: remove 'get temperature' debug print
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (3 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 05/10] accel/habanalabs/gaudi2: fix undef opcode reporting Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 07/10] accel/habanalabs: set hard reset flag if graceful reset is skipped Oded Gabbay
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

The print was added long back for a specific debug and can
now be removed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/hwmon.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/accel/habanalabs/common/hwmon.c b/drivers/accel/habanalabs/common/hwmon.c
index 8598056216e7..1ee2ee07e9ed 100644
--- a/drivers/accel/habanalabs/common/hwmon.c
+++ b/drivers/accel/habanalabs/common/hwmon.c
@@ -578,10 +578,6 @@ int hl_get_temperature(struct hl_device *hdev,
 				CPUCP_PKT_CTL_OPCODE_SHIFT);
 	pkt.sensor_index = __cpu_to_le16(sensor_index);
 	pkt.type = __cpu_to_le16(attr);
-
-	dev_dbg(hdev->dev, "get temp, ctl 0x%x, sensor %d, type %d\n",
-		pkt.ctl, pkt.sensor_index, pkt.type);
-
 	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
 						0, &result);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 07/10] accel/habanalabs: set hard reset flag if graceful reset is skipped
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (4 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 06/10] accel/habanalabs: remove 'get temperature' debug print Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 08/10] accel/habanalabs/gaudi2: get the correct QM CQ info upon an error Oded Gabbay
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Tomer Tayar

From: Tomer Tayar <ttayar@habana.ai>

hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a graceful reset
is valid.
If the conditions for graceful reset are not met, hl_device_reset() will
be called for an immediate reset. In this case a compute reset is not
valid, so it will be replaced with a hard reset together with a debug
message about it.
This message might be confusing, as it implies that a compute reset was
requested when it shouldn't. To prevent this confusion, set the hard
reset flag in hl_device_cond_reset() if going to an immediate reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 6bf5f1d0d005..a365791a9f5c 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -2040,7 +2040,7 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask)
 	if (ctx)
 		hl_ctx_put(ctx);
 
-	return hl_device_reset(hdev, flags);
+	return hl_device_reset(hdev, flags | HL_DRV_RESET_HARD);
 }
 
 static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 08/10] accel/habanalabs/gaudi2: get the correct QM CQ info upon an error
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (5 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 07/10] accel/habanalabs: set hard reset flag if graceful reset is skipped Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 09/10] accel/habanalabs: print error code when mapping fails Oded Gabbay
  2023-11-15 16:39 ` [PATCH 10/10] accel/habanalabs: expose module id through sysfs Oded Gabbay
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Tomer Tayar

From: Tomer Tayar <ttayar@habana.ai>

Upon a QM error, the address/size from both the CQ and the ARC_CQ are
printed, although the instruction that led to the error was received
from only one of them.

Moreover, in case of a QM undefined opcode, only one of these
address/size sets will be captured based on the value of ARC_CQ_PTR.
However, this value can be non-zero even if currently the CQ is used, in
case the CQ/ARC_CQ are alternately used.

Under the assumption of having a stop-on-error configuration, modify to
use CP_STS.CUR_CQ field to get the relevant CQ for the QM error.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/gaudi2/gaudi2.c      | 44 +++++++++----------
 .../include/gaudi2/asic_reg/gaudi2_regs.h     |  1 +
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index 5075f92d15cc..77c480725a84 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -7860,36 +7860,36 @@ static bool gaudi2_handle_ecc_event(struct hl_device *hdev, u16 event_type,
 
 static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u64 event_mask)
 {
-	u32 lo, hi, cq_ptr_size, arc_cq_ptr_size;
-	u64 cq_ptr, arc_cq_ptr, cp_current_inst;
-
-	lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET);
-	hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET);
-	cq_ptr = ((u64) hi) << 32 | lo;
-	cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET);
-
-	lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET);
-	hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET);
-	arc_cq_ptr = ((u64) hi) << 32 | lo;
-	arc_cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET);
+	u32 lo, hi, cq_ptr_size, cp_sts;
+	u64 cq_ptr, cp_current_inst;
+	bool is_arc_cq;
+
+	cp_sts = RREG32(qman_base + QM_CP_STS_4_OFFSET);
+	is_arc_cq = FIELD_GET(PDMA0_QM_CP_STS_CUR_CQ_MASK, cp_sts); /* 0 - legacy CQ, 1 - ARC_CQ */
+
+	if (is_arc_cq) {
+		lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET);
+		hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET);
+		cq_ptr = ((u64) hi) << 32 | lo;
+		cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET);
+	} else {
+		lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET);
+		hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET);
+		cq_ptr = ((u64) hi) << 32 | lo;
+		cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET);
+	}
 
 	lo = RREG32(qman_base + QM_CP_CURRENT_INST_LO_4_OFFSET);
 	hi = RREG32(qman_base + QM_CP_CURRENT_INST_HI_4_OFFSET);
 	cp_current_inst = ((u64) hi) << 32 | lo;
 
 	dev_info(hdev->dev,
-		"LowerQM. CQ: {ptr %#llx, size %u}, ARC_CQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n",
-		cq_ptr, cq_ptr_size, arc_cq_ptr, arc_cq_ptr_size, cp_current_inst);
+		"LowerQM. %sCQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n",
+		is_arc_cq ? "ARC_" : "", cq_ptr, cq_ptr_size, cp_current_inst);
 
 	if (event_mask & HL_NOTIFIER_EVENT_UNDEFINED_OPCODE) {
-		if (arc_cq_ptr) {
-			hdev->captured_err_info.undef_opcode.cq_addr = arc_cq_ptr;
-			hdev->captured_err_info.undef_opcode.cq_size = arc_cq_ptr_size;
-		} else {
-			hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr;
-			hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size;
-		}
-
+		hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr;
+		hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size;
 		hdev->captured_err_info.undef_opcode.stream_id = QMAN_STREAMS;
 	}
 }
diff --git a/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h b/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h
index a08378d0802b..8018214a7b59 100644
--- a/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h
+++ b/drivers/accel/habanalabs/include/gaudi2/asic_reg/gaudi2_regs.h
@@ -250,6 +250,7 @@
 #define QM_ARC_CQ_PTR_HI_OFFSET		(mmPDMA0_QM_ARC_CQ_PTR_HI - mmPDMA0_QM_BASE)
 #define QM_ARC_CQ_TSIZE_OFFSET		(mmPDMA0_QM_ARC_CQ_TSIZE - mmPDMA0_QM_BASE)
 
+#define QM_CP_STS_4_OFFSET		(mmPDMA0_QM_CP_STS_4 - mmPDMA0_QM_BASE)
 #define QM_CP_CURRENT_INST_LO_4_OFFSET	(mmPDMA0_QM_CP_CURRENT_INST_LO_4 - mmPDMA0_QM_BASE)
 #define QM_CP_CURRENT_INST_HI_4_OFFSET	(mmPDMA0_QM_CP_CURRENT_INST_HI_4 - mmPDMA0_QM_BASE)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 09/10] accel/habanalabs: print error code when mapping fails
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (6 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 08/10] accel/habanalabs/gaudi2: get the correct QM CQ info upon an error Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  2023-11-15 16:39 ` [PATCH 10/10] accel/habanalabs: expose module id through sysfs Oded Gabbay
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Dani Liberman

From: Dani Liberman <dliberman@habana.ai>

Failure to map is considered a non-trivial error and we need to notify
the user about it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/memory.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/accel/habanalabs/common/memory.c b/drivers/accel/habanalabs/common/memory.c
index 0b8689fe0b64..3348ad12c237 100644
--- a/drivers/accel/habanalabs/common/memory.c
+++ b/drivers/accel/habanalabs/common/memory.c
@@ -955,8 +955,8 @@ static int map_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr,
 				(i + 1) == phys_pg_pack->npages);
 		if (rc) {
 			dev_err(hdev->dev,
-				"map failed for handle %u, npages: %llu, mapped: %llu",
-				phys_pg_pack->handle, phys_pg_pack->npages,
+				"map failed (%d) for handle %u, npages: %llu, mapped: %llu\n",
+				rc, phys_pg_pack->handle, phys_pg_pack->npages,
 				mapped_pg_cnt);
 			goto err;
 		}
@@ -1186,7 +1186,8 @@ static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args, u64 *device
 
 	rc = map_phys_pg_pack(ctx, ret_vaddr, phys_pg_pack);
 	if (rc) {
-		dev_err(hdev->dev, "mapping page pack failed for handle %u\n", handle);
+		dev_err(hdev->dev, "mapping page pack failed (%d) for handle %u\n",
+			rc, handle);
 		mutex_unlock(&hdev->mmu_lock);
 		goto map_err;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 10/10] accel/habanalabs: expose module id through sysfs
  2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
                   ` (7 preceding siblings ...)
  2023-11-15 16:39 ` [PATCH 09/10] accel/habanalabs: print error code when mapping fails Oded Gabbay
@ 2023-11-15 16:39 ` Oded Gabbay
  8 siblings, 0 replies; 10+ messages in thread
From: Oded Gabbay @ 2023-11-15 16:39 UTC (permalink / raw)
  To: dri-devel, linux-kernel; +Cc: Dani Liberman

From: Dani Liberman <dliberman@habana.ai>

Module ID exposes the physical location of the device in the server,
from the pov of the devices in regard to how they are connected by
internal fabric.

This information is already exposed in our INFO ioctl, but there are
utilities and scripts running in data-center which are already
accessing sysfs for topology information and it is easier for them
to continue getting that information from sysfs instead of opening
a file descriptor.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 Documentation/ABI/testing/sysfs-driver-habanalabs |  6 ++++++
 drivers/accel/habanalabs/common/sysfs.c           | 10 ++++++++++
 2 files changed, 16 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
index c63ca1ad500d..89fe3b09d4ad 100644
--- a/Documentation/ABI/testing/sysfs-driver-habanalabs
+++ b/Documentation/ABI/testing/sysfs-driver-habanalabs
@@ -149,6 +149,12 @@ Contact:        ogabbay@kernel.org
 Description:    Displays the current clock frequency, in Hz, of the MME compute
                 engine. This property is valid only for the Goya ASIC family
 
+What:           /sys/class/accel/accel<n>/device/module_id
+Date:           Nov 2023
+KernelVersion:  not yet upstreamed
+Contact:        ogabbay@kernel.org
+Description:    Displays the device's module id
+
 What:           /sys/class/accel/accel<n>/device/pci_addr
 Date:           Jan 2019
 KernelVersion:  5.1
diff --git a/drivers/accel/habanalabs/common/sysfs.c b/drivers/accel/habanalabs/common/sysfs.c
index 278606373055..8d2164691d81 100644
--- a/drivers/accel/habanalabs/common/sysfs.c
+++ b/drivers/accel/habanalabs/common/sysfs.c
@@ -386,6 +386,14 @@ static ssize_t security_enabled_show(struct device *dev,
 	return sprintf(buf, "%d\n", hdev->asic_prop.fw_security_enabled);
 }
 
+static ssize_t module_id_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%u\n", le32_to_cpu(hdev->asic_prop.cpucp_info.card_location));
+}
+
 static DEVICE_ATTR_RO(armcp_kernel_ver);
 static DEVICE_ATTR_RO(armcp_ver);
 static DEVICE_ATTR_RO(cpld_ver);
@@ -405,6 +413,7 @@ static DEVICE_ATTR_RO(thermal_ver);
 static DEVICE_ATTR_RO(uboot_ver);
 static DEVICE_ATTR_RO(fw_os_ver);
 static DEVICE_ATTR_RO(security_enabled);
+static DEVICE_ATTR_RO(module_id);
 
 static struct bin_attribute bin_attr_eeprom = {
 	.attr = {.name = "eeprom", .mode = (0444)},
@@ -430,6 +439,7 @@ static struct attribute *hl_dev_attrs[] = {
 	&dev_attr_uboot_ver.attr,
 	&dev_attr_fw_os_ver.attr,
 	&dev_attr_security_enabled.attr,
+	&dev_attr_module_id.attr,
 	NULL,
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-11-15 16:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-15 16:39 [PATCH 01/10] accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain Oded Gabbay
2023-11-15 16:39 ` [PATCH 02/10] accel/habanalabs: add log when eq event is not received Oded Gabbay
2023-11-15 16:39 ` [PATCH 03/10] accel/habanalabs: add support for Gaudi2C device Oded Gabbay
2023-11-15 16:39 ` [PATCH 04/10] accel/habanalabs: fix EQ heartbeat mechanism Oded Gabbay
2023-11-15 16:39 ` [PATCH 05/10] accel/habanalabs/gaudi2: fix undef opcode reporting Oded Gabbay
2023-11-15 16:39 ` [PATCH 06/10] accel/habanalabs: remove 'get temperature' debug print Oded Gabbay
2023-11-15 16:39 ` [PATCH 07/10] accel/habanalabs: set hard reset flag if graceful reset is skipped Oded Gabbay
2023-11-15 16:39 ` [PATCH 08/10] accel/habanalabs/gaudi2: get the correct QM CQ info upon an error Oded Gabbay
2023-11-15 16:39 ` [PATCH 09/10] accel/habanalabs: print error code when mapping fails Oded Gabbay
2023-11-15 16:39 ` [PATCH 10/10] accel/habanalabs: expose module id through sysfs Oded Gabbay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).