linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] habanalabs: halt device CPU only upon certain reset
@ 2020-07-10 17:36 Oded Gabbay
  2020-07-10 17:36 ` [PATCH 2/4] habanalabs: Assign each CQ with its own work queue Oded Gabbay
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-07-10 17:36 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers

Currently the driver halts the device CPU in the halt engines function,
which halts all the engines of the ASIC. The problem is that if later on we
stop the reset process (due to inability to clean memory mappings in time),
the CPU will remain in halt mode. This creates many issues, such as
thermal/power control and FLR handling.

Therefore, move the halting of the device CPU to the very end of the reset
process, just before writing to the registers to initiate the reset. In
addition, the driver now needs to send a message to the device F/W to
disable it from sending interrupts to the host machine because during halt
engines function the driver disables the MSI/MSI-X interrupts.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/device.c              | 16 ++++++++
 drivers/misc/habanalabs/gaudi/gaudi.c         | 40 +++++++++++--------
 drivers/misc/habanalabs/goya/goya.c           | 38 +++++++++---------
 .../include/gaudi/asic_reg/gaudi_regs.h       |  1 +
 .../habanalabs/include/gaudi/gaudi_masks.h    |  3 ++
 5 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 65a5a5c52a48..df709767c7ea 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -838,6 +838,22 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 		if (rc)
 			return 0;
 
+		if (hard_reset) {
+			/* Disable PCI access from device F/W so he won't send
+			 * us additional interrupts. We disable MSI/MSI-X at
+			 * the halt_engines function and we can't have the F/W
+			 * sending us interrupts after that. We need to disable
+			 * the access here because if the device is marked
+			 * disable, the message won't be send. Also, in case
+			 * of heartbeat, the device CPU is marked as disable
+			 * so this message won't be sent
+			 */
+			if (hl_fw_send_pci_access_msg(hdev,
+					ARMCP_PACKET_DISABLE_PCI_ACCESS))
+				dev_warn(hdev->dev,
+					"Failed to disable PCI access by F/W\n");
+		}
+
 		/* This also blocks future CS/VM/JOB completion operations */
 		hdev->disabled = true;
 
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 7eee4a10154b..a9fd3d352ef0 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -2578,27 +2578,16 @@ static void gaudi_disable_timestamp(struct hl_device *hdev)
 
 static void gaudi_halt_engines(struct hl_device *hdev, bool hard_reset)
 {
-	u32 wait_timeout_ms, cpu_timeout_ms;
+	u32 wait_timeout_ms;
 
 	dev_info(hdev->dev,
 		"Halting compute engines and disabling interrupts\n");
 
-	if (hdev->pldm) {
+	if (hdev->pldm)
 		wait_timeout_ms = GAUDI_PLDM_RESET_WAIT_MSEC;
-		cpu_timeout_ms = GAUDI_PLDM_RESET_WAIT_MSEC;
-	} else {
+	else
 		wait_timeout_ms = GAUDI_RESET_WAIT_MSEC;
-		cpu_timeout_ms = GAUDI_CPU_RESET_WAIT_MSEC;
-	}
 
-	/*
-	 * I don't know what is the state of the CPU so make sure it is
-	 * stopped in any means necessary
-	 */
-	WREG32(mmPSOC_GLOBAL_CONF_KMD_MSG_TO_CPU, KMD_MSG_GOTO_WFE);
-	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
-			GAUDI_EVENT_HALT_MACHINE);
-	msleep(cpu_timeout_ms);
 
 	gaudi_stop_mme_qmans(hdev);
 	gaudi_stop_tpc_qmans(hdev);
@@ -2966,17 +2955,34 @@ static int gaudi_hw_init(struct hl_device *hdev)
 static void gaudi_hw_fini(struct hl_device *hdev, bool hard_reset)
 {
 	struct gaudi_device *gaudi = hdev->asic_specific;
-	u32 status, reset_timeout_ms, boot_strap = 0;
+	u32 status, reset_timeout_ms, cpu_timeout_ms, boot_strap = 0;
 
 	if (!hard_reset) {
 		dev_err(hdev->dev, "GAUDI doesn't support soft-reset\n");
 		return;
 	}
 
-	if (hdev->pldm)
+	if (hdev->pldm) {
 		reset_timeout_ms = GAUDI_PLDM_HRESET_TIMEOUT_MSEC;
-	else
+		cpu_timeout_ms = GAUDI_PLDM_RESET_WAIT_MSEC;
+	} else {
 		reset_timeout_ms = GAUDI_RESET_TIMEOUT_MSEC;
+		cpu_timeout_ms = GAUDI_CPU_RESET_WAIT_MSEC;
+	}
+
+	/* Set device to handle FLR by H/W as we will put the device CPU to
+	 * halt mode
+	 */
+	WREG32(mmPCIE_AUX_FLR_CTRL, (PCIE_AUX_FLR_CTRL_HW_CTRL_MASK |
+					PCIE_AUX_FLR_CTRL_INT_MASK_MASK));
+
+	/* I don't know what is the state of the CPU so make sure it is
+	 * stopped in any means necessary
+	 */
+	WREG32(mmPSOC_GLOBAL_CONF_KMD_MSG_TO_CPU, KMD_MSG_GOTO_WFE);
+	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR, GAUDI_EVENT_HALT_MACHINE);
+
+	msleep(cpu_timeout_ms);
 
 	/* Tell ASIC not to re-initialize PCIe */
 	WREG32(mmPREBOOT_PCIE_EN, LKD_HARD_RESET_MAGIC);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 36db771f391c..2b0937d950c1 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -2240,29 +2240,15 @@ static void goya_disable_timestamp(struct hl_device *hdev)
 
 static void goya_halt_engines(struct hl_device *hdev, bool hard_reset)
 {
-	u32 wait_timeout_ms, cpu_timeout_ms;
+	u32 wait_timeout_ms;
 
 	dev_info(hdev->dev,
 		"Halting compute engines and disabling interrupts\n");
 
-	if (hdev->pldm) {
+	if (hdev->pldm)
 		wait_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
-		cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
-	} else {
+	else
 		wait_timeout_ms = GOYA_RESET_WAIT_MSEC;
-		cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
-	}
-
-	if (hard_reset) {
-		/*
-		 * I don't know what is the state of the CPU so make sure it is
-		 * stopped in any means necessary
-		 */
-		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
-		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
-			GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
-		msleep(cpu_timeout_ms);
-	}
 
 	goya_stop_external_queues(hdev);
 	goya_stop_internal_queues(hdev);
@@ -2567,14 +2553,26 @@ static int goya_hw_init(struct hl_device *hdev)
 static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 {
 	struct goya_device *goya = hdev->asic_specific;
-	u32 reset_timeout_ms, status;
+	u32 reset_timeout_ms, cpu_timeout_ms, status;
 
-	if (hdev->pldm)
+	if (hdev->pldm) {
 		reset_timeout_ms = GOYA_PLDM_RESET_TIMEOUT_MSEC;
-	else
+		cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
+	} else {
 		reset_timeout_ms = GOYA_RESET_TIMEOUT_MSEC;
+		cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
+	}
 
 	if (hard_reset) {
+		/* I don't know what is the state of the CPU so make sure it is
+		 * stopped in any means necessary
+		 */
+		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
+
+		msleep(cpu_timeout_ms);
+
 		goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
 		goya_disable_clk_rlx(hdev);
 		goya_set_pll_refclk(hdev);
diff --git a/drivers/misc/habanalabs/include/gaudi/asic_reg/gaudi_regs.h b/drivers/misc/habanalabs/include/gaudi/asic_reg/gaudi_regs.h
index 0c75d43532bd..f92dc53af074 100644
--- a/drivers/misc/habanalabs/include/gaudi/asic_reg/gaudi_regs.h
+++ b/drivers/misc/habanalabs/include/gaudi/asic_reg/gaudi_regs.h
@@ -292,6 +292,7 @@
 
 #define mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG                           0xC02000
 
+#define mmPCIE_AUX_FLR_CTRL                                          0xC07394
 #define mmPCIE_AUX_DBI                                               0xC07490
 
 #endif /* ASIC_REG_GAUDI_REGS_H_ */
diff --git a/drivers/misc/habanalabs/include/gaudi/gaudi_masks.h b/drivers/misc/habanalabs/include/gaudi/gaudi_masks.h
index 96f08050ef0f..13ef6b2887fd 100644
--- a/drivers/misc/habanalabs/include/gaudi/gaudi_masks.h
+++ b/drivers/misc/habanalabs/include/gaudi/gaudi_masks.h
@@ -455,4 +455,7 @@ enum axi_id {
 					QM_ARB_ERR_MSG_EN_CHOISE_WDT_MASK |\
 					QM_ARB_ERR_MSG_EN_AXI_LBW_ERR_MASK)
 
+#define PCIE_AUX_FLR_CTRL_HW_CTRL_MASK                               0x1
+#define PCIE_AUX_FLR_CTRL_INT_MASK_MASK                              0x2
+
 #endif /* GAUDI_MASKS_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] habanalabs: Assign each CQ with its own work queue
  2020-07-10 17:36 [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Oded Gabbay
@ 2020-07-10 17:36 ` Oded Gabbay
  2020-07-10 17:36 ` [PATCH 3/4] habanalabs: verify queue can contain all cs jobs Oded Gabbay
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-07-10 17:36 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

We identified a possible race during job completion when working
with a single multi-threaded work queue. In order to overcome this
race we suggest using a single threaded work queue per completion
queue, hence we guarantee jobs completion in order.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/command_submission.c |  4 +-
 drivers/misc/habanalabs/device.c             | 39 ++++++++++++++++----
 drivers/misc/habanalabs/habanalabs.h         |  7 +++-
 drivers/misc/habanalabs/irq.c                |  2 +-
 4 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
index 1ba937b9a22e..54f2f5afdd2a 100644
--- a/drivers/misc/habanalabs/command_submission.c
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -487,10 +487,12 @@ static void cs_rollback(struct hl_device *hdev, struct hl_cs *cs)
 
 void hl_cs_rollback_all(struct hl_device *hdev)
 {
+	int i;
 	struct hl_cs *cs, *tmp;
 
 	/* flush all completions */
-	flush_workqueue(hdev->cq_wq);
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		flush_workqueue(hdev->cq_wq[i]);
 
 	/* Make sure we don't have leftovers in the H/W queues mirror list */
 	list_for_each_entry_safe(cs, tmp, &hdev->hw_queues_mirror_list,
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index df709767c7ea..84800efec10d 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -249,7 +249,8 @@ static void device_cdev_sysfs_del(struct hl_device *hdev)
  */
 static int device_early_init(struct hl_device *hdev)
 {
-	int rc;
+	int i, rc;
+	char workq_name[32];
 
 	switch (hdev->asic_type) {
 	case ASIC_GOYA:
@@ -274,11 +275,24 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		goto early_fini;
 
-	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
-	if (hdev->cq_wq == NULL) {
-		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
-		rc = -ENOMEM;
-		goto asid_fini;
+	if (hdev->asic_prop.completion_queues_count) {
+		hdev->cq_wq = kcalloc(hdev->asic_prop.completion_queues_count,
+				sizeof(*hdev->cq_wq),
+				GFP_ATOMIC);
+		if (!hdev->cq_wq) {
+			rc = -ENOMEM;
+			goto asid_fini;
+		}
+	}
+
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++) {
+		snprintf(workq_name, 32, "hl-free-jobs-%u", i);
+		hdev->cq_wq[i] = create_singlethread_workqueue(workq_name);
+		if (hdev->cq_wq == NULL) {
+			dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
+			rc = -ENOMEM;
+			goto free_cq_wq;
+		}
 	}
 
 	hdev->eq_wq = alloc_workqueue("hl-events", WQ_UNBOUND, 0);
@@ -321,7 +335,10 @@ static int device_early_init(struct hl_device *hdev)
 free_eq_wq:
 	destroy_workqueue(hdev->eq_wq);
 free_cq_wq:
-	destroy_workqueue(hdev->cq_wq);
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		if (hdev->cq_wq[i])
+			destroy_workqueue(hdev->cq_wq[i]);
+	kfree(hdev->cq_wq);
 asid_fini:
 	hl_asid_fini(hdev);
 early_fini:
@@ -339,6 +356,8 @@ static int device_early_init(struct hl_device *hdev)
  */
 static void device_early_fini(struct hl_device *hdev)
 {
+	int i;
+
 	mutex_destroy(&hdev->mmu_cache_lock);
 	mutex_destroy(&hdev->debug_lock);
 	mutex_destroy(&hdev->send_cpu_message_lock);
@@ -351,7 +370,10 @@ static void device_early_fini(struct hl_device *hdev)
 	kfree(hdev->hl_chip_info);
 
 	destroy_workqueue(hdev->eq_wq);
-	destroy_workqueue(hdev->cq_wq);
+
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		destroy_workqueue(hdev->cq_wq[i]);
+	kfree(hdev->cq_wq);
 
 	hl_asid_fini(hdev);
 
@@ -1181,6 +1203,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 				"failed to initialize completion queue\n");
 			goto cq_fini;
 		}
+		hdev->completion_queue[i].cq_idx = i;
 	}
 
 	/*
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index ea0fd178accb..01fb45887a5a 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -479,6 +479,7 @@ struct hl_hw_queue {
  * @hdev: pointer to the device structure
  * @kernel_address: holds the queue's kernel virtual address
  * @bus_address: holds the queue's DMA address
+ * @cq_idx: completion queue index in array
  * @hw_queue_id: the id of the matching H/W queue
  * @ci: ci inside the queue
  * @pi: pi inside the queue
@@ -488,6 +489,7 @@ struct hl_cq {
 	struct hl_device	*hdev;
 	u64			kernel_address;
 	dma_addr_t		bus_address;
+	u32			cq_idx;
 	u32			hw_queue_id;
 	u32			ci;
 	u32			pi;
@@ -1396,7 +1398,8 @@ struct hl_device_idle_busy_ts {
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @completion_queue: array of hl_cq.
- * @cq_wq: work queue of completion queues for executing work in process context
+ * @cq_wq: work queues of completion queues for executing work in process
+ *         context.
  * @eq_wq: work queue of event queue for executing work in process context.
  * @kernel_ctx: Kernel driver context structure.
  * @kernel_queues: array of hl_hw_queue.
@@ -1492,7 +1495,7 @@ struct hl_device {
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
-	struct workqueue_struct		*cq_wq;
+	struct workqueue_struct		**cq_wq;
 	struct workqueue_struct		*eq_wq;
 	struct hl_ctx			*kernel_ctx;
 	struct hl_hw_queue		*kernel_queues;
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
index 195a5ecba0e8..c8db717023f5 100644
--- a/drivers/misc/habanalabs/irq.c
+++ b/drivers/misc/habanalabs/irq.c
@@ -119,7 +119,7 @@ irqreturn_t hl_irq_handler_cq(int irq, void *arg)
 
 		if ((shadow_index_valid) && (!hdev->disabled)) {
 			job = queue->shadow_queue[hl_pi_2_offset(shadow_index)];
-			queue_work(hdev->cq_wq, &job->finish_work);
+			queue_work(hdev->cq_wq[cq->cq_idx], &job->finish_work);
 		}
 
 		atomic_inc(&queue->ci);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] habanalabs: verify queue can contain all cs jobs
  2020-07-10 17:36 [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Oded Gabbay
  2020-07-10 17:36 ` [PATCH 2/4] habanalabs: Assign each CQ with its own work queue Oded Gabbay
@ 2020-07-10 17:36 ` Oded Gabbay
  2020-07-10 17:36 ` [PATCH 4/4] habanalabs: check for DMA errors when clearing memory Oded Gabbay
  2020-07-13  7:24 ` [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Tomer Tayar
  3 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-07-10 17:36 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

In order for the user to be aware of wrong inputs, we must return
error in case the amount of jobs per cs exceeds the corresponding
queue size.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/habanalabs.h | 4 ++++
 drivers/misc/habanalabs/hw_queue.c   | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 01fb45887a5a..14def0d26d2d 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -421,6 +421,10 @@ struct hl_cs_job;
 #define HL_QUEUE_LENGTH			4096
 #define HL_QUEUE_SIZE_IN_BYTES		(HL_QUEUE_LENGTH * HL_BD_SIZE)
 
+#if (HL_MAX_JOBS_PER_CS > HL_QUEUE_LENGTH)
+#error "HL_QUEUE_LENGTH must be greater than HL_MAX_JOBS_PER_CS"
+#endif
+
 /* HL_CQ_LENGTH is in units of struct hl_cq_entry */
 #define HL_CQ_LENGTH			HL_QUEUE_LENGTH
 #define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
index 474a0e8a7797..287681646071 100644
--- a/drivers/misc/habanalabs/hw_queue.c
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -158,6 +158,13 @@ static int int_queue_sanity_checks(struct hl_device *hdev,
 {
 	int free_slots_cnt;
 
+	if (num_of_entries > q->int_queue_len) {
+		dev_err(hdev->dev,
+			"Cannot populate queue %u with %u jobs\n",
+			q->hw_queue_id, num_of_entries);
+		return -ENOMEM;
+	}
+
 	/* Check we have enough space in the queue */
 	free_slots_cnt = queue_free_slots(q, q->int_queue_len);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] habanalabs: check for DMA errors when clearing memory
  2020-07-10 17:36 [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Oded Gabbay
  2020-07-10 17:36 ` [PATCH 2/4] habanalabs: Assign each CQ with its own work queue Oded Gabbay
  2020-07-10 17:36 ` [PATCH 3/4] habanalabs: verify queue can contain all cs jobs Oded Gabbay
@ 2020-07-10 17:36 ` Oded Gabbay
  2020-07-13  7:24 ` [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Tomer Tayar
  3 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-07-10 17:36 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: Moti Haimovski

From: Moti Haimovski <mhaimovski@habana.ai>

In GAUDI we use QMAN0 DMA for clearing the MMU memory region
at initialization. if this operation fails it places the DMA in an error
state and then when trying to initialize QMAN0 we fail and erroneously
assume its the QMAN that failed.

This commit adds a check and clear of such DMA errors at initialization so
we will have a better understanding of what went wrong.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/gaudi/gaudi.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index a9fd3d352ef0..57b2b9392cb2 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -4253,7 +4253,7 @@ static int gaudi_memset_device_memory(struct hl_device *hdev, u64 addr,
 {
 	struct packet_lin_dma *lin_dma_pkt;
 	struct hl_cs_job *job;
-	u32 cb_size, ctl;
+	u32 cb_size, ctl, err_cause;
 	struct hl_cb *cb;
 	int rc;
 
@@ -4282,6 +4282,15 @@ static int gaudi_memset_device_memory(struct hl_device *hdev, u64 addr,
 		goto release_cb;
 	}
 
+	/* Verify DMA is OK */
+	err_cause = RREG32(mmDMA0_CORE_ERR_CAUSE);
+	if (err_cause && !hdev->init_done) {
+		dev_dbg(hdev->dev,
+			"Clearing DMA0 engine from errors (cause 0x%x)\n",
+			err_cause);
+		WREG32(mmDMA0_CORE_ERR_CAUSE, err_cause);
+	}
+
 	job->id = 0;
 	job->user_cb = cb;
 	job->user_cb->cs_cnt++;
@@ -4293,11 +4302,23 @@ static int gaudi_memset_device_memory(struct hl_device *hdev, u64 addr,
 	hl_debugfs_add_job(hdev, job);
 
 	rc = gaudi_send_job_on_qman0(hdev, job);
-
 	hl_debugfs_remove_job(hdev, job);
 	kfree(job);
 	cb->cs_cnt--;
 
+	/* Verify DMA is OK */
+	err_cause = RREG32(mmDMA0_CORE_ERR_CAUSE);
+	if (err_cause) {
+		dev_err(hdev->dev, "DMA Failed, cause 0x%x\n", err_cause);
+		rc = -EIO;
+		if (!hdev->init_done) {
+			dev_dbg(hdev->dev,
+				"Clearing DMA0 engine from errors (cause 0x%x)\n",
+				err_cause);
+			WREG32(mmDMA0_CORE_ERR_CAUSE, err_cause);
+		}
+	}
+
 release_cb:
 	hl_cb_put(cb);
 	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH 1/4] habanalabs: halt device CPU only upon certain reset
  2020-07-10 17:36 [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Oded Gabbay
                   ` (2 preceding siblings ...)
  2020-07-10 17:36 ` [PATCH 4/4] habanalabs: check for DMA errors when clearing memory Oded Gabbay
@ 2020-07-13  7:24 ` Tomer Tayar
  3 siblings, 0 replies; 5+ messages in thread
From: Tomer Tayar @ 2020-07-13  7:24 UTC (permalink / raw)
  To: Oded Gabbay, linux-kernel, SW_Drivers

On Fri, Jul 10, 2020 at 20:37 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> Currently the driver halts the device CPU in the halt engines function,
> which halts all the engines of the ASIC. The problem is that if later on we
> stop the reset process (due to inability to clean memory mappings in time),
> the CPU will remain in halt mode. This creates many issues, such as
> thermal/power control and FLR handling.
> 
> Therefore, move the halting of the device CPU to the very end of the reset
> process, just before writing to the registers to initiate the reset. In
> addition, the driver now needs to send a message to the device F/W to
> disable it from sending interrupts to the host machine because during halt
> engines function the driver disables the MSI/MSI-X interrupts.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Reviewed-by: Tomer Tayar <ttayar@habana.ai>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-13  7:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-10 17:36 [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Oded Gabbay
2020-07-10 17:36 ` [PATCH 2/4] habanalabs: Assign each CQ with its own work queue Oded Gabbay
2020-07-10 17:36 ` [PATCH 3/4] habanalabs: verify queue can contain all cs jobs Oded Gabbay
2020-07-10 17:36 ` [PATCH 4/4] habanalabs: check for DMA errors when clearing memory Oded Gabbay
2020-07-13  7:24 ` [PATCH 1/4] habanalabs: halt device CPU only upon certain reset Tomer Tayar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).