All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] habanalabs: improve MMU cache invalidation code
@ 2020-05-21  7:02 Oded Gabbay
  2020-05-21  7:02 ` [PATCH 2/4] habanalabs: add print for soft reset due to event Oded Gabbay
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-05-21  7:02 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: gregkh, Omer Shpigelman

From: Omer Shpigelman <oshpigelman@habana.ai>

A new sequence is introduced to invalidate the MMU cache in order to avoid
timeouts.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/gaudi/gaudi.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 4cb1f71dd4f1..093384731f0d 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -5982,16 +5982,18 @@ static void gaudi_mmu_invalidate_cache(struct hl_device *hdev, bool is_hard,
 		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
 
 	/* L0 & L1 invalidation */
-	WREG32(mmSTLB_INV_ALL_START, 1);
+	WREG32(mmSTLB_INV_PS, 2);
 
 	rc = hl_poll_timeout(
 		hdev,
-		mmSTLB_INV_ALL_START,
+		mmSTLB_INV_PS,
 		status,
 		!status,
 		1000,
 		timeout_usec);
 
+	WREG32(mmSTLB_INV_SET, 0);
+
 	if (rc)
 		dev_notice_ratelimited(hdev->dev,
 			"Timeout when waiting for MMU cache invalidation\n");
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] habanalabs: add print for soft reset due to event
  2020-05-21  7:02 [PATCH 1/4] habanalabs: improve MMU cache invalidation code Oded Gabbay
@ 2020-05-21  7:02 ` Oded Gabbay
  2020-05-21  7:02 ` [PATCH 3/4] habanalabs: GAUDI does not support soft-reset Oded Gabbay
  2020-05-21  7:02 ` [PATCH 4/4] habanalabs: don't allow hard reset with open processes Oded Gabbay
  2 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-05-21  7:02 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: gregkh, Omer Shpigelman

From: Omer Shpigelman <oshpigelman@habana.ai>

Print the event name that caused the soft reset.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/gaudi/gaudi.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 093384731f0d..3d4a569914d3 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -5843,8 +5843,12 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 		soft_reset_required = gaudi_tpc_read_interrupts(hdev,
 					tpc_dec_event_to_tpc_id(event_type),
 					"AXI_SLV_DEC_Error");
-		if (soft_reset_required)
+		if (soft_reset_required) {
+			dev_err_ratelimited(hdev->dev,
+					"soft reset required due to %s\n",
+					gaudi_irq_map_table[event_type].name);
 			hl_device_reset(hdev, false, false);
+		}
 		hl_fw_unmask_irq(hdev, event_type);
 		break;
 
@@ -5860,8 +5864,12 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 		soft_reset_required = gaudi_tpc_read_interrupts(hdev,
 					tpc_krn_event_to_tpc_id(event_type),
 					"KRN_ERR");
-		if (soft_reset_required)
+		if (soft_reset_required) {
+			dev_err_ratelimited(hdev->dev,
+					"soft reset required due to %s\n",
+					gaudi_irq_map_table[event_type].name);
 			hl_device_reset(hdev, false, false);
+		}
 		hl_fw_unmask_irq(hdev, event_type);
 		break;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] habanalabs: GAUDI does not support soft-reset
  2020-05-21  7:02 [PATCH 1/4] habanalabs: improve MMU cache invalidation code Oded Gabbay
  2020-05-21  7:02 ` [PATCH 2/4] habanalabs: add print for soft reset due to event Oded Gabbay
@ 2020-05-21  7:02 ` Oded Gabbay
  2020-05-21  7:41   ` Tomer Tayar
  2020-05-21  7:02 ` [PATCH 4/4] habanalabs: don't allow hard reset with open processes Oded Gabbay
  2 siblings, 1 reply; 5+ messages in thread
From: Oded Gabbay @ 2020-05-21  7:02 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: gregkh

GAUDI does not support soft-reset as it leaves the NIC ports in an awkward
state, where their QMANs were reset but the NIC itself is still working.

In addition, there is not much sense in doing soft-reset when training is
done on multiple GAUDIs.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/device.c      |  6 +++++
 drivers/misc/habanalabs/gaudi/gaudi.c | 38 +++++++++++++++------------
 drivers/misc/habanalabs/goya/goya.c   |  1 +
 drivers/misc/habanalabs/habanalabs.h  |  2 ++
 drivers/misc/habanalabs/sysfs.c       |  5 ++++
 5 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 4b6c8de46dd8..4a4a446f479e 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -801,6 +801,7 @@ static void device_hard_reset_pending(struct work_struct *work)
  * @hdev: pointer to habanalabs device structure
  * @hard_reset: should we do hard reset to all engines or just reset the
  *              compute/dma engines
+ * @from_hard_reset_thread: is the caller the hard-reset thread
  *
  * Block future CS and wait for pending CS to be enqueued
  * Call ASIC H/W fini
@@ -823,6 +824,11 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 		return 0;
 	}
 
+	if ((!hard_reset) && (!hdev->supports_soft_reset)) {
+		dev_dbg(hdev->dev, "Doing hard-reset instead of soft-reset\n");
+		hard_reset = true;
+	}
+
 	/*
 	 * Prevent concurrency in this function - only one reset should be
 	 * done at any given time. Only need to perform this if we didn't
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 3d4a569914d3..92a5130f06fb 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -5774,7 +5774,7 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 	u16 event_type = ((ctl & EQ_CTL_EVENT_TYPE_MASK)
 			>> EQ_CTL_EVENT_TYPE_SHIFT);
 	u8 cause;
-	bool soft_reset_required;
+	bool reset_required;
 
 	gaudi->events_stat[event_type]++;
 	gaudi->events_stat_aggregate[event_type]++;
@@ -5840,16 +5840,18 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 	case GAUDI_EVENT_TPC6_DEC:
 	case GAUDI_EVENT_TPC7_DEC:
 		gaudi_print_irq_info(hdev, event_type, true);
-		soft_reset_required = gaudi_tpc_read_interrupts(hdev,
+		reset_required = gaudi_tpc_read_interrupts(hdev,
 					tpc_dec_event_to_tpc_id(event_type),
 					"AXI_SLV_DEC_Error");
-		if (soft_reset_required) {
-			dev_err_ratelimited(hdev->dev,
-					"soft reset required due to %s\n",
-					gaudi_irq_map_table[event_type].name);
-			hl_device_reset(hdev, false, false);
+		if (reset_required) {
+			dev_err(hdev->dev, "hard reset required due to %s\n",
+				gaudi_irq_map_table[event_type].name);
+
+			if (hdev->hard_reset_on_fw_events)
+				hl_device_reset(hdev, true, false);
+		} else {
+			hl_fw_unmask_irq(hdev, event_type);
 		}
-		hl_fw_unmask_irq(hdev, event_type);
 		break;
 
 	case GAUDI_EVENT_TPC0_KRN_ERR:
@@ -5861,16 +5863,18 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 	case GAUDI_EVENT_TPC6_KRN_ERR:
 	case GAUDI_EVENT_TPC7_KRN_ERR:
 		gaudi_print_irq_info(hdev, event_type, true);
-		soft_reset_required = gaudi_tpc_read_interrupts(hdev,
+		reset_required = gaudi_tpc_read_interrupts(hdev,
 					tpc_krn_event_to_tpc_id(event_type),
 					"KRN_ERR");
-		if (soft_reset_required) {
-			dev_err_ratelimited(hdev->dev,
-					"soft reset required due to %s\n",
-					gaudi_irq_map_table[event_type].name);
-			hl_device_reset(hdev, false, false);
+		if (reset_required) {
+			dev_err(hdev->dev, "hard reset required due to %s\n",
+				gaudi_irq_map_table[event_type].name);
+
+			if (hdev->hard_reset_on_fw_events)
+				hl_device_reset(hdev, true, false);
+		} else {
+			hl_fw_unmask_irq(hdev, event_type);
 		}
-		hl_fw_unmask_irq(hdev, event_type);
 		break;
 
 	case GAUDI_EVENT_PCIE_CORE_SERR:
@@ -5921,8 +5925,8 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 
 	case GAUDI_EVENT_RAZWI_OR_ADC_SW:
 		gaudi_print_irq_info(hdev, event_type, true);
-		hl_device_reset(hdev, false, false);
-		hl_fw_unmask_irq(hdev, event_type);
+		if (hdev->hard_reset_on_fw_events)
+			hl_device_reset(hdev, true, false);
 		break;
 
 	case GAUDI_EVENT_TPC0_BMON_SPMU:
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 15b6c3228e37..152418dfe20c 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -752,6 +752,7 @@ static int goya_sw_init(struct hl_device *hdev)
 
 	spin_lock_init(&goya->hw_queues_lock);
 	hdev->supports_coresight = true;
+	hdev->supports_soft_reset = true;
 
 	return 0;
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 5a855b7edf43..0f0691875298 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -1436,6 +1436,7 @@ struct hl_device_idle_busy_ts {
  * @stop_on_err: true if engines should stop on error.
  * @supports_sync_stream: is sync stream supported.
  * @supports_coresight: is CoreSight supported.
+ * @supports_soft_reset: is soft reset supported.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -1522,6 +1523,7 @@ struct hl_device {
 	u8				stop_on_err;
 	u8				supports_sync_stream;
 	u8				supports_coresight;
+	u8				supports_soft_reset;
 
 	/* Parameters for bring-up */
 	u8				mmu_enable;
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index e4454414d0e1..5d78d5e1c782 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -183,6 +183,11 @@ static ssize_t soft_reset_store(struct device *dev,
 		goto out;
 	}
 
+	if (!hdev->supports_soft_reset) {
+		dev_err(hdev->dev, "Device does not support soft-reset\n");
+		goto out;
+	}
+
 	dev_warn(hdev->dev, "Soft-Reset requested through sysfs\n");
 
 	hl_device_reset(hdev, false, false);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] habanalabs: don't allow hard reset with open processes
  2020-05-21  7:02 [PATCH 1/4] habanalabs: improve MMU cache invalidation code Oded Gabbay
  2020-05-21  7:02 ` [PATCH 2/4] habanalabs: add print for soft reset due to event Oded Gabbay
  2020-05-21  7:02 ` [PATCH 3/4] habanalabs: GAUDI does not support soft-reset Oded Gabbay
@ 2020-05-21  7:02 ` Oded Gabbay
  2 siblings, 0 replies; 5+ messages in thread
From: Oded Gabbay @ 2020-05-21  7:02 UTC (permalink / raw)
  To: linux-kernel, SW_Drivers; +Cc: gregkh, Omer Shpigelman

From: Omer Shpigelman <oshpigelman@habana.ai>

When the MMU is heavily used by the engines, unmapping might take a lot of
time due to a full MMU cache invalidation done as part of the unmap flow.
Hence we might not be able to kill all open processes before going to hard
reset the device, as it involves unmapping of all user memory.
In case of a failure in killing all open processes, we should stop the
hard reset flow as it might lead to a kernel crash - one thread (killing
of a process) is updating MMU structures that other thread (hard reset) is
freeing.
Stopping a hard reset flow leaves the device as nonoperational and the
user can then initiate a hard reset via sysfs to reinitialize the device.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/device.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 4a4a446f479e..2b38a119704c 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -726,7 +726,7 @@ int hl_device_resume(struct hl_device *hdev)
 	return rc;
 }
 
-static void device_kill_open_processes(struct hl_device *hdev)
+static int device_kill_open_processes(struct hl_device *hdev)
 {
 	u16 pending_total, pending_cnt;
 	struct hl_fpriv	*hpriv;
@@ -779,9 +779,7 @@ static void device_kill_open_processes(struct hl_device *hdev)
 		ssleep(1);
 	}
 
-	if (!list_empty(&hdev->fpriv_list))
-		dev_crit(hdev->dev,
-			"Going to hard reset with open user contexts\n");
+	return list_empty(&hdev->fpriv_list) ? 0 : -EBUSY;
 }
 
 static void device_hard_reset_pending(struct work_struct *work)
@@ -908,7 +906,12 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 		 * process can't really exit until all its CSs are done, which
 		 * is what we do in cs rollback
 		 */
-		device_kill_open_processes(hdev);
+		rc = device_kill_open_processes(hdev);
+		if (rc) {
+			dev_crit(hdev->dev,
+				"Failed to kill all open processes, stopping hard reset\n");
+			goto out_err;
+		}
 
 		/* Flush the Event queue workers to make sure no other thread is
 		 * reading or writing to registers during the reset
@@ -1391,7 +1394,9 @@ void hl_device_fini(struct hl_device *hdev)
 	 * can't really exit until all its CSs are done, which is what we
 	 * do in cs rollback
 	 */
-	device_kill_open_processes(hdev);
+	rc = device_kill_open_processes(hdev);
+	if (rc)
+		dev_crit(hdev->dev, "Failed to kill all open processes\n");
 
 	hl_cb_pool_fini(hdev);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH 3/4] habanalabs: GAUDI does not support soft-reset
  2020-05-21  7:02 ` [PATCH 3/4] habanalabs: GAUDI does not support soft-reset Oded Gabbay
@ 2020-05-21  7:41   ` Tomer Tayar
  0 siblings, 0 replies; 5+ messages in thread
From: Tomer Tayar @ 2020-05-21  7:41 UTC (permalink / raw)
  To: Oded Gabbay, linux-kernel, SW_Drivers; +Cc: gregkh

On Thu, May 21, 2020 at 10:02, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> GAUDI does not support soft-reset as it leaves the NIC ports in an awkward
> state, where their QMANs were reset but the NIC itself is still working.
> 
> In addition, there is not much sense in doing soft-reset when training is
> done on multiple GAUDIs.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Reviewed-by: Tomer Tayar <ttayar@habana.ai>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-21  7:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-21  7:02 [PATCH 1/4] habanalabs: improve MMU cache invalidation code Oded Gabbay
2020-05-21  7:02 ` [PATCH 2/4] habanalabs: add print for soft reset due to event Oded Gabbay
2020-05-21  7:02 ` [PATCH 3/4] habanalabs: GAUDI does not support soft-reset Oded Gabbay
2020-05-21  7:41   ` Tomer Tayar
2020-05-21  7:02 ` [PATCH 4/4] habanalabs: don't allow hard reset with open processes Oded Gabbay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.