All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct
@ 2022-06-23 20:42 Oded Gabbay
  2022-06-23 20:42 ` [PATCH 2/9] habanalabs/gaudi: fix warning: var might be used uninitialized Oded Gabbay
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

move the field memory_scrub_val from struct hl_dbg_device_entry
to struct hl_device. This is because we want to use
this field also if debugfs is off.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/debugfs.c    | 4 ++--
 drivers/misc/habanalabs/common/habanalabs.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/habanalabs/common/debugfs.c b/drivers/misc/habanalabs/common/debugfs.c
index c6744bfc6da4..0f07c2de986b 100644
--- a/drivers/misc/habanalabs/common/debugfs.c
+++ b/drivers/misc/habanalabs/common/debugfs.c
@@ -543,7 +543,7 @@ static ssize_t hl_memory_scrub(struct file *f, const char __user *buf,
 {
 	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
 	struct hl_device *hdev = entry->hdev;
-	u64 val = entry->memory_scrub_val;
+	u64 val = hdev->memory_scrub_val;
 	int rc;
 
 	if (!hl_device_operational(hdev, NULL)) {
@@ -1516,7 +1516,7 @@ void hl_debugfs_add_device(struct hl_device *hdev)
 	debugfs_create_x64("memory_scrub_val",
 				0644,
 				dev_entry->root,
-				&dev_entry->memory_scrub_val);
+				&hdev->memory_scrub_val);
 
 	debugfs_create_file("memory_scrub",
 				0200,
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 94893305b928..931fa7092646 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -2065,7 +2065,6 @@ struct hl_debugfs_entry {
  * @addr: next address to read/write from/to in read/write32.
  * @mmu_addr: next virtual address to translate to physical address in mmu_show.
  * @userptr_lookup: the target user ptr to look up for on demand.
- * @memory_scrub_val: the value to which the dram will be scrubbed to using cb scrub_device_dram
  * @mmu_asid: ASID to use while translating in mmu_show.
  * @state_dump_head: index of the latest state dump
  * @i2c_bus: generic u8 debugfs file for bus value to use in i2c_data_read.
@@ -2096,7 +2095,6 @@ struct hl_dbg_device_entry {
 	u64				addr;
 	u64				mmu_addr;
 	u64				userptr_lookup;
-	u64				memory_scrub_val;
 	u32				mmu_asid;
 	u32				state_dump_head;
 	u8				i2c_bus;
@@ -2797,6 +2795,7 @@ struct hl_reset_info {
  * @stream_master_qid_arr: pointer to array with QIDs of master streams.
  * @fw_major_version: major version of current loaded preboot
  * @dram_used_mem: current DRAM memory consumption.
+ * @memory_scrub_val: the value to which the dram will be scrubbed to using cb scrub_device_dram
  * @timeout_jiffies: device CS timeout value.
  * @max_power: the max power of the device, as configured by the sysadmin. This
  *             value is saved so in case of hard-reset, the driver will restore
@@ -2942,6 +2941,7 @@ struct hl_device {
 	u32				*stream_master_qid_arr;
 	u32				fw_major_version;
 	atomic64_t			dram_used_mem;
+	u64				memory_scrub_val;
 	u64				timeout_jiffies;
 	u64				max_power;
 	u64				boot_error_status_mask;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/9] habanalabs/gaudi: fix warning: var might be used uninitialized
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 3/9] habanalabs/gaudi: fix a race condition causing DMAR error Oded Gabbay
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Koby Elbaz, kernel test robot

From: Koby Elbaz <kelbaz@habana.ai>

kernel test robot:
"warning: variable 'index' is used uninitialized whenever 'if' condition
is false"

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/gaudi/gaudi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 939d2636b9ed..172c21f262e7 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -7388,7 +7388,7 @@ static void gaudi_handle_qman_err(struct hl_device *hdev, u16 event_type, u64 *e
 		if (event_type == GAUDI_EVENT_MME0_QM) {
 			index = 0;
 			qid_base = GAUDI_QUEUE_ID_MME_0_0;
-		} else if (event_type == GAUDI_EVENT_MME2_QM) {
+		} else { /* event_type == GAUDI_EVENT_MME2_QM */
 			index = 2;
 			qid_base = GAUDI_QUEUE_ID_MME_1_0;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/9] habanalabs/gaudi: fix a race condition causing DMAR error
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
  2022-06-23 20:42 ` [PATCH 2/9] habanalabs/gaudi: fix warning: var might be used uninitialized Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 4/9] habanalabs: print if firmware is secured during load Oded Gabbay
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Yuri Nudelman

From: Yuri Nudelman <ynudelman@habana.ai>

There is a rare race condition in CB completion mechanism, that can
occur under a very high pressure of command submissions.
The preconditions for this to happen are:

 1. There should be enough command submissions for the pre-allocated
    patched CB pool to run out of commands. At this stage we start
    allocating new patched CBs as they arrive.
 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below
    a cache line end.

The flow:

 1. Two command buffers being completed on different streams, at the
    same time. Denote those CB1 and CB2.
 2. Each command buffer is injected with two messages, 16B each - one
    for a HBW update of the completion queue, another to raise
    interrupt.
 3. Assume CB1 updated the completion queue and raise the interrupt.
 4. Assume CB2 updated the completion queue but did not raise the
    interrupt yet.
 5. The host receives the interrupt. It goes over the completion queue
    and sees two completions - CB1 and CB2. Release them both.
 6. CB2 performs the last command. The problem is that the last command
    is split between 2 cache lines. So to read the last 8B of the last
    command, it has to access the host again. Problem is - CB2 is
    already released. This causes a DMAR error.

The solution to this problem is simply to make sure the last two
commands in the CB are always in the same cache line, using NOP padding.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/habanalabs.h |  1 +
 drivers/misc/habanalabs/common/hw_queue.c   |  1 +
 drivers/misc/habanalabs/gaudi/gaudi.c       | 46 +++++++++++++++------
 drivers/misc/habanalabs/goya/goya.c         |  4 +-
 drivers/misc/habanalabs/goya/goyaP.h        |  4 +-
 5 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 931fa7092646..44752e5954ca 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -1380,6 +1380,7 @@ struct hl_asic_funcs {
 				enum dma_data_direction dir);
 	void (*add_end_of_cb_packets)(struct hl_device *hdev,
 					void *kernel_address, u32 len,
+					u32 original_len,
 					u64 cq_addr, u32 cq_val, u32 msix_num,
 					bool eb);
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
diff --git a/drivers/misc/habanalabs/common/hw_queue.c b/drivers/misc/habanalabs/common/hw_queue.c
index 6103e479e855..32408887dd7c 100644
--- a/drivers/misc/habanalabs/common/hw_queue.c
+++ b/drivers/misc/habanalabs/common/hw_queue.c
@@ -308,6 +308,7 @@ static void ext_queue_schedule_job(struct hl_cs_job *job)
 	cq_addr = cq->bus_address + cq->pi * sizeof(struct hl_cq_entry);
 
 	hdev->asic_funcs->add_end_of_cb_packets(hdev, cb->kernel_address, len,
+						job->user_cb_size,
 						cq_addr,
 						le32_to_cpu(cq_pkt.data),
 						q->msi_vec,
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 172c21f262e7..453de3d27d0c 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -1423,6 +1423,19 @@ static int gaudi_collective_wait_init_cs(struct hl_cs *cs)
 	return 0;
 }
 
+static u32 gaudi_get_patched_cb_extra_size(u32 user_cb_size)
+{
+	u32 cacheline_end, additional_commands;
+
+	cacheline_end = round_up(user_cb_size, DEVICE_CACHE_LINE_SIZE);
+	additional_commands = sizeof(struct packet_msg_prot) * 2;
+
+	if (user_cb_size + additional_commands > cacheline_end)
+		return cacheline_end - user_cb_size + additional_commands;
+	else
+		return additional_commands;
+}
+
 static int gaudi_collective_wait_create_job(struct hl_device *hdev,
 		struct hl_ctx *ctx, struct hl_cs *cs,
 		enum hl_collective_mode mode, u32 queue_id, u32 wait_queue_id,
@@ -1443,7 +1456,7 @@ static int gaudi_collective_wait_create_job(struct hl_device *hdev,
 		 * 1 fence packet
 		 * 4 msg short packets for monitor 2 configuration
 		 * 1 fence packet
-		 * 2 msg prot packets for completion and MSI-X
+		 * 2 msg prot packets for completion and MSI
 		 */
 		cb_size = sizeof(struct packet_msg_short) * 8 +
 				sizeof(struct packet_fence) * 2 +
@@ -5337,11 +5350,13 @@ static int gaudi_validate_cb(struct hl_device *hdev,
 
 	/*
 	 * The new CB should have space at the end for two MSG_PROT packets:
-	 * 1. A packet that will act as a completion packet
-	 * 2. A packet that will generate MSI-X interrupt
+	 * 1. Optional NOP padding for cacheline alignment
+	 * 2. A packet that will act as a completion packet
+	 * 3. A packet that will generate MSI interrupt
 	 */
 	if (parser->completion)
-		parser->patched_cb_size += sizeof(struct packet_msg_prot) * 2;
+		parser->patched_cb_size += gaudi_get_patched_cb_extra_size(
+			parser->patched_cb_size);
 
 	return rc;
 }
@@ -5564,13 +5579,14 @@ static int gaudi_parse_cb_mmu(struct hl_device *hdev,
 	int rc;
 
 	/*
-	 * The new CB should have space at the end for two MSG_PROT pkt:
-	 * 1. A packet that will act as a completion packet
-	 * 2. A packet that will generate MSI interrupt
+	 * The new CB should have space at the end for two MSG_PROT packets:
+	 * 1. Optional NOP padding for cacheline alignment
+	 * 2. A packet that will act as a completion packet
+	 * 3. A packet that will generate MSI interrupt
 	 */
 	if (parser->completion)
 		parser->patched_cb_size = parser->user_cb_size +
-				sizeof(struct packet_msg_prot) * 2;
+				gaudi_get_patched_cb_extra_size(parser->user_cb_size);
 	else
 		parser->patched_cb_size = parser->user_cb_size;
 
@@ -5745,18 +5761,24 @@ static int gaudi_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser)
 		return gaudi_parse_cb_no_mmu(hdev, parser);
 }
 
-static void gaudi_add_end_of_cb_packets(struct hl_device *hdev,
-					void *kernel_address, u32 len,
-					u64 cq_addr, u32 cq_val, u32 msi_vec,
-					bool eb)
+static void gaudi_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address,
+				u32 len, u32 original_len, u64 cq_addr, u32 cq_val,
+				u32 msi_vec, bool eb)
 {
 	struct gaudi_device *gaudi = hdev->asic_specific;
 	struct packet_msg_prot *cq_pkt;
+	struct packet_nop *cq_padding;
 	u64 msi_addr;
 	u32 tmp;
 
+	cq_padding = kernel_address + original_len;
 	cq_pkt = kernel_address + len - (sizeof(struct packet_msg_prot) * 2);
 
+	while ((void *)cq_padding < (void *)cq_pkt) {
+		cq_padding->ctl = FIELD_PREP(GAUDI_PKT_CTL_OPCODE_MASK, PACKET_NOP);
+		cq_padding++;
+	}
+
 	tmp = FIELD_PREP(GAUDI_PKT_CTL_OPCODE_MASK, PACKET_MSG_PROT);
 	tmp |= FIELD_PREP(GAUDI_PKT_CTL_MB_MASK, 1);
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 64590fc55dc9..40c082cafbd7 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -4166,8 +4166,8 @@ int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser)
 }
 
 void goya_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address,
-				u32 len, u64 cq_addr, u32 cq_val, u32 msix_vec,
-				bool eb)
+				u32 len, u32 original_len, u64 cq_addr, u32 cq_val,
+				u32 msix_vec, bool eb)
 {
 	struct packet_msg_prot *cq_pkt;
 	u32 tmp;
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 647f57402616..54b5b6125df5 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -230,8 +230,8 @@ void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry);
 void *goya_get_events_stat(struct hl_device *hdev, bool aggregate, u32 *size);
 
 void goya_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address,
-				u32 len, u64 cq_addr, u32 cq_val, u32 msix_vec,
-				bool eb);
+				u32 len, u32 original_len, u64 cq_addr, u32 cq_val,
+				u32 msix_vec, bool eb);
 int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser);
 int goya_scrub_device_mem(struct hl_device *hdev, u64 addr, u64 size);
 void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/9] habanalabs: print if firmware is secured during load
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
  2022-06-23 20:42 ` [PATCH 2/9] habanalabs/gaudi: fix warning: var might be used uninitialized Oded Gabbay
  2022-06-23 20:42 ` [PATCH 3/9] habanalabs/gaudi: fix a race condition causing DMAR error Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 5/9] habanalabs: don't do memory scrubbing when unmapping Oded Gabbay
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

For easier debug, it is desirable to have a simple way
to know whether the device is secured or not, hence we dump this
indication during boot.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/firmware_if.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/common/firmware_if.c b/drivers/misc/habanalabs/common/firmware_if.c
index bd66e4f84156..42dfbfff92fd 100644
--- a/drivers/misc/habanalabs/common/firmware_if.c
+++ b/drivers/misc/habanalabs/common/firmware_if.c
@@ -2425,7 +2425,8 @@ static int hl_fw_dynamic_init_cpu(struct hl_device *hdev,
 	int rc;
 
 	dev_info(hdev->dev,
-		"Loading firmware to device, may take some time...\n");
+		"Loading %sfirmware to device, may take some time...\n",
+		hdev->asic_prop.fw_security_enabled ? "secured " : "");
 
 	/* initialize FW descriptor as invalid */
 	fw_loader->dynamic_loader.fw_desc_valid = false;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/9] habanalabs: don't do memory scrubbing when unmapping
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
                   ` (2 preceding siblings ...)
  2022-06-23 20:42 ` [PATCH 4/9] habanalabs: print if firmware is secured during load Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 6/9] habanalabs: don't send addr and size to scrub_device_mem cb Oded Gabbay
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

There is no need to do memory scrub when unmapping anymore as it is
an overhead as long as we have a single user at any given time.

Remove that code and change return value of free_phys_pg_pack to void

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/memory.c | 36 +++++--------------------
 1 file changed, 6 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index d5e6500f8a1f..039fd87021ab 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -305,33 +305,20 @@ static void dram_pg_pool_do_release(struct kref *ref)
  *
  * This function does the following:
  * - For DRAM memory only
- *   - iterate over the pack, scrub and free each physical block structure by
+ *   - iterate over the pack, free each physical block structure by
  *     returning it to the general pool.
- *     In case of error during scrubbing, initiate hard reset.
- *     Once hard reset is triggered, scrubbing is bypassed while freeing the
- *     memory continues.
  * - Free the hl_vm_phys_pg_pack structure.
  */
-static int free_phys_pg_pack(struct hl_device *hdev,
+static void free_phys_pg_pack(struct hl_device *hdev,
 				struct hl_vm_phys_pg_pack *phys_pg_pack)
 {
 	struct hl_vm *vm = &hdev->vm;
 	u64 i;
-	int rc = 0;
 
 	if (phys_pg_pack->created_from_userptr)
 		goto end;
 
 	if (phys_pg_pack->contiguous) {
-		if (hdev->memory_scrub && !hdev->disabled) {
-			rc = hdev->asic_funcs->scrub_device_mem(hdev,
-					phys_pg_pack->pages[0],
-					phys_pg_pack->total_size);
-			if (rc)
-				dev_err(hdev->dev,
-					"Failed to scrub contiguous device memory\n");
-		}
-
 		gen_pool_free(vm->dram_pg_pool, phys_pg_pack->pages[0],
 			phys_pg_pack->total_size);
 
@@ -340,15 +327,6 @@ static int free_phys_pg_pack(struct hl_device *hdev,
 				dram_pg_pool_do_release);
 	} else {
 		for (i = 0 ; i < phys_pg_pack->npages ; i++) {
-			if (hdev->memory_scrub && !hdev->disabled && rc == 0) {
-				rc = hdev->asic_funcs->scrub_device_mem(
-						hdev,
-						phys_pg_pack->pages[i],
-						phys_pg_pack->page_size);
-				if (rc)
-					dev_err(hdev->dev,
-						"Failed to scrub device memory\n");
-			}
 			gen_pool_free(vm->dram_pg_pool,
 				phys_pg_pack->pages[i],
 				phys_pg_pack->page_size);
@@ -357,14 +335,11 @@ static int free_phys_pg_pack(struct hl_device *hdev,
 		}
 	}
 
-	if (rc && !hdev->disabled)
-		hl_device_reset(hdev, HL_DRV_RESET_HARD);
-
 end:
 	kvfree(phys_pg_pack->pages);
 	kfree(phys_pg_pack);
 
-	return rc;
+	return;
 }
 
 /**
@@ -409,7 +384,8 @@ static int free_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args)
 		atomic64_sub(phys_pg_pack->total_size, &ctx->dram_phys_mem);
 		atomic64_sub(phys_pg_pack->total_size, &hdev->dram_used_mem);
 
-		return free_phys_pg_pack(hdev, phys_pg_pack);
+		free_phys_pg_pack(hdev, phys_pg_pack);
+		return 0;
 	} else {
 		spin_unlock(&vm->idr_lock);
 		dev_err(hdev->dev,
@@ -1278,7 +1254,7 @@ static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args, u64 *device
 	*device_addr = ret_vaddr;
 
 	if (is_userptr)
-		rc = free_phys_pg_pack(hdev, phys_pg_pack);
+		free_phys_pg_pack(hdev, phys_pg_pack);
 
 	return rc;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/9] habanalabs: don't send addr and size to scrub_device_mem cb
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
                   ` (3 preceding siblings ...)
  2022-06-23 20:42 ` [PATCH 5/9] habanalabs: don't do memory scrubbing when unmapping Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 7/9] habanalabs/gaudi: use memory_scrub_val from debugfs Oded Gabbay
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

We use scrub_device_mem only to scrub the entire SRAM and entire
DRAM. Therefore there is no need to send addr and size
args to the callback.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/context.c    |  2 +-
 drivers/misc/habanalabs/common/habanalabs.h |  4 +-
 drivers/misc/habanalabs/gaudi/gaudi.c       | 64 ++++++++++-----------
 drivers/misc/habanalabs/goya/goya.c         |  2 +-
 drivers/misc/habanalabs/goya/goyaP.h        |  2 +-
 5 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/drivers/misc/habanalabs/common/context.c b/drivers/misc/habanalabs/common/context.c
index 64ac65d9268b..60e3e3125fbc 100644
--- a/drivers/misc/habanalabs/common/context.c
+++ b/drivers/misc/habanalabs/common/context.c
@@ -108,7 +108,7 @@ static void hl_ctx_fini(struct hl_ctx *ctx)
 		hl_encaps_sig_mgr_fini(hdev, &ctx->sig_mgr);
 
 		/* Scrub both SRAM and DRAM */
-		hdev->asic_funcs->scrub_device_mem(hdev, 0, 0);
+		hdev->asic_funcs->scrub_device_mem(hdev);
 	} else {
 		dev_dbg(hdev->dev, "closing kernel context\n");
 		hdev->asic_funcs->ctx_fini(ctx);
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 44752e5954ca..4d2f69fb4b9d 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -1248,7 +1248,7 @@ struct fw_load_mgr {
  *                           dma_free_coherent(). This is ASIC function because
  *                           its implementation is not trivial when the driver
  *                           is loaded in simulation mode (not upstreamed).
- * @scrub_device_mem: Scrub device memory given an address and size
+ * @scrub_device_mem: Scrub the entire SRAM and DRAM.
  * @scrub_device_dram: Scrub the dram memory of the device.
  * @get_int_queue_base: get the internal queue base address.
  * @test_queues: run simple test on all queues for sanity check.
@@ -1359,7 +1359,7 @@ struct hl_asic_funcs {
 					dma_addr_t *dma_handle, gfp_t flag);
 	void (*asic_dma_free_coherent)(struct hl_device *hdev, size_t size,
 					void *cpu_addr, dma_addr_t dma_handle);
-	int (*scrub_device_mem)(struct hl_device *hdev, u64 addr, u64 size);
+	int (*scrub_device_mem)(struct hl_device *hdev);
 	int (*scrub_device_dram)(struct hl_device *hdev, u64 val);
 	void* (*get_int_queue_base)(struct hl_device *hdev, u32 queue_id,
 				dma_addr_t *dma_handle, u16 *queue_len);
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 453de3d27d0c..bc5e74505d03 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -1657,7 +1657,7 @@ static int gaudi_late_init(struct hl_device *hdev)
 	}
 
 	/* Scrub both SRAM and DRAM */
-	rc = hdev->asic_funcs->scrub_device_mem(hdev, 0, 0);
+	rc = hdev->asic_funcs->scrub_device_mem(hdev);
 	if (rc)
 		goto disable_pci_access;
 
@@ -4846,51 +4846,49 @@ static int gaudi_scrub_device_dram(struct hl_device *hdev, u64 val)
 	return 0;
 }
 
-static int gaudi_scrub_device_mem(struct hl_device *hdev, u64 addr, u64 size)
+static int gaudi_scrub_device_mem(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u64 addr, size, dummy_val;
 	int rc = 0;
 	u64 val = 0;
 
 	if (!hdev->memory_scrub)
 		return 0;
 
-	if (!addr && !size) {
-		/* Wait till device is idle */
-		rc = hl_poll_timeout(
-				hdev,
-				mmDMA0_CORE_STS0/* dummy */,
-				val/* dummy */,
-				(hdev->asic_funcs->is_device_idle(hdev, NULL,
-						0, NULL)),
-						1000,
-						HBM_SCRUBBING_TIMEOUT_US);
-		if (rc) {
-			dev_err(hdev->dev, "waiting for idle timeout\n");
-			return -EIO;
-		}
+	/* Wait till device is idle */
+	rc = hl_poll_timeout(hdev,
+			mmDMA0_CORE_STS0 /* dummy */,
+			dummy_val /* dummy */,
+			(hdev->asic_funcs->is_device_idle(hdev, NULL, 0, NULL)),
+			1000,
+			HBM_SCRUBBING_TIMEOUT_US);
+	if (rc) {
+		dev_err(hdev->dev, "waiting for idle timeout\n");
+		return -EIO;
+	}
 
-		/* Scrub SRAM */
-		addr = prop->sram_user_base_address;
-		size = hdev->pldm ? 0x10000 :
-				(prop->sram_size - SRAM_USER_BASE_OFFSET);
-		val = 0x7777777777777777ull;
+	/* Scrub SRAM */
+	addr = prop->sram_user_base_address;
+	size = hdev->pldm ? 0x10000 : prop->sram_size - SRAM_USER_BASE_OFFSET;
+	val = 0x7777777777777777ull;
 
-		rc = gaudi_memset_device_memory(hdev, addr, size, val);
-		if (rc) {
-			dev_err(hdev->dev,
-				"Failed to clear SRAM in mem scrub all\n");
-			return rc;
-		}
+	dev_dbg(hdev->dev, "Scrubing SRAM: 0x%09llx - 0x%09llx val: 0x%llx\n",
+			addr, addr + size, val);
+	rc = gaudi_memset_device_memory(hdev, addr, size, val);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to clear SRAM (%d)\n", rc);
+		return rc;
+	}
 
-		/* Scrub HBM using all DMA channels in parallel */
-		rc = gaudi_scrub_device_dram(hdev, 0xdeadbeaf);
-		if (rc)
-			dev_err(hdev->dev,
-				"Failed to clear HBM in mem scrub all\n");
+	/* Scrub HBM using all DMA channels in parallel */
+	rc = gaudi_scrub_device_dram(hdev, 0xdeadbeaf);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to clear HBM (%d)\n", rc);
+		return rc;
 	}
 
-	return rc;
+	return 0;
 }
 
 static void *gaudi_get_int_queue_base(struct hl_device *hdev,
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 40c082cafbd7..25b1e3e139e8 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -3019,7 +3019,7 @@ static void goya_dma_free_coherent(struct hl_device *hdev, size_t size,
 	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, fixed_dma_handle);
 }
 
-int goya_scrub_device_mem(struct hl_device *hdev, u64 addr, u64 size)
+int goya_scrub_device_mem(struct hl_device *hdev)
 {
 	return 0;
 }
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 54b5b6125df5..d6ec43d6f6b0 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -233,7 +233,7 @@ void goya_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address,
 				u32 len, u32 original_len, u64 cq_addr, u32 cq_val,
 				u32 msix_vec, bool eb);
 int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser);
-int goya_scrub_device_mem(struct hl_device *hdev, u64 addr, u64 size);
+int goya_scrub_device_mem(struct hl_device *hdev);
 void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
 				dma_addr_t *dma_handle,	u16 *queue_len);
 u32 goya_get_dma_desc_list_size(struct hl_device *hdev, struct sg_table *sgt);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 7/9] habanalabs/gaudi: use memory_scrub_val from debugfs
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
                   ` (4 preceding siblings ...)
  2022-06-23 20:42 ` [PATCH 6/9] habanalabs: don't send addr and size to scrub_device_mem cb Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 8/9] habanalabs: move call to scrub_device_mem after ctx_fini Oded Gabbay
  2022-06-23 20:42 ` [PATCH 9/9] habanalabs: set default value for memory_scrub Oded Gabbay
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

In the callback scrub_device_mem, use 'memory_scrub_val'
from debugfs for the scrubbing value.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 Documentation/ABI/testing/debugfs-driver-habanalabs | 3 ++-
 drivers/misc/habanalabs/gaudi/gaudi.c               | 5 ++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
index 0f8d20fe343f..deb66944cd0c 100644
--- a/Documentation/ABI/testing/debugfs-driver-habanalabs
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs
@@ -182,7 +182,8 @@ Date:           May 2022
 KernelVersion:  5.19
 Contact:        dhirschfeld@habana.ai
 Description:    The value to which the dram will be set to when the user
-                scrubs the dram using 'memory_scrub' debugfs file
+                scrubs the dram using 'memory_scrub' debugfs file and
+                the scrubbing value when using module param 'memory_scrub'
 
 What:           /sys/kernel/debug/habanalabs/hl<n>/mmu
 Date:           Jan 2019
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index bc5e74505d03..8cf3382fa039 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -4851,7 +4851,7 @@ static int gaudi_scrub_device_mem(struct hl_device *hdev)
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
 	u64 addr, size, dummy_val;
 	int rc = 0;
-	u64 val = 0;
+	u64 val = hdev->memory_scrub_val;
 
 	if (!hdev->memory_scrub)
 		return 0;
@@ -4871,7 +4871,6 @@ static int gaudi_scrub_device_mem(struct hl_device *hdev)
 	/* Scrub SRAM */
 	addr = prop->sram_user_base_address;
 	size = hdev->pldm ? 0x10000 : prop->sram_size - SRAM_USER_BASE_OFFSET;
-	val = 0x7777777777777777ull;
 
 	dev_dbg(hdev->dev, "Scrubing SRAM: 0x%09llx - 0x%09llx val: 0x%llx\n",
 			addr, addr + size, val);
@@ -4882,7 +4881,7 @@ static int gaudi_scrub_device_mem(struct hl_device *hdev)
 	}
 
 	/* Scrub HBM using all DMA channels in parallel */
-	rc = gaudi_scrub_device_dram(hdev, 0xdeadbeaf);
+	rc = gaudi_scrub_device_dram(hdev, val);
 	if (rc) {
 		dev_err(hdev->dev, "Failed to clear HBM (%d)\n", rc);
 		return rc;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 8/9] habanalabs: move call to scrub_device_mem after ctx_fini
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
                   ` (5 preceding siblings ...)
  2022-06-23 20:42 ` [PATCH 7/9] habanalabs/gaudi: use memory_scrub_val from debugfs Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  2022-06-23 20:42 ` [PATCH 9/9] habanalabs: set default value for memory_scrub Oded Gabbay
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

In future ASICs, it would be possible to have a non-idle
device when context is released. We thus need to postpone the
scrubbing. Postpone it to hpriv release if reset is not executed
or to device late init if reset is executed.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/context.c |  3 ---
 drivers/misc/habanalabs/common/device.c  | 16 ++++++++++++++--
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/habanalabs/common/context.c b/drivers/misc/habanalabs/common/context.c
index 60e3e3125fbc..a69c14405f41 100644
--- a/drivers/misc/habanalabs/common/context.c
+++ b/drivers/misc/habanalabs/common/context.c
@@ -106,9 +106,6 @@ static void hl_ctx_fini(struct hl_ctx *ctx)
 		hl_vm_ctx_fini(ctx);
 		hl_asid_free(hdev, ctx->asid);
 		hl_encaps_sig_mgr_fini(hdev, &ctx->sig_mgr);
-
-		/* Scrub both SRAM and DRAM */
-		hdev->asic_funcs->scrub_device_mem(hdev);
 	} else {
 		dev_dbg(hdev->dev, "closing kernel context\n");
 		hdev->asic_funcs->ctx_fini(ctx);
diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index 0f804ecb6caa..1a4f3eb941a9 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -272,9 +272,15 @@ static void hpriv_release(struct kref *ref)
 	list_del(&hpriv->dev_node);
 	mutex_unlock(&hdev->fpriv_list_lock);
 
-	if ((hdev->reset_if_device_not_idle && !device_is_idle)
-			|| hdev->reset_upon_device_release)
+	if ((hdev->reset_if_device_not_idle && !device_is_idle) ||
+		hdev->reset_upon_device_release) {
 		hl_device_reset(hdev, HL_DRV_RESET_DEV_RELEASE);
+	} else {
+		int rc = hdev->asic_funcs->scrub_device_mem(hdev);
+
+		if (rc)
+			dev_err(hdev->dev, "failed to scrub memory from hpriv release (%d)\n", rc);
+	}
 
 	/* Now we can mark the compute_ctx as not active. Even if a reset is running in a different
 	 * thread, we don't care because the in_reset is marked so if a user will try to open
@@ -1459,6 +1465,12 @@ int hl_device_reset(struct hl_device *hdev, u32 flags)
 		}
 	}
 
+	rc = hdev->asic_funcs->scrub_device_mem(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "scrub mem failed from device reset (%d)\n", rc);
+		return rc;
+	}
+
 	spin_lock(&hdev->reset_info.lock);
 	hdev->reset_info.is_in_soft_reset = false;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 9/9] habanalabs: set default value for memory_scrub
  2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
                   ` (6 preceding siblings ...)
  2022-06-23 20:42 ` [PATCH 8/9] habanalabs: move call to scrub_device_mem after ctx_fini Oded Gabbay
@ 2022-06-23 20:42 ` Oded Gabbay
  7 siblings, 0 replies; 9+ messages in thread
From: Oded Gabbay @ 2022-06-23 20:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

Set a default value for memory scrubbing

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index 1a4f3eb941a9..34ba521e2d1a 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -15,6 +15,8 @@
 
 #define HL_RESET_DELAY_USEC		10000	/* 10ms */
 
+#define MEM_SCRUB_DEFAULT_VAL 0x1122334455667788
+
 /*
  * hl_set_dram_bar- sets the bar to allow later access to address
  *
@@ -1727,6 +1729,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 
 	hdev->asic_funcs->state_dump_init(hdev);
 
+	hdev->memory_scrub_val = MEM_SCRUB_DEFAULT_VAL;
 	hl_debugfs_add_device(hdev);
 
 	/* debugfs nodes are created in hl_ctx_init so it must be called after
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-23 20:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23 20:42 [PATCH 1/9] habanalabs: move memory_scrub_val to hdev struct Oded Gabbay
2022-06-23 20:42 ` [PATCH 2/9] habanalabs/gaudi: fix warning: var might be used uninitialized Oded Gabbay
2022-06-23 20:42 ` [PATCH 3/9] habanalabs/gaudi: fix a race condition causing DMAR error Oded Gabbay
2022-06-23 20:42 ` [PATCH 4/9] habanalabs: print if firmware is secured during load Oded Gabbay
2022-06-23 20:42 ` [PATCH 5/9] habanalabs: don't do memory scrubbing when unmapping Oded Gabbay
2022-06-23 20:42 ` [PATCH 6/9] habanalabs: don't send addr and size to scrub_device_mem cb Oded Gabbay
2022-06-23 20:42 ` [PATCH 7/9] habanalabs/gaudi: use memory_scrub_val from debugfs Oded Gabbay
2022-06-23 20:42 ` [PATCH 8/9] habanalabs: move call to scrub_device_mem after ctx_fini Oded Gabbay
2022-06-23 20:42 ` [PATCH 9/9] habanalabs: set default value for memory_scrub Oded Gabbay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.