[PATCH 00/22] *** bug fixing serials ***

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/22] *** bug fixing serials ***
@ 2018-02-26  5:17 Monk Liu
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:17 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

*** those patches are mainly bug fixing for SR-iov and couple bare-metal case ***

Emily Deng (2):
  drm/amdgpu: Remove the memory leak after unload amdgpu driver
  drm/amdgpu: Correct sdma_v4 get_wptr

Monk Liu (20):
  drm/amdgpu: fix&cleanups for wb_clear
  drm/amdgpu: remove duplicated job_free_resources
  drm/amdgpu: skip ECC for SRIOV in gmc late_init
  drm/amdgpu: only flush hotplug work without DC
  drm/amdgpu: cond_exec only for schedule with a job
  drm/amdgpu: don't use MM idle_work for SRIOV
  drm/amdgpu: fix fence_emit NULL pointer ref risk
  drm/amdgpu: cleanups VCE/UVD ib test part
  drm/amdgpu: fix kmd reload bug on bare-metal
  drm/amdgpu: no need to count emitted for SRIOV
  drm/amdgpu: send rel event first after init failed
  drm/amdgpu: fix vce_ring test memleak
  drm/amdgpu: change gfx9 ib test to use WB
  drm/amdgpu: move WB_FREE to correct place
  drm/amdgpu: cleanup SA inti and fini
  drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini
  drm/amdgpu: adjust timeout for ib_ring_tests
  drm/amdgpu: increase gart size to 512MB
  dma-buf/reservation: shouldn't kfree staged when slot available
  drm/amdgpu: fix reservation obj shared count bug

 drivers/dma-buf/reservation.c              |   7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  46 +++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |  38 +++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     |  61 ++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    |  19 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    |  59 +++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h    |   2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |   5 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 110 ++++++++++++++++-------------
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |   4 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     |  10 +--
 15 files changed, 190 insertions(+), 183 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/22] drm/amdgpu: fix&cleanups for wb_clear
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  5:17   ` Monk Liu
       [not found]     ` <1519622300-8990-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 02/22] drm/amdgpu: remove duplicated job_free_resources Monk Liu
                     ` (18 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:17 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

fix:
should do right shift on wb before clearing

cleanups:
1,should memset all wb buffer
2,set max wb number to 128 (total 4KB) is big enough

Change-Id: I43832245c875ce039e7709dc049828e21c50c81f
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 10e16d6..5bddfc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1072,7 +1072,7 @@ static inline void amdgpu_set_ib_value(struct amdgpu_cs_parser *p,
 /*
  * Writeback
  */
-#define AMDGPU_MAX_WB 512	/* Reserve at most 512 WB slots for amdgpu-owned rings. */
+#define AMDGPU_MAX_WB 128	/* Reserve at most 128 WB slots for amdgpu-owned rings. */
 
 struct amdgpu_wb {
 	struct amdgpu_bo	*wb_obj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 59b69a2..54145ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -492,7 +492,7 @@ static int amdgpu_device_wb_init(struct amdgpu_device *adev)
 		memset(&adev->wb.used, 0, sizeof(adev->wb.used));
 
 		/* clear wb memory */
-		memset((char *)adev->wb.wb, 0, AMDGPU_MAX_WB * sizeof(uint32_t));
+		memset((char *)adev->wb.wb, 0, AMDGPU_MAX_WB * sizeof(uint32_t) * 8);
 	}
 
 	return 0;
@@ -530,8 +530,9 @@ int amdgpu_device_wb_get(struct amdgpu_device *adev, u32 *wb)
  */
 void amdgpu_device_wb_free(struct amdgpu_device *adev, u32 wb)
 {
+	wb >>= 3;
 	if (wb < adev->wb.num_wb)
-		__clear_bit(wb >> 3, adev->wb.used);
+		__clear_bit(wb, adev->wb.used);
 }
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/22] drm/amdgpu: remove duplicated job_free_resources
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:17   ` [PATCH 01/22] drm/amdgpu: fix&cleanups for wb_clear Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 03/22] drm/amdgpu: skip ECC for SRIOV in gmc late_init Monk Liu
                     ` (17 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

if a job is go through scheduler, it is forbidden to
call job_free_resources after ib_schedule, becaust that
would assign wild pointer of fence on the sa_bo->fence
which could lead to weird bug

Change-Id: Iad7ee011c641cb7357569cbce36fdc10f0ed911d
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 2bd5676..7cb3437 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -207,7 +207,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 	dma_fence_put(job->fence);
 	job->fence = dma_fence_get(fence);
 
-	amdgpu_job_free_resources(job);
 	return fence;
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/22] drm/amdgpu: skip ECC for SRIOV in gmc late_init
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:17   ` [PATCH 01/22] drm/amdgpu: fix&cleanups for wb_clear Monk Liu
  2018-02-26  5:18   ` [PATCH 02/22] drm/amdgpu: remove duplicated job_free_resources Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 04/22] drm/amdgpu: only flush hotplug work without DC Monk Liu
                     ` (16 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: I2865a11d1dded214de289787d334ec4a22b5db19
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index bc4bd5e..6839d93 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -673,7 +673,7 @@ static int gmc_v9_0_late_init(void *handle)
 	for(i = 0; i < AMDGPU_MAX_VMHUBS; ++i)
 		BUG_ON(vm_inv_eng[i] > 16);
 
-	if (adev->asic_type == CHIP_VEGA10) {
+	if (adev->asic_type == CHIP_VEGA10 && !amdgpu_sriov_vf(adev)) {
 		r = gmc_v9_0_ecc_available(adev);
 		if (r == 1) {
 			DRM_INFO("ECC is active.\n");
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/22] drm/amdgpu: only flush hotplug work without DC
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 03/22] drm/amdgpu: skip ECC for SRIOV in gmc late_init Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 05/22] drm/amdgpu: cond_exec only for schedule with a job Monk Liu
                     ` (15 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

since hotplug_work is initialized under the case of
no dc support

Change-Id: I0d417a5b9f8dfb1863eafc95b6802be9e5c74720
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index f6f2a66..11dfe57 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -208,7 +208,8 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
 	r = drm_irq_install(adev->ddev, adev->ddev->pdev->irq);
 	if (r) {
 		adev->irq.installed = false;
-		flush_work(&adev->hotplug_work);
+		if (!amdgpu_device_has_dc_support(adev))
+			flush_work(&adev->hotplug_work);
 		cancel_work_sync(&adev->reset_work);
 		return r;
 	}
@@ -234,7 +235,8 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
 		adev->irq.installed = false;
 		if (adev->irq.msi_enabled)
 			pci_disable_msi(adev->pdev);
-		flush_work(&adev->hotplug_work);
+		if (!amdgpu_device_has_dc_support(adev))
+			flush_work(&adev->hotplug_work);
 		cancel_work_sync(&adev->reset_work);
 	}
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/22] drm/amdgpu: cond_exec only for schedule with a job
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 04/22] drm/amdgpu: only flush hotplug work without DC Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 06/22] drm/amdgpu: don't use MM idle_work for SRIOV Monk Liu
                     ` (14 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

issue:
under SR-IOV sometimes the iB test will fail on
gfx ring

fix:
with cond_exec inserted in RB the gfx engine would
skip part packets if RLCV issue PREEMPT on gfx engine
if gfx engine is prior to COND_EXEC packet, this is
okay for regular command from UMD, but for the ib test
since the whole dma format doesn't support PREEMPT
so must remove the COND_EXEC from it.

Change-Id: I73bd0479bf1307139ff3726a00c312526054aa6d
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 8ea342d..7f2c314 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -181,7 +181,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
 		}
 	}
 
-	if (ring->funcs->init_cond_exec)
+	if (job && ring->funcs->init_cond_exec)
 		patch_offset = amdgpu_ring_init_cond_exec(ring);
 
 #ifdef CONFIG_X86_64
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/22] drm/amdgpu: don't use MM idle_work for SRIOV
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 05/22] drm/amdgpu: cond_exec only for schedule with a job Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 07/22] drm/amdgpu: fix fence_emit NULL pointer ref risk Monk Liu
                     ` (13 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

two reason for this patch:
1) SRIOV doesn't give VF cg/pg feature so the idle_work
is useless

2) VCE's idle work would cause "KMD reload" test failed and
trigger KERNEL PANIC, this is because the idle work is triggered
after VCE ib test and KMD removed right after loaded, which
can hit page fault in TIMER callback.

ISSUE resovled:

SWDEV-143244:

[  275.021338] RAX: 6f150001c3808df1 RBX: ffff95667fd91e70 RCX: ffff9566586878b8
[  275.023000] RDX: ffff95667fd83ee8 RSI: ffff95667fd91e28 RDI: ffff95667fd83ef0
[  275.024660] RBP: ffff95667fd83f58 R08: ffff95667fd92238 R09: ffff95667fd83ef0
[  275.026303] R10: 0000000000000000 R11: 0000000000000070 R12: ffff95667fd91e00
[  275.027951] R13: ffff95667fd83ee8 R14: 0000000000000000 R15: ffff9566586878b8
[  275.029600] FS:  0000000000000000(0000) GS:ffff95667fd80000(0000) knlGS:0000000000000000
[  275.031896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  275.033367] CR2: 0000560972e27300 CR3: 000000011fe09000 CR4: 00000000001406e0
[  275.035095] Call Trace:
[  275.036032]  <IRQ>
[  275.036909]  ? ktime_get+0x3b/0xa0
[  275.037997]  ? native_apic_msr_write+0x2b/0x30
[  275.039259]  ? lapic_next_event+0x1d/0x30
[  275.040461]  ? clockevents_program_event+0x7d/0xf0
[  275.041723]  __do_softirq+0xed/0x278
[  275.042767]  irq_exit+0xf3/0x100
[  275.043740]  smp_apic_timer_interrupt+0x3d/0x50
[  275.044917]  apic_timer_interrupt+0x93/0xa0
[  275.046041] RIP: 0010:native_safe_halt+0x6/0x10
[  275.047212] RSP: 0018:ffffaeac8069bea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  275.049330] RAX: 0000000000080000 RBX: ffff95667a6b1e00 RCX: 0000000000000000
[  275.052920] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  275.054560] RBP: ffffaeac8069bea0 R08: 0000000000000006 R09: 0000000000000001
[  275.056148] R10: 000000000001fee5 R11: 0000000000000000 R12: 0000000000000003
[  275.057751] R13: ffff95667a6b1e00 R14: 0000000000000000 R15: 0000000000000000
[  275.059340]  </IRQ>
[  275.060138]  default_idle+0x1e/0x100
[  275.061183]  arch_cpu_idle+0xf/0x20
[  275.062184]  default_idle_call+0x23/0x30
[  275.063256]  do_idle+0x16a/0x1d0
[  275.064223]  cpu_startup_entry+0x1d/0x20
[  275.065282]  start_secondary+0xfd/0x120
[  275.066276]  secondary_startup_64+0x9f/0x9f
[  275.067300] Code: 4c 8d 6c c5 90 49 8b 45 00 48 85 c0 74 e1 4d 8b 7d 00 4d 89 7c 24 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 20 48 b8 00 02 00 00 00 00
[  275.071284] RIP: run_timer_softirq+0x138/0x440 RSP: ffff95667fd83ee0
[  275.072572] ---[ end trace a5789e583cb451a7 ]---
[  275.073626] Kernel panic - not syncing: Fatal exception in interrupt
[  275.075001] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  275.077381] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[  275.078753] sched: Unexpected reschedule of offline CPU#0!
[  275.079918] ------------[ cut here ]------------
[  275.082724] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x3c/0x40
[  275.084856] Modules linked in: amdgpu(OE-) chash ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hw]
[  275.094230] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D W  OE   4.13.0-sriov-17.50 #1
[  275.096303] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  275.098575] task: ffff95667a6b1e00 task.stack: ffffaeac80698000
[  275.099883] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
[  275.101168] RSP: 0018:ffff95667fd83b08 EFLAGS: 00010096
[  275.102381] RAX: 000000000000002e RBX: ffff9566791c9e00 RCX: 000000000000083f
[  275.103846] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f

Change-Id: I6dd7ea48d23b0fee74ecb9e93b53bfe36b0e8164
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 14 +++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 11 +++++++----
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 9cd5517..caba610 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -123,7 +123,8 @@ int amdgpu_uvd_sw_init(struct amdgpu_device *adev)
 	unsigned version_major, version_minor, family_id;
 	int i, r;
 
-	INIT_DELAYED_WORK(&adev->uvd.idle_work, amdgpu_uvd_idle_work_handler);
+	if (!amdgpu_sriov_vf(adev))
+		INIT_DELAYED_WORK(&adev->uvd.idle_work, amdgpu_uvd_idle_work_handler);
 
 	switch (adev->asic_type) {
 #ifdef CONFIG_DRM_AMDGPU_CIK
@@ -297,7 +298,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 	if (adev->uvd.vcpu_bo == NULL)
 		return 0;
 
-	cancel_delayed_work_sync(&adev->uvd.idle_work);
+	if (!amdgpu_sriov_vf(adev))
+		cancel_delayed_work_sync(&adev->uvd.idle_work);
 
 	for (i = 0; i < adev->uvd.max_handles; ++i)
 		if (atomic_read(&adev->uvd.handles[i]))
@@ -1117,7 +1119,7 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
 	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
 
 	if (amdgpu_sriov_vf(adev))
-		return;
+		BUG();
 
 	if (fences == 0) {
 		if (adev->pm.dpm_enabled) {
@@ -1138,11 +1140,12 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
 void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 {
 	struct amdgpu_device *adev = ring->adev;
-	bool set_clocks = !cancel_delayed_work_sync(&adev->uvd.idle_work);
+	bool set_clocks;
 
 	if (amdgpu_sriov_vf(adev))
 		return;
 
+	set_clocks = !cancel_delayed_work_sync(&adev->uvd.idle_work);
 	if (set_clocks) {
 		if (adev->pm.dpm_enabled) {
 			amdgpu_dpm_enable_uvd(adev, true);
@@ -1158,7 +1161,8 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
-	schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+	if (!amdgpu_sriov_vf(ring->adev))
+		schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index d274ae5..0877fc39 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -187,7 +187,8 @@ int amdgpu_vce_sw_init(struct amdgpu_device *adev, unsigned long size)
 		adev->vce.filp[i] = NULL;
 	}
 
-	INIT_DELAYED_WORK(&adev->vce.idle_work, amdgpu_vce_idle_work_handler);
+	if (!amdgpu_sriov_vf(adev))
+		INIT_DELAYED_WORK(&adev->vce.idle_work, amdgpu_vce_idle_work_handler);
 	mutex_init(&adev->vce.idle_mutex);
 
 	return 0;
@@ -241,7 +242,8 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
 	if (i == AMDGPU_MAX_VCE_HANDLES)
 		return 0;
 
-	cancel_delayed_work_sync(&adev->vce.idle_work);
+	if (!amdgpu_sriov_vf(adev))
+		cancel_delayed_work_sync(&adev->vce.idle_work);
 	/* TODO: suspending running encoding sessions isn't supported */
 	return -EINVAL;
 }
@@ -301,7 +303,7 @@ static void amdgpu_vce_idle_work_handler(struct work_struct *work)
 	unsigned i, count = 0;
 
 	if (amdgpu_sriov_vf(adev))
-		return;
+		BUG();
 
 	for (i = 0; i < adev->vce.num_rings; i++)
 		count += amdgpu_fence_count_emitted(&adev->vce.ring[i]);
@@ -362,7 +364,8 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
  */
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
-	schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+	if (!amdgpu_sriov_vf(ring->adev))
+		schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/22] drm/amdgpu: fix fence_emit NULL pointer ref risk
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (5 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 06/22] drm/amdgpu: don't use MM idle_work for SRIOV Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 08/22] drm/amdgpu: cleanups VCE/UVD ib test part Monk Liu
                     ` (12 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: Ic39de73338d515d4e33a828f445d8a3c3ba544bb
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 008e198..401d710 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -163,7 +163,8 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f)
 
 	rcu_assign_pointer(*ptr, dma_fence_get(&fence->base));
 
-	*f = &fence->base;
+	if (f)
+		*f = &fence->base;
 
 	return 0;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/22] drm/amdgpu: cleanups VCE/UVD ib test part
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (6 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 07/22] drm/amdgpu: fix fence_emit NULL pointer ref risk Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal Monk Liu
                     ` (11 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1)amdgpu_vce_get_create_msg is only used in ib test so
no reason no to use a static routine for it, and add a
timeout parameter for it.

2)fence handling of MM's ib test part is a little messy
clean it make it easier to read

Change-Id: Ic9bfd9971457600266096e114210e84ce9b4347d
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 43 ++++++++++++++++++---------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h |  2 --
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 0877fc39..a829350 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -405,8 +405,8 @@ void amdgpu_vce_free_handles(struct amdgpu_device *adev, struct drm_file *filp)
  *
  * Open up a stream for HW test
  */
-int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
-			      struct dma_fence **fence)
+static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
+			      struct dma_fence **fence, long timeout)
 {
 	const unsigned ib_size_dw = 1024;
 	struct amdgpu_job *job;
@@ -461,19 +461,25 @@ int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
 		ib->ptr[i] = 0x0;
 
 	r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
-	job->fence = dma_fence_get(f);
 	if (r)
+		return r;
+	r = dma_fence_wait_timeout(f, false, timeout);
+	if (r == 0) {
+		DRM_ERROR("amdgpu: VCE IB test get_create_msg timed out.\n");
+		r = -ETIMEDOUT;
 		goto err;
-
-	amdgpu_job_free(job);
-	if (fence)
-		*fence = dma_fence_get(f);
-	dma_fence_put(f);
-	return 0;
+	} else if (r < 0) {
+		DRM_ERROR("amdgpu: VCE IB test get_create_msg fence wait failed (%ld).\n", r);
+		goto err;
+	} else {
+		r = 0;
+	}
 
 err:
-	amdgpu_job_free(job);
+	amdgpu_ib_free(ring->adev, ib, NULL);
+	dma_fence_put(f);
 	return r;
+
 }
 
 /**
@@ -523,7 +529,7 @@ int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
 		ib->ptr[i] = 0x0;
 
 	if (direct) {
-		r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
+		r = amdgpu_ib_schedule(ring, 1, ib, NULL, fence);
 		job->fence = dma_fence_get(f);
 		if (r)
 			goto err;
@@ -531,14 +537,11 @@ int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
 		amdgpu_job_free(job);
 	} else {
 		r = amdgpu_job_submit(job, ring, &ring->adev->vce.entity,
-				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
+				      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
 		if (r)
 			goto err;
 	}
 
-	if (fence)
-		*fence = dma_fence_get(f);
-	dma_fence_put(f);
 	return 0;
 
 err:
@@ -1075,13 +1078,13 @@ int amdgpu_vce_ring_test_ring(struct amdgpu_ring *ring)
 int amdgpu_vce_ring_test_ib(struct amdgpu_ring *ring, long timeout)
 {
 	struct dma_fence *fence = NULL;
-	long r;
+	long r = 0;
 
 	/* skip vce ring1/2 ib test for now, since it's not reliable */
 	if (ring != &ring->adev->vce.ring[0])
 		return 0;
 
-	r = amdgpu_vce_get_create_msg(ring, 1, NULL);
+	r = amdgpu_vce_get_create_msg(ring, 1, NULL, timeout);
 	if (r) {
 		DRM_ERROR("amdgpu: failed to get create msg (%ld).\n", r);
 		goto error;
@@ -1095,12 +1098,12 @@ int amdgpu_vce_ring_test_ib(struct amdgpu_ring *ring, long timeout)
 
 	r = dma_fence_wait_timeout(fence, false, timeout);
 	if (r == 0) {
-		DRM_ERROR("amdgpu: IB test timed out.\n");
+		DRM_ERROR("amdgpu: IB test get_destory_msg timed out.\n");
 		r = -ETIMEDOUT;
 	} else if (r < 0) {
-		DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
+		DRM_ERROR("amdgpu: IB test get_destory_msg fence wait failed (%ld).\n", r);
 	} else {
-		DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
+		DRM_DEBUG("ib test get_destory_msg on ring %d succeeded\n", ring->idx);
 		r = 0;
 	}
 error:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
index 7178126..60cb657 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
@@ -57,8 +57,6 @@ int amdgpu_vce_sw_init(struct amdgpu_device *adev, unsigned long size);
 int amdgpu_vce_sw_fini(struct amdgpu_device *adev);
 int amdgpu_vce_suspend(struct amdgpu_device *adev);
 int amdgpu_vce_resume(struct amdgpu_device *adev);
-int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
-			      struct dma_fence **fence);
 int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
 			       bool direct, struct dma_fence **fence);
 void amdgpu_vce_free_handles(struct amdgpu_device *adev, struct drm_file *filp);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (7 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 08/22] drm/amdgpu: cleanups VCE/UVD ib test part Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV Monk Liu
                     ` (10 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

issue:
on bare-metal when doing kmd reload test, there are chance
that kernel hit fatal error afer driver unloaded/reloaded

fix:
the cause is that those "idle work" not really stopped and
if kmd was is unloaded too quick that were chance that
"idle work" run after driver structures already released which
introduces this issue.

Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 54145ec..69fb5e50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
 		}
 	}
 
-	mod_delayed_work(system_wq, &adev->late_init_work,
+	if (!amdgpu_sriov_vf(adev))
+		mod_delayed_work(system_wq, &adev->late_init_work,
 			msecs_to_jiffies(AMDGPU_RESUME_MS));
 
 	amdgpu_device_fill_reset_magic(adev);
@@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 		adev->firmware.gpu_info_fw = NULL;
 	}
 	adev->accel_working = false;
-	cancel_delayed_work_sync(&adev->late_init_work);
+
+	if (!amdgpu_sriov_vf(adev))
+		while (cancel_delayed_work_sync(&adev->late_init_work))
+			schedule(); /* to make sure late_init_work really stopped */
+
 	/* free i2c buses */
 	if (!amdgpu_device_has_dc_support(adev))
 		amdgpu_i2c_fini(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index caba610..337db57 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		return 0;
 
 	if (!amdgpu_sriov_vf(adev))
-		cancel_delayed_work_sync(&adev->uvd.idle_work);
+		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
+			schedule(); /* to make sure idle work really stopped */
 
 	for (i = 0; i < adev->uvd.max_handles; ++i)
 		if (atomic_read(&adev->uvd.handles[i]))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index a829350..2874fda 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
 		return 0;
 
 	if (!amdgpu_sriov_vf(adev))
-		cancel_delayed_work_sync(&adev->vce.idle_work);
+		while (cancel_delayed_work_sync(&adev->vce.idle_work))
+			schedule(); /* to make sure the idle_work really stopped */
+
 	/* TODO: suspending running encoding sessions isn't supported */
 	return -EINVAL;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (8 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver Monk Liu
                     ` (9 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: I370966acd0f1925a99dfde888678e6e0fd093b15
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 337db57..5fb4357 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1117,11 +1117,13 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
 {
 	struct amdgpu_device *adev =
 		container_of(work, struct amdgpu_device, uvd.idle_work.work);
-	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
+	unsigned fences;
 
 	if (amdgpu_sriov_vf(adev))
 		BUG();
 
+	fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
+
 	if (fences == 0) {
 		if (adev->pm.dpm_enabled) {
 			amdgpu_dpm_enable_uvd(adev, false);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (9 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 12/22] drm/amdgpu: send rel event first after init failed Monk Liu
                     ` (8 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Emily Deng

From: Emily Deng <Emily.Deng@amd.com>

Need to call function amdgpu_ucode_fini_bo to release ucode bo for
psp firmware load type.

Change-Id: I1c7be8135993e11076c9d46b3cd87615514a9abb
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 69fb5e50..61696c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1490,6 +1490,7 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 
 	/* disable all interrupts */
 	amdgpu_irq_disable_all(adev);
+	amdgpu_ucode_fini_bo(adev);
 
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.sw)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/22] drm/amdgpu: send rel event first after init failed
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (10 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 13/22] drm/amdgpu: fix vce_ring test memleak Monk Liu
                     ` (7 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

and later send req_fini and rel_fini to host for the
finish routine

Change-Id: Ib0347a305ab5f7712d2d76b1a843bb2429acbf3d
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 61696c7..f6380ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1308,7 +1308,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 		if (r) {
 			DRM_ERROR("sw_init of IP block <%s> failed %d\n",
 				  adev->ip_blocks[i].version->funcs->name, r);
-			return r;
+			goto init_failed;
 		}
 		adev->ip_blocks[i].status.sw = true;
 
@@ -1317,17 +1317,17 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 			r = amdgpu_device_vram_scratch_init(adev);
 			if (r) {
 				DRM_ERROR("amdgpu_vram_scratch_init failed %d\n", r);
-				return r;
+				goto init_failed;
 			}
 			r = adev->ip_blocks[i].version->funcs->hw_init((void *)adev);
 			if (r) {
 				DRM_ERROR("hw_init %d failed %d\n", i, r);
-				return r;
+				goto init_failed;
 			}
 			r = amdgpu_device_wb_init(adev);
 			if (r) {
 				DRM_ERROR("amdgpu_device_wb_init failed %d\n", r);
-				return r;
+				goto init_failed;
 			}
 			adev->ip_blocks[i].status.hw = true;
 
@@ -1336,7 +1336,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 				r = amdgpu_allocate_static_csa(adev);
 				if (r) {
 					DRM_ERROR("allocate CSA failed %d\n", r);
-					return r;
+					goto init_failed;
 				}
 			}
 		}
@@ -1351,17 +1351,17 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 		if (r) {
 			DRM_ERROR("hw_init of IP block <%s> failed %d\n",
 				  adev->ip_blocks[i].version->funcs->name, r);
-			return r;
+			goto init_failed;
 		}
 		adev->ip_blocks[i].status.hw = true;
 	}
 
 	amdgpu_amdkfd_device_init(adev);
-
+init_failed:
 	if (amdgpu_sriov_vf(adev))
 		amdgpu_virt_release_full_gpu(adev, true);
 
-	return 0;
+	return r;
 }
 
 static void amdgpu_device_fill_reset_magic(struct amdgpu_device *adev)
@@ -1978,7 +1978,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		}
 		dev_err(adev->dev, "amdgpu_device_ip_init failed\n");
 		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0);
-		amdgpu_device_ip_fini(adev);
+		if (!amdgpu_virt_request_full_gpu(adev, false))
+			amdgpu_device_ip_fini(adev);
 		goto failed;
 	}
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/22] drm/amdgpu: fix vce_ring test memleak
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (11 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 12/22] drm/amdgpu: send rel event first after init failed Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-14-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 14/22] drm/amdgpu: change gfx9 ib test to use WB Monk Liu
                     ` (6 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: I35a343b21a007716fc7811781650264339c94273
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 2874fda..4ae7cb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -463,8 +463,10 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
 		ib->ptr[i] = 0x0;
 
 	r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
-	if (r)
-		return r;
+	if (r) {
+		DRM_ERROR("failed to ib_schedule VCE get_create_msg\n");
+		goto err;
+	}
 	r = dma_fence_wait_timeout(f, false, timeout);
 	if (r == 0) {
 		DRM_ERROR("amdgpu: VCE IB test get_create_msg timed out.\n");
@@ -478,6 +480,7 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
 	}
 
 err:
+	amdgpu_job_free(job);
 	amdgpu_ib_free(ring->adev, ib, NULL);
 	dma_fence_put(f);
 	return r;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/22] drm/amdgpu: change gfx9 ib test to use WB
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (12 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 13/22] drm/amdgpu: fix vce_ring test memleak Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-15-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 15/22] drm/amdgpu: move WB_FREE to correct place Monk Liu
                     ` (5 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

two reasons to switch SCRATCH reg method to WB method:

1)Because when doing IB test we don't want to involve KIQ health
status affect, and since SCRATCH register access is go through
KIQ that way GFX IB test would failed due to KIQ fail.

2)acccessing SCRATCH register cost much more time than WB method
because SCRATCH register access runs through KIQ which at least could
begin after GPU world switch back to current Guest VF

Change-Id: Iac7ef394c1b3aef9f9eca0ea4cb0f889227801d5
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 107 ++++++++++++++++++----------------
 1 file changed, 57 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 848008e..e9cc03e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -271,58 +271,65 @@ static int gfx_v9_0_ring_test_ring(struct amdgpu_ring *ring)
 
 static int gfx_v9_0_ring_test_ib(struct amdgpu_ring *ring, long timeout)
 {
-        struct amdgpu_device *adev = ring->adev;
-        struct amdgpu_ib ib;
-        struct dma_fence *f = NULL;
-        uint32_t scratch;
-        uint32_t tmp = 0;
-        long r;
-
-        r = amdgpu_gfx_scratch_get(adev, &scratch);
-        if (r) {
-                DRM_ERROR("amdgpu: failed to get scratch reg (%ld).\n", r);
-                return r;
-        }
-        WREG32(scratch, 0xCAFEDEAD);
-        memset(&ib, 0, sizeof(ib));
-        r = amdgpu_ib_get(adev, NULL, 256, &ib);
-        if (r) {
-                DRM_ERROR("amdgpu: failed to get ib (%ld).\n", r);
-                goto err1;
-        }
-        ib.ptr[0] = PACKET3(PACKET3_SET_UCONFIG_REG, 1);
-        ib.ptr[1] = ((scratch - PACKET3_SET_UCONFIG_REG_START));
-        ib.ptr[2] = 0xDEADBEEF;
-        ib.length_dw = 3;
-
-        r = amdgpu_ib_schedule(ring, 1, &ib, NULL, &f);
-        if (r)
-                goto err2;
-
-        r = dma_fence_wait_timeout(f, false, timeout);
-        if (r == 0) {
-                DRM_ERROR("amdgpu: IB test timed out.\n");
-                r = -ETIMEDOUT;
-                goto err2;
-        } else if (r < 0) {
-                DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
-                goto err2;
-        }
-        tmp = RREG32(scratch);
-        if (tmp == 0xDEADBEEF) {
-                DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
-                r = 0;
-        } else {
-                DRM_ERROR("amdgpu: ib test failed (scratch(0x%04X)=0x%08X)\n",
-                          scratch, tmp);
-                r = -EINVAL;
-        }
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_ib ib;
+	struct dma_fence *f = NULL;
+
+	unsigned index;
+	uint64_t gpu_addr;
+	uint32_t tmp;
+	long r;
+
+	r = amdgpu_device_wb_get(adev, &index);
+	if (r) {
+		dev_err(adev->dev, "(%ld) failed to allocate wb slot\n", r);
+		return r;
+	}
+
+	gpu_addr = adev->wb.gpu_addr + (index * 4);
+	adev->wb.wb[index] = cpu_to_le32(0xCAFEDEAD);
+	memset(&ib, 0, sizeof(ib));
+	r = amdgpu_ib_get(adev, NULL, 16, &ib);
+	if (r) {
+		DRM_ERROR("amdgpu: failed to get ib (%ld).\n", r);
+		goto err1;
+	}
+	ib.ptr[0] = PACKET3(PACKET3_WRITE_DATA, 3);
+	ib.ptr[1] = WRITE_DATA_DST_SEL(5) | WR_CONFIRM;
+	ib.ptr[2] = lower_32_bits(gpu_addr);
+	ib.ptr[3] = upper_32_bits(gpu_addr);
+	ib.ptr[4] = 0xDEADBEEF;
+	ib.length_dw = 5;
+
+	r = amdgpu_ib_schedule(ring, 1, &ib, NULL, &f);
+	if (r)
+		goto err2;
+
+	r = dma_fence_wait_timeout(f, false, timeout);
+	if (r == 0) {
+			DRM_ERROR("amdgpu: IB test timed out.\n");
+			r = -ETIMEDOUT;
+			goto err2;
+	} else if (r < 0) {
+			DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
+			goto err2;
+	}
+
+	tmp = adev->wb.wb[index];
+	if (tmp == 0xDEADBEEF) {
+			DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
+			r = 0;
+	} else {
+			DRM_ERROR("ib test on ring %d failed\n", ring->idx);
+			r = -EINVAL;
+	}
+
 err2:
-        amdgpu_ib_free(adev, &ib, NULL);
-        dma_fence_put(f);
+	amdgpu_ib_free(adev, &ib, NULL);
+	dma_fence_put(f);
 err1:
-        amdgpu_gfx_scratch_free(adev, scratch);
-        return r;
+	amdgpu_device_wb_free(adev, index);
+	return r;
 }
 
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/22] drm/amdgpu: move WB_FREE to correct place
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (13 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 14/22] drm/amdgpu: change gfx9 ib test to use WB Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-16-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini Monk Liu
                     ` (4 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

WB_FREE should be put after all engines's hw_fini
done, otherwise the invalid wptr/rptr_addr would still
be used by engines which trigger abnormal bugs.

This fixes couple DMAR reading error in host side for SRIOV
after guest kmd is unloaded.

Change-Id: If721cfd06d8c3113929306378793713fd05fc929
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f6380ed..730ff97 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1460,11 +1460,6 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.hw)
 			continue;
-		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
-			amdgpu_free_static_csa(adev);
-			amdgpu_device_wb_fini(adev);
-			amdgpu_device_vram_scratch_fini(adev);
-		}
 
 		if (adev->ip_blocks[i].version->type != AMD_IP_BLOCK_TYPE_UVD &&
 			adev->ip_blocks[i].version->type != AMD_IP_BLOCK_TYPE_VCE) {
@@ -1495,6 +1490,13 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.sw)
 			continue;
+
+		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
+			amdgpu_free_static_csa(adev);
+			amdgpu_device_wb_fini(adev);
+			amdgpu_device_vram_scratch_fini(adev);
+		}
+
 		r = adev->ip_blocks[i].version->funcs->sw_fini((void *)adev);
 		/* XXX handle errors */
 		if (r) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (14 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 15/22] drm/amdgpu: move WB_FREE to correct place Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-17-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini Monk Liu
                     ` (3 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

should use bo_create_kernel instead of split to two
function that create and pin the SA bo

issue:
before this patch, there are DMAR read error in host
side when running SRIOV test, the DMAR address dropped
in the range of SA bo.

fix:
after this cleanups of SA init and fini, above DMAR
eror gone.

Change-Id: I3f299a3342bd7263776bff69e4b31b0d3816749a
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c |  6 ----
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c | 61 +++-------------------------------
 2 files changed, 4 insertions(+), 63 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 7f2c314..4709d13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -279,11 +279,6 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
 		return r;
 	}
 
-	r = amdgpu_sa_bo_manager_start(adev, &adev->ring_tmp_bo);
-	if (r) {
-		return r;
-	}
-
 	adev->ib_pool_ready = true;
 	if (amdgpu_debugfs_sa_init(adev)) {
 		dev_err(adev->dev, "failed to register debugfs file for SA\n");
@@ -303,7 +298,6 @@ void amdgpu_ib_pool_fini(struct amdgpu_device *adev)
 {
 	if (adev->ib_pool_ready) {
 		amdgpu_sa_bo_manager_suspend(adev, &adev->ring_tmp_bo);
-		amdgpu_sa_bo_manager_fini(adev, &adev->ring_tmp_bo);
 		adev->ib_pool_ready = false;
 	}
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
index 5ca75a4..695c639 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
@@ -63,80 +63,27 @@ int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
 	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
 		INIT_LIST_HEAD(&sa_manager->flist[i]);
 
-	r = amdgpu_bo_create(adev, size, align, true, domain,
-			     0, NULL, NULL, &sa_manager->bo);
+	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
+				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
 	if (r) {
 		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
-	return r;
-}
-
-void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
-			       struct amdgpu_sa_manager *sa_manager)
-{
-	struct amdgpu_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		amdgpu_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		amdgpu_sa_bo_remove_locked(sa_bo);
-	}
-	amdgpu_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
-}
-
-int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
-			       struct amdgpu_sa_manager *sa_manager)
-{
-	int r;
-
-	if (sa_manager->bo == NULL) {
-		dev_err(adev->dev, "no bo for sa manager\n");
-		return -EINVAL;
-	}
-
-	/* map the buffer */
-	r = amdgpu_bo_reserve(sa_manager->bo, false);
-	if (r) {
-		dev_err(adev->dev, "(%d) failed to reserve manager bo\n", r);
-		return r;
-	}
-	r = amdgpu_bo_pin(sa_manager->bo, sa_manager->domain, &sa_manager->gpu_addr);
-	if (r) {
-		amdgpu_bo_unreserve(sa_manager->bo);
-		dev_err(adev->dev, "(%d) failed to pin manager bo\n", r);
-		return r;
-	}
-	r = amdgpu_bo_kmap(sa_manager->bo, &sa_manager->cpu_ptr);
 	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
-	amdgpu_bo_unreserve(sa_manager->bo);
 	return r;
 }
 
 int amdgpu_sa_bo_manager_suspend(struct amdgpu_device *adev,
 				 struct amdgpu_sa_manager *sa_manager)
 {
-	int r;
-
 	if (sa_manager->bo == NULL) {
 		dev_err(adev->dev, "no bo for sa manager\n");
 		return -EINVAL;
 	}
 
-	r = amdgpu_bo_reserve(sa_manager->bo, true);
-	if (!r) {
-		amdgpu_bo_kunmap(sa_manager->bo);
-		amdgpu_bo_unpin(sa_manager->bo);
-		amdgpu_bo_unreserve(sa_manager->bo);
-	}
-	return r;
+	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
+	return 0;
 }
 
 static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (15 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-18-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr Monk Liu
                     ` (2 subsequent siblings)
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

otherwise there will be DMAR reading error comes out from CP since
GFX is still alive and CPC's WPTR_POLL is still enabled, which would
lead to DMAR read error.

fix:
we can hault CPG after hw_fini, but cannot halt CPC becaues KIQ
stil need to be alive to let RLCV invoke, but its WPTR_POLL could
be disabled.

Change-Id: Ia60ee54901531f737d09063bf2037630e7c94771
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index e9cc03e..793db9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -2961,7 +2961,8 @@ static int gfx_v9_0_hw_fini(void *handle)
 		gfx_v9_0_kcq_disable(&adev->gfx.kiq.ring, &adev->gfx.compute_ring[i]);
 
 	if (amdgpu_sriov_vf(adev)) {
-		pr_debug("For SRIOV client, shouldn't do anything.\n");
+		gfx_v9_0_cp_gfx_enable(adev, false);
+		WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
 		return 0;
 	}
 	gfx_v9_0_cp_enable(adev, false);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (16 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-19-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 19/22] drm/amdgpu: adjust timeout for ib_ring_tests Monk Liu
  2018-02-26  5:18   ` [PATCH 20/22] drm/amdgpu: increase gart size to 512MB Monk Liu
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Emily Deng

From: Emily Deng <Emily.Deng@amd.com>

the original method will change the wptr value in wb.

Change-Id: I984fabca35d9dcf1f5fa8ef7779b2afb7f7d7370
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 3d5385d..8c6e408 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -240,13 +240,14 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
 	struct amdgpu_device *adev = ring->adev;
 	u64 *wptr = NULL;
 	uint64_t local_wptr = 0;
-
+	u64 tmp = 0;
 	if (ring->use_doorbell) {
 		/* XXX check if swapping is necessary on BE */
 		wptr = ((u64 *)&adev->wb.wb[ring->wptr_offs]);
 		DRM_DEBUG("wptr/doorbell before shift == 0x%016llx\n", *wptr);
-		*wptr = (*wptr) >> 2;
-		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", *wptr);
+		tmp = (*wptr) >> 2;
+		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", tmp);
+		return tmp;
 	} else {
 		u32 lowbit, highbit;
 		int me = (ring == &adev->sdma.instance[0].ring) ? 0 : 1;
@@ -260,9 +261,8 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
 		*wptr = highbit;
 		*wptr = (*wptr) << 32;
 		*wptr |= lowbit;
+		return *wptr;
 	}
-
-	return *wptr;
 }
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 19/22] drm/amdgpu: adjust timeout for ib_ring_tests
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (17 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-20-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:18   ` [PATCH 20/22] drm/amdgpu: increase gart size to 512MB Monk Liu
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

issue:
sometime GFX/MM ib test hit timeout under SRIOV env, root cause
is that engine doesn't come back soon enough so the current
IB test considered as timed out.

fix:
for SRIOV GFX IB test wait time need to be expanded a lot during
SRIOV runtimei mode since it couldn't really begin before GFX engine
come back.

for SRIOV MM IB test it always need more time since MM scheduling
is not go together with GFX engine, it is controled by h/w MM
scheduler so no matter runtime or exclusive mode MM IB test
always need more time.

Change-Id: I0342371bc073656476ad850e1f5d9a021846dc8c
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 4709d13..d6776286 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -316,14 +316,42 @@ int amdgpu_ib_ring_tests(struct amdgpu_device *adev)
 {
 	unsigned i;
 	int r, ret = 0;
+	long tmo_gfx, tmo_mm;
+
+	tmo_mm = tmo_gfx = AMDGPU_IB_TEST_TIMEOUT;
+	if (amdgpu_sriov_vf(adev)) {
+		/* for MM engines in hypervisor side they are not scheduled together
+		 * with CP and SDMA engines, so even in exclusive mode MM engine could
+		 * still running on other VF thus the IB TEST TIMEOUT for MM engines
+		 * under SR-IOV should be set to a long time.
+		 */
+		tmo_mm = 8 * AMDGPU_IB_TEST_TIMEOUT; /* 8 sec should be enough for the MM comes back to this VF */
+	}
+
+	if (amdgpu_sriov_runtime(adev)) {
+		/* for CP & SDMA engines since they are scheduled together so
+		 * need to make the timeout width enough to cover the time
+		 * cost waiting for it coming back under RUNTIME only
+		*/
+		tmo_gfx = 8 * AMDGPU_IB_TEST_TIMEOUT;
+	}
+
+	adev->accel_working = true;
 
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
+		long tmo;
 
 		if (!ring || !ring->ready)
 			continue;
 
-		r = amdgpu_ring_test_ib(ring, AMDGPU_IB_TEST_TIMEOUT);
+		/* MM engine need more time */
+		if (ring->idx > 11)
+			tmo = tmo_mm;
+		else
+			tmo = tmo_gfx;
+
+		r = amdgpu_ring_test_ib(ring, tmo);
 		if (r) {
 			ring->ready = false;
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 20/22] drm/amdgpu: increase gart size to 512MB
       [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (18 preceding siblings ...)
  2018-02-26  5:18   ` [PATCH 19/22] drm/amdgpu: adjust timeout for ib_ring_tests Monk Liu
@ 2018-02-26  5:18   ` Monk Liu
       [not found]     ` <1519622300-8990-21-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  19 siblings, 1 reply; 58+ messages in thread
From: Monk Liu @ 2018-02-26  5:18 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

256MB is too small consider PTE/PDE shadow and TTM
eviction activity

Change-Id: Ifaa30dc730eec36af47fbdeb3cce30de9067b17f
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 6839d93..e925121 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -789,7 +789,7 @@ static int gmc_v9_0_mc_init(struct amdgpu_device *adev)
 		switch (adev->asic_type) {
 		case CHIP_VEGA10:  /* all engines support GPUVM */
 		default:
-			adev->gmc.gart_size = 256ULL << 20;
+			adev->gmc.gart_size = 512ULL << 20;
 			break;
 		case CHIP_RAVEN:   /* DCE SG support */
 			adev->gmc.gart_size = 1024ULL << 20;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/22] drm/amdgpu: remove duplicated job_free_resources
       [not found]     ` <1519622300-8990-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  9:51       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26  9:51 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

NAK, the job->fence is assigned a reference just the line before:
> job->fence = dma_fence_get(fence);

When the fence ends up as a wild pointer in the SA we have a problem 
with fence reference counting somewhere, not a problem with freeing 
resources.

Regards,
Christian.

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> if a job is go through scheduler, it is forbidden to
> call job_free_resources after ib_schedule, becaust that
> would assign wild pointer of fence on the sa_bo->fence
> which could lead to weird bug
>
> Change-Id: Iad7ee011c641cb7357569cbce36fdc10f0ed911d
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 2bd5676..7cb3437 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -207,7 +207,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   	dma_fence_put(job->fence);
>   	job->fence = dma_fence_get(fence);
>   
> -	amdgpu_job_free_resources(job);
>   	return fence;
>   }
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/22] drm/amdgpu: cond_exec only for schedule with a job
       [not found]     ` <1519622300-8990-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  9:53       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26  9:53 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> issue:
> under SR-IOV sometimes the iB test will fail on
> gfx ring
>
> fix:
> with cond_exec inserted in RB the gfx engine would
> skip part packets if RLCV issue PREEMPT on gfx engine
> if gfx engine is prior to COND_EXEC packet, this is
> okay for regular command from UMD, but for the ib test
> since the whole dma format doesn't support PREEMPT
> so must remove the COND_EXEC from it.
>
> Change-Id: I73bd0479bf1307139ff3726a00c312526054aa6d
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index 8ea342d..7f2c314 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -181,7 +181,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
>   		}
>   	}
>   
> -	if (ring->funcs->init_cond_exec)
> +	if (job && ring->funcs->init_cond_exec)
>   		patch_offset = amdgpu_ring_init_cond_exec(ring);
>   
>   #ifdef CONFIG_X86_64

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/22] drm/amdgpu: don't use MM idle_work for SRIOV
       [not found]     ` <1519622300-8990-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  9:57       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26  9:57 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> two reason for this patch:
> 1) SRIOV doesn't give VF cg/pg feature so the idle_work
> is useless

That is a good reason.

> 2) VCE's idle work would cause "KMD reload" test failed and
> trigger KERNEL PANIC, this is because the idle work is triggered
> after VCE ib test and KMD removed right after loaded, which
> can hit page fault in TIMER callback.

That isn't. In this case we need to fix the KMD reload order and not 
band aid here.

Additional to that the patch is way to intrusive, but see below.

>
> ISSUE resovled:
>
> SWDEV-143244:
>
> [  275.021338] RAX: 6f150001c3808df1 RBX: ffff95667fd91e70 RCX: ffff9566586878b8
> [  275.023000] RDX: ffff95667fd83ee8 RSI: ffff95667fd91e28 RDI: ffff95667fd83ef0
> [  275.024660] RBP: ffff95667fd83f58 R08: ffff95667fd92238 R09: ffff95667fd83ef0
> [  275.026303] R10: 0000000000000000 R11: 0000000000000070 R12: ffff95667fd91e00
> [  275.027951] R13: ffff95667fd83ee8 R14: 0000000000000000 R15: ffff9566586878b8
> [  275.029600] FS:  0000000000000000(0000) GS:ffff95667fd80000(0000) knlGS:0000000000000000
> [  275.031896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  275.033367] CR2: 0000560972e27300 CR3: 000000011fe09000 CR4: 00000000001406e0
> [  275.035095] Call Trace:
> [  275.036032]  <IRQ>
> [  275.036909]  ? ktime_get+0x3b/0xa0
> [  275.037997]  ? native_apic_msr_write+0x2b/0x30
> [  275.039259]  ? lapic_next_event+0x1d/0x30
> [  275.040461]  ? clockevents_program_event+0x7d/0xf0
> [  275.041723]  __do_softirq+0xed/0x278
> [  275.042767]  irq_exit+0xf3/0x100
> [  275.043740]  smp_apic_timer_interrupt+0x3d/0x50
> [  275.044917]  apic_timer_interrupt+0x93/0xa0
> [  275.046041] RIP: 0010:native_safe_halt+0x6/0x10
> [  275.047212] RSP: 0018:ffffaeac8069bea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> [  275.049330] RAX: 0000000000080000 RBX: ffff95667a6b1e00 RCX: 0000000000000000
> [  275.052920] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [  275.054560] RBP: ffffaeac8069bea0 R08: 0000000000000006 R09: 0000000000000001
> [  275.056148] R10: 000000000001fee5 R11: 0000000000000000 R12: 0000000000000003
> [  275.057751] R13: ffff95667a6b1e00 R14: 0000000000000000 R15: 0000000000000000
> [  275.059340]  </IRQ>
> [  275.060138]  default_idle+0x1e/0x100
> [  275.061183]  arch_cpu_idle+0xf/0x20
> [  275.062184]  default_idle_call+0x23/0x30
> [  275.063256]  do_idle+0x16a/0x1d0
> [  275.064223]  cpu_startup_entry+0x1d/0x20
> [  275.065282]  start_secondary+0xfd/0x120
> [  275.066276]  secondary_startup_64+0x9f/0x9f
> [  275.067300] Code: 4c 8d 6c c5 90 49 8b 45 00 48 85 c0 74 e1 4d 8b 7d 00 4d 89 7c 24 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 20 48 b8 00 02 00 00 00 00
> [  275.071284] RIP: run_timer_softirq+0x138/0x440 RSP: ffff95667fd83ee0
> [  275.072572] ---[ end trace a5789e583cb451a7 ]---
> [  275.073626] Kernel panic - not syncing: Fatal exception in interrupt
> [  275.075001] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [  275.077381] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> [  275.078753] sched: Unexpected reschedule of offline CPU#0!
> [  275.079918] ------------[ cut here ]------------
> [  275.082724] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x3c/0x40
> [  275.084856] Modules linked in: amdgpu(OE-) chash ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hw]
> [  275.094230] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D W  OE   4.13.0-sriov-17.50 #1
> [  275.096303] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [  275.098575] task: ffff95667a6b1e00 task.stack: ffffaeac80698000
> [  275.099883] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
> [  275.101168] RSP: 0018:ffff95667fd83b08 EFLAGS: 00010096
> [  275.102381] RAX: 000000000000002e RBX: ffff9566791c9e00 RCX: 000000000000083f
> [  275.103846] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
>
> Change-Id: I6dd7ea48d23b0fee74ecb9e93b53bfe36b0e8164
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 14 +++++++++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 11 +++++++----
>   2 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 9cd5517..caba610 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -123,7 +123,8 @@ int amdgpu_uvd_sw_init(struct amdgpu_device *adev)
>   	unsigned version_major, version_minor, family_id;
>   	int i, r;
>   
> -	INIT_DELAYED_WORK(&adev->uvd.idle_work, amdgpu_uvd_idle_work_handler);
> +	if (!amdgpu_sriov_vf(adev))
> +		INIT_DELAYED_WORK(&adev->uvd.idle_work, amdgpu_uvd_idle_work_handler);
>   
>   	switch (adev->asic_type) {
>   #ifdef CONFIG_DRM_AMDGPU_CIK
> @@ -297,7 +298,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>   	if (adev->uvd.vcpu_bo == NULL)
>   		return 0;
>   
> -	cancel_delayed_work_sync(&adev->uvd.idle_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work_sync(&adev->uvd.idle_work);
>   
>   	for (i = 0; i < adev->uvd.max_handles; ++i)
>   		if (atomic_read(&adev->uvd.handles[i]))
> @@ -1117,7 +1119,7 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
>   	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
>   
>   	if (amdgpu_sriov_vf(adev))
> -		return;
> +		BUG();
>   
>   	if (fences == 0) {
>   		if (adev->pm.dpm_enabled) {
> @@ -1138,11 +1140,12 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
>   void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
>   {
>   	struct amdgpu_device *adev = ring->adev;
> -	bool set_clocks = !cancel_delayed_work_sync(&adev->uvd.idle_work);
> +	bool set_clocks;
>   
>   	if (amdgpu_sriov_vf(adev))
>   		return;
>   
> +	set_clocks = !cancel_delayed_work_sync(&adev->uvd.idle_work);
>   	if (set_clocks) {
>   		if (adev->pm.dpm_enabled) {
>   			amdgpu_dpm_enable_uvd(adev, true);
> @@ -1158,7 +1161,8 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
>   
>   void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
>   {
> -	schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
> +	if (!amdgpu_sriov_vf(ring->adev))
> +		schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index d274ae5..0877fc39 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -187,7 +187,8 @@ int amdgpu_vce_sw_init(struct amdgpu_device *adev, unsigned long size)
>   		adev->vce.filp[i] = NULL;
>   	}
>   
> -	INIT_DELAYED_WORK(&adev->vce.idle_work, amdgpu_vce_idle_work_handler);
> +	if (!amdgpu_sriov_vf(adev))
> +		INIT_DELAYED_WORK(&adev->vce.idle_work, amdgpu_vce_idle_work_handler);
>   	mutex_init(&adev->vce.idle_mutex);
>   
>   	return 0;
> @@ -241,7 +242,8 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>   	if (i == AMDGPU_MAX_VCE_HANDLES)
>   		return 0;
>   
> -	cancel_delayed_work_sync(&adev->vce.idle_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work_sync(&adev->vce.idle_work);
>   	/* TODO: suspending running encoding sessions isn't supported */
>   	return -EINVAL;
>   }
> @@ -301,7 +303,7 @@ static void amdgpu_vce_idle_work_handler(struct work_struct *work)
>   	unsigned i, count = 0;
>   
>   	if (amdgpu_sriov_vf(adev))
> -		return;
> +		BUG();
>   
>   	for (i = 0; i < adev->vce.num_rings; i++)
>   		count += amdgpu_fence_count_emitted(&adev->vce.ring[i]);
> @@ -362,7 +364,8 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
>    */
>   void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
>   {
> -	schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
> +	if (!amdgpu_sriov_vf(ring->adev))
> +		schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);

This is the only change needed, everything else is superfluous and can 
be removed from the patch.

Regards,
Christian.

>   }
>   
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 07/22] drm/amdgpu: fix fence_emit NULL pointer ref risk
       [not found]     ` <1519622300-8990-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  9:58       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26  9:58 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> Change-Id: Ic39de73338d515d4e33a828f445d8a3c3ba544bb
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 008e198..401d710 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -163,7 +163,8 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f)
>   
>   	rcu_assign_pointer(*ptr, dma_fence_get(&fence->base));
>   
> -	*f = &fence->base;
> +	if (f)
> +		*f = &fence->base;

NAK, returning the fence pointer isn't optional here.

Christian.

>   
>   	return 0;
>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/22] drm/amdgpu: cleanups VCE/UVD ib test part
       [not found]     ` <1519622300-8990-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:06       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26 10:06 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> 1)amdgpu_vce_get_create_msg is only used in ib test so
> no reason no to use a static routine for it, and add a
> timeout parameter for it.
>
> 2)fence handling of MM's ib test part is a little messy
> clean it make it easier to read
>
> Change-Id: Ic9bfd9971457600266096e114210e84ce9b4347d
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 43 ++++++++++++++++++---------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h |  2 --
>   2 files changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 0877fc39..a829350 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -405,8 +405,8 @@ void amdgpu_vce_free_handles(struct amdgpu_device *adev, struct drm_file *filp)
>    *
>    * Open up a stream for HW test
>    */
> -int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
> -			      struct dma_fence **fence)
> +static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
> +			      struct dma_fence **fence, long timeout)

Looks good to me, but you could go a step further and make 
amdgpu_vce_get_destroy_msg static as well.

Additional to that you can also drop the fence parameter from 
amdgpu_vce_get_create_msg.

>   {
>   	const unsigned ib_size_dw = 1024;
>   	struct amdgpu_job *job;
> @@ -461,19 +461,25 @@ int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
>   		ib->ptr[i] = 0x0;
>   
>   	r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
> -	job->fence = dma_fence_get(f);
>   	if (r)
> +		return r;
> +	r = dma_fence_wait_timeout(f, false, timeout);
> +	if (r == 0) {
> +		DRM_ERROR("amdgpu: VCE IB test get_create_msg timed out.\n");
> +		r = -ETIMEDOUT;

NAK, waiting for the create message alone can crash the VCE firmware on 
older firmware versions.

You need to send create plus destroy or encode message before you can wait.

>   		goto err;
> -
> -	amdgpu_job_free(job);
> -	if (fence)
> -		*fence = dma_fence_get(f);
> -	dma_fence_put(f);
> -	return 0;
> +	} else if (r < 0) {
> +		DRM_ERROR("amdgpu: VCE IB test get_create_msg fence wait failed (%ld).\n", r);
> +		goto err;
> +	} else {
> +		r = 0;
> +	}

That looks like you are leaking job here now.

Probably best if you split the patch and make the functions static first 
and then add the rest in another patch.

Regards,
Christian.

>   
>   err:
> -	amdgpu_job_free(job);
> +	amdgpu_ib_free(ring->adev, ib, NULL);
> +	dma_fence_put(f);
>   	return r;
> +
>   }
>   
>   /**
> @@ -523,7 +529,7 @@ int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
>   		ib->ptr[i] = 0x0;
>   
>   	if (direct) {
> -		r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
> +		r = amdgpu_ib_schedule(ring, 1, ib, NULL, fence);
>   		job->fence = dma_fence_get(f);
>   		if (r)
>   			goto err;
> @@ -531,14 +537,11 @@ int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
>   		amdgpu_job_free(job);
>   	} else {
>   		r = amdgpu_job_submit(job, ring, &ring->adev->vce.entity,
> -				      AMDGPU_FENCE_OWNER_UNDEFINED, &f);
> +				      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
>   		if (r)
>   			goto err;
>   	}
>   
> -	if (fence)
> -		*fence = dma_fence_get(f);
> -	dma_fence_put(f);
>   	return 0;
>   
>   err:
> @@ -1075,13 +1078,13 @@ int amdgpu_vce_ring_test_ring(struct amdgpu_ring *ring)
>   int amdgpu_vce_ring_test_ib(struct amdgpu_ring *ring, long timeout)
>   {
>   	struct dma_fence *fence = NULL;
> -	long r;
> +	long r = 0;
>   
>   	/* skip vce ring1/2 ib test for now, since it's not reliable */
>   	if (ring != &ring->adev->vce.ring[0])
>   		return 0;
>   
> -	r = amdgpu_vce_get_create_msg(ring, 1, NULL);
> +	r = amdgpu_vce_get_create_msg(ring, 1, NULL, timeout);
>   	if (r) {
>   		DRM_ERROR("amdgpu: failed to get create msg (%ld).\n", r);
>   		goto error;
> @@ -1095,12 +1098,12 @@ int amdgpu_vce_ring_test_ib(struct amdgpu_ring *ring, long timeout)
>   
>   	r = dma_fence_wait_timeout(fence, false, timeout);
>   	if (r == 0) {
> -		DRM_ERROR("amdgpu: IB test timed out.\n");
> +		DRM_ERROR("amdgpu: IB test get_destory_msg timed out.\n");
>   		r = -ETIMEDOUT;
>   	} else if (r < 0) {
> -		DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
> +		DRM_ERROR("amdgpu: IB test get_destory_msg fence wait failed (%ld).\n", r);
>   	} else {
> -		DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
> +		DRM_DEBUG("ib test get_destory_msg on ring %d succeeded\n", ring->idx);
>   		r = 0;
>   	}
>   error:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
> index 7178126..60cb657 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h
> @@ -57,8 +57,6 @@ int amdgpu_vce_sw_init(struct amdgpu_device *adev, unsigned long size);
>   int amdgpu_vce_sw_fini(struct amdgpu_device *adev);
>   int amdgpu_vce_suspend(struct amdgpu_device *adev);
>   int amdgpu_vce_resume(struct amdgpu_device *adev);
> -int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
> -			      struct dma_fence **fence);
>   int amdgpu_vce_get_destroy_msg(struct amdgpu_ring *ring, uint32_t handle,
>   			       bool direct, struct dma_fence **fence);
>   void amdgpu_vce_free_handles(struct amdgpu_device *adev, struct drm_file *filp);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]     ` <1519622300-8990-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:08       ` Christian König
       [not found]         ` <a89aeb3e-1687-e7e6-bef4-9371ecb13a70-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-26 10:08 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> issue:
> on bare-metal when doing kmd reload test, there are chance
> that kernel hit fatal error afer driver unloaded/reloaded
>
> fix:
> the cause is that those "idle work" not really stopped and
> if kmd was is unloaded too quick that were chance that
> "idle work" run after driver structures already released which
> introduces this issue.

In this case I think it would be much better to wait for the idle work 
before trying to unload the driver.

>
> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>   3 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 54145ec..69fb5e50 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> -	mod_delayed_work(system_wq, &adev->late_init_work,
> +	if (!amdgpu_sriov_vf(adev))
> +		mod_delayed_work(system_wq, &adev->late_init_work,
>   			msecs_to_jiffies(AMDGPU_RESUME_MS));
>   
>   	amdgpu_device_fill_reset_magic(adev);
> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>   		adev->firmware.gpu_info_fw = NULL;
>   	}
>   	adev->accel_working = false;
> -	cancel_delayed_work_sync(&adev->late_init_work);
> +
> +	if (!amdgpu_sriov_vf(adev))
> +		while (cancel_delayed_work_sync(&adev->late_init_work))
> +			schedule(); /* to make sure late_init_work really stopped */

Never use schedule() like that.

Regards,
Christian.

> +
>   	/* free i2c buses */
>   	if (!amdgpu_device_has_dc_support(adev))
>   		amdgpu_i2c_fini(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index caba610..337db57 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>   		return 0;
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
> +			schedule(); /* to make sure idle work really stopped */
>   
>   	for (i = 0; i < adev->uvd.max_handles; ++i)
>   		if (atomic_read(&adev->uvd.handles[i]))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index a829350..2874fda 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>   		return 0;
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		cancel_delayed_work_sync(&adev->vce.idle_work);
> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
> +			schedule(); /* to make sure the idle_work really stopped */
> +
>   	/* TODO: suspending running encoding sessions isn't supported */
>   	return -EINVAL;
>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV
       [not found]     ` <1519622300-8990-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:09       ` Christian König
       [not found]         ` <994af188-1283-44ea-f868-73f4002a7cb3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-26 10:09 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I would rather avoid calling the function in the first place.

Christian.

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> Change-Id: I370966acd0f1925a99dfde888678e6e0fd093b15
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 337db57..5fb4357 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1117,11 +1117,13 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
>   {
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, uvd.idle_work.work);
> -	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
> +	unsigned fences;
>   
>   	if (amdgpu_sriov_vf(adev))
>   		BUG();
>   
> +	fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
> +
>   	if (fences == 0) {
>   		if (adev->pm.dpm_enabled) {
>   			amdgpu_dpm_enable_uvd(adev, false);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 13/22] drm/amdgpu: fix vce_ring test memleak
       [not found]     ` <1519622300-8990-14-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:10       ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26 10:10 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Yeah, well that is the leak you introduced in patch #8.

So please instead fix the original patch which messed things up.

Christian.

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> Change-Id: I35a343b21a007716fc7811781650264339c94273
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 2874fda..4ae7cb0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -463,8 +463,10 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
>   		ib->ptr[i] = 0x0;
>   
>   	r = amdgpu_ib_schedule(ring, 1, ib, NULL, &f);
> -	if (r)
> -		return r;
> +	if (r) {
> +		DRM_ERROR("failed to ib_schedule VCE get_create_msg\n");
> +		goto err;
> +	}
>   	r = dma_fence_wait_timeout(f, false, timeout);
>   	if (r == 0) {
>   		DRM_ERROR("amdgpu: VCE IB test get_create_msg timed out.\n");
> @@ -478,6 +480,7 @@ static int amdgpu_vce_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,
>   	}
>   
>   err:
> +	amdgpu_job_free(job);
>   	amdgpu_ib_free(ring->adev, ib, NULL);
>   	dma_fence_put(f);
>   	return r;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
       [not found]     ` <1519622300-8990-17-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:12       ` Christian König
       [not found]         ` <51af9fc4-7827-9f40-da2b-0fd8753bc321-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-26 10:12 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> should use bo_create_kernel instead of split to two
> function that create and pin the SA bo
>
> issue:
> before this patch, there are DMAR read error in host
> side when running SRIOV test, the DMAR address dropped
> in the range of SA bo.
>
> fix:
> after this cleanups of SA init and fini, above DMAR
> eror gone.

Well the change looks like a valid cleanup to me, but the explanation 
why that should be a problem doesn't looks like it makes sense.

Please explain further what this should fix?

Regards,
Christian.

>
> Change-Id: I3f299a3342bd7263776bff69e4b31b0d3816749a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c |  6 ----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c | 61 +++-------------------------------
>   2 files changed, 4 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index 7f2c314..4709d13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -279,11 +279,6 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	r = amdgpu_sa_bo_manager_start(adev, &adev->ring_tmp_bo);
> -	if (r) {
> -		return r;
> -	}
> -
>   	adev->ib_pool_ready = true;
>   	if (amdgpu_debugfs_sa_init(adev)) {
>   		dev_err(adev->dev, "failed to register debugfs file for SA\n");
> @@ -303,7 +298,6 @@ void amdgpu_ib_pool_fini(struct amdgpu_device *adev)
>   {
>   	if (adev->ib_pool_ready) {
>   		amdgpu_sa_bo_manager_suspend(adev, &adev->ring_tmp_bo);
> -		amdgpu_sa_bo_manager_fini(adev, &adev->ring_tmp_bo);
>   		adev->ib_pool_ready = false;
>   	}
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> index 5ca75a4..695c639 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> @@ -63,80 +63,27 @@ int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
>   	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
>   		INIT_LIST_HEAD(&sa_manager->flist[i]);
>   
> -	r = amdgpu_bo_create(adev, size, align, true, domain,
> -			     0, NULL, NULL, &sa_manager->bo);
> +	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
> +				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
>   	if (r) {
>   		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
>   		return r;
>   	}
>   
> -	return r;
> -}
> -
> -void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	struct amdgpu_sa_bo *sa_bo, *tmp;
> -
> -	if (!list_empty(&sa_manager->olist)) {
> -		sa_manager->hole = &sa_manager->olist,
> -		amdgpu_sa_bo_try_free(sa_manager);
> -		if (!list_empty(&sa_manager->olist)) {
> -			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
> -		}
> -	}
> -	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
> -		amdgpu_sa_bo_remove_locked(sa_bo);
> -	}
> -	amdgpu_bo_unref(&sa_manager->bo);
> -	sa_manager->size = 0;
> -}
> -
> -int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	int r;
> -
> -	if (sa_manager->bo == NULL) {
> -		dev_err(adev->dev, "no bo for sa manager\n");
> -		return -EINVAL;
> -	}
> -
> -	/* map the buffer */
> -	r = amdgpu_bo_reserve(sa_manager->bo, false);
> -	if (r) {
> -		dev_err(adev->dev, "(%d) failed to reserve manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_pin(sa_manager->bo, sa_manager->domain, &sa_manager->gpu_addr);
> -	if (r) {
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -		dev_err(adev->dev, "(%d) failed to pin manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_kmap(sa_manager->bo, &sa_manager->cpu_ptr);
>   	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
> -	amdgpu_bo_unreserve(sa_manager->bo);
>   	return r;
>   }
>   
>   int amdgpu_sa_bo_manager_suspend(struct amdgpu_device *adev,
>   				 struct amdgpu_sa_manager *sa_manager)
>   {
> -	int r;
> -
>   	if (sa_manager->bo == NULL) {
>   		dev_err(adev->dev, "no bo for sa manager\n");
>   		return -EINVAL;
>   	}
>   
> -	r = amdgpu_bo_reserve(sa_manager->bo, true);
> -	if (!r) {
> -		amdgpu_bo_kunmap(sa_manager->bo);
> -		amdgpu_bo_unpin(sa_manager->bo);
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -	}
> -	return r;
> +	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
> +	return 0;
>   }
>   
>   static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr
       [not found]     ` <1519622300-8990-19-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:22       ` Christian König
       [not found]         ` <b9826bed-1b38-cbee-536a-1bd9d50dca6b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-26 10:22 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Emily Deng

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> From: Emily Deng <Emily.Deng@amd.com>
>
> the original method will change the wptr value in wb.
>
> Change-Id: I984fabca35d9dcf1f5fa8ef7779b2afb7f7d7370
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index 3d5385d..8c6e408 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -240,13 +240,14 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
>   	struct amdgpu_device *adev = ring->adev;
>   	u64 *wptr = NULL;
>   	uint64_t local_wptr = 0;
> -
> +	u64 tmp = 0;

Please keep an empty line between deceleration and code.

>   	if (ring->use_doorbell) {
>   		/* XXX check if swapping is necessary on BE */
>   		wptr = ((u64 *)&adev->wb.wb[ring->wptr_offs]);
>   		DRM_DEBUG("wptr/doorbell before shift == 0x%016llx\n", *wptr);
> -		*wptr = (*wptr) >> 2;
> -		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", *wptr);
> +		tmp = (*wptr) >> 2;
> +		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", tmp);
> +		return tmp;

Well I completely agree that this code is absolutely nonsense and could 
result in complete random hardware behavior.

But I think we need further cleanup than what you already did. How about 
the following instead?

> static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
> {
>         struct amdgpu_device *adev = ring->adev;
>         uint64_t wptr;
>
>         if (ring->use_doorbell) {
>                 /* XXX check if swapping is necessary on BE */
>                 wptr = READ_ONCE(*((u64 *)&adev->wb.wb[ring->wptr_offs]));
>                 DRM_DEBUG("wptr/doorbell == 0x%016llx\n", *wptr);
>         } else {
>                 int me = (ring == &adev->sdma.instance[0].ring) ? 0 : 1;
>                 u32 lowbit, highbit;
>
>                 lowbit = RREG32(sdma_v4_0_get_reg_offset(adev, me, 
> mmSDMA0_GFX_RB_WPTR));
>                 highbit = RREG32(sdma_v4_0_get_reg_offset(adev, me, 
> mmSDMA0_GFX_RB_WPTR_HI));
>
>                 DRM_DEBUG("wptr [%i]high== 0x%08x low==0x%08x\n",
>                                 me, highbit, lowbit);
>                 wptr = highbit;
>                 wptr = wptr << 32;
>                 wptr |= lowbit;
>         }
>
>         return wptr >> 2;
> }

Good catch,
Christian.

>   	} else {
>   		u32 lowbit, highbit;
>   		int me = (ring == &adev->sdma.instance[0].ring) ? 0 : 1;
> @@ -260,9 +261,8 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
>   		*wptr = highbit;
>   		*wptr = (*wptr) << 32;
>   		*wptr |= lowbit;
> +		return *wptr;
>   	}
> -
> -	return *wptr;
>   }
>   
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
       [not found]         ` <51af9fc4-7827-9f40-da2b-0fd8753bc321-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-26 10:41           ` Liu, Monk
       [not found]             ` <SN1PR12MB04625A0F831F186440F4A81684C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-26 10:41 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

In SRIOV, I found hypervisor continues report DMAR read error, and finally one of that error 
Related with SA, (I judge by comparing the DMAR error page address with SA MC address, and they shared the same PAGE!!)

With this patch applied, one of the DMAR reading error gone, 

Root cause is simple: many engine always accessing some wptr polling address before
They are disabled, so SA must always be valid since it is created, so must use bo_create_kernel to make sure it is pinned already before someone use it, 
Ortherwise during kmd reloading test there will be lot of DMAR reading error

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 18:13
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> should use bo_create_kernel instead of split to two function that 
> create and pin the SA bo
>
> issue:
> before this patch, there are DMAR read error in host side when running 
> SRIOV test, the DMAR address dropped in the range of SA bo.
>
> fix:
> after this cleanups of SA init and fini, above DMAR eror gone.

Well the change looks like a valid cleanup to me, but the explanation why that should be a problem doesn't looks like it makes sense.

Please explain further what this should fix?

Regards,
Christian.

>
> Change-Id: I3f299a3342bd7263776bff69e4b31b0d3816749a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c |  6 ----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c | 61 +++-------------------------------
>   2 files changed, 4 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index 7f2c314..4709d13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -279,11 +279,6 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	r = amdgpu_sa_bo_manager_start(adev, &adev->ring_tmp_bo);
> -	if (r) {
> -		return r;
> -	}
> -
>   	adev->ib_pool_ready = true;
>   	if (amdgpu_debugfs_sa_init(adev)) {
>   		dev_err(adev->dev, "failed to register debugfs file for SA\n"); @@ 
> -303,7 +298,6 @@ void amdgpu_ib_pool_fini(struct amdgpu_device *adev)
>   {
>   	if (adev->ib_pool_ready) {
>   		amdgpu_sa_bo_manager_suspend(adev, &adev->ring_tmp_bo);
> -		amdgpu_sa_bo_manager_fini(adev, &adev->ring_tmp_bo);
>   		adev->ib_pool_ready = false;
>   	}
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> index 5ca75a4..695c639 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> @@ -63,80 +63,27 @@ int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
>   	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
>   		INIT_LIST_HEAD(&sa_manager->flist[i]);
>   
> -	r = amdgpu_bo_create(adev, size, align, true, domain,
> -			     0, NULL, NULL, &sa_manager->bo);
> +	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
> +				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
>   	if (r) {
>   		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
>   		return r;
>   	}
>   
> -	return r;
> -}
> -
> -void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	struct amdgpu_sa_bo *sa_bo, *tmp;
> -
> -	if (!list_empty(&sa_manager->olist)) {
> -		sa_manager->hole = &sa_manager->olist,
> -		amdgpu_sa_bo_try_free(sa_manager);
> -		if (!list_empty(&sa_manager->olist)) {
> -			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
> -		}
> -	}
> -	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
> -		amdgpu_sa_bo_remove_locked(sa_bo);
> -	}
> -	amdgpu_bo_unref(&sa_manager->bo);
> -	sa_manager->size = 0;
> -}
> -
> -int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	int r;
> -
> -	if (sa_manager->bo == NULL) {
> -		dev_err(adev->dev, "no bo for sa manager\n");
> -		return -EINVAL;
> -	}
> -
> -	/* map the buffer */
> -	r = amdgpu_bo_reserve(sa_manager->bo, false);
> -	if (r) {
> -		dev_err(adev->dev, "(%d) failed to reserve manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_pin(sa_manager->bo, sa_manager->domain, &sa_manager->gpu_addr);
> -	if (r) {
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -		dev_err(adev->dev, "(%d) failed to pin manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_kmap(sa_manager->bo, &sa_manager->cpu_ptr);
>   	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
> -	amdgpu_bo_unreserve(sa_manager->bo);
>   	return r;
>   }
>   
>   int amdgpu_sa_bo_manager_suspend(struct amdgpu_device *adev,
>   				 struct amdgpu_sa_manager *sa_manager)
>   {
> -	int r;
> -
>   	if (sa_manager->bo == NULL) {
>   		dev_err(adev->dev, "no bo for sa manager\n");
>   		return -EINVAL;
>   	}
>   
> -	r = amdgpu_bo_reserve(sa_manager->bo, true);
> -	if (!r) {
> -		amdgpu_bo_kunmap(sa_manager->bo);
> -		amdgpu_bo_unpin(sa_manager->bo);
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -	}
> -	return r;
> +	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
> +	return 0;
>   }
>   
>   static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
       [not found]             ` <SN1PR12MB04625A0F831F186440F4A81684C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-26 10:48               ` Liu, Monk
       [not found]                 ` <SN1PR12MB04620B6D208218E5EB23824284C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-26 10:48 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Root cause is simple: many engine always accessing some wptr polling address before They are disabled, so SA must always be valid since it is created, so must use bo_create_kernel to make sure it is pinned already before someone use it, Ortherwise during kmd reloading test there will be lot of DMAR reading error

Sorry, above description is wrong, I have too much patches and I mistook the root cause of them

And this sentence is correct: " I judge by comparing the DMAR error page address with SA MC address, and they shared the same PAGE"

I cannot remind the details right now ... 

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2018年2月26日 18:41
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini

In SRIOV, I found hypervisor continues report DMAR read error, and finally one of that error Related with SA, (I judge by comparing the DMAR error page address with SA MC address, and they shared the same PAGE!!)

With this patch applied, one of the DMAR reading error gone, 

Root cause is simple: many engine always accessing some wptr polling address before They are disabled, so SA must always be valid since it is created, so must use bo_create_kernel to make sure it is pinned already before someone use it, Ortherwise during kmd reloading test there will be lot of DMAR reading error

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
Sent: 2018年2月26日 18:13
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> should use bo_create_kernel instead of split to two function that 
> create and pin the SA bo
>
> issue:
> before this patch, there are DMAR read error in host side when running 
> SRIOV test, the DMAR address dropped in the range of SA bo.
>
> fix:
> after this cleanups of SA init and fini, above DMAR eror gone.

Well the change looks like a valid cleanup to me, but the explanation why that should be a problem doesn't looks like it makes sense.

Please explain further what this should fix?

Regards,
Christian.

>
> Change-Id: I3f299a3342bd7263776bff69e4b31b0d3816749a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c |  6 ----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c | 61 +++-------------------------------
>   2 files changed, 4 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index 7f2c314..4709d13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -279,11 +279,6 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	r = amdgpu_sa_bo_manager_start(adev, &adev->ring_tmp_bo);
> -	if (r) {
> -		return r;
> -	}
> -
>   	adev->ib_pool_ready = true;
>   	if (amdgpu_debugfs_sa_init(adev)) {
>   		dev_err(adev->dev, "failed to register debugfs file for SA\n"); @@
> -303,7 +298,6 @@ void amdgpu_ib_pool_fini(struct amdgpu_device *adev)
>   {
>   	if (adev->ib_pool_ready) {
>   		amdgpu_sa_bo_manager_suspend(adev, &adev->ring_tmp_bo);
> -		amdgpu_sa_bo_manager_fini(adev, &adev->ring_tmp_bo);
>   		adev->ib_pool_ready = false;
>   	}
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> index 5ca75a4..695c639 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
> @@ -63,80 +63,27 @@ int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
>   	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
>   		INIT_LIST_HEAD(&sa_manager->flist[i]);
>   
> -	r = amdgpu_bo_create(adev, size, align, true, domain,
> -			     0, NULL, NULL, &sa_manager->bo);
> +	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
> +				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
>   	if (r) {
>   		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
>   		return r;
>   	}
>   
> -	return r;
> -}
> -
> -void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	struct amdgpu_sa_bo *sa_bo, *tmp;
> -
> -	if (!list_empty(&sa_manager->olist)) {
> -		sa_manager->hole = &sa_manager->olist,
> -		amdgpu_sa_bo_try_free(sa_manager);
> -		if (!list_empty(&sa_manager->olist)) {
> -			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
> -		}
> -	}
> -	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
> -		amdgpu_sa_bo_remove_locked(sa_bo);
> -	}
> -	amdgpu_bo_unref(&sa_manager->bo);
> -	sa_manager->size = 0;
> -}
> -
> -int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
> -			       struct amdgpu_sa_manager *sa_manager)
> -{
> -	int r;
> -
> -	if (sa_manager->bo == NULL) {
> -		dev_err(adev->dev, "no bo for sa manager\n");
> -		return -EINVAL;
> -	}
> -
> -	/* map the buffer */
> -	r = amdgpu_bo_reserve(sa_manager->bo, false);
> -	if (r) {
> -		dev_err(adev->dev, "(%d) failed to reserve manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_pin(sa_manager->bo, sa_manager->domain, &sa_manager->gpu_addr);
> -	if (r) {
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -		dev_err(adev->dev, "(%d) failed to pin manager bo\n", r);
> -		return r;
> -	}
> -	r = amdgpu_bo_kmap(sa_manager->bo, &sa_manager->cpu_ptr);
>   	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
> -	amdgpu_bo_unreserve(sa_manager->bo);
>   	return r;
>   }
>   
>   int amdgpu_sa_bo_manager_suspend(struct amdgpu_device *adev,
>   				 struct amdgpu_sa_manager *sa_manager)
>   {
> -	int r;
> -
>   	if (sa_manager->bo == NULL) {
>   		dev_err(adev->dev, "no bo for sa manager\n");
>   		return -EINVAL;
>   	}
>   
> -	r = amdgpu_bo_reserve(sa_manager->bo, true);
> -	if (!r) {
> -		amdgpu_bo_kunmap(sa_manager->bo);
> -		amdgpu_bo_unpin(sa_manager->bo);
> -		amdgpu_bo_unreserve(sa_manager->bo);
> -	}
> -	return r;
> +	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
> +	return 0;
>   }
>   
>   static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
       [not found]                 ` <SN1PR12MB04620B6D208218E5EB23824284C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-26 11:29                   ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-26 11:29 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> I cannot remind the details right now ...

Ok, well it is still a rather nice to have cleanup.

Just one comment, please keep the amdgpu_sa_bo_manager_fini function 
instead of the amdgpu_sa_bo_manager_suspend function.

We can just return void and we still want the error message in case 
somebody forgot to free something.

With that fixed the patch looks good to me,
Christian.

Am 26.02.2018 um 11:48 schrieb Liu, Monk:
>> Root cause is simple: many engine always accessing some wptr polling address before They are disabled, so SA must always be valid since it is created, so must use bo_create_kernel to make sure it is pinned already before someone use it, Ortherwise during kmd reloading test there will be lot of DMAR reading error
> Sorry, above description is wrong, I have too much patches and I mistook the root cause of them
>
> And this sentence is correct: " I judge by comparing the DMAR error page address with SA MC address, and they shared the same PAGE"
>
> I cannot remind the details right now ...
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
> Sent: 2018年2月26日 18:41
> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
>
> In SRIOV, I found hypervisor continues report DMAR read error, and finally one of that error Related with SA, (I judge by comparing the DMAR error page address with SA MC address, and they shared the same PAGE!!)
>
> With this patch applied, one of the DMAR reading error gone,
>
> Root cause is simple: many engine always accessing some wptr polling address before They are disabled, so SA must always be valid since it is created, so must use bo_create_kernel to make sure it is pinned already before someone use it, Ortherwise during kmd reloading test there will be lot of DMAR reading error
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 18:13
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini
>
> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>> should use bo_create_kernel instead of split to two function that
>> create and pin the SA bo
>>
>> issue:
>> before this patch, there are DMAR read error in host side when running
>> SRIOV test, the DMAR address dropped in the range of SA bo.
>>
>> fix:
>> after this cleanups of SA init and fini, above DMAR eror gone.
> Well the change looks like a valid cleanup to me, but the explanation why that should be a problem doesn't looks like it makes sense.
>
> Please explain further what this should fix?
>
> Regards,
> Christian.
>
>> Change-Id: I3f299a3342bd7263776bff69e4b31b0d3816749a
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c |  6 ----
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c | 61 +++-------------------------------
>>    2 files changed, 4 insertions(+), 63 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> index 7f2c314..4709d13 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>> @@ -279,11 +279,6 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
>>    		return r;
>>    	}
>>    
>> -	r = amdgpu_sa_bo_manager_start(adev, &adev->ring_tmp_bo);
>> -	if (r) {
>> -		return r;
>> -	}
>> -
>>    	adev->ib_pool_ready = true;
>>    	if (amdgpu_debugfs_sa_init(adev)) {
>>    		dev_err(adev->dev, "failed to register debugfs file for SA\n"); @@
>> -303,7 +298,6 @@ void amdgpu_ib_pool_fini(struct amdgpu_device *adev)
>>    {
>>    	if (adev->ib_pool_ready) {
>>    		amdgpu_sa_bo_manager_suspend(adev, &adev->ring_tmp_bo);
>> -		amdgpu_sa_bo_manager_fini(adev, &adev->ring_tmp_bo);
>>    		adev->ib_pool_ready = false;
>>    	}
>>    }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
>> index 5ca75a4..695c639 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
>> @@ -63,80 +63,27 @@ int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
>>    	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
>>    		INIT_LIST_HEAD(&sa_manager->flist[i]);
>>    
>> -	r = amdgpu_bo_create(adev, size, align, true, domain,
>> -			     0, NULL, NULL, &sa_manager->bo);
>> +	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
>> +				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
>>    	if (r) {
>>    		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
>>    		return r;
>>    	}
>>    
>> -	return r;
>> -}
>> -
>> -void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
>> -			       struct amdgpu_sa_manager *sa_manager)
>> -{
>> -	struct amdgpu_sa_bo *sa_bo, *tmp;
>> -
>> -	if (!list_empty(&sa_manager->olist)) {
>> -		sa_manager->hole = &sa_manager->olist,
>> -		amdgpu_sa_bo_try_free(sa_manager);
>> -		if (!list_empty(&sa_manager->olist)) {
>> -			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
>> -		}
>> -	}
>> -	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
>> -		amdgpu_sa_bo_remove_locked(sa_bo);
>> -	}
>> -	amdgpu_bo_unref(&sa_manager->bo);
>> -	sa_manager->size = 0;
>> -}
>> -
>> -int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
>> -			       struct amdgpu_sa_manager *sa_manager)
>> -{
>> -	int r;
>> -
>> -	if (sa_manager->bo == NULL) {
>> -		dev_err(adev->dev, "no bo for sa manager\n");
>> -		return -EINVAL;
>> -	}
>> -
>> -	/* map the buffer */
>> -	r = amdgpu_bo_reserve(sa_manager->bo, false);
>> -	if (r) {
>> -		dev_err(adev->dev, "(%d) failed to reserve manager bo\n", r);
>> -		return r;
>> -	}
>> -	r = amdgpu_bo_pin(sa_manager->bo, sa_manager->domain, &sa_manager->gpu_addr);
>> -	if (r) {
>> -		amdgpu_bo_unreserve(sa_manager->bo);
>> -		dev_err(adev->dev, "(%d) failed to pin manager bo\n", r);
>> -		return r;
>> -	}
>> -	r = amdgpu_bo_kmap(sa_manager->bo, &sa_manager->cpu_ptr);
>>    	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
>> -	amdgpu_bo_unreserve(sa_manager->bo);
>>    	return r;
>>    }
>>    
>>    int amdgpu_sa_bo_manager_suspend(struct amdgpu_device *adev,
>>    				 struct amdgpu_sa_manager *sa_manager)
>>    {
>> -	int r;
>> -
>>    	if (sa_manager->bo == NULL) {
>>    		dev_err(adev->dev, "no bo for sa manager\n");
>>    		return -EINVAL;
>>    	}
>>    
>> -	r = amdgpu_bo_reserve(sa_manager->bo, true);
>> -	if (!r) {
>> -		amdgpu_bo_kunmap(sa_manager->bo);
>> -		amdgpu_bo_unpin(sa_manager->bo);
>> -		amdgpu_bo_unreserve(sa_manager->bo);
>> -	}
>> -	return r;
>> +	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
>> +	return 0;
>>    }
>>    
>>    static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 19/22] drm/amdgpu: adjust timeout for ib_ring_tests
       [not found]     ` <1519622300-8990-20-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 15:55       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 15:55 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> issue:
> sometime GFX/MM ib test hit timeout under SRIOV env, root cause
> is that engine doesn't come back soon enough so the current
> IB test considered as timed out.
>
> fix:
> for SRIOV GFX IB test wait time need to be expanded a lot during
> SRIOV runtimei mode since it couldn't really begin before GFX engine
> come back.
>
> for SRIOV MM IB test it always need more time since MM scheduling
> is not go together with GFX engine, it is controled by h/w MM
> scheduler so no matter runtime or exclusive mode MM IB test
> always need more time.
>
> Change-Id: I0342371bc073656476ad850e1f5d9a021846dc8c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> index 4709d13..d6776286 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -316,14 +316,42 @@ int amdgpu_ib_ring_tests(struct amdgpu_device *adev)
>  {
>         unsigned i;
>         int r, ret = 0;
> +       long tmo_gfx, tmo_mm;
> +
> +       tmo_mm = tmo_gfx = AMDGPU_IB_TEST_TIMEOUT;
> +       if (amdgpu_sriov_vf(adev)) {
> +               /* for MM engines in hypervisor side they are not scheduled together
> +                * with CP and SDMA engines, so even in exclusive mode MM engine could
> +                * still running on other VF thus the IB TEST TIMEOUT for MM engines
> +                * under SR-IOV should be set to a long time.
> +                */
> +               tmo_mm = 8 * AMDGPU_IB_TEST_TIMEOUT; /* 8 sec should be enough for the MM comes back to this VF */
> +       }
> +
> +       if (amdgpu_sriov_runtime(adev)) {
> +               /* for CP & SDMA engines since they are scheduled together so
> +                * need to make the timeout width enough to cover the time
> +                * cost waiting for it coming back under RUNTIME only
> +               */
> +               tmo_gfx = 8 * AMDGPU_IB_TEST_TIMEOUT;
> +       }
> +
> +       adev->accel_working = true;

This change seems unrelated.

>
>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 struct amdgpu_ring *ring = adev->rings[i];
> +               long tmo;
>
>                 if (!ring || !ring->ready)
>                         continue;
>
> -               r = amdgpu_ring_test_ib(ring, AMDGPU_IB_TEST_TIMEOUT);
> +               /* MM engine need more time */
> +               if (ring->idx > 11)

Please check ring type here rather than the idx since the idx may vary
based on the number of IPs on the SOC.

Alex

> +                       tmo = tmo_mm;
> +               else
> +                       tmo = tmo_gfx;
> +
> +               r = amdgpu_ring_test_ib(ring, tmo);
>                 if (r) {
>                         ring->ready = false;
>
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/22] drm/amdgpu: fix&cleanups for wb_clear
       [not found]     ` <1519622300-8990-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:18       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:18 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:17 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> fix:
> should do right shift on wb before clearing
>
> cleanups:
> 1,should memset all wb buffer
> 2,set max wb number to 128 (total 4KB) is big enough
>
> Change-Id: I43832245c875ce039e7709dc049828e21c50c81f
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Acked-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 10e16d6..5bddfc1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1072,7 +1072,7 @@ static inline void amdgpu_set_ib_value(struct amdgpu_cs_parser *p,
>  /*
>   * Writeback
>   */
> -#define AMDGPU_MAX_WB 512      /* Reserve at most 512 WB slots for amdgpu-owned rings. */
> +#define AMDGPU_MAX_WB 128      /* Reserve at most 128 WB slots for amdgpu-owned rings. */
>
>  struct amdgpu_wb {
>         struct amdgpu_bo        *wb_obj;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 59b69a2..54145ec 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -492,7 +492,7 @@ static int amdgpu_device_wb_init(struct amdgpu_device *adev)
>                 memset(&adev->wb.used, 0, sizeof(adev->wb.used));
>
>                 /* clear wb memory */
> -               memset((char *)adev->wb.wb, 0, AMDGPU_MAX_WB * sizeof(uint32_t));
> +               memset((char *)adev->wb.wb, 0, AMDGPU_MAX_WB * sizeof(uint32_t) * 8);
>         }
>
>         return 0;
> @@ -530,8 +530,9 @@ int amdgpu_device_wb_get(struct amdgpu_device *adev, u32 *wb)
>   */
>  void amdgpu_device_wb_free(struct amdgpu_device *adev, u32 wb)
>  {
> +       wb >>= 3;
>         if (wb < adev->wb.num_wb)
> -               __clear_bit(wb >> 3, adev->wb.used);
> +               __clear_bit(wb, adev->wb.used);
>  }
>
>  /**
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/22] drm/amdgpu: skip ECC for SRIOV in gmc late_init
       [not found]     ` <1519622300-8990-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:19       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:19 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> Change-Id: I2865a11d1dded214de289787d334ec4a22b5db19
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Acked-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index bc4bd5e..6839d93 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -673,7 +673,7 @@ static int gmc_v9_0_late_init(void *handle)
>         for(i = 0; i < AMDGPU_MAX_VMHUBS; ++i)
>                 BUG_ON(vm_inv_eng[i] > 16);
>
> -       if (adev->asic_type == CHIP_VEGA10) {
> +       if (adev->asic_type == CHIP_VEGA10 && !amdgpu_sriov_vf(adev)) {
>                 r = gmc_v9_0_ecc_available(adev);
>                 if (r == 1) {
>                         DRM_INFO("ECC is active.\n");
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/22] drm/amdgpu: only flush hotplug work without DC
       [not found]     ` <1519622300-8990-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:19       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:19 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> since hotplug_work is initialized under the case of
> no dc support
>
> Change-Id: I0d417a5b9f8dfb1863eafc95b6802be9e5c74720
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index f6f2a66..11dfe57 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -208,7 +208,8 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
>         r = drm_irq_install(adev->ddev, adev->ddev->pdev->irq);
>         if (r) {
>                 adev->irq.installed = false;
> -               flush_work(&adev->hotplug_work);
> +               if (!amdgpu_device_has_dc_support(adev))
> +                       flush_work(&adev->hotplug_work);
>                 cancel_work_sync(&adev->reset_work);
>                 return r;
>         }
> @@ -234,7 +235,8 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
>                 adev->irq.installed = false;
>                 if (adev->irq.msi_enabled)
>                         pci_disable_msi(adev->pdev);
> -               flush_work(&adev->hotplug_work);
> +               if (!amdgpu_device_has_dc_support(adev))
> +                       flush_work(&adev->hotplug_work);
>                 cancel_work_sync(&adev->reset_work);
>         }
>
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/22] drm/amdgpu: send rel event first after init failed
       [not found]     ` <1519622300-8990-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:24       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:24 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> and later send req_fini and rel_fini to host for the
> finish routine

This looks like it should be two patches:
1. Properly release the gpu in sr-iov in device_ip_init() when it fails
2. Make sure to request gpu in sr-iov in device_init() before calling
device_ip_fini()

Also, please provide better commit messages.

Alex

>
> Change-Id: Ib0347a305ab5f7712d2d76b1a843bb2429acbf3d
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 61696c7..f6380ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1308,7 +1308,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>                 if (r) {
>                         DRM_ERROR("sw_init of IP block <%s> failed %d\n",
>                                   adev->ip_blocks[i].version->funcs->name, r);
> -                       return r;
> +                       goto init_failed;
>                 }
>                 adev->ip_blocks[i].status.sw = true;
>
> @@ -1317,17 +1317,17 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>                         r = amdgpu_device_vram_scratch_init(adev);
>                         if (r) {
>                                 DRM_ERROR("amdgpu_vram_scratch_init failed %d\n", r);
> -                               return r;
> +                               goto init_failed;
>                         }
>                         r = adev->ip_blocks[i].version->funcs->hw_init((void *)adev);
>                         if (r) {
>                                 DRM_ERROR("hw_init %d failed %d\n", i, r);
> -                               return r;
> +                               goto init_failed;
>                         }
>                         r = amdgpu_device_wb_init(adev);
>                         if (r) {
>                                 DRM_ERROR("amdgpu_device_wb_init failed %d\n", r);
> -                               return r;
> +                               goto init_failed;
>                         }
>                         adev->ip_blocks[i].status.hw = true;
>
> @@ -1336,7 +1336,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>                                 r = amdgpu_allocate_static_csa(adev);
>                                 if (r) {
>                                         DRM_ERROR("allocate CSA failed %d\n", r);
> -                                       return r;
> +                                       goto init_failed;
>                                 }
>                         }
>                 }
> @@ -1351,17 +1351,17 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>                 if (r) {
>                         DRM_ERROR("hw_init of IP block <%s> failed %d\n",
>                                   adev->ip_blocks[i].version->funcs->name, r);
> -                       return r;
> +                       goto init_failed;
>                 }
>                 adev->ip_blocks[i].status.hw = true;
>         }
>
>         amdgpu_amdkfd_device_init(adev);
> -
> +init_failed:
>         if (amdgpu_sriov_vf(adev))
>                 amdgpu_virt_release_full_gpu(adev, true);
>
> -       return 0;
> +       return r;
>  }
>
>  static void amdgpu_device_fill_reset_magic(struct amdgpu_device *adev)
> @@ -1978,7 +1978,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>                 }
>                 dev_err(adev->dev, "amdgpu_device_ip_init failed\n");
>                 amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0);
> -               amdgpu_device_ip_fini(adev);
> +               if (!amdgpu_virt_request_full_gpu(adev, false))
> +                       amdgpu_device_ip_fini(adev);
>                 goto failed;
>         }
>
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver
       [not found]     ` <1519622300-8990-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:27       ` Alex Deucher
       [not found]         ` <CADnq5_M9z3T0ihL9cDVbgEjaKp0AM=Utvs_J-=q7szT2NXtvrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-27  5:52       ` Liu, Monk
  1 sibling, 1 reply; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:27 UTC (permalink / raw)
  To: Monk Liu; +Cc: Emily Deng, amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> From: Emily Deng <Emily.Deng@amd.com>
>
> Need to call function amdgpu_ucode_fini_bo to release ucode bo for
> psp firmware load type.

Are you sure this is right?  I think is this is handled in amdgpu_psp.c already.

Alex

>
> Change-Id: I1c7be8135993e11076c9d46b3cd87615514a9abb
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 69fb5e50..61696c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1490,6 +1490,7 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>
>         /* disable all interrupts */
>         amdgpu_irq_disable_all(adev);
> +       amdgpu_ucode_fini_bo(adev);
>
>         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>                 if (!adev->ip_blocks[i].status.sw)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/22] drm/amdgpu: change gfx9 ib test to use WB
       [not found]     ` <1519622300-8990-15-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:30       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:30 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> two reasons to switch SCRATCH reg method to WB method:
>
> 1)Because when doing IB test we don't want to involve KIQ health
> status affect, and since SCRATCH register access is go through
> KIQ that way GFX IB test would failed due to KIQ fail.
>
> 2)acccessing SCRATCH register cost much more time than WB method
> because SCRATCH register access runs through KIQ which at least could
> begin after GPU world switch back to current Guest VF
>
> Change-Id: Iac7ef394c1b3aef9f9eca0ea4cb0f889227801d5
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Acked-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 107 ++++++++++++++++++----------------
>  1 file changed, 57 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 848008e..e9cc03e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -271,58 +271,65 @@ static int gfx_v9_0_ring_test_ring(struct amdgpu_ring *ring)
>
>  static int gfx_v9_0_ring_test_ib(struct amdgpu_ring *ring, long timeout)
>  {
> -        struct amdgpu_device *adev = ring->adev;
> -        struct amdgpu_ib ib;
> -        struct dma_fence *f = NULL;
> -        uint32_t scratch;
> -        uint32_t tmp = 0;
> -        long r;
> -
> -        r = amdgpu_gfx_scratch_get(adev, &scratch);
> -        if (r) {
> -                DRM_ERROR("amdgpu: failed to get scratch reg (%ld).\n", r);
> -                return r;
> -        }
> -        WREG32(scratch, 0xCAFEDEAD);
> -        memset(&ib, 0, sizeof(ib));
> -        r = amdgpu_ib_get(adev, NULL, 256, &ib);
> -        if (r) {
> -                DRM_ERROR("amdgpu: failed to get ib (%ld).\n", r);
> -                goto err1;
> -        }
> -        ib.ptr[0] = PACKET3(PACKET3_SET_UCONFIG_REG, 1);
> -        ib.ptr[1] = ((scratch - PACKET3_SET_UCONFIG_REG_START));
> -        ib.ptr[2] = 0xDEADBEEF;
> -        ib.length_dw = 3;
> -
> -        r = amdgpu_ib_schedule(ring, 1, &ib, NULL, &f);
> -        if (r)
> -                goto err2;
> -
> -        r = dma_fence_wait_timeout(f, false, timeout);
> -        if (r == 0) {
> -                DRM_ERROR("amdgpu: IB test timed out.\n");
> -                r = -ETIMEDOUT;
> -                goto err2;
> -        } else if (r < 0) {
> -                DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
> -                goto err2;
> -        }
> -        tmp = RREG32(scratch);
> -        if (tmp == 0xDEADBEEF) {
> -                DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
> -                r = 0;
> -        } else {
> -                DRM_ERROR("amdgpu: ib test failed (scratch(0x%04X)=0x%08X)\n",
> -                          scratch, tmp);
> -                r = -EINVAL;
> -        }
> +       struct amdgpu_device *adev = ring->adev;
> +       struct amdgpu_ib ib;
> +       struct dma_fence *f = NULL;
> +
> +       unsigned index;
> +       uint64_t gpu_addr;
> +       uint32_t tmp;
> +       long r;
> +
> +       r = amdgpu_device_wb_get(adev, &index);
> +       if (r) {
> +               dev_err(adev->dev, "(%ld) failed to allocate wb slot\n", r);
> +               return r;
> +       }
> +
> +       gpu_addr = adev->wb.gpu_addr + (index * 4);
> +       adev->wb.wb[index] = cpu_to_le32(0xCAFEDEAD);
> +       memset(&ib, 0, sizeof(ib));
> +       r = amdgpu_ib_get(adev, NULL, 16, &ib);
> +       if (r) {
> +               DRM_ERROR("amdgpu: failed to get ib (%ld).\n", r);
> +               goto err1;
> +       }
> +       ib.ptr[0] = PACKET3(PACKET3_WRITE_DATA, 3);
> +       ib.ptr[1] = WRITE_DATA_DST_SEL(5) | WR_CONFIRM;
> +       ib.ptr[2] = lower_32_bits(gpu_addr);
> +       ib.ptr[3] = upper_32_bits(gpu_addr);
> +       ib.ptr[4] = 0xDEADBEEF;
> +       ib.length_dw = 5;
> +
> +       r = amdgpu_ib_schedule(ring, 1, &ib, NULL, &f);
> +       if (r)
> +               goto err2;
> +
> +       r = dma_fence_wait_timeout(f, false, timeout);
> +       if (r == 0) {
> +                       DRM_ERROR("amdgpu: IB test timed out.\n");
> +                       r = -ETIMEDOUT;
> +                       goto err2;
> +       } else if (r < 0) {
> +                       DRM_ERROR("amdgpu: fence wait failed (%ld).\n", r);
> +                       goto err2;
> +       }
> +
> +       tmp = adev->wb.wb[index];
> +       if (tmp == 0xDEADBEEF) {
> +                       DRM_DEBUG("ib test on ring %d succeeded\n", ring->idx);
> +                       r = 0;
> +       } else {
> +                       DRM_ERROR("ib test on ring %d failed\n", ring->idx);
> +                       r = -EINVAL;
> +       }
> +
>  err2:
> -        amdgpu_ib_free(adev, &ib, NULL);
> -        dma_fence_put(f);
> +       amdgpu_ib_free(adev, &ib, NULL);
> +       dma_fence_put(f);
>  err1:
> -        amdgpu_gfx_scratch_free(adev, scratch);
> -        return r;
> +       amdgpu_device_wb_free(adev, index);
> +       return r;
>  }
>
>
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 15/22] drm/amdgpu: move WB_FREE to correct place
       [not found]     ` <1519622300-8990-16-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:36       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:36 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> WB_FREE should be put after all engines's hw_fini
> done, otherwise the invalid wptr/rptr_addr would still
> be used by engines which trigger abnormal bugs.
>
> This fixes couple DMAR reading error in host side for SRIOV
> after guest kmd is unloaded.
>
> Change-Id: If721cfd06d8c3113929306378793713fd05fc929
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f6380ed..730ff97 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1460,11 +1460,6 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>                 if (!adev->ip_blocks[i].status.hw)
>                         continue;
> -               if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
> -                       amdgpu_free_static_csa(adev);
> -                       amdgpu_device_wb_fini(adev);
> -                       amdgpu_device_vram_scratch_fini(adev);
> -               }
>
>                 if (adev->ip_blocks[i].version->type != AMD_IP_BLOCK_TYPE_UVD &&
>                         adev->ip_blocks[i].version->type != AMD_IP_BLOCK_TYPE_VCE) {
> @@ -1495,6 +1490,13 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>                 if (!adev->ip_blocks[i].status.sw)
>                         continue;
> +
> +               if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
> +                       amdgpu_free_static_csa(adev);
> +                       amdgpu_device_wb_fini(adev);
> +                       amdgpu_device_vram_scratch_fini(adev);
> +               }
> +
>                 r = adev->ip_blocks[i].version->funcs->sw_fini((void *)adev);
>                 /* XXX handle errors */
>                 if (r) {
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini
       [not found]     ` <1519622300-8990-18-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:39       ` Alex Deucher
       [not found]         ` <CADnq5_M1BkpZzNURswAh1fTpm1rZbspkZ4qq+g2pbdVjWyR2eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:39 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> otherwise there will be DMAR reading error comes out from CP since
> GFX is still alive and CPC's WPTR_POLL is still enabled, which would
> lead to DMAR read error.
>
> fix:
> we can hault CPG after hw_fini, but cannot halt CPC becaues KIQ
> stil need to be alive to let RLCV invoke, but its WPTR_POLL could
> be disabled.
>
> Change-Id: Ia60ee54901531f737d09063bf2037630e7c94771
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Is this handled properly for bare metal as well?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index e9cc03e..793db9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2961,7 +2961,8 @@ static int gfx_v9_0_hw_fini(void *handle)
>                 gfx_v9_0_kcq_disable(&adev->gfx.kiq.ring, &adev->gfx.compute_ring[i]);
>
>         if (amdgpu_sriov_vf(adev)) {
> -               pr_debug("For SRIOV client, shouldn't do anything.\n");
> +               gfx_v9_0_cp_gfx_enable(adev, false);
> +               WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>                 return 0;
>         }
>         gfx_v9_0_cp_enable(adev, false);
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 20/22] drm/amdgpu: increase gart size to 512MB
       [not found]     ` <1519622300-8990-21-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 17:45       ` Alex Deucher
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Deucher @ 2018-02-26 17:45 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> 256MB is too small consider PTE/PDE shadow and TTM
> eviction activity
>
> Change-Id: Ifaa30dc730eec36af47fbdeb3cce30de9067b17f
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Acked-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 6839d93..e925121 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -789,7 +789,7 @@ static int gmc_v9_0_mc_init(struct amdgpu_device *adev)
>                 switch (adev->asic_type) {
>                 case CHIP_VEGA10:  /* all engines support GPUVM */
>                 default:
> -                       adev->gmc.gart_size = 256ULL << 20;
> +                       adev->gmc.gart_size = 512ULL << 20;
>                         break;
>                 case CHIP_RAVEN:   /* DCE SG support */
>                         adev->gmc.gart_size = 1024ULL << 20;
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]         ` <a89aeb3e-1687-e7e6-bef4-9371ecb13a70-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-27  4:45           ` Liu, Monk
       [not found]             ` <BLUPR12MB044911AD4E7442812B7D589084C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  4:45 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> In this case I think it would be much better to wait for the idle work before trying to unload the driver.

I already did it:
> +	if (!amdgpu_sriov_vf(adev))
> +		while (cancel_delayed_work_sync(&adev->late_init_work))
> +			schedule(); /* to make sure late_init_work really stopped */

What do you mean never call "schedule()" this way ?
Please show me how to do it to achieve the same goal and I can modify my patch 

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 18:08
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> issue:
> on bare-metal when doing kmd reload test, there are chance that kernel 
> hit fatal error afer driver unloaded/reloaded
>
> fix:
> the cause is that those "idle work" not really stopped and if kmd was 
> is unloaded too quick that were chance that "idle work" run after 
> driver structures already released which introduces this issue.

In this case I think it would be much better to wait for the idle work before trying to unload the driver.

>
> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>   3 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 54145ec..69fb5e50 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> -	mod_delayed_work(system_wq, &adev->late_init_work,
> +	if (!amdgpu_sriov_vf(adev))
> +		mod_delayed_work(system_wq, &adev->late_init_work,
>   			msecs_to_jiffies(AMDGPU_RESUME_MS));
>   
>   	amdgpu_device_fill_reset_magic(adev);
> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>   		adev->firmware.gpu_info_fw = NULL;
>   	}
>   	adev->accel_working = false;
> -	cancel_delayed_work_sync(&adev->late_init_work);
> +
> +	if (!amdgpu_sriov_vf(adev))
> +		while (cancel_delayed_work_sync(&adev->late_init_work))
> +			schedule(); /* to make sure late_init_work really stopped */

Never use schedule() like that.

Regards,
Christian.

> +
>   	/* free i2c buses */
>   	if (!amdgpu_device_has_dc_support(adev))
>   		amdgpu_i2c_fini(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index caba610..337db57 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>   		return 0;
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
> +			schedule(); /* to make sure idle work really stopped */
>   
>   	for (i = 0; i < adev->uvd.max_handles; ++i)
>   		if (atomic_read(&adev->uvd.handles[i]))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index a829350..2874fda 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>   		return 0;
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		cancel_delayed_work_sync(&adev->vce.idle_work);
> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
> +			schedule(); /* to make sure the idle_work really stopped */
> +
>   	/* TODO: suspending running encoding sessions isn't supported */
>   	return -EINVAL;
>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV
       [not found]         ` <994af188-1283-44ea-f868-73f4002a7cb3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-27  5:26           ` Liu, Monk
       [not found]             ` <BLUPR12MB04493445BAD30EA2B4F14A8184C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:26 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> I would rather avoid calling the function in the first place.

I already did it in patch 08, and you also rejected this patch

So I'll consider patch 08 is still valid, and drop this one

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 18:09
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV

I would rather avoid calling the function in the first place.

Christian.

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> Change-Id: I370966acd0f1925a99dfde888678e6e0fd093b15
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 337db57..5fb4357 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1117,11 +1117,13 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
>   {
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, uvd.idle_work.work);
> -	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
> +	unsigned fences;
>   
>   	if (amdgpu_sriov_vf(adev))
>   		BUG();
>   
> +	fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
> +
>   	if (fences == 0) {
>   		if (adev->pm.dpm_enabled) {
>   			amdgpu_dpm_enable_uvd(adev, false);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver
       [not found]         ` <CADnq5_M9z3T0ihL9cDVbgEjaKp0AM=Utvs_J-=q7szT2NXtvrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-27  5:27           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449DF1E4689A22636F01AB484C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:27 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deng, Emily, amd-gfx list

I'm sure it is right, the kmemleak is fixed by this patch

-----Original Message-----
From: Alex Deucher [mailto:alexdeucher@gmail.com] 
Sent: 2018年2月27日 1:28
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>; Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> From: Emily Deng <Emily.Deng@amd.com>
>
> Need to call function amdgpu_ucode_fini_bo to release ucode bo for psp 
> firmware load type.

Are you sure this is right?  I think is this is handled in amdgpu_psp.c already.

Alex

>
> Change-Id: I1c7be8135993e11076c9d46b3cd87615514a9abb
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 69fb5e50..61696c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1490,6 +1490,7 @@ static int amdgpu_device_ip_fini(struct 
> amdgpu_device *adev)
>
>         /* disable all interrupts */
>         amdgpu_irq_disable_all(adev);
> +       amdgpu_ucode_fini_bo(adev);
>
>         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>                 if (!adev->ip_blocks[i].status.sw)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini
       [not found]         ` <CADnq5_M1BkpZzNURswAh1fTpm1rZbspkZ4qq+g2pbdVjWyR2eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-27  5:31           ` Liu, Monk
       [not found]             ` <BLUPR12MB044966DA7194FE030265D15D84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:31 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx list

This is only for virtualization, for bare-metal it go another path

-----Original Message-----
From: Alex Deucher [mailto:alexdeucher@gmail.com] 
Sent: 2018年2月27日 1:40
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> otherwise there will be DMAR reading error comes out from CP since GFX 
> is still alive and CPC's WPTR_POLL is still enabled, which would lead 
> to DMAR read error.
>
> fix:
> we can hault CPG after hw_fini, but cannot halt CPC becaues KIQ stil 
> need to be alive to let RLCV invoke, but its WPTR_POLL could be 
> disabled.
>
> Change-Id: Ia60ee54901531f737d09063bf2037630e7c94771
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Is this handled properly for bare metal as well?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index e9cc03e..793db9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2961,7 +2961,8 @@ static int gfx_v9_0_hw_fini(void *handle)
>                 gfx_v9_0_kcq_disable(&adev->gfx.kiq.ring, 
> &adev->gfx.compute_ring[i]);
>
>         if (amdgpu_sriov_vf(adev)) {
> -               pr_debug("For SRIOV client, shouldn't do anything.\n");
> +               gfx_v9_0_cp_gfx_enable(adev, false);
> +               WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>                 return 0;
>         }
>         gfx_v9_0_cp_enable(adev, false);
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr
       [not found]         ` <b9826bed-1b38-cbee-536a-1bd9d50dca6b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-27  5:35           ` Liu, Monk
  0 siblings, 0 replies; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:35 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Deng, Emily

Yeah, agreed 

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 18:22
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr

Am 26.02.2018 um 06:18 schrieb Monk Liu:
> From: Emily Deng <Emily.Deng@amd.com>
>
> the original method will change the wptr value in wb.
>
> Change-Id: I984fabca35d9dcf1f5fa8ef7779b2afb7f7d7370
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c 
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index 3d5385d..8c6e408 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -240,13 +240,14 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
>   	struct amdgpu_device *adev = ring->adev;
>   	u64 *wptr = NULL;
>   	uint64_t local_wptr = 0;
> -
> +	u64 tmp = 0;

Please keep an empty line between deceleration and code.

>   	if (ring->use_doorbell) {
>   		/* XXX check if swapping is necessary on BE */
>   		wptr = ((u64 *)&adev->wb.wb[ring->wptr_offs]);
>   		DRM_DEBUG("wptr/doorbell before shift == 0x%016llx\n", *wptr);
> -		*wptr = (*wptr) >> 2;
> -		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", *wptr);
> +		tmp = (*wptr) >> 2;
> +		DRM_DEBUG("wptr/doorbell after shift == 0x%016llx\n", tmp);
> +		return tmp;

Well I completely agree that this code is absolutely nonsense and could result in complete random hardware behavior.

But I think we need further cleanup than what you already did. How about the following instead?

> static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring) {
>         struct amdgpu_device *adev = ring->adev;
>         uint64_t wptr;
>
>         if (ring->use_doorbell) {
>                 /* XXX check if swapping is necessary on BE */
>                 wptr = READ_ONCE(*((u64 
> *)&adev->wb.wb[ring->wptr_offs]));
>                 DRM_DEBUG("wptr/doorbell == 0x%016llx\n", *wptr);
>         } else {
>                 int me = (ring == &adev->sdma.instance[0].ring) ? 0 : 
> 1;
>                 u32 lowbit, highbit;
>
>                 lowbit = RREG32(sdma_v4_0_get_reg_offset(adev, me, 
> mmSDMA0_GFX_RB_WPTR));
>                 highbit = RREG32(sdma_v4_0_get_reg_offset(adev, me, 
> mmSDMA0_GFX_RB_WPTR_HI));
>
>                 DRM_DEBUG("wptr [%i]high== 0x%08x low==0x%08x\n",
>                                 me, highbit, lowbit);
>                 wptr = highbit;
>                 wptr = wptr << 32;
>                 wptr |= lowbit;
>         }
>
>         return wptr >> 2;
> }

Good catch,
Christian.

>   	} else {
>   		u32 lowbit, highbit;
>   		int me = (ring == &adev->sdma.instance[0].ring) ? 0 : 1; @@ -260,9 
> +261,8 @@ static uint64_t sdma_v4_0_ring_get_wptr(struct amdgpu_ring *ring)
>   		*wptr = highbit;
>   		*wptr = (*wptr) << 32;
>   		*wptr |= lowbit;
> +		return *wptr;
>   	}
> -
> -	return *wptr;
>   }
>   
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver
       [not found]     ` <1519622300-8990-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26 17:27       ` Alex Deucher
@ 2018-02-27  5:52       ` Liu, Monk
  1 sibling, 0 replies; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:52 UTC (permalink / raw)
  To: Deucher, Alexander, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Oh, Alex

This patch is verified on an elder branch, I checked the latest staging and PSP already include the ucode_bo_fini, so I'll drop it right now and verify this memleak later

/Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Monk Liu
Sent: 2018年2月26日 13:18
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>
Subject: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver

From: Emily Deng <Emily.Deng@amd.com>

Need to call function amdgpu_ucode_fini_bo to release ucode bo for psp firmware load type.

Change-Id: I1c7be8135993e11076c9d46b3cd87615514a9abb
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 69fb5e50..61696c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1490,6 +1490,7 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 
 	/* disable all interrupts */
 	amdgpu_irq_disable_all(adev);
+	amdgpu_ucode_fini_bo(adev);
 
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.sw)
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* RE: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver
       [not found]             ` <BLUPR12MB0449DF1E4689A22636F01AB484C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27  5:53               ` Liu, Monk
  0 siblings, 0 replies; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  5:53 UTC (permalink / raw)
  To: Liu, Monk, Alex Deucher; +Cc: Deng, Emily, amd-gfx list

Hi Alex
This patch works on another branch, and staging looks not need it, my bad

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2018年2月27日 13:27
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Deng, Emily <Emily.Deng@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver

I'm sure it is right, the kmemleak is fixed by this patch

-----Original Message-----
From: Alex Deucher [mailto:alexdeucher@gmail.com]
Sent: 2018年2月27日 1:28
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx list <amd-gfx@lists.freedesktop.org>; Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> From: Emily Deng <Emily.Deng@amd.com>
>
> Need to call function amdgpu_ucode_fini_bo to release ucode bo for psp 
> firmware load type.

Are you sure this is right?  I think is this is handled in amdgpu_psp.c already.

Alex

>
> Change-Id: I1c7be8135993e11076c9d46b3cd87615514a9abb
> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 69fb5e50..61696c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1490,6 +1490,7 @@ static int amdgpu_device_ip_fini(struct 
> amdgpu_device *adev)
>
>         /* disable all interrupts */
>         amdgpu_irq_disable_all(adev);
> +       amdgpu_ucode_fini_bo(adev);
>
>         for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>                 if (!adev->ip_blocks[i].status.sw)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]             ` <BLUPR12MB044911AD4E7442812B7D589084C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27  8:42               ` Christian König
       [not found]                 ` <1b9d2cbf-1ba5-1be6-11ef-4723e08e9699-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-27  8:42 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
> I already did it:
>> +	if (!amdgpu_sriov_vf(adev))
>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>> +			schedule(); /* to make sure late_init_work really stopped */
> What do you mean never call "schedule()" this way ?

Never use schedule() in a while loop as long as you don't work on core 
locking primitives or interrupt handling or stuff like that. In this 
particular case a single call to cancel_delayed_work_sync() should be 
sufficient.

Additional to that I think you can drop the "if 
(!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something 
like that. In other words if there is an error we need to cancel the 
late init work independent if we are on SRIOV or bare metal.

Regards,
Christian.

> Please show me how to do it to achieve the same goal and I can modify my patch
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 18:08
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
>
> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>> issue:
>> on bare-metal when doing kmd reload test, there are chance that kernel
>> hit fatal error afer driver unloaded/reloaded
>>
>> fix:
>> the cause is that those "idle work" not really stopped and if kmd was
>> is unloaded too quick that were chance that "idle work" run after
>> driver structures already released which introduces this issue.
> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>
>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>    3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 54145ec..69fb5e50 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>    		}
>>    	}
>>    
>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>> +	if (!amdgpu_sriov_vf(adev))
>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>    			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>    
>>    	amdgpu_device_fill_reset_magic(adev);
>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>    		adev->firmware.gpu_info_fw = NULL;
>>    	}
>>    	adev->accel_working = false;
>> -	cancel_delayed_work_sync(&adev->late_init_work);
>> +
>> +	if (!amdgpu_sriov_vf(adev))
>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>> +			schedule(); /* to make sure late_init_work really stopped */
> Never use schedule() like that.
>
> Regards,
> Christian.
>
>> +
>>    	/* free i2c buses */
>>    	if (!amdgpu_device_has_dc_support(adev))
>>    		amdgpu_i2c_fini(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> index caba610..337db57 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>    		return 0;
>>    
>>    	if (!amdgpu_sriov_vf(adev))
>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>> +			schedule(); /* to make sure idle work really stopped */
>>    
>>    	for (i = 0; i < adev->uvd.max_handles; ++i)
>>    		if (atomic_read(&adev->uvd.handles[i]))
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> index a829350..2874fda 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>    		return 0;
>>    
>>    	if (!amdgpu_sriov_vf(adev))
>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>> +			schedule(); /* to make sure the idle_work really stopped */
>> +
>>    	/* TODO: suspending running encoding sessions isn't supported */
>>    	return -EINVAL;
>>    }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV
       [not found]             ` <BLUPR12MB04493445BAD30EA2B4F14A8184C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27  8:47               ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-27  8:47 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 27.02.2018 um 06:26 schrieb Liu, Monk:
>> I would rather avoid calling the function in the first place.
> I already did it in patch 08, and you also rejected this patch
>
> So I'll consider patch 08 is still valid, and drop this one

Well a good part of patch 08 is still valid. I just rejected that you 
want to wait for the create message alone cause that is known to cause 
problems.

But patch 08 is about VCE and this change is about UVD. So you should 
probably go into amdgpu_uvd_ring_end_use() and disable the 
schedule_delayed_work() under SRIOV.

Regards,
Christian.

>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 18:09
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV
>
> I would rather avoid calling the function in the first place.
>
> Christian.
>
> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>> Change-Id: I370966acd0f1925a99dfde888678e6e0fd093b15
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 4 +++-
>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> index 337db57..5fb4357 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> @@ -1117,11 +1117,13 @@ static void amdgpu_uvd_idle_work_handler(struct work_struct *work)
>>    {
>>    	struct amdgpu_device *adev =
>>    		container_of(work, struct amdgpu_device, uvd.idle_work.work);
>> -	unsigned fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
>> +	unsigned fences;
>>    
>>    	if (amdgpu_sriov_vf(adev))
>>    		BUG();
>>    
>> +	fences = amdgpu_fence_count_emitted(&adev->uvd.ring);
>> +
>>    	if (fences == 0) {
>>    		if (adev->pm.dpm_enabled) {
>>    			amdgpu_dpm_enable_uvd(adev, false);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]                 ` <1b9d2cbf-1ba5-1be6-11ef-4723e08e9699-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-27  9:09                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB0449EE0E7CD1DAE5C60290C684C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27  9:09 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.

I thought that only one call on cancel_delayed_work_sync() cannot guarantee that after this function returned, the work had been finished, so if the work is still pending (return you TRUE) there
And the caller just continue and release driver, we would run to page fault in that work since all structures pointers accessed in that work is invalid now 

That's why I use a while() loop to keep calling cancel_delayed_work_sync() till the work is not pending any more

Is my understanding wrong ?...
/Monk
-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月27日 16:42
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal

Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
> I already did it:
>> +	if (!amdgpu_sriov_vf(adev))
>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>> +			schedule(); /* to make sure late_init_work really stopped */
> What do you mean never call "schedule()" this way ?

Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.

Additional to that I think you can drop the "if (!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something like that. In other words if there is an error we need to cancel the late init work independent if we are on SRIOV or bare metal.

Regards,
Christian.

> Please show me how to do it to achieve the same goal and I can modify 
> my patch
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 18:08
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on 
> bare-metal
>
> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>> issue:
>> on bare-metal when doing kmd reload test, there are chance that 
>> kernel hit fatal error afer driver unloaded/reloaded
>>
>> fix:
>> the cause is that those "idle work" not really stopped and if kmd was 
>> is unloaded too quick that were chance that "idle work" run after 
>> driver structures already released which introduces this issue.
> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>
>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>    3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 54145ec..69fb5e50 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>    		}
>>    	}
>>    
>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>> +	if (!amdgpu_sriov_vf(adev))
>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>    			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>    
>>    	amdgpu_device_fill_reset_magic(adev);
>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>    		adev->firmware.gpu_info_fw = NULL;
>>    	}
>>    	adev->accel_working = false;
>> -	cancel_delayed_work_sync(&adev->late_init_work);
>> +
>> +	if (!amdgpu_sriov_vf(adev))
>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>> +			schedule(); /* to make sure late_init_work really stopped */
> Never use schedule() like that.
>
> Regards,
> Christian.
>
>> +
>>    	/* free i2c buses */
>>    	if (!amdgpu_device_has_dc_support(adev))
>>    		amdgpu_i2c_fini(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> index caba610..337db57 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>    		return 0;
>>    
>>    	if (!amdgpu_sriov_vf(adev))
>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>> +			schedule(); /* to make sure idle work really stopped */
>>    
>>    	for (i = 0; i < adev->uvd.max_handles; ++i)
>>    		if (atomic_read(&adev->uvd.handles[i]))
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> index a829350..2874fda 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>    		return 0;
>>    
>>    	if (!amdgpu_sriov_vf(adev))
>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>> +			schedule(); /* to make sure the idle_work really stopped */
>> +
>>    	/* TODO: suspending running encoding sessions isn't supported */
>>    	return -EINVAL;
>>    }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]                     ` <BLUPR12MB0449EE0E7CD1DAE5C60290C684C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27  9:52                       ` Christian König
       [not found]                         ` <07b98381-cbea-7352-7cf3-d69020da6395-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Christian König @ 2018-02-27  9:52 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Is my understanding wrong ?...
The first thing that cancel_delayed_work_sync() does is it takes the 
work item from the TODO list.

The it makes sure that the work is currently not running on another CPU.

The return value just indicates if the work was still on the TODO list 
or not, e.g. when false is returned when the work was already done 
before the call was made, true otherwise.

So the while loop isn't necessary here.

Regards,
Christian.

Am 27.02.2018 um 10:09 schrieb Liu, Monk:
>> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
> I thought that only one call on cancel_delayed_work_sync() cannot guarantee that after this function returned, the work had been finished, so if the work is still pending (return you TRUE) there
> And the caller just continue and release driver, we would run to page fault in that work since all structures pointers accessed in that work is invalid now
>
> That's why I use a while() loop to keep calling cancel_delayed_work_sync() till the work is not pending any more
>
> Is my understanding wrong ?...
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 16:42
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
>
> Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>> I already did it:
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> What do you mean never call "schedule()" this way ?
> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
>
> Additional to that I think you can drop the "if (!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something like that. In other words if there is an error we need to cancel the late init work independent if we are on SRIOV or bare metal.
>
> Regards,
> Christian.
>
>> Please show me how to do it to achieve the same goal and I can modify
>> my patch
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月26日 18:08
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on
>> bare-metal
>>
>> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>>> issue:
>>> on bare-metal when doing kmd reload test, there are chance that
>>> kernel hit fatal error afer driver unloaded/reloaded
>>>
>>> fix:
>>> the cause is that those "idle work" not really stopped and if kmd was
>>> is unloaded too quick that were chance that "idle work" run after
>>> driver structures already released which introduces this issue.
>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>>
>>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>>     3 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 54145ec..69fb5e50 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>>     		}
>>>     	}
>>>     
>>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>>     			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>>     
>>>     	amdgpu_device_fill_reset_magic(adev);
>>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>>     		adev->firmware.gpu_info_fw = NULL;
>>>     	}
>>>     	adev->accel_working = false;
>>> -	cancel_delayed_work_sync(&adev->late_init_work);
>>> +
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> Never use schedule() like that.
>>
>> Regards,
>> Christian.
>>
>>> +
>>>     	/* free i2c buses */
>>>     	if (!amdgpu_device_has_dc_support(adev))
>>>     		amdgpu_i2c_fini(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> index caba610..337db57 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>>> +			schedule(); /* to make sure idle work really stopped */
>>>     
>>>     	for (i = 0; i < adev->uvd.max_handles; ++i)
>>>     		if (atomic_read(&adev->uvd.handles[i]))
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> index a829350..2874fda 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>>> +			schedule(); /* to make sure the idle_work really stopped */
>>> +
>>>     	/* TODO: suspending running encoding sessions isn't supported */
>>>     	return -EINVAL;
>>>     }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]                         ` <07b98381-cbea-7352-7cf3-d69020da6395-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-27 10:19                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB0449F97B88BE9D312DD5479784C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 58+ messages in thread
From: Liu, Monk @ 2018-02-27 10:19 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

So returning true only means the work is on TODO list, but the work will *never* get executed after this cancel_delayed_work_sync() returned right ?

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月27日 17:53
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal

> Is my understanding wrong ?...
The first thing that cancel_delayed_work_sync() does is it takes the work item from the TODO list.

The it makes sure that the work is currently not running on another CPU.

The return value just indicates if the work was still on the TODO list or not, e.g. when false is returned when the work was already done before the call was made, true otherwise.

So the while loop isn't necessary here.

Regards,
Christian.

Am 27.02.2018 um 10:09 schrieb Liu, Monk:
>> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
> I thought that only one call on cancel_delayed_work_sync() cannot 
> guarantee that after this function returned, the work had been 
> finished, so if the work is still pending (return you TRUE) there And 
> the caller just continue and release driver, we would run to page 
> fault in that work since all structures pointers accessed in that work 
> is invalid now
>
> That's why I use a while() loop to keep calling 
> cancel_delayed_work_sync() till the work is not pending any more
>
> Is my understanding wrong ?...
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 16:42
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on 
> bare-metal
>
> Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>> I already did it:
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> What do you mean never call "schedule()" this way ?
> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
>
> Additional to that I think you can drop the "if (!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something like that. In other words if there is an error we need to cancel the late init work independent if we are on SRIOV or bare metal.
>
> Regards,
> Christian.
>
>> Please show me how to do it to achieve the same goal and I can modify 
>> my patch
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月26日 18:08
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on 
>> bare-metal
>>
>> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>>> issue:
>>> on bare-metal when doing kmd reload test, there are chance that 
>>> kernel hit fatal error afer driver unloaded/reloaded
>>>
>>> fix:
>>> the cause is that those "idle work" not really stopped and if kmd 
>>> was is unloaded too quick that were chance that "idle work" run 
>>> after driver structures already released which introduces this issue.
>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>>
>>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>>     3 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 54145ec..69fb5e50 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>>     		}
>>>     	}
>>>     
>>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>>     			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>>     
>>>     	amdgpu_device_fill_reset_magic(adev);
>>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>>     		adev->firmware.gpu_info_fw = NULL;
>>>     	}
>>>     	adev->accel_working = false;
>>> -	cancel_delayed_work_sync(&adev->late_init_work);
>>> +
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> Never use schedule() like that.
>>
>> Regards,
>> Christian.
>>
>>> +
>>>     	/* free i2c buses */
>>>     	if (!amdgpu_device_has_dc_support(adev))
>>>     		amdgpu_i2c_fini(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> index caba610..337db57 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>>> +			schedule(); /* to make sure idle work really stopped */
>>>     
>>>     	for (i = 0; i < adev->uvd.max_handles; ++i)
>>>     		if (atomic_read(&adev->uvd.handles[i]))
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> index a829350..2874fda 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>>> +			schedule(); /* to make sure the idle_work really stopped */
>>> +
>>>     	/* TODO: suspending running encoding sessions isn't supported */
>>>     	return -EINVAL;
>>>     }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
       [not found]                             ` <BLUPR12MB0449F97B88BE9D312DD5479784C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27 10:29                               ` Christian König
  0 siblings, 0 replies; 58+ messages in thread
From: Christian König @ 2018-02-27 10:29 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 27.02.2018 um 11:19 schrieb Liu, Monk:
> So returning true only means the work is on TODO list, but the work will *never* get executed after this cancel_delayed_work_sync() returned right ?

Correct yes.

Christian.

>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 17:53
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
>
>> Is my understanding wrong ?...
> The first thing that cancel_delayed_work_sync() does is it takes the work item from the TODO list.
>
> The it makes sure that the work is currently not running on another CPU.
>
> The return value just indicates if the work was still on the TODO list or not, e.g. when false is returned when the work was already done before the call was made, true otherwise.
>
> So the while loop isn't necessary here.
>
> Regards,
> Christian.
>
> Am 27.02.2018 um 10:09 schrieb Liu, Monk:
>>> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
>> I thought that only one call on cancel_delayed_work_sync() cannot
>> guarantee that after this function returned, the work had been
>> finished, so if the work is still pending (return you TRUE) there And
>> the caller just continue and release driver, we would run to page
>> fault in that work since all structures pointers accessed in that work
>> is invalid now
>>
>> That's why I use a while() loop to keep calling
>> cancel_delayed_work_sync() till the work is not pending any more
>>
>> Is my understanding wrong ?...
>> /Monk
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月27日 16:42
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on
>> bare-metal
>>
>> Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>>>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>>> I already did it:
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>>> +			schedule(); /* to make sure late_init_work really stopped */
>>> What do you mean never call "schedule()" this way ?
>> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
>>
>> Additional to that I think you can drop the "if (!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something like that. In other words if there is an error we need to cancel the late init work independent if we are on SRIOV or bare metal.
>>
>> Regards,
>> Christian.
>>
>>> Please show me how to do it to achieve the same goal and I can modify
>>> my patch
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月26日 18:08
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on
>>> bare-metal
>>>
>>> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>>>> issue:
>>>> on bare-metal when doing kmd reload test, there are chance that
>>>> kernel hit fatal error afer driver unloaded/reloaded
>>>>
>>>> fix:
>>>> the cause is that those "idle work" not really stopped and if kmd
>>>> was is unloaded too quick that were chance that "idle work" run
>>>> after driver structures already released which introduces this issue.
>>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>>>
>>>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>>>      3 files changed, 12 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 54145ec..69fb5e50 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>>>      		}
>>>>      	}
>>>>      
>>>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>>>      			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>>>      
>>>>      	amdgpu_device_fill_reset_magic(adev);
>>>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>>>      		adev->firmware.gpu_info_fw = NULL;
>>>>      	}
>>>>      	adev->accel_working = false;
>>>> -	cancel_delayed_work_sync(&adev->late_init_work);
>>>> +
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>>> +			schedule(); /* to make sure late_init_work really stopped */
>>> Never use schedule() like that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +
>>>>      	/* free i2c buses */
>>>>      	if (!amdgpu_device_has_dc_support(adev))
>>>>      		amdgpu_i2c_fini(adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> index caba610..337db57 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>>      		return 0;
>>>>      
>>>>      	if (!amdgpu_sriov_vf(adev))
>>>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>>>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>>>> +			schedule(); /* to make sure idle work really stopped */
>>>>      
>>>>      	for (i = 0; i < adev->uvd.max_handles; ++i)
>>>>      		if (atomic_read(&adev->uvd.handles[i]))
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> index a829350..2874fda 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>>>      		return 0;
>>>>      
>>>>      	if (!amdgpu_sriov_vf(adev))
>>>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>>>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>>>> +			schedule(); /* to make sure the idle_work really stopped */
>>>> +
>>>>      	/* TODO: suspending running encoding sessions isn't supported */
>>>>      	return -EINVAL;
>>>>      }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini
       [not found]             ` <BLUPR12MB044966DA7194FE030265D15D84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27 14:28               ` Deucher, Alexander
  0 siblings, 0 replies; 58+ messages in thread
From: Deucher, Alexander @ 2018-02-27 14:28 UTC (permalink / raw)
  To: Liu, Monk, Alex Deucher; +Cc: amd-gfx list


[-- Attachment #1.1: Type: text/plain, Size: 2833 bytes --]

Please add a comment above the CP_PQ_WPTR_POLL_CNTL write to explain that it's to disable the polling.  WIth that fixed:

Acked-by: Alex Deucher <alexander.deucher-5C7GfCeVMHo@public.gmane.org>

________________________________
From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on behalf of Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
Sent: Tuesday, February 27, 2018 12:31:35 AM
To: Alex Deucher
Cc: amd-gfx list
Subject: RE: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini

This is only for virtualization, for bare-metal it go another path

-----Original Message-----
From: Alex Deucher [mailto:alexdeucher-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
Sent: 2018年2月27日 1:40
To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
Cc: amd-gfx list <amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini

On Mon, Feb 26, 2018 at 12:18 AM, Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org> wrote:
> otherwise there will be DMAR reading error comes out from CP since GFX
> is still alive and CPC's WPTR_POLL is still enabled, which would lead
> to DMAR read error.
>
> fix:
> we can hault CPG after hw_fini, but cannot halt CPC becaues KIQ stil
> need to be alive to let RLCV invoke, but its WPTR_POLL could be
> disabled.
>
> Change-Id: Ia60ee54901531f737d09063bf2037630e7c94771
> Signed-off-by: Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org>

Is this handled properly for bare metal as well?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index e9cc03e..793db9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2961,7 +2961,8 @@ static int gfx_v9_0_hw_fini(void *handle)
>                 gfx_v9_0_kcq_disable(&adev->gfx.kiq.ring,
> &adev->gfx.compute_ring[i]);
>
>         if (amdgpu_sriov_vf(adev)) {
> -               pr_debug("For SRIOV client, shouldn't do anything.\n");
> +               gfx_v9_0_cp_gfx_enable(adev, false);
> +               WREG32_FIELD15(GC, 0, CP_PQ_WPTR_POLL_CNTL, EN, 0);
>                 return 0;
>         }
>         gfx_v9_0_cp_enable(adev, false);
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 4862 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2018-02-27 14:28 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-26  5:17 [PATCH 00/22] *** bug fixing serials *** Monk Liu
     [not found] ` <1519622300-8990-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  5:17   ` [PATCH 01/22] drm/amdgpu: fix&cleanups for wb_clear Monk Liu
     [not found]     ` <1519622300-8990-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:18       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 02/22] drm/amdgpu: remove duplicated job_free_resources Monk Liu
     [not found]     ` <1519622300-8990-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  9:51       ` Christian König
2018-02-26  5:18   ` [PATCH 03/22] drm/amdgpu: skip ECC for SRIOV in gmc late_init Monk Liu
     [not found]     ` <1519622300-8990-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:19       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 04/22] drm/amdgpu: only flush hotplug work without DC Monk Liu
     [not found]     ` <1519622300-8990-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:19       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 05/22] drm/amdgpu: cond_exec only for schedule with a job Monk Liu
     [not found]     ` <1519622300-8990-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  9:53       ` Christian König
2018-02-26  5:18   ` [PATCH 06/22] drm/amdgpu: don't use MM idle_work for SRIOV Monk Liu
     [not found]     ` <1519622300-8990-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  9:57       ` Christian König
2018-02-26  5:18   ` [PATCH 07/22] drm/amdgpu: fix fence_emit NULL pointer ref risk Monk Liu
     [not found]     ` <1519622300-8990-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  9:58       ` Christian König
2018-02-26  5:18   ` [PATCH 08/22] drm/amdgpu: cleanups VCE/UVD ib test part Monk Liu
     [not found]     ` <1519622300-8990-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:06       ` Christian König
2018-02-26  5:18   ` [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal Monk Liu
     [not found]     ` <1519622300-8990-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:08       ` Christian König
     [not found]         ` <a89aeb3e-1687-e7e6-bef4-9371ecb13a70-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-27  4:45           ` Liu, Monk
     [not found]             ` <BLUPR12MB044911AD4E7442812B7D589084C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27  8:42               ` Christian König
     [not found]                 ` <1b9d2cbf-1ba5-1be6-11ef-4723e08e9699-5C7GfCeVMHo@public.gmane.org>
2018-02-27  9:09                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB0449EE0E7CD1DAE5C60290C684C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27  9:52                       ` Christian König
     [not found]                         ` <07b98381-cbea-7352-7cf3-d69020da6395-5C7GfCeVMHo@public.gmane.org>
2018-02-27 10:19                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB0449F97B88BE9D312DD5479784C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27 10:29                               ` Christian König
2018-02-26  5:18   ` [PATCH 10/22] drm/amdgpu: no need to count emitted for SRIOV Monk Liu
     [not found]     ` <1519622300-8990-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:09       ` Christian König
     [not found]         ` <994af188-1283-44ea-f868-73f4002a7cb3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-27  5:26           ` Liu, Monk
     [not found]             ` <BLUPR12MB04493445BAD30EA2B4F14A8184C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27  8:47               ` Christian König
2018-02-26  5:18   ` [PATCH 11/22] drm/amdgpu: Remove the memory leak after unload amdgpu driver Monk Liu
     [not found]     ` <1519622300-8990-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:27       ` Alex Deucher
     [not found]         ` <CADnq5_M9z3T0ihL9cDVbgEjaKp0AM=Utvs_J-=q7szT2NXtvrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-27  5:27           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449DF1E4689A22636F01AB484C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27  5:53               ` Liu, Monk
2018-02-27  5:52       ` Liu, Monk
2018-02-26  5:18   ` [PATCH 12/22] drm/amdgpu: send rel event first after init failed Monk Liu
     [not found]     ` <1519622300-8990-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:24       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 13/22] drm/amdgpu: fix vce_ring test memleak Monk Liu
     [not found]     ` <1519622300-8990-14-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:10       ` Christian König
2018-02-26  5:18   ` [PATCH 14/22] drm/amdgpu: change gfx9 ib test to use WB Monk Liu
     [not found]     ` <1519622300-8990-15-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:30       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 15/22] drm/amdgpu: move WB_FREE to correct place Monk Liu
     [not found]     ` <1519622300-8990-16-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:36       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 16/22] drm/amdgpu: cleanup SA inti and fini Monk Liu
     [not found]     ` <1519622300-8990-17-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:12       ` Christian König
     [not found]         ` <51af9fc4-7827-9f40-da2b-0fd8753bc321-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-26 10:41           ` Liu, Monk
     [not found]             ` <SN1PR12MB04625A0F831F186440F4A81684C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-26 10:48               ` Liu, Monk
     [not found]                 ` <SN1PR12MB04620B6D208218E5EB23824284C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-26 11:29                   ` Christian König
2018-02-26  5:18   ` [PATCH 17/22] drm/amdgpu: disable GFX ring and disable PQ wptr in hw_fini Monk Liu
     [not found]     ` <1519622300-8990-18-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:39       ` Alex Deucher
     [not found]         ` <CADnq5_M1BkpZzNURswAh1fTpm1rZbspkZ4qq+g2pbdVjWyR2eQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-27  5:31           ` Liu, Monk
     [not found]             ` <BLUPR12MB044966DA7194FE030265D15D84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27 14:28               ` Deucher, Alexander
2018-02-26  5:18   ` [PATCH 18/22] drm/amdgpu: Correct sdma_v4 get_wptr Monk Liu
     [not found]     ` <1519622300-8990-19-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:22       ` Christian König
     [not found]         ` <b9826bed-1b38-cbee-536a-1bd9d50dca6b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-27  5:35           ` Liu, Monk
2018-02-26  5:18   ` [PATCH 19/22] drm/amdgpu: adjust timeout for ib_ring_tests Monk Liu
     [not found]     ` <1519622300-8990-20-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 15:55       ` Alex Deucher
2018-02-26  5:18   ` [PATCH 20/22] drm/amdgpu: increase gart size to 512MB Monk Liu
     [not found]     ` <1519622300-8990-21-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26 17:45       ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.