that breaks the device list in gpu recovery. ________________________________ From: Pan, Xinhui Sent: Friday, April 17, 2020 7:11:40 PM To: Chen, Guchun ; amd-gfx@lists.freedesktop.org ; Zhang, Hawking ; Li, Dennis ; Clements, John ; Koenig, Christian Subject: Re: [PATCH] drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU This patch shluld fix the panic. but I would like you do NOT add adev xgmi head to the local device list. if ras ue occurs while the gpu is already in gpu recovery. ________________________________ From: amd-gfx on behalf of Christian König Sent: Friday, April 17, 2020 5:17:22 PM To: Chen, Guchun ; amd-gfx@lists.freedesktop.org ; Zhang, Hawking ; Li, Dennis ; Clements, John Subject: Re: [PATCH] drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU Am 16.04.20 um 17:47 schrieb Guchun Chen: > When running ras uncorrectable error injection and trigger GPU > reset on sGPU, below issue is observed. It's caused by the list > uninitialized when accessing. > > [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750 > [ 80.047300] #PF: supervisor write access in kernel mode > [ 80.047351] #PF: error_code(0x0003) - permissions violation > [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061 > [ 80.047477] Oops: 0003 [#1] SMP PTI > [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1 > [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 > [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu] > > Signed-off-by: Guchun Chen > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index b27d9d62c9df..260b4a42e0ae 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -1448,9 +1448,10 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) > struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false); > > /* Build list of devices to query RAS related errors */ > - if (hive && adev->gmc.xgmi.num_physical_nodes > 1) { > + if (hive && adev->gmc.xgmi.num_physical_nodes > 1) > device_list_handle = &hive->device_list; > - } else { > + else { The coding style here is incorrect. If one branch of an if/else uses {} the other(s) should use it as well. > + INIT_LIST_HEAD(&device_list); That was suggested before but then reverted, but I'm not sure why. Regards, Christian. > list_add_tail(&adev->gmc.xgmi.head, &device_list); > device_list_handle = &device_list; > } _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cxinhui.pan%40amd.com%7Ca2bb160328c942b51b4608d7e2b02579%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637227118528983724&sdata=6bJVduxNT%2FVXKPDWDBAnJfVSe3CJI9mdGfwi5V89Kgw%3D&reserved=0