From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI, MIME_HTML_MOSTLY,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,T_KAM_HTML_FONT_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED1D6C2BA19 for ; Wed, 15 Apr 2020 15:02:08 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 946AE2076A for ; Wed, 15 Apr 2020 15:02:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amdcloud.onmicrosoft.com header.i=@amdcloud.onmicrosoft.com header.b="09HEj6Gz" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 946AE2076A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 37BD96E9FC; Wed, 15 Apr 2020 15:02:08 +0000 (UTC) Received: from NAM02-BL2-obe.outbound.protection.outlook.com (mail-eopbgr750049.outbound.protection.outlook.com [40.107.75.49]) by gabe.freedesktop.org (Postfix) with ESMTPS id 273186E9E8 for ; Wed, 15 Apr 2020 15:02:07 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=G5yaH5sv7mBeM9N8uS354k0fATgxVotE1kKh2ZnCxZqG6SfZFOVbYaXniISZAEB47fL4IL/GMKSWBmtqTS0dNUMwUHT6MgoLnfpw7MvfnNayr2XSz+MOoY9RzqH0ocWrFrdqYkqlrUTSaRU6l8dinpcH4sdGolgb8pzCODqKQY9zS62wk2zi3lZ01AYKVaIvlSx+r4+PdEYAFxf2xPqM7LYWY5Ha4861dhdAY3JrM5IQR5JIIHLB75a04m8HlG4uey+FR5PfYJdADxJh160ob7j2NJ2o1G2Pc+hBDwk8qRqNlX4+lWdld+BOy0/uwfAKGU7M95R57RAPKfw+oVguKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=YojnLLuTX5RW1k1BOYVby+gH43rlk12E/siX68XeX/M=; b=ELtgzFaEm5Sez/mgoI8tVTMcs97iOvGVIKrggBjL4MfRp9k/wBaCB0mrxPg0MTFTGPhZZ+pu1RMI+jJOK2bIJeAfYw5B6WWiDncw4hcbRmrscf3hKvKOPtCaxRW9hxDv3wYaWJO83fygbEBbUC/6Og7jYPqmvAWHGQHOQ+4/0a6jPiozBEgyWEmS/WUK2+KULokuYDbsgiax3+7YqElodCLMzYZrveYNBFuSoxRtkElVn2V2MYflD82FdnSil4QW8dnotyRibH1ShKituWxUtG2PzLZCPnwsf361egL7eeA0SDscOoXUu8jURebJP1xz8AE1t3XeUd1HaF5TTvrSsw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=YojnLLuTX5RW1k1BOYVby+gH43rlk12E/siX68XeX/M=; b=09HEj6GzsWPPA+EshEC3aysDspXtiXRsxE0M6PDmBpBRBawBGoiw67uKHjfpiof6ly5N6VbCZap78akid+axIy9Vxu/JDnXAYQOZQV8G7bwJ0KGn7HdkuhbJrXlK8HwRF2TB+dMAtmfrEq6SSA6/N1SC9M20MXxa/7YL9p9Vw4Y= Received: from SN1PR12MB2414.namprd12.prod.outlook.com (2603:10b6:802:2e::31) by SN1PR12MB2573.namprd12.prod.outlook.com (2603:10b6:802:2b::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2900.24; Wed, 15 Apr 2020 15:02:04 +0000 Received: from SN1PR12MB2414.namprd12.prod.outlook.com ([fe80::38ef:1510:9525:f806]) by SN1PR12MB2414.namprd12.prod.outlook.com ([fe80::38ef:1510:9525:f806%7]) with mapi id 15.20.2900.028; Wed, 15 Apr 2020 15:02:04 +0000 From: "Kuehling, Felix" To: "Koenig, Christian" , "Kim, Jonathan" , "Deucher, Alexander" Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Thread-Topic: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Thread-Index: AQHWEcBDjmf7bCaARkqoSlxoZNmgcKh4pBqAgAAHvgCAAASbgIAAAxkAgAABSwCAAD2WgIAAIQaAgADD5YCAABtygIAAE0sAgABAjf8= Date: Wed, 15 Apr 2020 15:02:04 +0000 Message-ID: References: <20200413182026.2561-1-kent.russell@amd.com> <85fcb568-b0d8-b6c9-4e62-3866aa2da0c9@gmail.com> <146d9570-724e-423d-931e-24c96821aaae@email.android.com> , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_Enabled=True; MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d; MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_SetDate=2020-04-15T15:02:03.752Z; MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_Name=Internal Distribution Only; MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_ContentBits=0; MSIP_Label_76546daa-41b6-470c-bb85-f6f40f044d7f_Method=Standard; authentication-results: spf=none (sender IP is ) smtp.mailfrom=Felix.Kuehling@amd.com; x-originating-ip: [142.116.63.128] x-ms-publictraffictype: Email x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: 58e86874-d345-42fc-a790-08d7e14df5af x-ms-traffictypediagnostic: SN1PR12MB2573:|SN1PR12MB2573: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-forefront-prvs: 0374433C81 x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:SN1PR12MB2414.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:(10009020)(4636009)(396003)(376002)(346002)(366004)(136003)(39860400002)(966005)(4326008)(66476007)(86362001)(186003)(19627405001)(19627235002)(66946007)(110136005)(478600001)(55016002)(30864003)(64756008)(71200400001)(54906003)(5660300002)(316002)(45080400002)(66574012)(66446008)(66556008)(9686003)(6636002)(53546011)(8936002)(7696005)(33656002)(52536014)(6506007)(76116006)(8676002)(81156014)(2906002)(26005)(579004)(559001); DIR:OUT; SFP:1101; received-spf: None (protection.outlook.com: amd.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: Q/KMG/1Y5CprmYWxsq2Dq7PKHBOD9a+1XqYUpKfvkpTNrzRzTbuddI9RhldAU4XN5m9RYcbDwHNyfQno4ItnIJGrFa1liXU65tikxFEQ8Ao1OX2W1ilR+v9vdbtvuF4Lo/p0cpk5HOQD5Wfd+xwvYkYz1QQYp0qtOCEAv+duwVAdDWUSPPfBEkqr0F1fgtZ2bxPh6tricqUBlfR5JXXeJenTK5NWiVcruW2oZdhZfXePkWtef6IooT6TiCPdzGbMwCGh8O/qRn7UxtrcjzW0jBqliBp43wHHvaPopvM7FuVU2zzvnVViEEezuyARTdvjyopKckFKq+asBgrPyGUVK/0LJdOXc14kttKW/86Ha/666Qu3W2z+dQIaKWhhem/f6kn65bt0/XfPia7mjjXUZszeyxhXHigInv0PsLJVcPas/ySWEnW3YnuyAEsOK99QeERffOzhomqHVZKXwQIEUe9fTwXW+/ycLMJGWZZikPM= x-ms-exchange-antispam-messagedata: VVHj/80e9zZJjjGfP2GvZlNJhjB2iPUZ9lVpYFAtcLcfJTQr0ynahJ/GZ5r2yxW5Zav3yVVuL95F0Fhq5nrIngraif1c+zw5m+a5wfhS4fWlSaH8C8qnrbCZ2sm2n94tb1ULZYlx6k9bL0/dOoDfPQ== MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 58e86874-d345-42fc-a790-08d7e14df5af X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Apr 2020 15:02:04.5266 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 6TaGlqjY4u/PgFMsb3TUnzhi2d/g5TxxI7PMdCE/Wfd21xTeVnFk3MOhYZFHrkE49XWh0qW5h2VjStwvH4kJWw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1PR12MB2573 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Russell, Kent" , "amd-gfx@lists.freedesktop.org" Content-Type: multipart/mixed; boundary="===============1112019713==" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" --===============1112019713== Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_SN1PR12MB2414C7A4F1A0133BB0AE3A4492DB0SN1PR12MB2414namp_" --_000_SN1PR12MB2414C7A4F1A0133BB0AE3A4492DB0SN1PR12MB2414namp_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable [AMD Official Use Only - Internal Distribution Only] The test does not access outside of the allocated memory. But it deliberate= ly crosses a boundary where memory can be allocated non-contiguously. This = is meant to catch problems where the access function doesn't handle non-con= tiguous VRAM allocations correctly. However, the way that VRAM allocation h= as been optimized, I expect that most allocations are contiguous nowadays. = However, the more interesting aspect of the test is, that it performs misal= igned memory accesses. The MMIO method of accessing VRAM explicitly handles= misaligned accesses and breaks them down into dword aligned accesses with = proper masking and shifting. Could the unaligned nature of the memory access have something to do with h= itting RAS errors? That's something unique to this test that we wouldn't se= e on a normal page table update or memory eviction. Regards, Felix ________________________________ From: Koenig, Christian Sent: Wednesday, April 15, 2020 6:58 AM To: Kim, Jonathan ; Kuehling, Felix ; Deucher, Alexander Cc: Russell, Kent ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated ma= pped memory and 2 DWORDS outside that boundary (it=92s only about 4MB to th= e boundary). Then we POKE to swap the DWORD positions across the boundary.= The RAS event on the single failing machine happens on the out of boundar= y PEEK. Well when you access outside of an allocated buffer I would expect that we = never get as far as even touching the hardware because the kernel should bl= ock the access with an -EPERM or -EFAULT. So sounds like I'm not understand= ing something correctly here. Apart from that I completely agree that we need to sort out any other RAS e= vent first to make sure that the system is simply not failing randomly. Regards, Christian. Am 15.04.20 um 11:49 schrieb Kim, Jonathan: [AMD Public Use] Hi Christian, That could potentially be it. With additional testing, 2 of 3 Vega20 machi= nes never hit error over BAR access with the PTRACE test. 3 of 3 machines = (from the same pool) always hit error with CWSR. To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated ma= pped memory and 2 DWORDS outside that boundary (it=92s only about 4MB to th= e boundary). Then we POKE to swap the DWORD positions across the boundary.= The RAS event on the single failing machine happens on the out of boundar= y PEEK. Felix mentioned we don=92t hit errors over general HDP access but that may = not true. An Arcturus failure sys logs posted (which wasn=92t tested by me= ) shows someone launched rocm bandwidth test, hit a VM fault and a RAS even= t ensued during evictions (I can point the internal ticket or log snippet o= ffline if interested). Whether the RAS event is BAR access triggered or th= e result of HW instability is beyond me since I don=92t have access to the = machine. Thanks, Jon From: Koenig, Christian Sent: Wednesday, April 15, 2020 4:11 AM To: Kim, Jonathan ; Kueh= ling, Felix ; Deuche= r, Alexander Cc: Russell, Kent ; amd-= gfx@lists.freedesktop.org Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" Hi Jon, Also cwsr tests fail on Vega20 with or without the revert with the same RAS= error. That sounds like the system/setup has a more general problem. Could it be that we are seeing RAS errors because there really is some hard= ware failure, but with the MM path we don't trigger a RAS interrupt? Thanks, Christian. Am 14.04.20 um 22:30 schrieb Kim, Jonathan: [AMD Official Use Only - Internal Distribution Only] If we=92re passing the test on the revert, then the only thing that=92s dif= ferent is we=92re not invalidating HDP and doing a copy to host anymore in = amdgpu_device_vram_access since the function is still called in ttm access_= memory with BAR. Also cwsr tests fail on Vega20 with or without the revert with the same RAS= error. Thanks, Jon From: Kuehling, Felix Sent: Tuesday, April 14, 2020 2:32 PM To: Kim, Jonathan ; Koen= ig, Christian ; = Deucher, Alexander Cc: Russell, Kent ; amd-= gfx@lists.freedesktop.org Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" I wouldn't call it premature. Revert is a usual practice when there is a se= rious regression that isn't fully understood or root-caused. As far as I ca= n tell, the problem has been reproduced on multiple systems, different GPUs= , and clearly regressed to Christian's commit. I think that justifies rever= ting it for now. I agree with Christian that a general HDP memory access problem causing RAS= errors would potentially cause problems in other tests as well. For exampl= e common operations like GART table updates, and GPUVM page table updates a= nd PCIe peer2peer accesses in ROCm applications use HDP. But we're not seei= ng obvious problems from those. So we need to understand what's special abo= ut this test. I asked questions to that effect on our other email thread. Regards, Felix Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan: [AMD Official Use Only - Internal Distribution Only] I think it=92s premature to push this revert. With more testing, I=92m getting failures from different tests or sometimes= none at all on my machine. Kent, let=92s continue the discussion on the original thread. Thanks, Jon From: Koenig, Christian Sent: Tuesday, April 14, 2020 10:47 AM To: Deucher, Alexander Cc: Russell, Kent ; amd-= gfx@lists.freedesktop.org; Kuehling, = Felix ; Kim, Jonatha= n Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" That's exactly my concern as well. This looks a bit like the test creates erroneous data somehow, but there do= esn't seems to be a RAS check in the MM data path. And now that we use the BAR path it goes up in flames. I just don't see how we can create erroneous data in a test case? Christian. Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 --_000_SN1PR12MB2414C7A4F1A0133BB0AE3A4492DB0SN1PR12MB2414namp_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

[AMD Official Use Only - Internal Distribution Only]


The test does not access outside of the allocated memory. But it deliberate= ly crosses a boundary where memory can be allocated non-contiguously. This = is meant to catch problems where the access function doesn't handle non-con= tiguous VRAM allocations correctly. However, the way that VRAM allocation has been optimized, I expect that mo= st allocations are contiguous nowadays. However, the more interesting aspec= t of the test is, that it performs misaligned memory accesses. The MMIO met= hod of accessing VRAM explicitly handles misaligned accesses and breaks them down into dword aligned access= es with proper masking and shifting.

Could the unaligned nature of the memory access have something to do with h= itting RAS errors? That's something unique to this test that we wouldn't se= e on a normal page table update or memory eviction.

Regards,
  Felix


From: Koenig, Christian <= ;Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 6:58 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix <= Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.co= m>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freede= sktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"
 

To elaborate on t= he PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2= DWORDS outside that boundary (it=92s only about 4MB to the boundary). = ; Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine hap= pens on the out of boundary PEEK.


Well when you access outside of an allocated buffer I would expect that we = never get as far as even touching the hardware because the kernel should bl= ock the access with an -EPERM or -EFAULT. So sounds like I'm not understand= ing something correctly here.

Apart from that I completely agree that we need to sort out any other RAS e= vent first to make sure that the system is simply not failing randomly.

Regards,
Christian.

Am 15.04.20 um 11:49 schrieb Kim, Jonathan:

[AMD Public Use]

 

Hi Christian,

 

That could potent= ially be it.  With additional testing, 2 of 3 Vega20 machines never hi= t error over BAR access with the PTRACE test.  3 of 3 machines (from t= he same pool) always hit error with CWSR.

To elaborate on t= he PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2= DWORDS outside that boundary (it=92s only about 4MB to the boundary). = ; Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine hap= pens on the out of boundary PEEK.

 

Felix mentioned w= e don=92t hit errors over general HDP access but that may not true.  A= n Arcturus failure sys logs posted (which wasn=92t tested by me) shows some= one launched rocm bandwidth test, hit a VM fault and a RAS event ensued during evictions (I can point the internal ticket o= r log snippet offline if interested).  Whether the RAS event is BAR ac= cess triggered or the result of HW instability is beyond me since I don=92t= have access to the machine.

 

Thanks,

 

Jon

 

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, April 15, 2020 4:11 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

Hi Jon,

Also cwsr tests fail on Vega20 with or without the= revert with the same RAS error.


That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some hard= ware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:

[AMD Official Use Only - Internal Distribution Only]

 

If we=92re passing the test on the revert, then th= e only thing that=92s different is we=92re not invalidating HDP and doing a= copy to host anymore in amdgpu_device_vram_access since the function is st= ill called in ttm access_memory with BAR.

 

Also cwsr tests fail on Vega20 with or without the= revert with the same RAS error.

 

Thanks,

 

Jon

 

From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonath= an.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>= ;; Deucher, Alexander <Alexander.Deucher@amd.com&= gt;
Cc: Russell, Kent <Kent.R= ussell@amd.com>; amd-gfx@lists.freedesktop.= org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

I wouldn't call it premature. Revert is a usual practice when there is a= serious regression that isn't fully understood or root-caused. As far as I= can tell, the problem has been reproduced on multiple systems, different G= PUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing = RAS errors would potentially cause problems in other tests as well. For exa= mple common operations like GART table updates, and GPUVM page table update= s and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from thos= e. So we need to understand what's special about this test. I asked questio= ns to that effect on our other email thread.

Regards,
  Felix

Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:=

[AMD Official Use Only - Internal Distribution Only]

 

I think it=92s premature to push this revert.

 

With more testing, I=92m getting failures from dif= ferent tests or sometimes none at all on my machine.

 

Kent, let=92s continue the discussion on the origi= nal thread.

 

Thanks,

 

Jon

 

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander = <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.R= ussell@amd.com>; amd-gfx@lists.freedesktop.= org; Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

That's exactly my concern as well.

 

This looks a bit like the test creates erroneous d= ata somehow, but there doesn't seems to be a RAS check in the MM data path.=

 

And now that we use the BAR path it goes up in fla= mes.

 

I just don't see how we can create erroneous data = in a test case?

 

Christian.

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexand= er" <Alexander.Deucher= @amd.com>:

[AMD Public Use]

 

If this causes an= issue, any access to vram via the BAR could cause an issue.

 

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org&g= t; on behalf of Russell, Kent <K= ent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Onl= y]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexand= er" <Alexander.Deucher= @amd.com>:

[AMD Public Use]

 

If this causes an= issue, any access to vram via the BAR could cause an issue.

 

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org&g= t; on behalf of Russell, Kent <K= ent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Onl= y]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexand= er" <Alexander.Deucher= @amd.com>:

[AMD Public Use]

 

If this causes an= issue, any access to vram via the BAR could cause an issue.

 

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org&g= t; on behalf of Russell, Kent <K= ent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Onl= y]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexand= er" <Alexander.Deucher= @amd.com>:

[AMD Public Use]

 

If this causes an= issue, any access to vram via the BAR could cause an issue.

 

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org&g= t; on behalf of Russell, Kent <K= ent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Onl= y]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexand= er" <Alexander.Deucher= @amd.com>:

[AMD Public Use]

 

If this causes an= issue, any access to vram via the BAR could cause an issue.

 

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org&g= t; on behalf of Russell, Kent <K= ent.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Onl= y]

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 


--_000_SN1PR12MB2414C7A4F1A0133BB0AE3A4492DB0SN1PR12MB2414namp_-- --===============1112019713== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx --===============1112019713==--