From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI, MIME_HTML_MOSTLY,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,T_KAM_HTML_FONT_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8829C3815B for ; Wed, 15 Apr 2020 09:49:32 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 76BAD20936 for ; Wed, 15 Apr 2020 09:49:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amdcloud.onmicrosoft.com header.i=@amdcloud.onmicrosoft.com header.b="wEwhdRA4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 76BAD20936 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2646B6E936; Wed, 15 Apr 2020 09:49:32 +0000 (UTC) Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2088.outbound.protection.outlook.com [40.107.94.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3E5F96E937 for ; Wed, 15 Apr 2020 09:49:30 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bNgBIgcn4VsEX75UD9k1W97xh4zYF+iYyohPYl/os8/A48iOnhzTFC5DQygK+4c27PzKCoBzyHqvn7kv6IS498PNLbsrrm6aDnNH7oMER58Wl5TcHZpOo4BuE5cHfzqPbjSyc0BDLOuAAgmiRhWZ9qmx59hb4SVt5KxIeZMM2dq20B9uYeY8ddQOPiSTSQONRIm+TOkHf4+ltZQA1vN1usCtqSWirlWoRh43SH+rIB4P/C772kR/vaba8x3tvdwcn6mnaaesdjyZaBThYJLbQUDyApJRBzJ07dx78LCh4wM+JvEjDjJIvjtnOVBwXDTr1Q++/yV/dKbfxWjo5wmq5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5D3Vua4vzg3xi7PRa4sQlPbpM9f2EHP4/jEaG23HM94=; b=XcwImzq5g4I+Qr6K8eCkSdPd0uMh+Zmpc4kmrrvo2mWlgaRddyDa3BZ48ZwwnkdJ2pPKNqQCaWNUwst3Q1KTU3e75G3GqxEhM8RG6GAO1E27NCceUdXsLWKC2ROh6XI0YLL4kMK52Hu7QXWOYL6gmoyFahMR4zDfWb+1PKhvsXzhla+6nSXAYSJEbzUXYLBUmJqyhMro9XuV2OXLXxiMVdHRka7UjAWEbQbQQZoqePbPKwowhe/4l7FC+MM6eYcHZnNyJt4vbo2/Q5d/qNf9m6w1bLi7ii5jENT+PpVZObgKR0A+GWX0J4/S68xX2CyC9PPQL3lx6MFmO3SIb4pceg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5D3Vua4vzg3xi7PRa4sQlPbpM9f2EHP4/jEaG23HM94=; b=wEwhdRA4Sr2m86hA8UZ18NDN5dIlCXT3Yog/UaLYBdQqURQ1n3KndXRuVWypJw3rKpRV/ydlo72QFBncGgHaM3gRVQ2PHXgieRzw6j22WAwHqScYZvPqsZ6iuR67HNoGi9g0+a3KHjKOkhP58D4Qa8Rz28EPhvhMwOXbkKtA9d4= Received: from MN2PR12MB4518.namprd12.prod.outlook.com (2603:10b6:208:266::19) by MN2PR12MB3104.namprd12.prod.outlook.com (2603:10b6:208:cc::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2900.19; Wed, 15 Apr 2020 09:49:26 +0000 Received: from MN2PR12MB4518.namprd12.prod.outlook.com ([fe80::4cd:783:ca8:7af0]) by MN2PR12MB4518.namprd12.prod.outlook.com ([fe80::4cd:783:ca8:7af0%5]) with mapi id 15.20.2900.028; Wed, 15 Apr 2020 09:49:25 +0000 From: "Kim, Jonathan" To: "Koenig, Christian" , "Kuehling, Felix" , "Deucher, Alexander" Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Thread-Topic: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2" Thread-Index: AQHWEcA9UrBaY2ldnUWtIBCEF2kBZ6h4pBqAgAAHvgCAAASbgIAAAxkAgAAAdnCAAD5rgIAAHPDwgADH+4CAAAzOAA== Date: Wed, 15 Apr 2020 09:49:25 +0000 Message-ID: References: <20200413182026.2561-1-kent.russell@amd.com> <85fcb568-b0d8-b6c9-4e62-3866aa2da0c9@gmail.com> <146d9570-724e-423d-931e-24c96821aaae@email.android.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Enabled=true; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_SetDate=2020-04-15T09:24:10Z; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Method=Privileged; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_Name=Public_0; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_ActionId=5ae7294a-6278-4216-bb1f-0000a4912758; MSIP_Label_0d814d60-469d-470c-8cb0-58434e2bf457_ContentBits=1 msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_enabled: true msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_setdate: 2020-04-15T09:49:20Z msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_method: Privileged msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_name: Public_0 msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_siteid: 3dd8961f-e488-4e60-8e11-a82d994e183d msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_actionid: 2a8db11d-8d0c-4d43-b07c-0000a6a43563 msip_label_0d814d60-469d-470c-8cb0-58434e2bf457_contentbits: 0 authentication-results: spf=none (sender IP is ) smtp.mailfrom=Jonathan.Kim@amd.com; x-originating-ip: [2607:fea8:7a0:3a96:6877:77b2:30d4:9082] x-ms-publictraffictype: Email x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: f9665446-f8c3-4527-5c9c-08d7e1224888 x-ms-traffictypediagnostic: MN2PR12MB3104:|MN2PR12MB3104: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-forefront-prvs: 0374433C81 x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:MN2PR12MB4518.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:(10009020)(4636009)(346002)(366004)(396003)(136003)(39860400002)(376002)(54906003)(186003)(5660300002)(55016002)(2906002)(6636002)(6506007)(30864003)(66574012)(53546011)(8676002)(81156014)(8936002)(45080400002)(478600001)(9686003)(66556008)(33656002)(66946007)(316002)(86362001)(110136005)(76116006)(966005)(71200400001)(66476007)(7696005)(64756008)(52536014)(4326008)(66446008)(579004)(559001); DIR:OUT; SFP:1101; received-spf: None (protection.outlook.com: amd.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: a3SIx+WEI8iNzkU58UaTGOc+IBnSWQdKg0ZuWf0jZmUTOF4QLWulyxYEJ0qnN48UpYpEPD4ClZxZcOJxAt02yNOfW5r1YjhYBvmHDn6cEjmC5Ccj5GJbcRHlI3lq7fOsSJmBIuMYEC58CQPyJImkeAupr4j67/cYW+r80pb2ILAhJImcOkkqNBjWNPuzbGj4G3aVUpKlDP5iOpcPz65VedyAokr3p9Oy6lg3eNWdanHlo19fin5pwMcNEulsIEz1xr3Ae8a66VnY88k6BVwK8oNLQgC7sehTl7g30GJvbqfrVMD8FPvRbuAlU0AwtE1PHbQ4p8xaw4uiqCppmEoZPJ/eBp+uPDz69WgI7m8l/+/rD6abc2VDNgqef9Oj9VCsmXNhGBpWNboG+X3Of5rxKjIXF48L0GsNB0DOlHu0jIB5KZo5k3yw5Ic5DASgC0n1oeUh49EV7forsFiWbOzhxj9ACdpv4fd3W96HJAzWh/Q= x-ms-exchange-antispam-messagedata: fAdRlgAlCzDVRwT5AdW2ybLajKF07s5lwwregGmDir70ToDUIsob6IOLy8PKEXhfxJvjQom7FlTggtXMQI6O6wcPw+2KVjwse0Wu410ipv/Cu1jqVTnJFlZNfCCNEZ8a13jVyh7IqWyfPnGvdnksDyfpN9t9XxWe1+vGJboZEpfI1xkn06zyNpcvgTs4USn6sTs4wicbmAjT6sP3rKjfHw== MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: f9665446-f8c3-4527-5c9c-08d7e1224888 X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Apr 2020 09:49:25.6436 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: NGFOdDofOSMKXR3FcE2NK+/xF1g7D4vSmWQyt9soJThfLs2EkMpBpr36maARuC8a X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB3104 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Russell, Kent" , "amd-gfx@lists.freedesktop.org" Content-Type: multipart/mixed; boundary="===============0659252215==" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" --===============0659252215== Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_MN2PR12MB4518963F186CF8528A620A7D85DB0MN2PR12MB4518namp_" --_000_MN2PR12MB4518963F186CF8528A620A7D85DB0MN2PR12MB4518namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable [AMD Public Use] Hi Christian, That could potentially be it. With additional testing, 2 of 3 Vega20 machi= nes never hit error over BAR access with the PTRACE test. 3 of 3 machines = (from the same pool) always hit error with CWSR. To elaborate on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated ma= pped memory and 2 DWORDS outside that boundary (it's only about 4MB to the = boundary). Then we POKE to swap the DWORD positions across the boundary. = The RAS event on the single failing machine happens on the out of boundary = PEEK. Felix mentioned we don't hit errors over general HDP access but that may no= t true. An Arcturus failure sys logs posted (which wasn't tested by me) sh= ows someone launched rocm bandwidth test, hit a VM fault and a RAS event en= sued during evictions (I can point the internal ticket or log snippet offli= ne if interested). Whether the RAS event is BAR access triggered or the re= sult of HW instability is beyond me since I don't have access to the machin= e. Thanks, Jon From: Koenig, Christian Sent: Wednesday, April 15, 2020 4:11 AM To: Kim, Jonathan ; Kuehling, Felix ; Deucher, Alexander Cc: Russell, Kent ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" Hi Jon, Also cwsr tests fail on Vega20 with or without the revert with the same RAS= error. That sounds like the system/setup has a more general problem. Could it be that we are seeing RAS errors because there really is some hard= ware failure, but with the MM path we don't trigger a RAS interrupt? Thanks, Christian. Am 14.04.20 um 22:30 schrieb Kim, Jonathan: [AMD Official Use Only - Internal Distribution Only] If we're passing the test on the revert, then the only thing that's differe= nt is we're not invalidating HDP and doing a copy to host anymore in amdgpu= _device_vram_access since the function is still called in ttm access_memory= with BAR. Also cwsr tests fail on Vega20 with or without the revert with the same RAS= error. Thanks, Jon From: Kuehling, Felix Sent: Tuesday, April 14, 2020 2:32 PM To: Kim, Jonathan ; Koen= ig, Christian ; = Deucher, Alexander Cc: Russell, Kent ; amd-= gfx@lists.freedesktop.org Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" I wouldn't call it premature. Revert is a usual practice when there is a se= rious regression that isn't fully understood or root-caused. As far as I ca= n tell, the problem has been reproduced on multiple systems, different GPUs= , and clearly regressed to Christian's commit. I think that justifies rever= ting it for now. I agree with Christian that a general HDP memory access problem causing RAS= errors would potentially cause problems in other tests as well. For exampl= e common operations like GART table updates, and GPUVM page table updates a= nd PCIe peer2peer accesses in ROCm applications use HDP. But we're not seei= ng obvious problems from those. So we need to understand what's special abo= ut this test. I asked questions to that effect on our other email thread. Regards, Felix Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan: [AMD Official Use Only - Internal Distribution Only] I think it's premature to push this revert. With more testing, I'm getting failures from different tests or sometimes n= one at all on my machine. Kent, let's continue the discussion on the original thread. Thanks, Jon From: Koenig, Christian Sent: Tuesday, April 14, 2020 10:47 AM To: Deucher, Alexander Cc: Russell, Kent ; amd-= gfx@lists.freedesktop.org; Kuehling, = Felix ; Kim, Jonatha= n Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" That's exactly my concern as well. This looks a bit like the test creates erroneous data somehow, but there do= esn't seems to be a RAS check in the MM data path. And now that we use the BAR path it goes up in flames. I just don't see how we can create erroneous data in a test case? Christian. Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 Am 14.04.2020 16:35 schrieb "Deucher, Alexander" >: [AMD Public Use] If this causes an issue, any access to vram via the BAR could cause an issu= e. Alex ________________________________ From: amd-gfx > on behalf of Russell, Kent > Sent: Tuesday, April 14, 2020 10:19 AM To: Koenig, Christian >; amd-gfx@lists.freedesktop.org= > Cc: Kuehling, Felix >= ; Kim, Jonathan > Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_= device_vram_access v2" [AMD Official Use Only - Internal Distribution Only] On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can h= opefully provide some additional information. Kent > -----Original Message----- > From: Christian K=F6nig > > Sent: Tuesday, April 14, 2020 9:52 AM > To: Russell, Kent >; am= d-gfx@lists.freedesktop.org > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in > amdgpu_device_vram_access v2" > > Am 13.04.20 um 20:20 schrieb Kent Russell: > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e. > > The original patch causes a RAS event and subsequent kernel hard-hang > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and > > Arcturus > > > > dmesg output at hang time: > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected! > > amdgpu 0000:67:00.0: GPU reset begin! > > Evicting PASID 0x8000 queues > > Started evicting pasid 0x8000 > > qcm fence wait loop timeout expired > > The cp might be in an unrecoverable state due to an unsuccessful > > queues preemption Failed to evict process queues Failed to suspend > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost > > due to RAS ERREVENT_ATHUB_INTERRUPT > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0 > > amdgpu: [powerplay] Failed to set soft min gfxclk ! > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels! > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0 > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu > features! > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu feature= s! > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM! > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP > > block failed -5 > > Do you have more information on what's going wrong here since this is a r= eally > important patch for KFD debugging. > > > > > Signed-off-by: Kent Russell > > > Reviewed-by: Christian K=F6nig > > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ---------------------= - > > 1 file changed, 26 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index cf5d6e585634..a3f997f84020 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos, > > uint32_t hi =3D ~0; > > uint64_t last; > > > > - > > -#ifdef CONFIG_64BIT > > - last =3D min(pos + size, adev->gmc.visible_vram_size); > > - if (last > pos) { > > - void __iomem *addr =3D adev->mman.aper_base_kaddr + pos; > > - size_t count =3D last - pos; > > - > > - if (write) { > > - memcpy_toio(addr, buf, count); > > - mb(); > > - amdgpu_asic_flush_hdp(adev, NULL); > > - } else { > > - amdgpu_asic_invalidate_hdp(adev, NULL); > > - mb(); > > - memcpy_fromio(buf, addr, count); > > - } > > - > > - if (count =3D=3D size) > > - return; > > - > > - pos +=3D count; > > - buf +=3D count / 4; > > - size -=3D count; > > - } > > -#endif > > - > > spin_lock_irqsave(&adev->mmio_idx_lock, flags); > > for (last =3D pos + size; pos < last; pos +=3D 4) { > > uint32_t tmp =3D pos >> 31; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D02%7C01%7Calexande= r.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e= 11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=3DttNOHJt0IwywpOIWahK= jjuC6OkT1jxduc6iMzYzndpg%3D&reserved=3D0 --_000_MN2PR12MB4518963F186CF8528A620A7D85DB0MN2PR12MB4518namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

= [AMD Public Use]

 

Hi Christian,<= /o:p>

 

That could potentia= lly be it.  With additional testing, 2 of 3 Vega20 machines never hit = error over BAR access with the PTRACE test.  3 of 3 machines (from the= same pool) always hit error with CWSR.

To elaborate on the= PTRACE test, we PEEK 2 DWORDs inside thunk allocated mapped memory and 2 D= WORDS outside that boundary (it’s only about 4MB to the boundary).&nb= sp; Then we POKE to swap the DWORD positions across the boundary.  The RAS event on the single failing machine happens on= the out of boundary PEEK.

 

Felix mentioned we = don’t hit errors over general HDP access but that may not true. = An Arcturus failure sys logs posted (which wasn’t tested by me) show= s someone launched rocm bandwidth test, hit a VM fault and a RAS event ensued during evictions (I can point the internal ticket o= r log snippet offline if interested).  Whether the RAS event is BAR ac= cess triggered or the result of HW instability is beyond me since I donR= 17;t have access to the machine.

 

Thanks,<= /span>

 

Jon

 

From:= Koenig, Christian <Christian.Koenig@am= d.com>
Sent: Wednesday, April 15, 2020 4:11 AM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Kuehling, Felix <= Felix.Kuehling@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.co= m>
Cc: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freede= sktop.org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

Hi Jon,

Also cwsr tests fail on Vega20 with or without the r= evert with the same RAS error.


That sounds like the system/setup has a more general problem.

Could it be that we are seeing RAS errors because there really is some hard= ware failure, but with the MM path we don't trigger a RAS interrupt?

Thanks,
Christian.

Am 14.04.20 um 22:30 schrieb Kim, Jonathan:

= [AMD Official Use Only - Internal Distribution Only]

 

If we’re passing the test on the revert, then = the only thing that’s different is we’re not invalidating HDP a= nd doing a copy to host anymore in amdgpu_device_vram_access since the func= tion is still called in ttm access_memory with BAR.

 

Also cwsr tests fail on Vega20 with or without the r= evert with the same RAS error.

 

Thanks,

 

Jon

 

From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Tuesday, April 14, 2020 2:32 PM
To: Kim, Jonathan <Jonath= an.Kim@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>= ;; Deucher, Alexander <Alexander.Deucher@amd.com&= gt;
Cc: Russell, Kent <Kent.R= ussell@amd.com>; amd-gfx@lists.freedesktop.= org
Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

I wouldn't call it premature. Revert is a usual practice when there is a= serious regression that isn't fully understood or root-caused. As far as I= can tell, the problem has been reproduced on multiple systems, different G= PUs, and clearly regressed to Christian's commit. I think that justifies reverting it for now.

I agree with Christian that a general HDP memory access problem causing = RAS errors would potentially cause problems in other tests as well. For exa= mple common operations like GART table updates, and GPUVM page table update= s and PCIe peer2peer accesses in ROCm applications use HDP. But we're not seeing obvious problems from thos= e. So we need to understand what's special about this test. I asked questio= ns to that effect on our other email thread.

Regards,
  Felix

Am 2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:

= [AMD Official Use Only - Internal Distribution Only]

 

I think it’s premature to push this revert.

 

With more testing, I’m getting failures from d= ifferent tests or sometimes none at all on my machine.

 

Kent, let’s continue the discussion on the ori= ginal thread.

 

Thanks,

 

Jon

 

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Tuesday, April 14, 2020 10:47 AM
To: Deucher, Alexander = <Alexander.Deucher@amd.com>
Cc: Russell, Kent <Kent.R= ussell@amd.com>; amd-gfx@lists.freedesktop.= org; Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

That's exactly my concern as well.

 

This looks a bit like the test creates erroneous dat= a somehow, but there doesn't seems to be a RAS check in the MM data path.

 

And now that we use the BAR path it goes up in flame= s.

 

I just don't see how we can create erroneous data in= a test case?

 

Christian.

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexander= " <Alexander.Deucher@a= md.com>:

[AMD Public Use]<= /p>

 

If this causes an i= ssue, any access to vram via the BAR could cause an issue.

 <= /o:p>

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>= on behalf of Russell, Kent <Ken= t.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Only]=

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexander= " <Alexander.Deucher@a= md.com>:

[AMD Public Use]<= /p>

 

If this causes an i= ssue, any access to vram via the BAR could cause an issue.

 <= /o:p>

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>= on behalf of Russell, Kent <Ken= t.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Only]=

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexander= " <Alexander.Deucher@a= md.com>:

[AMD Public Use]<= /p>

 

If this causes an i= ssue, any access to vram via the BAR could cause an issue.

 <= /o:p>

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>= on behalf of Russell, Kent <Ken= t.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Only]=

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexander= " <Alexander.Deucher@a= md.com>:

[AMD Public Use]<= /p>

 

If this causes an i= ssue, any access to vram via the BAR could cause an issue.

 <= /o:p>

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>= on behalf of Russell, Kent <Ken= t.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Only]=

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

 

Am 14.04.2020 16:35 schrieb "Deucher, Alexander= " <Alexander.Deucher@a= md.com>:

[AMD Public Use]<= /p>

 

If this causes an i= ssue, any access to vram via the BAR could cause an issue.

 <= /o:p>

Alex


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>= on behalf of Russell, Kent <Ken= t.Russell@amd.com>
Sent: Tuesday, April 14, 2020 10:19 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.= org <amd-gfx@lists.= freedesktop.org>
Cc: Kuehling, Felix <Fe= lix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
Subject: RE: [PATCH] Revert "drm/amdgpu: use the BAR if possibl= e in amdgpu_device_vram_access v2"

 

[AMD Official Use Only - Internal Distribution Only]=

On VG20 or MI100, as soon as we run the subtest, we get the dmesg output be= low, and then the kernel ends up hanging. I don't know enough about the tes= t itself to know why this is occurring, but Jon Kim and Felix were discussi= ng it on a separate thread when the issue was first reported, so they can hopefully provide some additiona= l information.

 Kent

> -----Original Message-----
> From: Christian K=F6nig <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, April 14, 2020 9:52 AM
> To: Russell, Kent <Kent.Rus= sell@amd.com>; amd-gfx@lists.freedesktop.= org
> Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible = in
> amdgpu_device_vram_access v2"
>
> Am 13.04.20 um 20:20 schrieb Kent Russell:
> > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.
> > The original patch causes a RAS event and subsequent kernel hard-= hang
> > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 = and
> > Arcturus
> >
> > dmesg output at hang time:
> > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
> > amdgpu 0000:67:00.0: GPU reset begin!
> > Evicting PASID 0x8000 queues
> > Started evicting pasid 0x8000
> > qcm fence wait loop timeout expired
> > The cp might be in an unrecoverable state due to an unsuccessful<= br> > > queues preemption Failed to evict process queues Failed to suspen= d
> > process 0x8000 Finished evicting pasid 0x8000 Started restoring p= asid
> > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may l= ost
> > due to RAS ERREVENT_ATHUB_INTERRUPT
> > amdgpu: [powerplay] Failed to send message 0x26, response 0x0
> > amdgpu: [powerplay] Failed to set soft min gfxclk !
> > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
> > amdgpu: [powerplay] Failed to send message 0x7, response 0x0
> > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all= smu
> features!
> > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu f= eatures!
> > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
> > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of= IP
> > block <powerplay> failed -5
>
> Do you have more information on what's going wrong here since this is = a really
> important patch for KFD debugging.
>
> >
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
>
> Reviewed-by: Christian K=F6nig <christian.koenig@amd.com>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 -----= -----------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index cf5d6e585634..a3f997f84020 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct > amdgpu_device *adev, loff_t pos,
> >      uint32_t hi =3D ~0;
> >      uint64_t last;
> >
> > -
> > -#ifdef CONFIG_64BIT
> > -   last =3D min(pos + size, adev->gmc.visible_v= ram_size);
> > -   if (last > pos) {
> > -           voi= d __iomem *addr =3D adev->mman.aper_base_kaddr + pos;
> > -           siz= e_t count =3D last - pos;
> > -
> > -           if = (write) {
> > -          &nbs= p;        memcpy_toio(addr, buf, count);=
> > -          &nbs= p;        mb();
> > -          &nbs= p;        amdgpu_asic_flush_hdp(adev, NU= LL);
> > -           } e= lse {
> > -          &nbs= p;        amdgpu_asic_invalidate_hdp(ade= v, NULL);
> > -          &nbs= p;        mb();
> > -          &nbs= p;        memcpy_fromio(buf, addr, count= );
> > -           } > > -
> > -           if = (count =3D=3D size)
> > -          &nbs= p;        return;
> > -
> > -           pos= +=3D count;
> > -           buf= +=3D count / 4;
> > -           siz= e -=3D count;
> > -   }
> > -#endif
> > -
> >      spin_lock_irqsave(&adev->mmi= o_idx_lock, flags);
> >      for (last =3D pos + size; pos &= lt; last; pos +=3D 4) {
> >           =    uint32_t tmp =3D pos >> 31;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.= org
https:= //nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freedes= ktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D02%7C01%7Calexander.= deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=3DttNOHJt0IwywpOIWa= hKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=3D0

 

--_000_MN2PR12MB4518963F186CF8528A620A7D85DB0MN2PR12MB4518namp_-- --===============0659252215== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx --===============0659252215==--