From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D2578C433EF for ; Thu, 5 May 2022 10:09:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 57F0D10F98C; Thu, 5 May 2022 10:09:54 +0000 (UTC) Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2063.outbound.protection.outlook.com [40.107.93.63]) by gabe.freedesktop.org (Postfix) with ESMTPS id 89CAE10F98C for ; Thu, 5 May 2022 10:09:53 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=V495MNkYcZw0/9saFYYRpRmN/0EupdvSwWc9ZKvIGXRf9XYbu12+dkRXz8qOb6fnuaiN/ErBaQou8NVwlufsnpFx/8NFkD8zUeobzmWpEOuymP4mCtCwUxOMrJV1f4yWQCbP6D/quToqSdgXqEI3puCY3AObmzm8ttVU0+R0QfZAbR/HusZRmgAyKlfr4Egps8FdsEcmNtNVZsyv7x1Ky0xLeo812zfgBnvbM7/VctutZ03w5pNO2uBqWX42QXE9B5Z+1bPLqNQEjWUUunOtphJ27iXnjXa43mv9VS5EdPcjrIaXNeSqjGXBfV5GtjFrupQPu6sZpt/u0OVP4Ar11A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=jUnBVl6FHuspOuaDPCsUOENDPULMoPaGj/mZZL1zKqY=; b=a7LXm8HHgssUYUatw+uc7f2pIIYxsGsxlHMA3NXG+hzcbNF60jMD1gUdMPljkLzAHn6BmIxxK7sb1rfndNA4EXGPtPmmiL18bw5LCNwQi/wutO3gFC6zZIDMkUdCjOW5+ReFkeSMcp7CQcY7hMz0uU3UyiSRCtfQZiBA+OpnhOXSSkn+NCDzxr+5OYsgLwWCYFt8xGjlrGzfpsKgvfDk+kPoVY7QE4PmJI2y8pOGFnM+uK1E7z7wA/Fw1s71SaHNg16fyBHOyvr5i3fTto5AJVJmdOPDBKedUReJB6KfRsMNTcrIeRMbJ0EopktJ0x8lrZ2KVJmvAicD4ns5jGRnFQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jUnBVl6FHuspOuaDPCsUOENDPULMoPaGj/mZZL1zKqY=; b=bbtALy4FiWURjmqq4VF/9SGCTmL5lbdHqfvTmhSCRCIpDgbQ6zB/43KPbHR/uWzgGwNmNyuUjp5KsMFukXBoyLxoN/jMUW1md81G5kXBViv8dRWdmk5ts2qtCOsrE9gfTUPn8vxeH7AxL2Ncj9jD+Y5Na1qk2dGMxmgwHheRV8k= Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com; Received: from BN8PR12MB3587.namprd12.prod.outlook.com (2603:10b6:408:43::13) by DM4PR12MB6373.namprd12.prod.outlook.com (2603:10b6:8:a4::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5206.14; Thu, 5 May 2022 10:09:48 +0000 Received: from BN8PR12MB3587.namprd12.prod.outlook.com ([fe80::fdba:2c6c:b9ab:38f]) by BN8PR12MB3587.namprd12.prod.outlook.com ([fe80::fdba:2c6c:b9ab:38f%4]) with mapi id 15.20.5206.025; Thu, 5 May 2022 10:09:48 +0000 Message-ID: <7e9f45be-41a0-0764-8f4d-2787319477fb@amd.com> Date: Thu, 5 May 2022 12:09:42 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive. Content-Language: en-US To: Andrey Grodzovsky , amd-gfx@lists.freedesktop.org References: <20220504161841.24669-1-andrey.grodzovsky@amd.com> From: =?UTF-8?Q?Christian_K=c3=b6nig?= In-Reply-To: <20220504161841.24669-1-andrey.grodzovsky@amd.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: AS9PR04CA0073.eurprd04.prod.outlook.com (2603:10a6:20b:48b::27) To BN8PR12MB3587.namprd12.prod.outlook.com (2603:10b6:408:43::13) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 2db70e8e-4c4b-4fa6-1721-08da2e7f62c9 X-MS-TrafficTypeDiagnostic: DM4PR12MB6373:EE_ X-Microsoft-Antispam-PRVS: X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: cZ6ta2CxLZhZSeGV2SgcATlYyYXGu1pdu9vnnKFZv4HmOMiO9Ii2+O0fQ6ob5ANG8rdaNCqtwt3zW2LjBETjaAgcnCrMR8pkS1tIHwh7EzRZ9A9cJkdc6L3hxS6sVksZEYaCqpt2vy6AcpNkdZQPRPygJOPhOZlP3TNjScoYG3Ebz63Z6dXMR5jE6/zmr3ad1dW68OzgkiswvCTHopZjYcYU5KNawfj1ux0Ps8UdWF/62+NWTAZ5C85y1fyaSB3fl/LLdnFIFZyPZiReYxRriX8+X41Yd7WV52GDQw6qVIVFwCkMt9hvrwqH0gdOJkMtfq5yWvnUMtiDALbqVZOEKbwofDtrH9DKA3M6hNWGF2A2ngNQAPBMNQZ+CdCTwXOw1LsjDdKWNDqcvgRi1Fv1084gjmi+BuUrwmTjZ/kNW4aRJZU+GkdFJHWixyvuZxSO51JgdYiIXYYOze2Emw0atjz0LD10N78G9azptcGzHXT/Fjuq1wfhqYJb60pN83T6qJ5euaFI+1m5QeoNyAuAfps4iAmytuTXXr18tqQO5J76/TcqcS4nDGmlIrUyKpHBsGha9wBqe14hFm2RRsL4f7dy5Q7gFKCqHMDDvIrYmJrKXEHx+hx3/df3TOc3TKVOcsv7G3laGUhNnSMvQpMr7yUDxr3dT9hxXqIgRVJJF6RHMeDnJjyKq2nW4izHTKo/iSRHBSEkpl5ZWYJO+KvCuB9DxG/UzeyQM/ZXgKm2Lgwo1BtvoAM/6elTJepBJohX X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:BN8PR12MB3587.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230001)(4636009)(366004)(5660300002)(2906002)(6506007)(6666004)(31686004)(316002)(36756003)(186003)(38100700002)(8936002)(83380400001)(8676002)(6486002)(66946007)(66476007)(508600001)(4326008)(66556008)(86362001)(31696002)(30864003)(26005)(6512007)(2616005)(43740500002)(45980500001); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?K1hYQnkyTVdwZGtxNis1cTRUMkEzNFR2aDVoY211TzlLZi8rNmlYQ0dLOGwz?= =?utf-8?B?VlptN0dJbFY3bXBDQTk4ZDFRbjdRSWpES0JlSm9yZzBBblpOTmlDYk5kMWpX?= =?utf-8?B?ZjNiTEtwN0ptMGVMY290U0VxQWJHOVhQZStycC9uRURTMEp5a2Y4WWQxU2Rs?= =?utf-8?B?YTl4V2ZnWFRQN1dzRktWajRHMkJTWHRkVVVzV2NWcFIzcGh2WGxWdWRORXZ6?= =?utf-8?B?SlhBc28xbitPWnZNZEFvTXNqZHJWVU5qYnI2UEM3SitzOE5OL3ozclNDblZV?= =?utf-8?B?Y0FGSTROUzRnNS9BcE5vbUpRYnU5QWt2UFlWOEptWGNIeWJ3d2hrcTlIcWlQ?= =?utf-8?B?ZThqVjVTVTNZSnRMa2huV01ES2UzelNKMnJhaFNObzRPazgwdk00WHlSME9V?= =?utf-8?B?Z2hXaGcwZU9BRW1sSmRydWcwSHVNQXFaZEY5MVMyQkdtZTUxK0o0RE5hM1F2?= =?utf-8?B?R2dQeVlVZitsdG5WOUllUm80WVlTTUVrYkNiZWcxZlNCY1R3czRCNktVZUpv?= =?utf-8?B?bkxrZ2lpNGZyc1ZpUjJmSWM4ZHpScW11dkNvaWVSWEZUSjA0cDNSRHo1RHY3?= =?utf-8?B?Vkx0Y0ZZNmhwaWJmMTUvbjJVSE1UcnAyTWpmeXhqNjRndjErWFYwc3E2VHYw?= =?utf-8?B?ekt2OHMrRW5iVWpxUUNkM0RwdmxKNkF0ZGpvQmFBMndmcEJSdjFJSUNSejgx?= =?utf-8?B?WC9DRHI0QWlOMld6YzNCZDRyTmRQV2xNRWVxeUxNT0x1Wmx4ZmxqMHpKeC9o?= =?utf-8?B?SmVKMkgvS0JiY3E1endUZkZDVmRWT1U1RU8rOGpxazZiNlkrckpENHBKMFNa?= =?utf-8?B?Q3hBNGdHWHpQYnAvRFNPa1JtU2dsVkdFSXJHWmNXbkdnMDNpMFNRYU1aZFRp?= =?utf-8?B?a3dIMXA0cTh6UTZXb1dvQ0RqcDdnb3NzT1VhMDZiSXBvQ3FqRzU0L1RKeFBK?= =?utf-8?B?OGsydkltMDJ3LzhUOG1jMTFocUI4RmZLeHhPMDRGcVVsTDRJOHZPOWp4UENi?= =?utf-8?B?MlBIZDcya05mUGlJM0JWdUl5WVlJK013dVpEUUhOOHBJRDBaazRxc3RRU3Fr?= =?utf-8?B?S1pjUnluMEhqb09YMlllbThwQkdDU1Q1a3hUSXJJbDA4M3FaQUdQcFRzQjU0?= =?utf-8?B?Tjkzdkg1SGtUai9aV040eGltWlNhTk44eXpWbXFsR3FWRWhaT2VMbWNhRVlz?= =?utf-8?B?Z1JpWEl1OHQ4cnQ5VUxJaEVyK1MxRjlmRTByck5WK1VMcVB5elZLR3VRWVdG?= =?utf-8?B?R1NDTVBkcS9lQ0kwYkF3KzBMb0pOUGhGd0syZzJSaG1xZVFOMzJnYzJQdlJU?= =?utf-8?B?TG9HMGFCbXplUSthMy8yVFhja0lzRHMyRSsrQkR4R2NpTGV6MTRQTURJd3Fw?= =?utf-8?B?SDZ3UVFjWmxYYWtxTVUzODdVLzFtaXRpTDFreXdUWUNBZ21DcHVCY3cyZFlw?= =?utf-8?B?eFlZUCtPcDgyUFNLOFhLTHBmZkhPczJlUHVIcXlPN2R4d0E0K3BvZDVOWXNW?= =?utf-8?B?VVdybzg4QU4vdWcvS3FSZnlQSThjUm9WRGN2SzY4bWZTdmsvNHVmSy9KOUtG?= =?utf-8?B?RlZ0NXMzUGJWcWhvTndiWERQem1yTDNtZCtOWkt4VzZ2ZGZpUTgrMFBic0ht?= =?utf-8?B?UGFWQW9XR1Qyby9ubFRZcUJ6RFhxSXRkY01aKytZQmhiQlk5bnI3RFg1bjJp?= =?utf-8?B?aHVScitxR3E3aitRMGRoVWsyU2x1d0ZUWjJBMW9TYkxqUWN0VXF6eXZRYWZV?= =?utf-8?B?ZWNBNUtuZXlCQmJqUm8zcnpPVXdhcHNkdE83WmZlS3lxdndHbExQR1ZReG4x?= =?utf-8?B?b2YyOGJtRisyaFlJMmZJRStpT0pqVCtmeUlScjdNZ3h1OUZ4NU5ySnp4T3Nh?= =?utf-8?B?b3NzYXBtUkVuVTVLUC90S1lHRjFPNzB4NnJtY0tON2p5dG4xUWhkS1pxcjUr?= =?utf-8?B?NGp2eEV2OEp2YUJKa3ZOeHdyUXYxWVNmUE9RcG4rdzlBeGpRby8xRU0zR2ZN?= =?utf-8?B?R1hNanhiVlAxTWJ0ZE5vVS8zRnh6Y3J2WkxqUm5rbkxFcEJlQkxadGNVeFJE?= =?utf-8?B?REkvbUV6SUpQOUU5SzZjS0xvWVFlM01walgrOCs5c0ErUHN0ejY3a0QzcXR1?= =?utf-8?B?VWVIVDhSQTAzWjk2VVhkWVRyTDU1azlWdVUvMEVZRkY1OWh6TW5IVjJlWDFI?= =?utf-8?B?OHhJdTVwVDRtUmlCNTliT3I2aVphR3dlRklENFl4amdmbTJCWmk5Z3N6OUta?= =?utf-8?B?UmZMd0ZyWXZwRFJRL1ZmMzhJdFNjNWRXeEZWdFNhcGtENCs1K2FoTmFrNS8v?= =?utf-8?B?ejVJNnluQW9RbzNPZ1hqcVlpVnB0WVNaTG96bnI4WTZndjdWdVdkQT09?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2db70e8e-4c4b-4fa6-1721-08da2e7f62c9 X-MS-Exchange-CrossTenant-AuthSource: BN8PR12MB3587.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 May 2022 10:09:48.1447 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: +zrInQfe+YE+rNil4pgf8jx3qjQq0wduie2RPX10JUhxD82iBnmx50snI4GzdiOK X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB6373 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Bai Zoy , lijo.lazar@amd.com Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: > Problem: > During hive reset caused by command timing out on a ring > extra resets are generated by triggered by KFD which is > unable to accesses registers on the resetting ASIC. > > Fix: Rework GPU reset to use a list of pending reset jobs > such that the first reset jobs that actaully resets the entire > reset domain will cancel all those pending redundant resets. > > This is in line with what we already do for redundant TDRs > in scheduler code. Mhm, why exactly do you need the extra linked list then? Let's talk about that on our call today. Regards, Christian. > > Signed-off-by: Andrey Grodzovsky > Tested-by: Bai Zoy > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 11 +--- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 3 + > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 73 +++++++++++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 3 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 ++- > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 ++- > drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 ++- > 8 files changed, 104 insertions(+), 24 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 4264abc5604d..99efd8317547 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -109,6 +109,7 @@ > #include "amdgpu_fdinfo.h" > #include "amdgpu_mca.h" > #include "amdgpu_ras.h" > +#include "amdgpu_reset.h" > > #define MAX_GPU_INSTANCE 16 > > @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry { > bool grbm_indexed; > }; > > -enum amd_reset_method { > - AMD_RESET_METHOD_NONE = -1, > - AMD_RESET_METHOD_LEGACY = 0, > - AMD_RESET_METHOD_MODE0, > - AMD_RESET_METHOD_MODE1, > - AMD_RESET_METHOD_MODE2, > - AMD_RESET_METHOD_BACO, > - AMD_RESET_METHOD_PCI, > -}; > - > struct amdgpu_video_codec_info { > u32 codec_type; > u32 max_width; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e582f1044c0f..7fa82269c30f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev, > } > > tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); > + > + /* Drop all pending resets since we will reset now anyway */ > + tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device, > + reset_list); > + amdgpu_reset_pending_list(tmp_adev->reset_domain); > + > /* Actual ASIC resets if needed.*/ > /* Host driver will handle XGMI hive reset for SRIOV */ > if (amdgpu_sriov_vf(adev)) { > @@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev, > } > > struct amdgpu_recover_work_struct { > - struct work_struct base; > + struct amdgpu_reset_work_struct base; > struct amdgpu_device *adev; > struct amdgpu_job *job; > int ret; > @@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct { > > static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work) > { > - struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base); > + struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.base.work); > > recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job); > } > @@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > { > struct amdgpu_recover_work_struct work = {.adev = adev, .job = job}; > > - INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work); > + INIT_DELAYED_WORK(&work.base.base, amdgpu_device_queue_gpu_recover_work); > + INIT_LIST_HEAD(&work.base.node); > > if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base)) > return -EAGAIN; > > - flush_work(&work.base); > + flush_delayed_work(&work.base.base); > + > + amdgpu_reset_domain_del_pendning_work(adev->reset_domain, &work.base); > > return work.ret; > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index c80af0889773..ffddd419c351 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -134,6 +134,9 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d > atomic_set(&reset_domain->in_gpu_reset, 0); > init_rwsem(&reset_domain->sem); > > + INIT_LIST_HEAD(&reset_domain->pending_works); > + mutex_init(&reset_domain->reset_lock); > + > return reset_domain; > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > index 1949dbe28a86..863ec5720fc1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h > @@ -24,7 +24,18 @@ > #ifndef __AMDGPU_RESET_H__ > #define __AMDGPU_RESET_H__ > > -#include "amdgpu.h" > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +struct amdgpu_device; > +struct amdgpu_job; > +struct amdgpu_hive_info; > + > > enum AMDGPU_RESET_FLAGS { > > @@ -32,6 +43,17 @@ enum AMDGPU_RESET_FLAGS { > AMDGPU_SKIP_HW_RESET = 1, > }; > > + > +enum amd_reset_method { > + AMD_RESET_METHOD_NONE = -1, > + AMD_RESET_METHOD_LEGACY = 0, > + AMD_RESET_METHOD_MODE0, > + AMD_RESET_METHOD_MODE1, > + AMD_RESET_METHOD_MODE2, > + AMD_RESET_METHOD_BACO, > + AMD_RESET_METHOD_PCI, > +}; > + > struct amdgpu_reset_context { > enum amd_reset_method method; > struct amdgpu_device *reset_req_dev; > @@ -40,6 +62,8 @@ struct amdgpu_reset_context { > unsigned long flags; > }; > > +struct amdgpu_reset_control; > + > struct amdgpu_reset_handler { > enum amd_reset_method reset_method; > struct list_head handler_list; > @@ -76,12 +100,21 @@ enum amdgpu_reset_domain_type { > XGMI_HIVE > }; > > + > +struct amdgpu_reset_work_struct { > + struct delayed_work base; > + struct list_head node; > +}; > + > struct amdgpu_reset_domain { > struct kref refcount; > struct workqueue_struct *wq; > enum amdgpu_reset_domain_type type; > struct rw_semaphore sem; > atomic_t in_gpu_reset; > + > + struct list_head pending_works; > + struct mutex reset_lock; > }; > > > @@ -113,9 +146,43 @@ static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom > } > > static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain *domain, > - struct work_struct *work) > + struct amdgpu_reset_work_struct *work) > { > - return queue_work(domain->wq, work); > + mutex_lock(&domain->reset_lock); > + > + if (!queue_delayed_work(domain->wq, &work->base, 0)) { > + mutex_unlock(&domain->reset_lock); > + return false; > + } > + > + list_add_tail(&work->node, &domain->pending_works); > + mutex_unlock(&domain->reset_lock); > + > + return true; > +} > + > +static inline void amdgpu_reset_domain_del_pendning_work(struct amdgpu_reset_domain *domain, > + struct amdgpu_reset_work_struct *work) > +{ > + mutex_lock(&domain->reset_lock); > + list_del_init(&work->node); > + mutex_unlock(&domain->reset_lock); > +} > + > +static inline void amdgpu_reset_pending_list(struct amdgpu_reset_domain *domain) > +{ > + struct amdgpu_reset_work_struct *entry, *tmp; > + > + mutex_lock(&domain->reset_lock); > + list_for_each_entry_safe(entry, tmp, &domain->pending_works, node) { > + > + list_del_init(&entry->node); > + > + /* Stop any other related pending resets */ > + cancel_delayed_work(&entry->base); > + } > + > + mutex_unlock(&domain->reset_lock); > } > > void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain *reset_domain); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h > index 239f232f9c02..574e870d3064 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h > @@ -25,6 +25,7 @@ > #define AMDGPU_VIRT_H > > #include "amdgv_sriovmsg.h" > +#include "amdgpu_reset.h" > > #define AMDGPU_SRIOV_CAPS_SRIOV_VBIOS (1 << 0) /* vBIOS is sr-iov ready */ > #define AMDGPU_SRIOV_CAPS_ENABLE_IOV (1 << 1) /* sr-iov is enabled on this GPU */ > @@ -230,7 +231,7 @@ struct amdgpu_virt { > uint32_t reg_val_offs; > struct amdgpu_irq_src ack_irq; > struct amdgpu_irq_src rcv_irq; > - struct work_struct flr_work; > + struct amdgpu_reset_work_struct flr_work; > struct amdgpu_mm_table mm_table; > const struct amdgpu_virt_ops *ops; > struct amdgpu_vf_error_buffer vf_errors; > diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c > index b81acf59870c..f3d1c2be9292 100644 > --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c > +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c > @@ -251,7 +251,7 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev, > > static void xgpu_ai_mailbox_flr_work(struct work_struct *work) > { > - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); > + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.base.work); > struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); > int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; > > @@ -380,7 +380,8 @@ int xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev) > return r; > } > > - INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work); > + INIT_DELAYED_WORK(&adev->virt.flr_work.base, xgpu_ai_mailbox_flr_work); > + INIT_LIST_HEAD(&adev->virt.flr_work.node); > > return 0; > } > @@ -389,6 +390,8 @@ void xgpu_ai_mailbox_put_irq(struct amdgpu_device *adev) > { > amdgpu_irq_put(adev, &adev->virt.ack_irq, 0); > amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0); > + > + amdgpu_reset_domain_del_pendning_work(adev->reset_domain, &adev->virt.flr_work); > } > > static int xgpu_ai_request_init_data(struct amdgpu_device *adev) > diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c > index 22c10b97ea81..927b3d5bb1d0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c > +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c > @@ -275,7 +275,7 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev, > > static void xgpu_nv_mailbox_flr_work(struct work_struct *work) > { > - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); > + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.base.work); > struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); > int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; > > @@ -407,7 +407,8 @@ int xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev) > return r; > } > > - INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work); > + INIT_DELAYED_WORK(&adev->virt.flr_work.base, xgpu_nv_mailbox_flr_work); > + INIT_LIST_HEAD(&adev->virt.flr_work.node); > > return 0; > } > @@ -416,6 +417,8 @@ void xgpu_nv_mailbox_put_irq(struct amdgpu_device *adev) > { > amdgpu_irq_put(adev, &adev->virt.ack_irq, 0); > amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0); > + > + amdgpu_reset_domain_del_pendning_work(adev->reset_domain, &adev->virt.flr_work); > } > > const struct amdgpu_virt_ops xgpu_nv_virt_ops = { > diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c > index 7b63d30b9b79..1d4ef5c70730 100644 > --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c > +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c > @@ -512,7 +512,7 @@ static int xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev, > > static void xgpu_vi_mailbox_flr_work(struct work_struct *work) > { > - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); > + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.base.work); > struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); > > /* wait until RCV_MSG become 3 */ > @@ -610,7 +610,8 @@ int xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev) > return r; > } > > - INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work); > + INIT_DELAYED_WORK(&adev->virt.flr_work.base, xgpu_vi_mailbox_flr_work); > + INIT_LIST_HEAD(&adev->virt.flr_work.node); > > return 0; > } > @@ -619,6 +620,8 @@ void xgpu_vi_mailbox_put_irq(struct amdgpu_device *adev) > { > amdgpu_irq_put(adev, &adev->virt.ack_irq, 0); > amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0); > + > + amdgpu_reset_domain_del_pendning_work(adev->reset_domain, &adev->virt.flr_work); > } > > const struct amdgpu_virt_ops xgpu_vi_virt_ops = {