From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6AE9EC433FE for ; Wed, 5 Jan 2022 18:24:51 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0F86910EE38; Wed, 5 Jan 2022 18:24:50 +0000 (UTC) Received: from NAM10-BN7-obe.outbound.protection.outlook.com (mail-bn7nam10on2051.outbound.protection.outlook.com [40.107.92.51]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5B74910EE26; Wed, 5 Jan 2022 18:24:48 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HDXYL4j755/kf7oymU1bJJqsgdYOsZ8MGLPngeqF3b0QkZbjYEFDFpZ3Oi2x2KiKn4Kn+pAtqmftGppgkaJL8hzkkiFvkuCFjliyrEedZ+eJNruC+JQcqBoJYNQUd+Y/eJDAze5mnDUniW8vJLO6Ze7/kjNCA5c0+miU4MQnZtn654UwO/LyT3/oxqW5bJ6pvG0pHjtLaniSeG1CyXDdsyQLrtzYneOdrFSo3hj2oenTxBR9yluDJp//aAHsShJYoNo8+B1mUpTq1fdCglWJamJ7Q/K5zHfKZb2rfPhnL5clf0KeInJnUsUR2qjY6kdc4FDTCFswQFa0El+EtIO22w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/39qBsHwbZREqHDWqhMZXBoqmQzwkmxz5fsL/7ZezNA=; b=I9xrWM7xKD4MTqbEixljbaVM4TohVCiNxFySyx0tYY0LGe8Bmly+26dxkDKFYaQ5/u96jtexy0X4oRB+LmQ0l+/Nj5BuG8uWHWXMZzDmnSRWewKabhkKrimbLNPRyATgaeLbFYi2X7Qtty6ZJqfpXye3YNPmNjXTtpKAy00s/Q/cbkwZrsSJZZXqiWnESWRUaXcIMfXfN840OL76bIvAVssXt2odpJCAffGUm/xj/TJeo4e3Tq9J5aWeRzb9ibYkS1XNtNY9BQX7pCD63WWrOAR8T5Veu5RSPeIByDMd/1t9gYNsBljfuUiIQCZsj+ygAt+KSdwZRE+etjJBl42OIw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/39qBsHwbZREqHDWqhMZXBoqmQzwkmxz5fsL/7ZezNA=; b=4eZO2VXvv/vpx9bAcJOzFcJPJCXkL6/1mzzaHvMMAf+X10u8Bj5eJALbAZGmIckz/duarP3iQx08BKquokGEe8yL4b2GtIVF+xMRamqSFpRZzNbob0q7J60F0NWBWIkNTYKFb7bsoVAaNZReAoXQQs2yr3T3XJZrop1Y3TeYQkk= Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com; Received: from CY4PR12MB1943.namprd12.prod.outlook.com (2603:10b6:903:11b::7) by CY4PR12MB1319.namprd12.prod.outlook.com (2603:10b6:903:41::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4844.15; Wed, 5 Jan 2022 18:24:45 +0000 Received: from CY4PR12MB1943.namprd12.prod.outlook.com ([fe80::5b5:cbda:260b:7c39]) by CY4PR12MB1943.namprd12.prod.outlook.com ([fe80::5b5:cbda:260b:7c39%9]) with mapi id 15.20.4844.016; Wed, 5 Jan 2022 18:24:45 +0000 Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV To: =?UTF-8?Q?Christian_K=c3=b6nig?= , JingWen Chen , =?UTF-8?Q?Christian_K=c3=b6nig?= , "Liu, Monk" , "Chen, JingWen" , "Deng, Emily" , "dri-devel@lists.freedesktop.org" , "amd-gfx@lists.freedesktop.org" , "Chen, Horace" References: <20211222220506.789133-1-andrey.grodzovsky@amd.com> <20211222221400.790842-1-andrey.grodzovsky@amd.com> <20211222221400.790842-4-andrey.grodzovsky@amd.com> <9125ac3a-e578-6b34-1533-7622ec0274f1@amd.com> <2dee6f65-9ca9-a332-7206-f24021fb4c44@gmail.com> <20363a4e-b282-232d-34d0-14867bad4931@amd.com> <23bebf13-c622-7c61-af88-0e0970b90389@amd.com> <821c0b66-8c9c-9dff-a328-bfbc2233d4ef@gmail.com> From: Andrey Grodzovsky Message-ID: Date: Wed, 5 Jan 2022 13:24:42 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 In-Reply-To: <821c0b66-8c9c-9dff-a328-bfbc2233d4ef@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ClientProxiedBy: YT2PR01CA0021.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:38::26) To CY4PR12MB1943.namprd12.prod.outlook.com (2603:10b6:903:11b::7) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 34bd5acf-b07b-4f9b-e9bb-08d9d078a635 X-MS-TrafficTypeDiagnostic: CY4PR12MB1319:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: Iba9NgECmzNES5j61EGzrGolYIx9QfviuHFFD6xNBjyLoAc0Y9szLAakW5YxkkwrJ+DkxhC7zac/YlK4Z9NgxYMM1EKg6uul+BbbkGHmgDUo20E3r1571KbUU2yCmRmNfOTuyUhAc7Rik77U7tiDoTwl/nxGizlvLKHshrFwfu8nozcluxvcIZpGaUj3xqkjQsyqr5/V1FWJoHYMsBsGizsRJ25G52DAC0EfYxN1fgxOnaT4O67bLw2fvL53nNokHVESQ20t4L+r7F9UT6hZu+q8Lm4d1L8sSen0EpgDUz65Hlhb3Evc57KdfTN3LKuqcuzpQRH1WlwHMo+sxP6Ii7i3A9+wDrTCro9AFSV0uxCpXfhaDKWbjZySONxB87z+/QdFXlfFY9UFD3U4Gs22d+8+K55cyWO9tccSwWlDDP4xQxKYoUEaUPDECM3syDQaLo0VIYzhpOCAtfUiOv0CVLk+HjMA7yj6AexvAuLlBl7fNj95TiUQFxLSv2UL73uc0Sd80NHujcmCpP94e5bIGurpKPD6ysDLNI3wb3SL6tzKb64oLB7XcontYfAuIqomh/itmHeymQdaIfoTl4iRZzuaecC2lm9zrC0L4e4BO7OY3ey6e9MwJdKt09cQEzU2nxnHnlIzpi/Zy6pBDMrlPBnm/x7x9id08VocgkxUxpt8pRjHUTUNBk3A/lxeXyrIrf40Y1mUNctQITpvopXW7eESDnGSuY7VlDhwA2PjxNXk+r+7kAqlkYB285vS9kdR X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:CY4PR12MB1943.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(366004)(83380400001)(66574015)(66556008)(66946007)(316002)(86362001)(31686004)(6666004)(4001150100001)(31696002)(8936002)(6636002)(38100700002)(186003)(508600001)(6506007)(921005)(110136005)(4326008)(44832011)(6486002)(53546011)(30864003)(36756003)(5660300002)(8676002)(6512007)(2906002)(2616005)(66476007)(45980500001)(43740500002); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?UVQ2NGhXVVhraGZiS1Zlb1ZpbHBVN2J6YVNYZDZvY1MySXdQUG91RlN3WGtC?= =?utf-8?B?SDNEWThCbWRCRmNnLy95WGJ0Vi9zNzFJQ1p0cnI1Vk0wcU1TNis3dFZ6MXBw?= =?utf-8?B?VmFoYkdtb3dwaVlPQkNaREdzZ3BWcmRLYzZQMmltVy9MOW1JMjEySStXRDZU?= =?utf-8?B?RlVtYzd3Y2pPTDdrY0xKd0pVRTM1UXREN3NDemNuV2ZDZkxBL0NpMmFLOGdG?= =?utf-8?B?NlhQcFBZWU14WmJpUE9LYUw0MzNIaVZjOGVKd1ByS25GN3N3aGFwWjAyOGJp?= =?utf-8?B?NTJKWmdieFZReDFFcDhwVllmMUJzbmFsazlkci9Ra1dVNXlQMUNUdjRGb0RT?= =?utf-8?B?cXM4VHJENlRhbmdPczBTNVkzU3hXVTRZRXNZbHd0bFV0Z2s4dFlsMWcvWTI0?= =?utf-8?B?clV1RGN4RFJuTlZydXVjMDFrUzBzZWZGU2s3bS92bzFHTEtoVVMzaGwrRHFO?= =?utf-8?B?aEp5SWhiTlVGOTh5Mktiai9YRGNkK0NkWElobXJxNjFXQ2E1VU43cWhVLzN3?= =?utf-8?B?WHRmYnFiaWUwZEJ3YUFVRmxsREYxeFF1SFpOOVQvZzA1eThhazh0bW5yc29Q?= =?utf-8?B?TVN4VDY0SXRIc1BCb3RTcHROZVlmVFNHaEhwS2MzV0hxTm0vbmhjbjJFYnI2?= =?utf-8?B?d2FYZ1Z5cTZrYTBCdzdQckh4M2srS3Q2U3lGK3l6WWRnMXVsLzJYQ0tveGZZ?= =?utf-8?B?UDBrT2tDd0VrbWtWaUxDUWlnZC96T2NvRC9jaUZnNFVUNER3YUJFcVhjWU5u?= =?utf-8?B?UmZpMHRuTzJpdUNLWERGL1dZSGxzQnJFTldJYlJlT1ZJNXNHUjliaGRJQVZE?= =?utf-8?B?aWltWXZxc3RhVTRtWW1BMWFvTGlKbTIwTnV3d01rTFRSOTdUSXBYeW9KL3R0?= =?utf-8?B?Z09MLzl5c014a0xNSlJ6N3NnbmFzeTNjQlhSZEk2aVM3clZ1K1ZDc2ZQYTU1?= =?utf-8?B?dklCdDNJWTJtZFYrTmIrR1dHcENGdnhLMFl5YTh2TU1aNlI4SjVvVkxRUXpW?= =?utf-8?B?TEtEanBNR1oxMUkrWkFGcDlQSGRsWitGTTlVbkpQSFlXQlJyUHZ2N2ZtSWFx?= =?utf-8?B?STdoRWtjek1vSkpNY2N0Y3dIcGdHaE5jUlBoc0xaWTU5YkE2MlEwK2FHdGFL?= =?utf-8?B?L0VTa1RkSXJWOVpZNGlWWkJXTVgzRW5hc25UR0hRelNxZnMvQWc0Mzd5dXlU?= =?utf-8?B?UnQraVArdWZUc1lMTGNYWDF0VjlYYmQ1K0ZQdFVWUC91ekNMRlFBNWoyditZ?= =?utf-8?B?NElQR2lyMFhGTkdEby81cWVIT3huV0M1bDI1b2pKemxGR2Q2bmx3WTJFeGxi?= =?utf-8?B?Ym13cW95SklxMGNtRTl2Vk5pZGhDZTdjc2Y1aFdVcW5VQ3c5TjA3UENCdUJa?= =?utf-8?B?ZzJOc3kyYUlseENMRGhJVk9BYUNqajJqcE1CeGV2Nm5LL2NheHh3SUhSYm5j?= =?utf-8?B?a2xpRUlBOWZvNkhNZjAwcG1TRzN0OFNxWmU1L1E5Q1VraDFiTHR4a01tTjF4?= =?utf-8?B?RkkrZFJ3T2pKK3prdHpNS1haeU02eXBNdU5oaTR0bWpGZnRkdGZsUE9DNTV0?= =?utf-8?B?S3NDRk51SnJlRDdnT1NvK2FNSU00VnVwajAyQVhmOVViUEJjc0NqZTBKS3hF?= =?utf-8?B?cHBWM2pYWmhiblVNbFlLZ0hWemxsYUFPbGRjc2I0UzhIVlovWmM2cEVScVU2?= =?utf-8?B?MFBYQjdDd3E0RlczYWRwTlRwV0ZHYkRMdjY0a3hIT3cwZFRHbWcweGUzQW5Y?= =?utf-8?B?ZUVtSFJBN1YxNUpxMjZ1WHphTFBtVXhOTHNKRFh6UkY4WlRlM3FWR2dMWmQ5?= =?utf-8?B?Vk1uSEtDa2tjNUt1RWtuek4vd3RtNVRkSEh3MzZndDNYTEQ1VGoxeEI2YlhS?= =?utf-8?B?MUQ0NmZiend2Ylg4SVVjK3JEL2VGdUJlenJWNEk4eCs0SnVVSkpkVDIrd1B5?= =?utf-8?B?K3hoM1kyT2RidG9RdDBPRVFGVXlZSXlnQjdldHBNanhyY1lRU010dFdaMkRG?= =?utf-8?B?cGgrbEpsV09ZQ1UzTWwyYWNJVGt3cE9Yb3NONkZISG9vbVp3b0FwektoV293?= =?utf-8?B?ZTlYaFBLK0toNjlDM1RRZTNpYllBNVBuRWowcjhoMitCZkFKK1h2VzF1NTZl?= =?utf-8?B?WXA5U1l1a0FQV2ZyUjRFdUpHQ1FFVmxnY2U0UzlrTDJzck54UTFEb285Q2Fq?= =?utf-8?B?UjE0MWxIUmtDVlJtaklMblozY3RGR2llcEQxcGFTNGxzOWVneDdDQVFrcG5X?= =?utf-8?Q?GaP103YUTPAUk1W8lmG2BHtWkk2xQDF2b10CwTYYSI=3D?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 34bd5acf-b07b-4f9b-e9bb-08d9d078a635 X-MS-Exchange-CrossTenant-AuthSource: CY4PR12MB1943.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Jan 2022 18:24:45.5495 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: W12Dg5wcamJ2VIhtvttZ1CUTebPcq2C8fjR8nyHIXaX8zuGlyXPHe6RL6vESxpVyPNK0gW1hvhoWUscQUd/jzw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR12MB1319 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 2022-01-05 2:59 a.m., Christian König wrote: > Am 05.01.22 um 08:34 schrieb JingWen Chen: >> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>> On 2022-01-04 6:36 a.m., Christian König wrote: >>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>> [AMD Official Use Only] >>>>> >>>>>>> See the FLR request from the hypervisor is just another source >>>>>>> of signaling the need for a reset, similar to each job timeout >>>>>>> on each queue. Otherwise you have a race condition between the >>>>>>> hypervisor and the scheduler. >>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>> FLR is about to start or was already executed, but host will do >>>>> FLR anyway without waiting for guest too long >>>>> >>>> Then we have a major design issue in the SRIOV protocol and really >>>> need to question this. >>>> >>>> How do you want to prevent a race between the hypervisor resetting >>>> the hardware and the client trying the same because of a timeout? >>>> >>>> As far as I can see the procedure should be: >>>> 1. We detect that a reset is necessary, either because of a fault a >>>> timeout or signal from hypervisor. >>>> 2. For each of those potential reset sources a work item is send to >>>> the single workqueue. >>>> 3. One of those work items execute first and prepares the reset. >>>> 4. We either do the reset our self or notify the hypervisor that we >>>> are ready for the reset. >>>> 5. Cleanup after the reset, eventually resubmit jobs etc.. >>>> 6. Cancel work items which might have been scheduled from other >>>> reset sources. >>>> >>>> It does make sense that the hypervisor resets the hardware without >>>> waiting for the clients for too long, but if we don't follow this >>>> general steps we will always have a race between the different >>>> components. >>> >>> Monk, just to add to this - if indeed as you say that 'FLR from >>> hypervisor is just to notify guest the hw VF FLR is about to start >>> or was already executed, but host will do FLR anyway without waiting >>> for guest too long' >>> and there is no strict waiting from the hypervisor for >>> IDH_READY_TO_RESET to be recived from guest before starting the >>> reset then setting in_gpu_reset and locking reset_sem from guest >>> side is not really full proof >>> protection from MMIO accesses by the guest - it only truly helps if >>> hypervisor waits for that message before initiation of HW reset. >>> >> Hi Andrey, this cannot be done. If somehow guest kernel hangs and >> never has the chance to send the response back, then other VFs will >> have to wait it reset. All the vfs will hang in this case. Or >> sometimes the mailbox has some delay and other VFs will also wait. >> The user of other VFs will be affected in this case. > > Yeah, agree completely with JingWen. The hypervisor is the one in > charge here, not the guest. > > What the hypervisor should do (and it already seems to be designed > that way) is to send the guest a message that a reset is about to > happen and give it some time to response appropriately. > > The guest on the other hand then tells the hypervisor that all > processing has stopped and it is ready to restart. If that doesn't > happen in time the hypervisor should eliminate the guest probably > trigger even more severe consequences, e.g. restart the whole VM etc... > > Christian. So what's the end conclusion here regarding dropping this particular patch ? Seems to me we still need to drop it to prevent driver's MMIO access to the GPU during reset from various places in the code. Andrey > >>> Andrey >>> >>> >>>> Regards, >>>> Christian. >>>> >>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>> [AMD Official Use Only] >>>>> >>>>>>> See the FLR request from the hypervisor is just another source >>>>>>> of signaling the need for a reset, similar to each job timeout >>>>>>> on each queue. Otherwise you have a race condition between the >>>>>>> hypervisor and the scheduler. >>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>> FLR is about to start or was already executed, but host will do >>>>> FLR anyway without waiting for guest too long >>>>> >>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>> implementation is severely broken and what Andrey is doing is >>>>>>> actually fixing it. >>>>> It makes the code to crash ... how could it be a fix ? >>>>> >>>>> I'm afraid the patch is NAK from me,  but it is welcome if the >>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed. >>>>> >>>>> Thanks >>>>> ------------------------------------------------------------------- >>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>> ------------------------------------------------------------------- >>>>> we are hiring software manager for CVS core team >>>>> ------------------------------------------------------------------- >>>>> >>>>> -----Original Message----- >>>>> From: Koenig, Christian >>>>> Sent: Tuesday, January 4, 2022 6:19 PM >>>>> To: Chen, JingWen ; Christian König >>>>> ; Grodzovsky, Andrey >>>>> ; Deng, Emily ; >>>>> Liu, Monk ; dri-devel@lists.freedesktop.org; >>>>> amd-gfx@lists.freedesktop.org; Chen, Horace ; >>>>> Chen, JingWen >>>>> Cc: daniel@ffwll.ch >>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>> protection for SRIOV >>>>> >>>>> Hi Jingwen, >>>>> >>>>> well what I mean is that we need to adjust the implementation in >>>>> amdgpu to actually match the requirements. >>>>> >>>>> Could be that the reset sequence is questionable in general, but I >>>>> doubt so at least for now. >>>>> >>>>> See the FLR request from the hypervisor is just another source of >>>>> signaling the need for a reset, similar to each job timeout on >>>>> each queue. Otherwise you have a race condition between the >>>>> hypervisor and the scheduler. >>>>> >>>>> Properly setting in_gpu_reset is indeed mandatory, but should >>>>> happen at a central place and not in the SRIOV specific code. >>>>> >>>>> In other words I strongly think that the current SRIOV reset >>>>> implementation is severely broken and what Andrey is doing is >>>>> actually fixing it. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen: >>>>>> Hi Christian, >>>>>> I'm not sure what do you mean by "we need to change SRIOV not the >>>>>> driver". >>>>>> >>>>>> Do you mean we should change the reset sequence in SRIOV? This >>>>>> will be a huge change for our SRIOV solution. >>>>>> >>>>>>    From my point of view, we can directly use >>>>>> amdgpu_device_lock_adev >>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock >>>>>> since no one will conflict with this thread with reset_domain >>>>>> introduced. >>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep >>>>>> device untouched via user space. >>>>>> >>>>>> Best Regards, >>>>>> Jingwen Chen >>>>>> >>>>>> On 2022/1/3 下午6:17, Christian König wrote: >>>>>>> Please don't. This patch is vital to the cleanup of the reset >>>>>>> procedure. >>>>>>> >>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not >>>>>>> the driver. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >>>>>>>> Sure, I guess i can drop this patch then. >>>>>>>> >>>>>>>> Andrey >>>>>>>> >>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs >>>>>>>>> first, and do the flr, guest side thread may not know this and >>>>>>>>> still try to access HW(e.g. kfd is using a lot of >>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). >>>>>>>>> And this may lead to very bad result. >>>>>>>>> >>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>>>>>>>> These patches look good to me. JingWen will pull these >>>>>>>>>> patches and do some basic TDR test on sriov environment, and >>>>>>>>>> give feedback. >>>>>>>>>> >>>>>>>>>> Best wishes >>>>>>>>>> Emily Deng >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Liu, Monk >>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>>>>>>>> To: Koenig, Christian ; Grodzovsky, >>>>>>>>>>> Andrey ; >>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- >>>>>>>>>>> gfx@lists.freedesktop.org; >>>>>>>>>>> Chen, Horace ; Chen, JingWen >>>>>>>>>>> ; Deng, Emily >>>>>>>>>>> Cc: daniel@ffwll.ch >>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>> reset >>>>>>>>>>> protection for SRIOV >>>>>>>>>>> >>>>>>>>>>> [AMD Official Use Only] >>>>>>>>>>> >>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>>>>>>>> >>>>>>>>>>> Please take a review on Andrey's patch >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> -- we are hiring software manager for CVS core team >>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Koenig, Christian >>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>>>>>>>> To: Grodzovsky, Andrey ; dri- >>>>>>>>>>> devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org >>>>>>>>>>> Cc: daniel@ffwll.ch; Liu, Monk ; Chen, Horace >>>>>>>>>>> >>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>> reset >>>>>>>>>>> protection for SRIOV >>>>>>>>>>> >>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>>>>>>>> Since now flr work is serialized against  GPU resets there >>>>>>>>>>>> is no >>>>>>>>>>>> need for this. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky >>>>>>>>>>> Acked-by: Christian König >>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>>>>>>>>       2 files changed, 22 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>> @@ -248,15 +248,7 @@ static void >>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>> work_struct *work) >>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>> struct >>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>> >>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>> received, >>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>> -     * the VF FLR. >>>>>>>>>>>> -     */ >>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>> -        return; >>>>>>>>>>>> - >>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>> >>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, >>>>>>>>>>>> IDH_READY_TO_RESET, 0, >>>>>>>>>>>> 0, 0); >>>>>>>>>>>> >>>>>>>>>>>> @@ -269,9 +261,6 @@ static void >>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>> work_struct *work) >>>>>>>>>>>>           } while (timeout > 1); >>>>>>>>>>>> >>>>>>>>>>>>       flr_done: >>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>> -    up_write(&adev->reset_sem); >>>>>>>>>>>> - >>>>>>>>>>>>           /* Trigger recovery for world switch failure if >>>>>>>>>>>> no TDR >>>>>>>>>>>> */ >>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff >>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>> @@ -277,15 +277,7 @@ static void >>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>> work_struct *work) >>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>> struct >>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>> >>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>> received, >>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>> -     * the VF FLR. >>>>>>>>>>>> -     */ >>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>> -        return; >>>>>>>>>>>> - >>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>> >>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, >>>>>>>>>>>> IDH_READY_TO_RESET, 0, >>>>>>>>>>>> 0, 0); >>>>>>>>>>>> >>>>>>>>>>>> @@ -298,9 +290,6 @@ static void >>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>> work_struct *work) >>>>>>>>>>>>           } while (timeout > 1); >>>>>>>>>>>> >>>>>>>>>>>>       flr_done: >>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>> -    up_write(&adev->reset_sem); >>>>>>>>>>>> - >>>>>>>>>>>>           /* Trigger recovery for world switch failure if >>>>>>>>>>>> no TDR >>>>>>>>>>>> */ >>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || >