From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-oa1-f54.google.com (mail-oa1-f54.google.com [209.85.160.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B9E5E13ADC for ; Wed, 29 Nov 2023 18:52:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YgBLW3ni" Received: by mail-oa1-f54.google.com with SMTP id 586e51a60fabf-1eb39505ba4so11586fac.0 for ; Wed, 29 Nov 2023 10:52:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701283958; x=1701888758; darn=lists.linux.dev; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GKCMdliecR8K3FRKmMsgrRyyz7+kEtWPm4FIJ7dTiTY=; b=YgBLW3niCAl+FhP/6iUcgI1B5Z93E8Rsp011XKYgjbtLrYiK3yCdg7/BEg4i46pCD6 vptAVbFmwZgUuQ0BORPxxiZfc1qqGy3/UDHALcav17Zh4WL4WCg62ZRGKq5ET/fJJuDZ evR5LKmiaUlnUFQFa4x+UxFHO/GnYk4x6PIiT/05CUcl3Oqtgz+NSChhb8CJMc5zr1W2 GAwpvb4FLyU0Ewqrj0ZPjfGc+Y8S80V3qPkRR3e1O0rTzy1lWIGuBKWIEcb+sZ5YhEBZ /7WBDwK3KTplvVImEsgCtwpkQQdpRHH9beP2fpPhThr6Cs2+4iVlpa4nI1s8FVeHsO3F fELw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701283958; x=1701888758; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GKCMdliecR8K3FRKmMsgrRyyz7+kEtWPm4FIJ7dTiTY=; b=ek54AwE4WoXrKEq9vi1lYTLYhH9L093+QqL/Yr1fY6oUJRBfRk+U0xZoB0ov4k47DJ iSrtuDE4bD4SgsdswuARduizBs6TIro4w5ARKfa0c0oX8OmOGbddvyIL7KObN4Y/ThT2 X54hjFU0On9aFbZIlw8fR2TxxShgLGKmbKBUnD+yxqcJxVuuWY/04U1DhJqdQhvFknNo t6AwWw83yWfBA2EZA//eiJNGGxiu1MOVnty1yY9EfQN880PJ0Sxe49dRCaMrUB4qorOj uZu6dJVmSLKiZVwSi41yqycHjbzesGlAHAWJ6hRGAb0/MSwNqCwC/wpqjZSq5jobppuo ZCDA== X-Gm-Message-State: AOJu0YxqxG1Xibk7tg7G0Rbcqrd+ObzgG1HJ1JTuhQIUSchWdJGm+A4h dtBvo4h9b/qMw1rG9vm2yAjN8Rrve3P9zoXvpF0= X-Google-Smtp-Source: AGHT+IF3oOywhpirX3+AP8XTM3+9p+TwpPLHbDmBIZTRg15hZZxmdOFhzj4n/rOsFaUbFqEpr1x9fgbgQl7aRBPDypc= X-Received: by 2002:a05:6870:7a09:b0:1e9:f0fe:6ba4 with SMTP id hf9-20020a0568707a0900b001e9f0fe6ba4mr28988964oab.11.1701283957735; Wed, 29 Nov 2023 10:52:37 -0800 (PST) Precedence: bulk X-Mailing-List: regressions@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <87edgv4x3i.fsf@vps.thesusis.net> <559d0fa5-953a-4a97-b03b-5eb1287c83d8@leemhuis.info> <96e2e13c-f01c-4baf-a9a3-cbaa48fb10c7@amd.com> <87jzq2ixtm.fsf@vps.thesusis.net> <95fe9b5b-05ce-4462-9973-9aca306bc44f@gmail.com> <9595b8bf-e64d-4926-9263-97e18bcd7d05@gmail.com> In-Reply-To: <9595b8bf-e64d-4926-9263-97e18bcd7d05@gmail.com> From: Alex Deucher Date: Wed, 29 Nov 2023 13:52:26 -0500 Message-ID: Subject: Re: Radeon regression in 6.6 kernel To: Luben Tuikov Cc: Phillip Susi , Linux regressions mailing list , =?UTF-8?Q?Christian_K=C3=B6nig?= , linux-kernel@vger.kernel.org, "amd-gfx@lists.freedesktop.org" , dri-devel@lists.freedesktop.org, Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= , Danilo Krummrich Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Nov 29, 2023 at 11:41=E2=80=AFAM Luben Tuikov = wrote: > > On 2023-11-29 10:22, Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 8:50=E2=80=AFAM Alex Deucher wrote: > >> > >> On Tue, Nov 28, 2023 at 11:45=E2=80=AFPM Luben Tuikov wrote: > >>> > >>> On 2023-11-28 17:13, Alex Deucher wrote: > >>>> On Mon, Nov 27, 2023 at 6:24=E2=80=AFPM Phillip Susi wrote: > >>>>> > >>>>> Alex Deucher writes: > >>>>> > >>>>>>> In that case those are the already known problems with the schedu= ler > >>>>>>> changes, aren't they? > >>>>>> > >>>>>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'= m > >>>>>> misunderstanding what the original report was actually testing. I= f it > >>>>>> was 6.7, then try reverting: > >>>>>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>>>>> b70438004a14f4d0f9890b3297cd66248728546c > >>>>> > >>>>> At some point it was suggested that I file a gitlab issue, but I to= ok > >>>>> this to mean it was already known and being worked on. -rc3 came o= ut > >>>>> today and still has the problem. Is there a known issue I could tr= ack? > >>>>> > >>>> > >>>> At this point, unless there are any objections, I think we should ju= st > >>>> revert the two patches > >>> Uhm, no. > >>> > >>> Why "the two" patches? > >>> > >>> This email, part of this thread, > >>> > >>> https://lore.kernel.org/all/87r0kircdo.fsf@vps.thesusis.net/ > >>> > >>> clearly states that reverting *only* this commit, > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable numbe= r of run-queues > >>> *does not* mitigate the failed suspend. (Furthermore, this commit doe= sn't really change > >>> anything operational, other than using an allocated array, instead of= a static one, in DRM, > >>> while the 2nd patch is solely contained within the amdgpu driver code= .) > >>> > >>> Leaving us with only this change, > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >>> to be at fault, as the kernel log attached in the linked email above = shows. > >>> > >>> The conclusion is that only b70438004a14f4 needs reverting. > >> > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > >> 56e449603f0ac5 breaks amdgpu. > > > > We can try and re-enable it in the next kernel. I'm just not sure > > we'll be able to fix this in time for 6.7 with the holidays and all > > and I don't want to cause a lot of scheduler churn at the end of the > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > best short term solution. > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it o= pened > the opportunity for a 1-to-1 relationship between an entity and a schedul= er. > (Should've always been the case, from the outset. Not sure why it was cod= ed as > a fixed-size array.) > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the p= roblem > is wholly contained in amdgpu, and no other driver has this problem, ther= e is > no reason to have to "churn", i.e. go back and forth in DRM, only to cove= r up > an init bug in amdgpu. See the response I just sent in @this thread: > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba43e2@gmail.com > > And it's not like this issue is unknown. I first posted about it on 2023-= 10-16. > > Ideally, amdgpu would just fix their init code. You can't make changes to core code that break other drivers. Arguably 56e449603f0ac5 should not have gone in in the first place if it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init code, but as a side effect it seems to have broken suspend for some users. Alex From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 96015C07E97 for ; Wed, 29 Nov 2023 18:52:45 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0202610E337; Wed, 29 Nov 2023 18:52:41 +0000 (UTC) Received: from mail-oa1-x32.google.com (mail-oa1-x32.google.com [IPv6:2001:4860:4864:20::32]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8CF2110E237; Wed, 29 Nov 2023 18:52:38 +0000 (UTC) Received: by mail-oa1-x32.google.com with SMTP id 586e51a60fabf-1fa235f8026so348fac.3; Wed, 29 Nov 2023 10:52:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701283958; x=1701888758; darn=lists.freedesktop.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GKCMdliecR8K3FRKmMsgrRyyz7+kEtWPm4FIJ7dTiTY=; b=X6gFAvQAy46oMGwOXsbPV2+o20Qcsd1ueGCtBfH6F9ZpAaEuz1MDI7hvc5adme60bR +Gw2SbLY2AtT3dZX5T5onU6v0JsBugrVqurkIjw/sMOpRdvji/NpNC8h/LZcSiRKV8J1 Gth/qkUcNQIfWtFm8iH8srmFDEwUkEN1JSRMADF4TgJJds8/6zXRcqG+jbc2mkFg512+ EOQBsR5CLw8MKn5pxJN+PNqk4uTk3fZcPKogEBrctWOSgrkFdz5GlD9UzePIctFjaUKb BIsKACRi1EnNr4eWOIjJToCcAmzdvp7X+lJRGf9c6A2bBDzSO62rUL0sm9F00yv1ndr+ M1Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701283958; x=1701888758; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GKCMdliecR8K3FRKmMsgrRyyz7+kEtWPm4FIJ7dTiTY=; b=HH0rPo9GjWQNM0rCPxWtWZ76Q5bI7P6fXRjkqlTdZPbF6NxKeIrc2QM22SUz8/J4Dh c8aAgPBdAZpHqseBLTZLKVp449T2pnUzCzZHpANrQC9VcieYsDWx5oDx2Sy1WgcAd7mp S1PYp3E4WROJCmoMwASQp+SK+ijwwhMe90naFMymkiwnPFN+mtWhOxEaiDun/ZfKtz85 ynX7HhCl68JMiIscuTaZ5mkt92EGmvXM08YWiVuTi3pF2jQik5f36D5ZOqIG+IB+Vy0A L1GuaNnz2ZJPf8yN05V9zIIistNboL091mK0uKXNUp+6lsVEbMDCRXDHbc+5vi8HG49p EhPg== X-Gm-Message-State: AOJu0YyCEUNHR87Ux/NHTOlFP2YUP3NiKzL39VDRHpHSzW0SZi25ZTw1 p2j3SDyySSLGEaW6JU7DCNUY/ghIDpvoQydqyLs= X-Google-Smtp-Source: AGHT+IF3oOywhpirX3+AP8XTM3+9p+TwpPLHbDmBIZTRg15hZZxmdOFhzj4n/rOsFaUbFqEpr1x9fgbgQl7aRBPDypc= X-Received: by 2002:a05:6870:7a09:b0:1e9:f0fe:6ba4 with SMTP id hf9-20020a0568707a0900b001e9f0fe6ba4mr28988964oab.11.1701283957735; Wed, 29 Nov 2023 10:52:37 -0800 (PST) MIME-Version: 1.0 References: <87edgv4x3i.fsf@vps.thesusis.net> <559d0fa5-953a-4a97-b03b-5eb1287c83d8@leemhuis.info> <96e2e13c-f01c-4baf-a9a3-cbaa48fb10c7@amd.com> <87jzq2ixtm.fsf@vps.thesusis.net> <95fe9b5b-05ce-4462-9973-9aca306bc44f@gmail.com> <9595b8bf-e64d-4926-9263-97e18bcd7d05@gmail.com> In-Reply-To: <9595b8bf-e64d-4926-9263-97e18bcd7d05@gmail.com> From: Alex Deucher Date: Wed, 29 Nov 2023 13:52:26 -0500 Message-ID: Subject: Re: Radeon regression in 6.6 kernel To: Luben Tuikov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Linux regressions mailing list , =?UTF-8?Q?Christian_K=C3=B6nig?= , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Danilo Krummrich , "amd-gfx@lists.freedesktop.org" , Phillip Susi , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Wed, Nov 29, 2023 at 11:41=E2=80=AFAM Luben Tuikov = wrote: > > On 2023-11-29 10:22, Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 8:50=E2=80=AFAM Alex Deucher wrote: > >> > >> On Tue, Nov 28, 2023 at 11:45=E2=80=AFPM Luben Tuikov wrote: > >>> > >>> On 2023-11-28 17:13, Alex Deucher wrote: > >>>> On Mon, Nov 27, 2023 at 6:24=E2=80=AFPM Phillip Susi wrote: > >>>>> > >>>>> Alex Deucher writes: > >>>>> > >>>>>>> In that case those are the already known problems with the schedu= ler > >>>>>>> changes, aren't they? > >>>>>> > >>>>>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'= m > >>>>>> misunderstanding what the original report was actually testing. I= f it > >>>>>> was 6.7, then try reverting: > >>>>>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>>>>> b70438004a14f4d0f9890b3297cd66248728546c > >>>>> > >>>>> At some point it was suggested that I file a gitlab issue, but I to= ok > >>>>> this to mean it was already known and being worked on. -rc3 came o= ut > >>>>> today and still has the problem. Is there a known issue I could tr= ack? > >>>>> > >>>> > >>>> At this point, unless there are any objections, I think we should ju= st > >>>> revert the two patches > >>> Uhm, no. > >>> > >>> Why "the two" patches? > >>> > >>> This email, part of this thread, > >>> > >>> https://lore.kernel.org/all/87r0kircdo.fsf@vps.thesusis.net/ > >>> > >>> clearly states that reverting *only* this commit, > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable numbe= r of run-queues > >>> *does not* mitigate the failed suspend. (Furthermore, this commit doe= sn't really change > >>> anything operational, other than using an allocated array, instead of= a static one, in DRM, > >>> while the 2nd patch is solely contained within the amdgpu driver code= .) > >>> > >>> Leaving us with only this change, > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >>> to be at fault, as the kernel log attached in the linked email above = shows. > >>> > >>> The conclusion is that only b70438004a14f4 needs reverting. > >> > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > >> 56e449603f0ac5 breaks amdgpu. > > > > We can try and re-enable it in the next kernel. I'm just not sure > > we'll be able to fix this in time for 6.7 with the holidays and all > > and I don't want to cause a lot of scheduler churn at the end of the > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > best short term solution. > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it o= pened > the opportunity for a 1-to-1 relationship between an entity and a schedul= er. > (Should've always been the case, from the outset. Not sure why it was cod= ed as > a fixed-size array.) > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the p= roblem > is wholly contained in amdgpu, and no other driver has this problem, ther= e is > no reason to have to "churn", i.e. go back and forth in DRM, only to cove= r up > an init bug in amdgpu. See the response I just sent in @this thread: > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba43e2@gmail.com > > And it's not like this issue is unknown. I first posted about it on 2023-= 10-16. > > Ideally, amdgpu would just fix their init code. You can't make changes to core code that break other drivers. Arguably 56e449603f0ac5 should not have gone in in the first place if it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init code, but as a side effect it seems to have broken suspend for some users. Alex