From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA497C433E4 for ; Fri, 17 Jul 2020 19:04:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B67722074B for ; Fri, 17 Jul 2020 19:04:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ANuGBtdK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728298AbgGQTEQ (ORCPT ); Fri, 17 Jul 2020 15:04:16 -0400 Received: from us-smtp-1.mimecast.com ([207.211.31.81]:52768 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727821AbgGQTEP (ORCPT ); Fri, 17 Jul 2020 15:04:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1595012654; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VKQ3UPhST3OvzdKUW3dae3IGuD+EOft4yvMH2GzGai4=; b=ANuGBtdKw2VZjmvzvIePyyMeK/+DxxVu7KPI4ecH/ygLCOUkB5XgfXa0y7Dr2O4UAeJYo9 qLAaXnQvqvqS/R/uYUlYHmsW61AdLc9mACN/UNtpVCmR13HeipREUa7BsivQDGpvL3BRUV /zPKCSG7G6nNSMy9jAc3I7el4butJD8= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-179-iH3o28nQMJWax7GB-1EH9g-1; Fri, 17 Jul 2020 15:04:13 -0400 X-MC-Unique: iH3o28nQMJWax7GB-1EH9g-1 Received: by mail-qk1-f197.google.com with SMTP id u186so6673736qka.4 for ; Fri, 17 Jul 2020 12:04:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:subject:from:reply-to:to:cc:date :in-reply-to:references:organization:user-agent:mime-version :content-transfer-encoding; bh=VKQ3UPhST3OvzdKUW3dae3IGuD+EOft4yvMH2GzGai4=; b=A5Imb77u+0+DWFYtinoNIEk7TDNtfQrqpl73H30PT1AfrgYEmO3HupdMQOrObxPbWy C/fQGW062gmqyltkIS8ayOD6F3VIbhm4nbhgJtlGI2u6uS1P4iKyZFHTWze55UH0baHm LiUEUbTkE1GHhVK+MmzX1RJFHuXgL5qU5rBIKj0TBc1oNZXl70y6L2QRMT/ttZ+HGK7E tHkq9Xp20E6tcP5xtEJxHB5mrkKZrxyAs6THPu3z92HmsldZ0ZqV4FLhiwDkbWndzMi8 455LmTaCeujFl1KeHAYdmYBU6MY6b2UZ+lv7IBlSCxzUVJZhiK0CFKBWguH8M0uPTKJs xdng== X-Gm-Message-State: AOAM531496tzPGc9A6aguRLTz87tzn5kJmnAlyCCnnh6szmR7eDz7mcF xdBbPsGpzUTKAg9z5K+bp9DMBe1oXzP8OIPfp5ZCFNJ6DcwciG57WcM3NLpdy6hLaPKbZPA7tS5 paQVht+gFY+oQFZjkXhN4zrDe X-Received: by 2002:a37:a046:: with SMTP id j67mr9892605qke.395.1595012652218; Fri, 17 Jul 2020 12:04:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwgtOiKIUn5ao5bCnaTTskTcrkgF6EP4gHfCCcCRRv0vg7ItdvSMNS0FlNgUH04ZXSmho073g== X-Received: by 2002:a37:a046:: with SMTP id j67mr9892576qke.395.1595012651933; Fri, 17 Jul 2020 12:04:11 -0700 (PDT) Received: from Whitewolf.lyude.net (pool-108-49-102-102.bstnma.fios.verizon.net. [108.49.102.102]) by smtp.gmail.com with ESMTPSA id e9sm11311764qtq.70.2020.07.17.12.04.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Jul 2020 12:04:11 -0700 (PDT) Message-ID: Subject: Re: nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms" From: Lyude Paul Reply-To: lyude@redhat.com To: Bjorn Helgaas , Karol Herbst Cc: Linux PCI , Mika Westerberg , Ben Skeggs , Bjorn Helgaas , nouveau , dri-devel , Patrick Volkerding , linux-kernel@vger.kernel.org, Kai-Heng Feng , Sasha Levin Date: Fri, 17 Jul 2020 15:04:10 -0400 In-Reply-To: <20200716235440.GA675421@bjorn-Precision-5520> References: <20200716235440.GA675421@bjorn-Precision-5520> Organization: Red Hat Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.3 (3.36.3-1.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2020-07-16 at 18:54 -0500, Bjorn Helgaas wrote: > [+cc Sasha -- stable kernel regression] > [+cc Patrick, Kai-Heng, LKML] > > On Fri, Jul 17, 2020 at 12:10:39AM +0200, Karol Herbst wrote: > > On Tue, Jul 7, 2020 at 9:30 PM Karol Herbst wrote: > > > Hi everybody, > > > > > > with the mentioned commit Nouveau isn't able to load firmware onto the > > > GPU on one of my systems here. Even though the issue doesn't always > > > happen I am quite confident this is the commit breaking it. > > > > > > I am still digging into the issue and trying to figure out what > > > exactly breaks, but it shows up in different ways. Either we are not > > > able to boot the engines on the GPU or the GPU becomes unresponsive. > > > Btw, this is also a system where our runtime power management issue > > > shows up, so maybe there is indeed something funky with the bridge > > > controller. > > > > > > Just pinging you in case you have an idea on how this could break Nouveau > > > > > > most of the times it shows up like this: > > > nouveau 0000:01:00.0: acr: AHESASC binary failed > > > > > > Sometimes it works at boot and fails at runtime resuming with random > > > faults. So I will be investigating a bit more, but yeah... I am super > > > sure the commit triggered this issue, no idea if it actually causes > > > it. > > > > so yeah.. I reverted that locally and never ran into issues again. > > Still valid on latest 5.7. So can we get this reverted or properly > > fixed? This breaks runtime pm for us on at least some hardware. > > Yeah, that stinks. We had another similar report from Patrick: > > > https://lore.kernel.org/r/CAErSpo5sTeK_my1dEhWp7aHD0xOp87+oHYWkTjbL7ALgDbXo-Q@mail.gmail.com > > Apparently the problem is ec411e02b7a2 ("PCI/PM: Assume ports without > DLL Link Active train links in 100 ms"), which Patrick found was > backported to v5.4.49 as 828b192c57e8, and you found was backported to > v5.7.6 as afaff825e3a4. > > Oddly, Patrick reported that v5.7.7 worked correctly, even though it > still contains afaff825e3a4. > > I guess in the absence of any other clues we'll have to revert it. > I hate to do that because that means we'll have slow resume of > Thunderbolt-connected devices again, but that's better than having > GPUs completely broken. > > Could you and Patrick open bugzilla.kernel.org reports, attach dmesg > logs and "sudo lspci -vv" output, and add the URLs to Kai-Heng's > original report at https://bugzilla.kernel.org/show_bug.cgi?id=206837 > and to this thread? > > There must be a way to fix the slow resume problem without breaking > the GPUs. Isn't it possible to tell whether a PCI device is connected through thunderbolt or not? We could probably get away with just defaulting to 100ms for thunderbolt devices without DLL Link Active specified, and then default to the old delay value for non-thunderbolt devices. > > Bjorn > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lyude Paul Subject: Re: nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms" Date: Fri, 17 Jul 2020 15:04:10 -0400 Message-ID: References: <20200716235440.GA675421@bjorn-Precision-5520> Reply-To: lyude-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20200716235440.GA675421@bjorn-Precision-5520> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: nouveau-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "Nouveau" To: Bjorn Helgaas , Karol Herbst Cc: Sasha Levin , Patrick Volkerding , Linux PCI , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, dri-devel , Kai-Heng Feng , Ben Skeggs , nouveau , Bjorn Helgaas , Mika Westerberg List-Id: nouveau.vger.kernel.org On Thu, 2020-07-16 at 18:54 -0500, Bjorn Helgaas wrote: > [+cc Sasha -- stable kernel regression] > [+cc Patrick, Kai-Heng, LKML] > > On Fri, Jul 17, 2020 at 12:10:39AM +0200, Karol Herbst wrote: > > On Tue, Jul 7, 2020 at 9:30 PM Karol Herbst wrote: > > > Hi everybody, > > > > > > with the mentioned commit Nouveau isn't able to load firmware onto the > > > GPU on one of my systems here. Even though the issue doesn't always > > > happen I am quite confident this is the commit breaking it. > > > > > > I am still digging into the issue and trying to figure out what > > > exactly breaks, but it shows up in different ways. Either we are not > > > able to boot the engines on the GPU or the GPU becomes unresponsive. > > > Btw, this is also a system where our runtime power management issue > > > shows up, so maybe there is indeed something funky with the bridge > > > controller. > > > > > > Just pinging you in case you have an idea on how this could break Nouveau > > > > > > most of the times it shows up like this: > > > nouveau 0000:01:00.0: acr: AHESASC binary failed > > > > > > Sometimes it works at boot and fails at runtime resuming with random > > > faults. So I will be investigating a bit more, but yeah... I am super > > > sure the commit triggered this issue, no idea if it actually causes > > > it. > > > > so yeah.. I reverted that locally and never ran into issues again. > > Still valid on latest 5.7. So can we get this reverted or properly > > fixed? This breaks runtime pm for us on at least some hardware. > > Yeah, that stinks. We had another similar report from Patrick: > > > https://lore.kernel.org/r/CAErSpo5sTeK_my1dEhWp7aHD0xOp87+oHYWkTjbL7ALgDbXo-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org > > Apparently the problem is ec411e02b7a2 ("PCI/PM: Assume ports without > DLL Link Active train links in 100 ms"), which Patrick found was > backported to v5.4.49 as 828b192c57e8, and you found was backported to > v5.7.6 as afaff825e3a4. > > Oddly, Patrick reported that v5.7.7 worked correctly, even though it > still contains afaff825e3a4. > > I guess in the absence of any other clues we'll have to revert it. > I hate to do that because that means we'll have slow resume of > Thunderbolt-connected devices again, but that's better than having > GPUs completely broken. > > Could you and Patrick open bugzilla.kernel.org reports, attach dmesg > logs and "sudo lspci -vv" output, and add the URLs to Kai-Heng's > original report at https://bugzilla.kernel.org/show_bug.cgi?id=206837 > and to this thread? > > There must be a way to fix the slow resume problem without breaking > the GPUs. Isn't it possible to tell whether a PCI device is connected through thunderbolt or not? We could probably get away with just defaulting to 100ms for thunderbolt devices without DLL Link Active specified, and then default to the old delay value for non-thunderbolt devices. > > Bjorn > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B5F0C433E2 for ; Fri, 17 Jul 2020 19:04:19 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E6E0A2076A for ; Fri, 17 Jul 2020 19:04:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="DP9EFJPv" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E6E0A2076A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 659606E39B; Fri, 17 Jul 2020 19:04:18 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 884D96E39B for ; Fri, 17 Jul 2020 19:04:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1595012656; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VKQ3UPhST3OvzdKUW3dae3IGuD+EOft4yvMH2GzGai4=; b=DP9EFJPvOPlwsY2lYZsv0SYVVE/OZqr/tFhuYE9XdvsIrpvb6SOLpkALhIIuMW7CQXDEpe 4Wpy6cHLFlebFlRF5aQYw232pQg4aabNnCnQ952IJvuCxGGDIpOq92unbv8TlcP+AzNHLz ictCV+bwqDZie+XHoLuTGQJu6+b4AOg= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-341-FWB2HbHrPgmHhXEHvc47pw-1; Fri, 17 Jul 2020 15:04:12 -0400 X-MC-Unique: FWB2HbHrPgmHhXEHvc47pw-1 Received: by mail-qk1-f198.google.com with SMTP id p126so6640049qkf.15 for ; Fri, 17 Jul 2020 12:04:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:subject:from:reply-to:to:cc:date :in-reply-to:references:organization:user-agent:mime-version :content-transfer-encoding; bh=VKQ3UPhST3OvzdKUW3dae3IGuD+EOft4yvMH2GzGai4=; b=twj1CkmskOxeSBdVbTz6Nht5uMm2uPl2CI2bqftisR0LuwaUzHTeW0ieZU4Tr0BgN1 lprwAOV2URXMBQr8nW1W3ecTTyH2e0MWB83GDaSAcHi0/yn4uwcf7MxGeUWHLgWF5OlK RgWsQ3bo9wzB4NccPfecTGAx8G6Tn6DBAtRfAyPNMCA7SdKS82Tzmq3ktAxMkybzoF58 AIEyF7vZfjqC6aLZ1HE9Uy9RK3zpxfOBlOr9RkheuGTOBWN5uMZTV/9Oeok7JUXy0VvH aa9ZqGGLFcLHgT5esk4wIhzn6O+r6JasHdsHFDBt2HMvyLs9ZKlkNoAq+1fJUKTPemuP 1UxQ== X-Gm-Message-State: AOAM533ahxEof+1ipJP8umSWkfLSzj+Id6h4KkE1Xx4NVy3ywoyKBLK6 RMiCYWHwh6CbsnEBqUuqs6l0fLUCazngD3myf7mRqcKTGL4hKSj2+l52zdBiQtdAcjIKS7SNwIK DXFLY/Ut7+VGrY8hRrouFiS13Vm+Y X-Received: by 2002:a37:a046:: with SMTP id j67mr9892602qke.395.1595012652217; Fri, 17 Jul 2020 12:04:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwgtOiKIUn5ao5bCnaTTskTcrkgF6EP4gHfCCcCRRv0vg7ItdvSMNS0FlNgUH04ZXSmho073g== X-Received: by 2002:a37:a046:: with SMTP id j67mr9892576qke.395.1595012651933; Fri, 17 Jul 2020 12:04:11 -0700 (PDT) Received: from Whitewolf.lyude.net (pool-108-49-102-102.bstnma.fios.verizon.net. [108.49.102.102]) by smtp.gmail.com with ESMTPSA id e9sm11311764qtq.70.2020.07.17.12.04.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Jul 2020 12:04:11 -0700 (PDT) Message-ID: Subject: Re: nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms" From: Lyude Paul To: Bjorn Helgaas , Karol Herbst Date: Fri, 17 Jul 2020 15:04:10 -0400 In-Reply-To: <20200716235440.GA675421@bjorn-Precision-5520> References: <20200716235440.GA675421@bjorn-Precision-5520> Organization: Red Hat User-Agent: Evolution 3.36.3 (3.36.3-1.fc32) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: lyude@redhat.com Cc: Sasha Levin , Patrick Volkerding , Linux PCI , linux-kernel@vger.kernel.org, dri-devel , Kai-Heng Feng , Ben Skeggs , nouveau , Bjorn Helgaas , Mika Westerberg Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Thu, 2020-07-16 at 18:54 -0500, Bjorn Helgaas wrote: > [+cc Sasha -- stable kernel regression] > [+cc Patrick, Kai-Heng, LKML] > > On Fri, Jul 17, 2020 at 12:10:39AM +0200, Karol Herbst wrote: > > On Tue, Jul 7, 2020 at 9:30 PM Karol Herbst wrote: > > > Hi everybody, > > > > > > with the mentioned commit Nouveau isn't able to load firmware onto the > > > GPU on one of my systems here. Even though the issue doesn't always > > > happen I am quite confident this is the commit breaking it. > > > > > > I am still digging into the issue and trying to figure out what > > > exactly breaks, but it shows up in different ways. Either we are not > > > able to boot the engines on the GPU or the GPU becomes unresponsive. > > > Btw, this is also a system where our runtime power management issue > > > shows up, so maybe there is indeed something funky with the bridge > > > controller. > > > > > > Just pinging you in case you have an idea on how this could break Nouveau > > > > > > most of the times it shows up like this: > > > nouveau 0000:01:00.0: acr: AHESASC binary failed > > > > > > Sometimes it works at boot and fails at runtime resuming with random > > > faults. So I will be investigating a bit more, but yeah... I am super > > > sure the commit triggered this issue, no idea if it actually causes > > > it. > > > > so yeah.. I reverted that locally and never ran into issues again. > > Still valid on latest 5.7. So can we get this reverted or properly > > fixed? This breaks runtime pm for us on at least some hardware. > > Yeah, that stinks. We had another similar report from Patrick: > > > https://lore.kernel.org/r/CAErSpo5sTeK_my1dEhWp7aHD0xOp87+oHYWkTjbL7ALgDbXo-Q@mail.gmail.com > > Apparently the problem is ec411e02b7a2 ("PCI/PM: Assume ports without > DLL Link Active train links in 100 ms"), which Patrick found was > backported to v5.4.49 as 828b192c57e8, and you found was backported to > v5.7.6 as afaff825e3a4. > > Oddly, Patrick reported that v5.7.7 worked correctly, even though it > still contains afaff825e3a4. > > I guess in the absence of any other clues we'll have to revert it. > I hate to do that because that means we'll have slow resume of > Thunderbolt-connected devices again, but that's better than having > GPUs completely broken. > > Could you and Patrick open bugzilla.kernel.org reports, attach dmesg > logs and "sudo lspci -vv" output, and add the URLs to Kai-Heng's > original report at https://bugzilla.kernel.org/show_bug.cgi?id=206837 > and to this thread? > > There must be a way to fix the slow resume problem without breaking > the GPUs. Isn't it possible to tell whether a PCI device is connected through thunderbolt or not? We could probably get away with just defaulting to 100ms for thunderbolt devices without DLL Link Active specified, and then default to the old delay value for non-thunderbolt devices. > > Bjorn > _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel