From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E52ABC468C6 for ; Thu, 19 Jul 2018 15:46:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8228B20661 for ; Thu, 19 Jul 2018 15:46:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="V+0WfAj5" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8228B20661 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731866AbeGSQac (ORCPT ); Thu, 19 Jul 2018 12:30:32 -0400 Received: from mail-oi0-f65.google.com ([209.85.218.65]:33863 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731570AbeGSQac (ORCPT ); Thu, 19 Jul 2018 12:30:32 -0400 Received: by mail-oi0-f65.google.com with SMTP id 13-v6so15860103ois.1; Thu, 19 Jul 2018 08:46:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:subject:to:cc:references:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=OvumWTtxn2XMugUW0CE141REBO4t5geQsXiJFiPMp0o=; b=V+0WfAj5dz08QbvjiwUQVqCGpF5eFS7s4k1qI+v1om5DfV/daqyh16dPGkv6Zt8Ut8 RLoSskvMU+yzDeOgAZoi/9Ns8sOJIoFgAq3kUugv6WzLQ3wh6w6BPw8h/xEWz6buF/WC /bMsoGKPhTrIamxhM5nrW6+c/hgIjwIorp0oPAJCd9Szf2VoGjqfs/dgSYki1FCxe0dG v/25D9Logm9Lx7o/h1yFLouzrClhPDIoP296j7k/QMh0iP9PiTh/fm72SbDhXmRFQ4XL 5O37fZ32IaQPfllIozaAb8vvtNGLO5forz2CRQItfKPdBdDwi0+uH7Vt2gT3aD+I/3Mu IK0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:subject:to:cc:references:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=OvumWTtxn2XMugUW0CE141REBO4t5geQsXiJFiPMp0o=; b=QRJAQHHy8rUSR50hmcEtHrH3I/BZhYh/DSeVqxM9Hees+UmIrLg/7YU1EzEprsY3GJ /zMfzBkx0OZMwYKcJ9CiDPi1OitISuXbTKhLnmltAnB3+dMuAEm8jD+W9Kt6kEkfa3E7 +v970lRqv9DANBrnb76hunU+/ukrflthgeBlhV8vjBLtemEtflB7o3zEz1ReKIqPQJXK iaU+8BTL9ZThkqAgRT63vHFkG+M+crS8dJVTyJFakzoPZzZy04zoDz5kbgHMBJVAwQm0 LZFx5Wxmto9PM4vgXm8bQgtU+KVEvv83gqBA2QmhRyIIXQVAyGmmYc+Y2XTzIWTsEXvT qTBA== X-Gm-Message-State: AOUpUlHo5yElPvtemUyN/0XoBXZ/qhhGZcBmkwpkB7oHTJYUpBsab5U4 pCxmsIF7qRlu2D8TyuZEnWM= X-Google-Smtp-Source: AAOMgpeHFGSckz3hV7u2MiSy0uby0Gzd5AElDOIh7XiSLfhgNIazujO88aVtqKkeTnoC9tR1VoLcww== X-Received: by 2002:aca:3357:: with SMTP id z84-v6mr12191882oiz.49.1532015205677; Thu, 19 Jul 2018 08:46:45 -0700 (PDT) Received: from nuclearis2-1.gtech (c-98-195-139-126.hsd1.tx.comcast.net. [98.195.139.126]) by smtp.gmail.com with ESMTPSA id w204-v6sm11866974oif.2.2018.07.19.08.46.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Jul 2018 08:46:44 -0700 (PDT) From: "Alex G." Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions To: Bjorn Helgaas , Alex_Gagniuc@Dellteam.com Cc: bhelgaas@google.com, Austin.Bolen@dell.com, Shyam.Iyer@dell.com, keith.busch@intel.com, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, jeffrey.t.kirsher@intel.com, ariel.elior@cavium.com, michael.chan@broadcom.com, ganeshgr@chelsio.com, tariqt@mellanox.com, jakub.kicinski@netronome.com, talgi@mellanox.com, airlied@gmail.com, alexander.deucher@amd.com, Mike Marciniszyn References: <20180604155523.14906-1-mr.nuke.me@gmail.com> <20180716211706.GB12391@bhelgaas-glaptop.roam.corp.google.com> <97a70a71e1034bafbcabc6c4e23577c0@ausx13mps321.AMER.DELL.COM> <20180718215359.GG128988@bhelgaas-glaptop.roam.corp.google.com> Message-ID: <8baf16fb-7e7c-ec59-19ef-a709e164dc94@gmail.com> Date: Thu, 19 Jul 2018 10:46:43 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20180718215359.GG128988@bhelgaas-glaptop.roam.corp.google.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/18/2018 04:53 PM, Bjorn Helgaas wrote: > [+cc Mike (hfi1)] > > On Mon, Jul 16, 2018 at 10:28:35PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote: >>>> ... >>>> The easiest way to detect this is with pcie_print_link_status(), >>>> since the bottleneck is usually the link that is downtrained. It's not >>>> a perfect solution, but it works extremely well in most cases. >>> >>> This is an interesting idea. I have two concerns: >>> >>> Some drivers already do this on their own, and we probably don't want >>> duplicate output for those devices. In most cases (ixgbe and mlx* are >>> exceptions), the drivers do this unconditionally so we *could* remove >>> it from the driver if we add it to the core. The dmesg order would >>> change, and the message wouldn't be associated with the driver as it >>> now is. >> >> Oh, there are only 8 users of that. Even I could patch up the drivers to >> remove the call, assuming we reach agreement about this change. >> >>> Also, I think some of the GPU devices might come up at a lower speed, >>> then download firmware, then reset the device so it comes up at a >>> higher speed. I think this patch will make us complain about about >>> the low initial speed, which might confuse users. >> >> I spoke to one of the PCIe spec writers. It's allowable for a device to >> downtrain speed or width. It would also be extremely dumb to downtrain >> with the intent to re-train at a higher speed later, but it's possible >> devices do dumb stuff like that. That's why it's an informational >> message, instead of a warning. > > FWIW, here's some of the discussion related to hfi1 from [1]: > > > Btw, why is the driver configuring the PCIe link speed? Isn't > > this something we should be handling in the PCI core? > > The device comes out of reset at the 5GT/s speed. The driver > downloads device firmware, programs PCIe registers, and co-ordinates > the transition to 8GT/s. > > This recipe is device specific and is therefore implemented in the > hfi1 driver built on top of PCI core functions and macros. > > Also several DRM drivers seem to do this (see ), > si_pcie_gen3_enable()); from [2]: > > My understanding was that some platfoms only bring up the link in gen 1 > mode for compatibility reasons. > > [1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com > [2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com Downtraining a link "for compatibility reasons" is one of those dumb things that devices do. I'm SURPRISED AMD HW does it, although it is perfectly permissible by PCIe spec. >> Another case: Some devices (lower-end GPUs) use silicon (and marketing) >> that advertises x16, but they're only routed for x8. I'm okay with >> seeing an informational message in this case. In fact, I didn't know >> that my Quadro card for three years is only wired for x8 until I was >> testing this patch. > > Yeah, it's probably OK. I don't want bug reports from people who > think something's broken when it's really just a hardware limitation > of their system. But hopefully the message is not alarming. It looks fairly innocent: [ 0.749415] pci 0000:18:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:17:03.0 (capable of 15.752 Gb/s with 8 GT/s x2 link) >>> So I'm not sure whether it's better to do this in the core for all >>> devices, or if we should just add it to the high-performance drivers >>> that really care. >> >> You're thinking "do I really need that bandwidth" because I'm using a >> function called "_bandwidth_". The point of the change is very far from >> that: it is to help in system troubleshooting by detecting downtraining >> conditions. > > I'm not sure what you think I'm thinking :) My question is whether > it's worthwhile to print this extra information for *every* PCIe > device, given that your use case is the tiny percentage of broken > systems. I think this information is a lot more useful than a bunch of other info that's printed. Is "type 00 class 0x088000" more valuable? What about "reg 0x20: [mem 0x9d950000-0x9d95ffff 64bit pref]", which is also available under /proc/iomem for those curious? > If we only printed the info in the "bw_avail < bw_cap" case, i.e., > when the device is capable of more than it's getting, that would make > a lot of sense to me. The normal case line is more questionable. I > think the reason that's there is because the network drivers are very > performance sensitive and like to see that info all the time. I agree that can be an acceptable compromise. > Maybe we need something like this: > > pcie_print_link_status(struct pci_dev *dev, int verbose) > { > ... > if (bw_avail >= bw_cap) { > if (verbose) > pci_info(dev, "... available PCIe bandwidth ..."); > } else > pci_info(dev, "... available PCIe bandwidth, limited by ..."); > } > > So the core could print only the potential problems with: > > pcie_print_link_status(dev, 0); > > and drivers that really care even if there's no problem could do: > > pcie_print_link_status(dev, 1); Sounds good. I'll try to push out updated PATCH early next week. >>>> Signed-off-by: Alexandru Gagniuc >> [snip] >>>> + /* Look from the device up to avoid downstream ports with no devices. */ >>>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && >>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && >>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) >>>> + return; >>> >>> Do we care about Upstream Ports here? >> >> YES! Switches. e.g. an x16 switch with 4x downstream ports could >> downtrain at 8x and 4x, and we'd never catch it. > > OK, I think I see your point: if the upstream port *could* do 16x but > only trains to 4x, and two endpoints below it are both capable of 4x, > the endpoints *think* they're happy but in fact they have to share 4x > when they could use more. > > Bjorn >