From: "Alex G." <mr.nuke.me@gmail.com>
To: Bjorn Helgaas <helgaas@kernel.org>, Alex_Gagniuc@Dellteam.com
Cc: bhelgaas@google.com, Austin.Bolen@dell.com, Shyam.Iyer@dell.com,
keith.busch@intel.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, jeffrey.t.kirsher@intel.com,
ariel.elior@cavium.com, michael.chan@broadcom.com,
ganeshgr@chelsio.com, tariqt@mellanox.com,
jakub.kicinski@netronome.com, talgi@mellanox.com,
airlied@gmail.com, alexander.deucher@amd.com,
Mike Marciniszyn <mike.marciniszyn@intel.com>
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
Date: Thu, 19 Jul 2018 10:46:43 -0500 [thread overview]
Message-ID: <8baf16fb-7e7c-ec59-19ef-a709e164dc94@gmail.com> (raw)
In-Reply-To: <20180718215359.GG128988@bhelgaas-glaptop.roam.corp.google.com>
On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:
> [+cc Mike (hfi1)]
>
> On Mon, Jul 16, 2018 at 10:28:35PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
>>>> ...
>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>> a perfect solution, but it works extremely well in most cases.
>>>
>>> This is an interesting idea. I have two concerns:
>>>
>>> Some drivers already do this on their own, and we probably don't want
>>> duplicate output for those devices. In most cases (ixgbe and mlx* are
>>> exceptions), the drivers do this unconditionally so we *could* remove
>>> it from the driver if we add it to the core. The dmesg order would
>>> change, and the message wouldn't be associated with the driver as it
>>> now is.
>>
>> Oh, there are only 8 users of that. Even I could patch up the drivers to
>> remove the call, assuming we reach agreement about this change.
>>
>>> Also, I think some of the GPU devices might come up at a lower speed,
>>> then download firmware, then reset the device so it comes up at a
>>> higher speed. I think this patch will make us complain about about
>>> the low initial speed, which might confuse users.
>>
>> I spoke to one of the PCIe spec writers. It's allowable for a device to
>> downtrain speed or width. It would also be extremely dumb to downtrain
>> with the intent to re-train at a higher speed later, but it's possible
>> devices do dumb stuff like that. That's why it's an informational
>> message, instead of a warning.
>
> FWIW, here's some of the discussion related to hfi1 from [1]:
>
> > Btw, why is the driver configuring the PCIe link speed? Isn't
> > this something we should be handling in the PCI core?
>
> The device comes out of reset at the 5GT/s speed. The driver
> downloads device firmware, programs PCIe registers, and co-ordinates
> the transition to 8GT/s.
>
> This recipe is device specific and is therefore implemented in the
> hfi1 driver built on top of PCI core functions and macros.
>
> Also several DRM drivers seem to do this (see ),
> si_pcie_gen3_enable()); from [2]:
>
> My understanding was that some platfoms only bring up the link in gen 1
> mode for compatibility reasons.
>
> [1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
> [2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com
Downtraining a link "for compatibility reasons" is one of those dumb
things that devices do. I'm SURPRISED AMD HW does it, although it is
perfectly permissible by PCIe spec.
>> Another case: Some devices (lower-end GPUs) use silicon (and marketing)
>> that advertises x16, but they're only routed for x8. I'm okay with
>> seeing an informational message in this case. In fact, I didn't know
>> that my Quadro card for three years is only wired for x8 until I was
>> testing this patch.
>
> Yeah, it's probably OK. I don't want bug reports from people who
> think something's broken when it's really just a hardware limitation
> of their system. But hopefully the message is not alarming.
It looks fairly innocent:
[ 0.749415] pci 0000:18:00.0: 4.000 Gb/s available PCIe bandwidth,
limited by 5 GT/s x1 link at 0000:17:03.0 (capable of 15.752 Gb/s with 8
GT/s x2 link)
>>> So I'm not sure whether it's better to do this in the core for all
>>> devices, or if we should just add it to the high-performance drivers
>>> that really care.
>>
>> You're thinking "do I really need that bandwidth" because I'm using a
>> function called "_bandwidth_". The point of the change is very far from
>> that: it is to help in system troubleshooting by detecting downtraining
>> conditions.
>
> I'm not sure what you think I'm thinking :) My question is whether
> it's worthwhile to print this extra information for *every* PCIe
> device, given that your use case is the tiny percentage of broken
> systems.
I think this information is a lot more useful than a bunch of other info
that's printed. Is "type 00 class 0x088000" more valuable? What about
"reg 0x20: [mem 0x9d950000-0x9d95ffff 64bit pref]", which is also
available under /proc/iomem for those curious?
> If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
> when the device is capable of more than it's getting, that would make
> a lot of sense to me. The normal case line is more questionable. I
> think the reason that's there is because the network drivers are very
> performance sensitive and like to see that info all the time.
I agree that can be an acceptable compromise.
> Maybe we need something like this:
>
> pcie_print_link_status(struct pci_dev *dev, int verbose)
> {
> ...
> if (bw_avail >= bw_cap) {
> if (verbose)
> pci_info(dev, "... available PCIe bandwidth ...");
> } else
> pci_info(dev, "... available PCIe bandwidth, limited by ...");
> }
>
> So the core could print only the potential problems with:
>
> pcie_print_link_status(dev, 0);
>
> and drivers that really care even if there's no problem could do:
>
> pcie_print_link_status(dev, 1);
Sounds good. I'll try to push out updated PATCH early next week.
>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> [snip]
>>>> + /* Look from the device up to avoid downstream ports with no devices. */
>>>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>>>> + return;
>>>
>>> Do we care about Upstream Ports here?
>>
>> YES! Switches. e.g. an x16 switch with 4x downstream ports could
>> downtrain at 8x and 4x, and we'd never catch it.
>
> OK, I think I see your point: if the upstream port *could* do 16x but
> only trains to 4x, and two endpoints below it are both capable of 4x,
> the endpoints *think* they're happy but in fact they have to share 4x
> when they could use more.
>
> Bjorn
>
next prev parent reply other threads:[~2018-07-19 15:46 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-04 15:55 [PATCH v3] PCI: Check for PCIe downtraining conditions Alexandru Gagniuc
2018-06-05 12:27 ` Andy Shevchenko
2018-06-05 13:04 ` Andy Shevchenko
2018-07-16 21:17 ` Bjorn Helgaas
2018-07-16 22:28 ` Alex_Gagniuc
2018-07-18 21:53 ` Bjorn Helgaas
2018-07-19 15:46 ` Alex G. [this message]
2018-07-23 20:01 ` [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
2018-07-25 1:24 ` kbuild test robot
2018-07-23 20:03 ` [PATCH v5] PCI: Check for PCIe downtraining conditions Alexandru Gagniuc
2018-07-23 21:01 ` Jakub Kicinski
2018-07-23 21:52 ` Tal Gilboa
2018-07-23 22:14 ` Jakub Kicinski
2018-07-23 23:59 ` Alex G.
2018-07-24 13:39 ` Tal Gilboa
2018-07-30 23:26 ` Alex_Gagniuc
2018-07-31 6:40 ` Tal Gilboa
2018-07-31 15:10 ` Alex G.
2018-08-05 7:05 ` Tal Gilboa
2018-08-06 18:39 ` Alex_Gagniuc
2018-08-06 19:46 ` Bjorn Helgaas
2018-08-06 23:25 ` [PATCH v6 1/9] " Alexandru Gagniuc
2018-08-06 23:25 ` [PATCH v6 2/9] bnx2x: Do not call pcie_print_link_status() Alexandru Gagniuc
2018-08-06 23:25 ` [PATCH v6 3/9] bnxt_en: " Alexandru Gagniuc
2018-08-06 23:25 ` [PATCH v6 4/9] cxgb4: " Alexandru Gagniuc
2018-08-06 23:25 ` [PATCH v6 5/9] fm10k: " Alexandru Gagniuc
2018-08-07 17:52 ` Jeff Kirsher
2018-08-06 23:25 ` [PATCH v6 6/9] ixgbe: " Alexandru Gagniuc
2018-08-07 17:51 ` Jeff Kirsher
2018-08-06 23:25 ` [PATCH v6 7/9] net/mlx4: " Alexandru Gagniuc
2018-08-08 6:10 ` Leon Romanovsky
2018-08-06 23:25 ` [PATCH v6 8/9] net/mlx5: " Alexandru Gagniuc
2018-08-08 6:08 ` Leon Romanovsky
2018-08-08 14:23 ` Tal Gilboa
2018-08-08 15:41 ` Leon Romanovsky
2018-08-08 15:56 ` Tal Gilboa
2018-08-08 16:33 ` Alex G.
2018-08-08 17:27 ` Leon Romanovsky
2018-08-09 14:02 ` Bjorn Helgaas
2018-08-06 23:25 ` [PATCH v6 9/9] nfp: " Alexandru Gagniuc
2018-08-07 19:44 ` [PATCH v6 1/9] PCI: Check for PCIe downtraining conditions David Miller
2018-08-07 21:41 ` Bjorn Helgaas
2018-07-18 13:38 ` [PATCH v3] " Tal Gilboa
2018-07-19 15:49 ` Alex G.
2018-07-23 5:21 ` Tal Gilboa
2018-07-23 17:01 ` Alex G.
2018-07-23 21:35 ` Tal Gilboa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8baf16fb-7e7c-ec59-19ef-a709e164dc94@gmail.com \
--to=mr.nuke.me@gmail.com \
--cc=Alex_Gagniuc@Dellteam.com \
--cc=Austin.Bolen@dell.com \
--cc=Shyam.Iyer@dell.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=ariel.elior@cavium.com \
--cc=bhelgaas@google.com \
--cc=ganeshgr@chelsio.com \
--cc=helgaas@kernel.org \
--cc=jakub.kicinski@netronome.com \
--cc=jeffrey.t.kirsher@intel.com \
--cc=keith.busch@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=michael.chan@broadcom.com \
--cc=mike.marciniszyn@intel.com \
--cc=talgi@mellanox.com \
--cc=tariqt@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).