linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: Bjorn Helgaas <helgaas@kernel.org>, Alex_Gagniuc@Dellteam.com
Cc: bhelgaas@google.com, Austin.Bolen@dell.com, Shyam.Iyer@dell.com,
	keith.busch@intel.com, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, jeffrey.t.kirsher@intel.com,
	ariel.elior@cavium.com, michael.chan@broadcom.com,
	ganeshgr@chelsio.com, tariqt@mellanox.com,
	jakub.kicinski@netronome.com, talgi@mellanox.com,
	airlied@gmail.com, alexander.deucher@amd.com,
	Mike Marciniszyn <mike.marciniszyn@intel.com>
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
Date: Thu, 19 Jul 2018 10:46:43 -0500	[thread overview]
Message-ID: <8baf16fb-7e7c-ec59-19ef-a709e164dc94@gmail.com> (raw)
In-Reply-To: <20180718215359.GG128988@bhelgaas-glaptop.roam.corp.google.com>



On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:
> [+cc Mike (hfi1)]
> 
> On Mon, Jul 16, 2018 at 10:28:35PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
>>>> ...
>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>> a perfect solution, but it works extremely well in most cases.
>>>
>>> This is an interesting idea.  I have two concerns:
>>>
>>> Some drivers already do this on their own, and we probably don't want
>>> duplicate output for those devices.  In most cases (ixgbe and mlx* are
>>> exceptions), the drivers do this unconditionally so we *could* remove
>>> it from the driver if we add it to the core.  The dmesg order would
>>> change, and the message wouldn't be associated with the driver as it
>>> now is.
>>
>> Oh, there are only 8 users of that. Even I could patch up the drivers to
>> remove the call, assuming we reach agreement about this change.
>>
>>> Also, I think some of the GPU devices might come up at a lower speed,
>>> then download firmware, then reset the device so it comes up at a
>>> higher speed.  I think this patch will make us complain about about
>>> the low initial speed, which might confuse users.
>>
>> I spoke to one of the PCIe spec writers. It's allowable for a device to
>> downtrain speed or width. It would also be extremely dumb to downtrain
>> with the intent to re-train at a higher speed later, but it's possible
>> devices do dumb stuff like that. That's why it's an informational
>> message, instead of a warning.
> 
> FWIW, here's some of the discussion related to hfi1 from [1]:
> 
>    > Btw, why is the driver configuring the PCIe link speed?  Isn't
>    > this something we should be handling in the PCI core?
> 
>    The device comes out of reset at the 5GT/s speed. The driver
>    downloads device firmware, programs PCIe registers, and co-ordinates
>    the transition to 8GT/s.
> 
>    This recipe is device specific and is therefore implemented in the
>    hfi1 driver built on top of PCI core functions and macros.
> 
> Also several DRM drivers seem to do this (see ),
> si_pcie_gen3_enable()); from [2]:
> 
>    My understanding was that some platfoms only bring up the link in gen 1
>    mode for compatibility reasons.
> 
> [1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
> [2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com

Downtraining a link "for compatibility reasons" is one of those dumb 
things that devices do. I'm SURPRISED AMD HW does it, although it is 
perfectly permissible by PCIe spec.

>> Another case: Some devices (lower-end GPUs) use silicon (and marketing)
>> that advertises x16, but they're only routed for x8. I'm okay with
>> seeing an informational message in this case. In fact, I didn't know
>> that my Quadro card for three years is only wired for x8 until I was
>> testing this patch.
> 
> Yeah, it's probably OK.  I don't want bug reports from people who
> think something's broken when it's really just a hardware limitation
> of their system.  But hopefully the message is not alarming.

It looks fairly innocent:

[    0.749415] pci 0000:18:00.0: 4.000 Gb/s available PCIe bandwidth, 
limited by 5 GT/s x1 link at 0000:17:03.0 (capable of 15.752 Gb/s with 8 
GT/s x2 link)

>>> So I'm not sure whether it's better to do this in the core for all
>>> devices, or if we should just add it to the high-performance drivers
>>> that really care.
>>
>> You're thinking "do I really need that bandwidth" because I'm using a
>> function called "_bandwidth_". The point of the change is very far from
>> that: it is to help in system troubleshooting by detecting downtraining
>> conditions.
> 
> I'm not sure what you think I'm thinking :)  My question is whether
> it's worthwhile to print this extra information for *every* PCIe
> device, given that your use case is the tiny percentage of broken
> systems.

I think this information is a lot more useful than a bunch of other info 
that's printed. Is "type 00 class 0x088000" more valuable? What about 
"reg 0x20: [mem 0x9d950000-0x9d95ffff 64bit pref]", which is also 
available under /proc/iomem for those curious?

> If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
> when the device is capable of more than it's getting, that would make
> a lot of sense to me.  The normal case line is more questionable.  I
> think the reason that's there is because the network drivers are very
> performance sensitive and like to see that info all the time.

I agree that can be an acceptable compromise.

> Maybe we need something like this:
> 
>    pcie_print_link_status(struct pci_dev *dev, int verbose)
>    {
>      ...
>      if (bw_avail >= bw_cap) {
>        if (verbose)
>          pci_info(dev, "... available PCIe bandwidth ...");
>      } else
>        pci_info(dev, "... available PCIe bandwidth, limited by ...");
>    }
> 
> So the core could print only the potential problems with:
> 
>    pcie_print_link_status(dev, 0);
> 
> and drivers that really care even if there's no problem could do:
> 
>    pcie_print_link_status(dev, 1);

Sounds good. I'll try to push out updated PATCH early next week.

>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> [snip]
>>>> +	/* Look from the device up to avoid downstream ports with no devices. */
>>>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>>>> +		return;
>>>
>>> Do we care about Upstream Ports here?
>>
>> YES! Switches. e.g. an x16 switch with 4x downstream ports could
>> downtrain at 8x and 4x, and we'd never catch it.
> 
> OK, I think I see your point: if the upstream port *could* do 16x but
> only trains to 4x, and two endpoints below it are both capable of 4x,
> the endpoints *think* they're happy but in fact they have to share 4x
> when they could use more.
> 
> Bjorn
> 

  reply	other threads:[~2018-07-19 15:46 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-04 15:55 [PATCH v3] PCI: Check for PCIe downtraining conditions Alexandru Gagniuc
2018-06-05 12:27 ` Andy Shevchenko
2018-06-05 13:04   ` Andy Shevchenko
2018-07-16 21:17 ` Bjorn Helgaas
2018-07-16 22:28   ` Alex_Gagniuc
2018-07-18 21:53     ` Bjorn Helgaas
2018-07-19 15:46       ` Alex G. [this message]
2018-07-23 20:01       ` [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
2018-07-25  1:24         ` kbuild test robot
2018-07-23 20:03       ` [PATCH v5] PCI: Check for PCIe downtraining conditions Alexandru Gagniuc
2018-07-23 21:01         ` Jakub Kicinski
2018-07-23 21:52           ` Tal Gilboa
2018-07-23 22:14             ` Jakub Kicinski
2018-07-23 23:59               ` Alex G.
2018-07-24 13:39                 ` Tal Gilboa
2018-07-30 23:26                   ` Alex_Gagniuc
2018-07-31  6:40             ` Tal Gilboa
2018-07-31 15:10               ` Alex G.
2018-08-05  7:05                 ` Tal Gilboa
2018-08-06 18:39                   ` Alex_Gagniuc
2018-08-06 19:46                     ` Bjorn Helgaas
2018-08-06 23:25                       ` [PATCH v6 1/9] " Alexandru Gagniuc
2018-08-06 23:25                         ` [PATCH v6 2/9] bnx2x: Do not call pcie_print_link_status() Alexandru Gagniuc
2018-08-06 23:25                         ` [PATCH v6 3/9] bnxt_en: " Alexandru Gagniuc
2018-08-06 23:25                         ` [PATCH v6 4/9] cxgb4: " Alexandru Gagniuc
2018-08-06 23:25                         ` [PATCH v6 5/9] fm10k: " Alexandru Gagniuc
2018-08-07 17:52                           ` Jeff Kirsher
2018-08-06 23:25                         ` [PATCH v6 6/9] ixgbe: " Alexandru Gagniuc
2018-08-07 17:51                           ` Jeff Kirsher
2018-08-06 23:25                         ` [PATCH v6 7/9] net/mlx4: " Alexandru Gagniuc
2018-08-08  6:10                           ` Leon Romanovsky
2018-08-06 23:25                         ` [PATCH v6 8/9] net/mlx5: " Alexandru Gagniuc
2018-08-08  6:08                           ` Leon Romanovsky
2018-08-08 14:23                             ` Tal Gilboa
2018-08-08 15:41                               ` Leon Romanovsky
2018-08-08 15:56                                 ` Tal Gilboa
2018-08-08 16:33                                   ` Alex G.
2018-08-08 17:27                                     ` Leon Romanovsky
2018-08-09 14:02                                       ` Bjorn Helgaas
2018-08-06 23:25                         ` [PATCH v6 9/9] nfp: " Alexandru Gagniuc
2018-08-07 19:44                         ` [PATCH v6 1/9] PCI: Check for PCIe downtraining conditions David Miller
2018-08-07 21:41                         ` Bjorn Helgaas
2018-07-18 13:38   ` [PATCH v3] " Tal Gilboa
2018-07-19 15:49     ` Alex G.
2018-07-23  5:21       ` Tal Gilboa
2018-07-23 17:01         ` Alex G.
2018-07-23 21:35           ` Tal Gilboa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8baf16fb-7e7c-ec59-19ef-a709e164dc94@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=Alex_Gagniuc@Dellteam.com \
    --cc=Austin.Bolen@dell.com \
    --cc=Shyam.Iyer@dell.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=ariel.elior@cavium.com \
    --cc=bhelgaas@google.com \
    --cc=ganeshgr@chelsio.com \
    --cc=helgaas@kernel.org \
    --cc=jakub.kicinski@netronome.com \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=michael.chan@broadcom.com \
    --cc=mike.marciniszyn@intel.com \
    --cc=talgi@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).