All of lore.kernel.org
 help / color / mirror / Atom feed
* Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
@ 2019-06-28 10:23 Timur Kristóf
  2019-06-28 10:32 ` Mika Westerberg
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-06-28 10:23 UTC (permalink / raw)
  To: Mika Westerberg, michael.jamet; +Cc: dri-devel

Hi guys,

I use an AMD RX 570 in a Thunderbolt 3 external GPU box.
dmesg gives me the following message:
pci 0000:3a:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x4 link at 0000:04:04.0 (capable of 31.504 Gb/s with 8 GT/s x4 link)

Here is a tree view of the devices as well as the output of lspci -vvv:
https://pastebin.com/CSsS2akZ

The critical path of the device tree looks like this:

00:1c.4 Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
03:00.0 Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
04:04.0 Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
3a:00.0 Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015]
3b:01.0 Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015]
3c:00.0 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)

Here is the weird part:

Accoding to lspci, all of these devices report in their LnkCap that
they support 8 GT/s, except the 04:04.0 and 3a:00.0 which say they only
support 2.5 GT/s. Contradictory to lspci, sysfs on the other hand says
that both of them are capable of 8 GT/s as well:
"/sys/bus/pci/devices/0000:04:04.0/max_link_speed" and
"/sys/bus/pci/devices/0000:3a:00.0/max_link_speed" are 8 GT/s.
It seems that there is a discrepancy between what lspci thinks and what
the devices are actually capable of.

Questions:

1. Why are there four bridge devices? 04:00.0, 04:01.0 and 04:02.0 look
superfluous to me and nothing is connected to them. It actually gives
me the feeling that the TB3 driver creates 4 devices with 2.5 GT/s
each, instead of one device that can do the full 8 GT/s.

2. Why are some of the bridge devices only capable of 2.5 GT/s
according to lspci?

3. Is it possible to manually set them to 8 GT/s?

Thanks in advance for your answers!

Best regards,
Tim






_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 10:23 Why is Thunderbolt 3 limited to 2.5 GT/s on Linux? Timur Kristóf
@ 2019-06-28 10:32 ` Mika Westerberg
  2019-06-28 11:08   ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Mika Westerberg @ 2019-06-28 10:32 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, dri-devel

On Fri, Jun 28, 2019 at 12:23:09PM +0200, Timur Kristóf wrote:
> Hi guys,
> 
> I use an AMD RX 570 in a Thunderbolt 3 external GPU box.
> dmesg gives me the following message:
> pci 0000:3a:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x4 link at 0000:04:04.0 (capable of 31.504 Gb/s with 8 GT/s x4 link)
> 
> Here is a tree view of the devices as well as the output of lspci -vvv:
> https://pastebin.com/CSsS2akZ
> 
> The critical path of the device tree looks like this:
> 
> 00:1c.4 Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
> 03:00.0 Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
> 04:04.0 Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
> 3a:00.0 Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015]
> 3b:01.0 Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015]
> 3c:00.0 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
> 
> Here is the weird part:
> 
> Accoding to lspci, all of these devices report in their LnkCap that
> they support 8 GT/s, except the 04:04.0 and 3a:00.0 which say they only
> support 2.5 GT/s. Contradictory to lspci, sysfs on the other hand says
> that both of them are capable of 8 GT/s as well:
> "/sys/bus/pci/devices/0000:04:04.0/max_link_speed" and
> "/sys/bus/pci/devices/0000:3a:00.0/max_link_speed" are 8 GT/s.
> It seems that there is a discrepancy between what lspci thinks and what
> the devices are actually capable of.
> 
> Questions:
> 
> 1. Why are there four bridge devices? 04:00.0, 04:01.0 and 04:02.0 look
> superfluous to me and nothing is connected to them. It actually gives
> me the feeling that the TB3 driver creates 4 devices with 2.5 GT/s
> each, instead of one device that can do the full 8 GT/s.

Because it is standard PCIe switch with one upstream port and n
downstream ports.

> 2. Why are some of the bridge devices only capable of 2.5 GT/s
> according to lspci?

You need to talk to lspci maintainer.

> 3. Is it possible to manually set them to 8 GT/s?

No idea.

Are you actually seeing some performance issue because of this or are
you just curious?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 10:32 ` Mika Westerberg
@ 2019-06-28 11:08   ` Timur Kristóf
  2019-06-28 11:34     ` Mika Westerberg
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-06-28 11:08 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: michael.jamet, dri-devel

Hi Mika,

Thanks for your quick reply.

> > 1. Why are there four bridge devices? 04:00.0, 04:01.0 and 04:02.0
> > look
> > superfluous to me and nothing is connected to them. It actually
> > gives
> > me the feeling that the TB3 driver creates 4 devices with 2.5 GT/s
> > each, instead of one device that can do the full 8 GT/s.
> 
> Because it is standard PCIe switch with one upstream port and n
> downstream ports.

Sure, though in this case 3 of those downstream ports are not exposed
by the hardware, so it's a bit surprising to see them there.

Why I asked about it is because I have a suspicion that maybe the
bandwidth is allocated equally between the 4 downstream ports, even
though only one of them is used.

> 
> > 2. Why are some of the bridge devices only capable of 2.5 GT/s
> > according to lspci?
> 
> You need to talk to lspci maintainer.

Sorry if the question was unclear.
It's not only lspci, the kernel also prints a warning about it.

Like I said the device really is limited to 2.5 GT/s even though it
should be able to do 8 GT/s.

> 
> > 3. Is it possible to manually set them to 8 GT/s?
> 
> No idea.
> 
> Are you actually seeing some performance issue because of this or are
> you just curious?

Yes, I see a noticable performance hit: some games have very low frame
rate while neither the CPU nor the GPU are fully utilized.

(Side note: mesa 19.1 has a radeonsi patch that reduces the bandwidth
use, which does help. However it doesn't solve the underlying problem
of the slow TB3 interface.)

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 11:08   ` Timur Kristóf
@ 2019-06-28 11:34     ` Mika Westerberg
  2019-06-28 12:21       ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Mika Westerberg @ 2019-06-28 11:34 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, dri-devel

On Fri, Jun 28, 2019 at 01:08:07PM +0200, Timur Kristóf wrote:
> Hi Mika,
> 
> Thanks for your quick reply.
> 
> > > 1. Why are there four bridge devices? 04:00.0, 04:01.0 and 04:02.0
> > > look
> > > superfluous to me and nothing is connected to them. It actually
> > > gives
> > > me the feeling that the TB3 driver creates 4 devices with 2.5 GT/s
> > > each, instead of one device that can do the full 8 GT/s.
> > 
> > Because it is standard PCIe switch with one upstream port and n
> > downstream ports.
> 
> Sure, though in this case 3 of those downstream ports are not exposed
> by the hardware, so it's a bit surprising to see them there.

They lead to other peripherals on the TBT host router such as the TBT
controller and xHCI. Also there are two downstream ports for extension
from which you eGPU is using one.

> Why I asked about it is because I have a suspicion that maybe the
> bandwidth is allocated equally between the 4 downstream ports, even
> though only one of them is used.
> 
> > 
> > > 2. Why are some of the bridge devices only capable of 2.5 GT/s
> > > according to lspci?
> > 
> > You need to talk to lspci maintainer.
> 
> Sorry if the question was unclear.
> It's not only lspci, the kernel also prints a warning about it.
> 
> Like I said the device really is limited to 2.5 GT/s even though it
> should be able to do 8 GT/s.

There is Thunderbolt link between the host router (your host system) and
the eGPU box. That link is not limited to 2.5 GT/s so even if the slot
claims it is PCI gen1 the actual bandwidth can be much higher because of
the virtual link.

> > > 3. Is it possible to manually set them to 8 GT/s?
> > 
> > No idea.
> > 
> > Are you actually seeing some performance issue because of this or are
> > you just curious?
> 
> Yes, I see a noticable performance hit: some games have very low frame
> rate while neither the CPU nor the GPU are fully utilized.

Is that problem in Linux only or do you see the same issue in Windows as
well?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 11:34     ` Mika Westerberg
@ 2019-06-28 12:21       ` Timur Kristóf
  2019-06-28 12:53         ` Mika Westerberg
                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Timur Kristóf @ 2019-06-28 12:21 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: michael.jamet, dri-devel


> > Sure, though in this case 3 of those downstream ports are not
> > exposed
> > by the hardware, so it's a bit surprising to see them there.
> 
> They lead to other peripherals on the TBT host router such as the TBT
> controller and xHCI. Also there are two downstream ports for
> extension
> from which you eGPU is using one.

If you look at the device tree from my first email, you can see that
both the GPU and the XHCI uses the same port: 04:04.0 - in fact I can
even remove the other 3 ports from the system without any consequences.

> > Like I said the device really is limited to 2.5 GT/s even though it
> > should be able to do 8 GT/s.
> 
> There is Thunderbolt link between the host router (your host system)
> and
> the eGPU box. That link is not limited to 2.5 GT/s so even if the
> slot
> claims it is PCI gen1 the actual bandwidth can be much higher because
> of
> the virtual link.

Not sure I understand correctly, are you saying that TB3 can do 40
Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?

I haven't found a good way to measure the maximum PCIe throughput
between the CPU and GPU, but I did take a look at AMD's sysfs interface
at /sys/class/drm/card1/device/pcie_bw which while running the
bottlenecked game. The highest throughput I saw there was only 2.43
Gbit /sec.

One more thought. I've also looked at
/sys/class/drm/card1/device/pp_dpm_pcie - which tells me that amdgpu
thinks it is running on a 2.5GT/s x8 link (as opposed to the expected 8
GT/s x4). Can this be a problem?

> 
> > > > 3. Is it possible to manually set them to 8 GT/s?
> > > 
> > > No idea.
> > > 
> > > Are you actually seeing some performance issue because of this or
> > > are
> > > you just curious?
> > 
> > Yes, I see a noticable performance hit: some games have very low
> > frame
> > rate while neither the CPU nor the GPU are fully utilized.
> 
> Is that problem in Linux only or do you see the same issue in Windows
> as
> well?


I admit I don't have Windows on this computer now and it has been some
time since I last tried it, but when I did, I didn't see this problem.

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 12:21       ` Timur Kristóf
@ 2019-06-28 12:53         ` Mika Westerberg
  2019-06-28 13:33           ` Timur Kristóf
  2019-07-01 14:28         ` Alex Deucher
  2019-07-01 14:54         ` Michel Dänzer
  2 siblings, 1 reply; 34+ messages in thread
From: Mika Westerberg @ 2019-06-28 12:53 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, dri-devel

On Fri, Jun 28, 2019 at 02:21:36PM +0200, Timur Kristóf wrote:
> 
> > > Sure, though in this case 3 of those downstream ports are not
> > > exposed
> > > by the hardware, so it's a bit surprising to see them there.
> > 
> > They lead to other peripherals on the TBT host router such as the TBT
> > controller and xHCI. Also there are two downstream ports for
> > extension
> > from which you eGPU is using one.
> 
> If you look at the device tree from my first email, you can see that
> both the GPU and the XHCI uses the same port: 04:04.0 - in fact I can
> even remove the other 3 ports from the system without any consequences.

Well that's the extension PCIe downstream port. The other one is
04:01.0.

Typically 04:00.0 and 04:00.2 are used to connect TBT (05:00.0) and xHCI
(39:00.0) but in your case you don't seem to have USB 3 devices
connected to that so it is not present. If you plug in USB-C device
(non-TBT) you should see the host router xHCI appearing as well.

This is pretty standard topology.

> > > Like I said the device really is limited to 2.5 GT/s even though it
> > > should be able to do 8 GT/s.
> > 
> > There is Thunderbolt link between the host router (your host system)
> > and
> > the eGPU box. That link is not limited to 2.5 GT/s so even if the
> > slot
> > claims it is PCI gen1 the actual bandwidth can be much higher because
> > of
> > the virtual link.
> 
> Not sure I understand correctly, are you saying that TB3 can do 40
> Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?

Yes the PCIe switch upstream port (3a:00.0) is connected back to the
host router over virtual Thunderbolt 40Gb/s link so the PCIe gen1 speeds
it reports do not really matter here (same goes for the downstream).

The topology looks like bellow if I got it right from the lspci output:

  00:1c.4 (root port) 8 GT/s x 4
    ^
    | real PCIe link
    v
  03:00.0 (upstream port) 8 GT/s x 4
  04:04.0 (downstream port) 2.5 GT/s x 4
    ^
    |  virtual link 40 Gb/s
    v
  3a:00.0 (upstream port) 2.5 GT/s x 4
  3b:01.0 (downstream port) 8 GT/s x 4
    ^
    | real PCIe link
    v
  3c:00.0 (eGPU) 8 GT/s x 4

In other words all the real PCIe links run at full 8 GT/s x 4 which is
what is expected, I think.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 12:53         ` Mika Westerberg
@ 2019-06-28 13:33           ` Timur Kristóf
  2019-06-28 14:14             ` Mika Westerberg
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-06-28 13:33 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: michael.jamet, dri-devel


> Well that's the extension PCIe downstream port. The other one is
> 04:01.0.
> 
> Typically 04:00.0 and 04:00.2 are used to connect TBT (05:00.0) and
> xHCI
> (39:00.0) but in your case you don't seem to have USB 3 devices
> connected to that so it is not present. If you plug in USB-C device
> (non-TBT) you should see the host router xHCI appearing as well.
> 
> This is pretty standard topology.
> > 
> > Not sure I understand correctly, are you saying that TB3 can do 40
> > Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?
> 
> Yes the PCIe switch upstream port (3a:00.0) is connected back to the
> host router over virtual Thunderbolt 40Gb/s link so the PCIe gen1
> speeds
> it reports do not really matter here (same goes for the downstream).
> 
> The topology looks like bellow if I got it right from the lspci
> output:
> 
>   00:1c.4 (root port) 8 GT/s x 4
>     ^
>     | real PCIe link
>     v
>   03:00.0 (upstream port) 8 GT/s x 4
>   04:04.0 (downstream port) 2.5 GT/s x 4
>     ^
>     |  virtual link 40 Gb/s
>     v
>   3a:00.0 (upstream port) 2.5 GT/s x 4
>   3b:01.0 (downstream port) 8 GT/s x 4
>     ^
>     | real PCIe link
>     v
>   3c:00.0 (eGPU) 8 GT/s x 4
> 
> In other words all the real PCIe links run at full 8 GT/s x 4 which
> is
> what is expected, I think.


It makes sense now. This is hands down the best explanation I've seen
about how TB3 hangs together. Thanks for taking the time to explain it!

I have two more questions:

1. What is the best way to test that the virtual link is indeed capable
of 40 Gbit / sec? So far I've been unable to figure out how to measure
its maximum throughput.

2. Why is it that the game can only utilize as much as 2.5 Gbit / sec
when it gets bottlenecked? The same problem is not present on a desktop
computer with a "normal" PCIe port.


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 13:33           ` Timur Kristóf
@ 2019-06-28 14:14             ` Mika Westerberg
  2019-06-28 14:53               ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Mika Westerberg @ 2019-06-28 14:14 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, dri-devel

On Fri, Jun 28, 2019 at 03:33:56PM +0200, Timur Kristóf wrote:
> I have two more questions:
> 
> 1. What is the best way to test that the virtual link is indeed capable
> of 40 Gbit / sec? So far I've been unable to figure out how to measure
> its maximum throughput.

I don't think there is any good way to test it but the Thunderbolt gen 3
link is pretty much always 40 Gb/s (20 Gb/s x 2) from which the
bandwidth is shared dynamically between different tunnels (virtual links).

> 2. Why is it that the game can only utilize as much as 2.5 Gbit / sec
> when it gets bottlenecked? The same problem is not present on a desktop
> computer with a "normal" PCIe port.

This is outside of my knowledge, sorry. How that game even knows it can
"utilize" only 2.5 Gbit/s. Does it go over the output of "lspci" as well? :-)

The PCIe links itself should to get you the 8 GT/s x 4 and I'm quite
sure the underlying TBT link works fine as well so my guess is that the
issue lies somewhere else but where, I have no idea.

Maybe the problem is in the game itself?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 14:14             ` Mika Westerberg
@ 2019-06-28 14:53               ` Timur Kristóf
  2019-07-01 11:44                 ` Mika Westerberg
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-06-28 14:53 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: michael.jamet, dri-devel

On Fri, 2019-06-28 at 17:14 +0300, Mika Westerberg wrote:
> On Fri, Jun 28, 2019 at 03:33:56PM +0200, Timur Kristóf wrote:
> > I have two more questions:
> > 
> > 1. What is the best way to test that the virtual link is indeed
> > capable
> > of 40 Gbit / sec? So far I've been unable to figure out how to
> > measure
> > its maximum throughput.
> 
> I don't think there is any good way to test it but the Thunderbolt
> gen 3
> link is pretty much always 40 Gb/s (20 Gb/s x 2) from which the
> bandwidth is shared dynamically between different tunnels (virtual
> links).

That's unfortunate, I would have expected there to be some sort of PCIe
speed test utility.

Now that I gave it a try, I can measure ~20 Gbit/sec when I run Gnome
Wayland on this system (which forces the eGPU to send the framebuffer
back and forth all the time - for two 4K monitors). But it still
doesn't give me 40 Gbit/sec.

> 
> > 2. Why is it that the game can only utilize as much as 2.5 Gbit /
> > sec
> > when it gets bottlenecked? The same problem is not present on a
> > desktop
> > computer with a "normal" PCIe port.
> 
> This is outside of my knowledge, sorry. How that game even knows it
> can
> "utilize" only 2.5 Gbit/s. Does it go over the output of "lspci" as
> well? :-)
> 
> The PCIe links itself should to get you the 8 GT/s x 4 and I'm quite
> sure the underlying TBT link works fine as well so my guess is that
> the
> issue lies somewhere else but where, I have no idea.
> 
> Maybe the problem is in the game itself?

I had a brief discussion with Marek about this earlier, and he said
that this has to do with latency too, not just bandwidth, but he didn't
explain any further.

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 14:53               ` Timur Kristóf
@ 2019-07-01 11:44                 ` Mika Westerberg
  2019-07-01 14:25                   ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Mika Westerberg @ 2019-07-01 11:44 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, dri-devel

On Fri, Jun 28, 2019 at 04:53:02PM +0200, Timur Kristóf wrote:
> On Fri, 2019-06-28 at 17:14 +0300, Mika Westerberg wrote:
> > On Fri, Jun 28, 2019 at 03:33:56PM +0200, Timur Kristóf wrote:
> > > I have two more questions:
> > > 
> > > 1. What is the best way to test that the virtual link is indeed
> > > capable
> > > of 40 Gbit / sec? So far I've been unable to figure out how to
> > > measure
> > > its maximum throughput.
> > 
> > I don't think there is any good way to test it but the Thunderbolt
> > gen 3
> > link is pretty much always 40 Gb/s (20 Gb/s x 2) from which the
> > bandwidth is shared dynamically between different tunnels (virtual
> > links).
> 
> That's unfortunate, I would have expected there to be some sort of PCIe
> speed test utility.
> 
> Now that I gave it a try, I can measure ~20 Gbit/sec when I run Gnome
> Wayland on this system (which forces the eGPU to send the framebuffer
> back and forth all the time - for two 4K monitors). But it still
> doesn't give me 40 Gbit/sec.

How do you measure that? Is there a DP stream also? As I said the
bandwidth is dynamically shared between the consumers so you probably do
not get the full bandwidth for PCIe only because it needs to reserve
something for possible DP streams and so on.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 11:44                 ` Mika Westerberg
@ 2019-07-01 14:25                   ` Timur Kristóf
  0 siblings, 0 replies; 34+ messages in thread
From: Timur Kristóf @ 2019-07-01 14:25 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: michael.jamet, dri-devel

> > 
> > That's unfortunate, I would have expected there to be some sort of
> > PCIe
> > speed test utility.
> > 
> > Now that I gave it a try, I can measure ~20 Gbit/sec when I run
> > Gnome
> > Wayland on this system (which forces the eGPU to send the
> > framebuffer
> > back and forth all the time - for two 4K monitors). But it still
> > doesn't give me 40 Gbit/sec.
> 
> How do you measure that? Is there a DP stream also? As I said the
> bandwidth is dynamically shared between the consumers so you probably
> do
> not get the full bandwidth for PCIe only because it needs to reserve
> something for possible DP streams and so on.

I'm measuring it using AMD's pcie_bw sysfs interface which shows how
many packets were sent and received by the GPU, and the max packet
size. So it's not an exact measurement but a good estimate.

AFAIK there is no DP stream. Only the eGPU is connected to the TB3 port
and nothing else. The graphics card inside the TB3 enclosure does have
a DP connector which is in use, but I assume that's not what you mean.

It also doesn't seem to make a difference whether or not anything is
plugged into the USB ports provided by the eGPU. (Some online posts
suggest that not using those ports would allow higher throughput to the
eGPU, but I don't see that it would make any difference here.)

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 12:21       ` Timur Kristóf
  2019-06-28 12:53         ` Mika Westerberg
@ 2019-07-01 14:28         ` Alex Deucher
  2019-07-01 14:38           ` Timur Kristóf
  2019-07-01 14:54         ` Michel Dänzer
  2 siblings, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2019-07-01 14:28 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, Mika Westerberg, dri-devel

On Sun, Jun 30, 2019 at 2:27 PM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
>
> > > Sure, though in this case 3 of those downstream ports are not
> > > exposed
> > > by the hardware, so it's a bit surprising to see them there.
> >
> > They lead to other peripherals on the TBT host router such as the TBT
> > controller and xHCI. Also there are two downstream ports for
> > extension
> > from which you eGPU is using one.
>
> If you look at the device tree from my first email, you can see that
> both the GPU and the XHCI uses the same port: 04:04.0 - in fact I can
> even remove the other 3 ports from the system without any consequences.
>
> > > Like I said the device really is limited to 2.5 GT/s even though it
> > > should be able to do 8 GT/s.
> >
> > There is Thunderbolt link between the host router (your host system)
> > and
> > the eGPU box. That link is not limited to 2.5 GT/s so even if the
> > slot
> > claims it is PCI gen1 the actual bandwidth can be much higher because
> > of
> > the virtual link.
>
> Not sure I understand correctly, are you saying that TB3 can do 40
> Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?
>
> I haven't found a good way to measure the maximum PCIe throughput
> between the CPU and GPU, but I did take a look at AMD's sysfs interface
> at /sys/class/drm/card1/device/pcie_bw which while running the
> bottlenecked game. The highest throughput I saw there was only 2.43
> Gbit /sec.
>
> One more thought. I've also looked at
> /sys/class/drm/card1/device/pp_dpm_pcie - which tells me that amdgpu
> thinks it is running on a 2.5GT/s x8 link (as opposed to the expected 8
> GT/s x4). Can this be a problem?

We limit the speed of the link the the driver to the max speed of any
upstream links.  So if there are any links upstream limited to 2.5
GT/s, it doesn't make sense to clock the local link up to faster
speeds.

Alex

>
> >
> > > > > 3. Is it possible to manually set them to 8 GT/s?
> > > >
> > > > No idea.
> > > >
> > > > Are you actually seeing some performance issue because of this or
> > > > are
> > > > you just curious?
> > >
> > > Yes, I see a noticable performance hit: some games have very low
> > > frame
> > > rate while neither the CPU nor the GPU are fully utilized.
> >
> > Is that problem in Linux only or do you see the same issue in Windows
> > as
> > well?
>
>
> I admit I don't have Windows on this computer now and it has been some
> time since I last tried it, but when I did, I didn't see this problem.
>
> Best regards,
> Tim
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 14:28         ` Alex Deucher
@ 2019-07-01 14:38           ` Timur Kristóf
  2019-07-01 14:46             ` Alex Deucher
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-01 14:38 UTC (permalink / raw)
  To: Alex Deucher; +Cc: michael.jamet, Mika Westerberg, dri-devel

> > > > Like I said the device really is limited to 2.5 GT/s even
> > > > though it
> > > > should be able to do 8 GT/s.
> > > 
> > > There is Thunderbolt link between the host router (your host
> > > system)
> > > and
> > > the eGPU box. That link is not limited to 2.5 GT/s so even if the
> > > slot
> > > claims it is PCI gen1 the actual bandwidth can be much higher
> > > because
> > > of
> > > the virtual link.
> > 
> > Not sure I understand correctly, are you saying that TB3 can do 40
> > Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?
> > 
> > I haven't found a good way to measure the maximum PCIe throughput
> > between the CPU and GPU, but I did take a look at AMD's sysfs
> > interface
> > at /sys/class/drm/card1/device/pcie_bw which while running the
> > bottlenecked game. The highest throughput I saw there was only 2.43
> > Gbit /sec.
> > 
> > One more thought. I've also looked at
> > /sys/class/drm/card1/device/pp_dpm_pcie - which tells me that
> > amdgpu
> > thinks it is running on a 2.5GT/s x8 link (as opposed to the
> > expected 8
> > GT/s x4). Can this be a problem?
> 
> We limit the speed of the link the the driver to the max speed of any
> upstream links.  So if there are any links upstream limited to 2.5
> GT/s, it doesn't make sense to clock the local link up to faster
> speeds.
> 
> Alex

Hi Alex,

I have two concerns about it:

1. Why does amdgpu think that the link has 8 lanes, when it only has 4?

2. As far as I understood what Mika said, there isn't really a 2.5 GT/s
limitation there, since the virtual link should be running at 40 Gb/s
regardless of the reported speed of that device. Would it be possible
to run the AMD GPU at 8 GT/s in this case?

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 14:38           ` Timur Kristóf
@ 2019-07-01 14:46             ` Alex Deucher
  2019-07-01 15:10               ` Mika Westerberg
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2019-07-01 14:46 UTC (permalink / raw)
  To: Timur Kristóf; +Cc: michael.jamet, Mika Westerberg, dri-devel

On Mon, Jul 1, 2019 at 10:38 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> > > > > Like I said the device really is limited to 2.5 GT/s even
> > > > > though it
> > > > > should be able to do 8 GT/s.
> > > >
> > > > There is Thunderbolt link between the host router (your host
> > > > system)
> > > > and
> > > > the eGPU box. That link is not limited to 2.5 GT/s so even if the
> > > > slot
> > > > claims it is PCI gen1 the actual bandwidth can be much higher
> > > > because
> > > > of
> > > > the virtual link.
> > >
> > > Not sure I understand correctly, are you saying that TB3 can do 40
> > > Gbit/sec even though the kernel thinks it can only do 8 Gbit / sec?
> > >
> > > I haven't found a good way to measure the maximum PCIe throughput
> > > between the CPU and GPU, but I did take a look at AMD's sysfs
> > > interface
> > > at /sys/class/drm/card1/device/pcie_bw which while running the
> > > bottlenecked game. The highest throughput I saw there was only 2.43
> > > Gbit /sec.
> > >
> > > One more thought. I've also looked at
> > > /sys/class/drm/card1/device/pp_dpm_pcie - which tells me that
> > > amdgpu
> > > thinks it is running on a 2.5GT/s x8 link (as opposed to the
> > > expected 8
> > > GT/s x4). Can this be a problem?
> >
> > We limit the speed of the link the the driver to the max speed of any
> > upstream links.  So if there are any links upstream limited to 2.5
> > GT/s, it doesn't make sense to clock the local link up to faster
> > speeds.
> >
> > Alex
>
> Hi Alex,
>
> I have two concerns about it:
>
> 1. Why does amdgpu think that the link has 8 lanes, when it only has 4?

Not sure.  We use pcie_bandwidth_available() on kernel 5.3 and newer
to determine the max speed and links when we set up the GPU.  For
older kernels use use an open coded version of it in the driver.

>
> 2. As far as I understood what Mika said, there isn't really a 2.5 GT/s
> limitation there, since the virtual link should be running at 40 Gb/s
> regardless of the reported speed of that device. Would it be possible
> to run the AMD GPU at 8 GT/s in this case?

If there is really a faster link here then we need some way to pass
that information to the drivers.  We rely on the information from the
upstream bridges and the pcie core helper functions.

Alex
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-06-28 12:21       ` Timur Kristóf
  2019-06-28 12:53         ` Mika Westerberg
  2019-07-01 14:28         ` Alex Deucher
@ 2019-07-01 14:54         ` Michel Dänzer
  2019-07-01 16:01           ` Timur Kristóf
  2 siblings, 1 reply; 34+ messages in thread
From: Michel Dänzer @ 2019-07-01 14:54 UTC (permalink / raw)
  To: Timur Kristóf, Mika Westerberg; +Cc: michael.jamet, dri-devel

On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
> 
> I haven't found a good way to measure the maximum PCIe throughput
> between the CPU and GPU,

amdgpu.benchmark=3

on the kernel command line will measure throughput for various transfer
sizes during driver initialization.


> but I did take a look at AMD's sysfs interface at
> /sys/class/drm/card1/device/pcie_bw which while running the bottlenecked
> game. The highest throughput I saw there was only 2.43 Gbit /sec.

PCIe bandwidth generally isn't a bottleneck for games, since they don't
constantly transfer large data volumes across PCIe, but store them in
the GPU's local VRAM, which is connected at much higher bandwidth.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 14:46             ` Alex Deucher
@ 2019-07-01 15:10               ` Mika Westerberg
  0 siblings, 0 replies; 34+ messages in thread
From: Mika Westerberg @ 2019-07-01 15:10 UTC (permalink / raw)
  To: Alex Deucher; +Cc: michael.jamet, dri-devel, Timur Kristóf

On Mon, Jul 01, 2019 at 10:46:34AM -0400, Alex Deucher wrote:
> > 2. As far as I understood what Mika said, there isn't really a 2.5 GT/s
> > limitation there, since the virtual link should be running at 40 Gb/s
> > regardless of the reported speed of that device. Would it be possible
> > to run the AMD GPU at 8 GT/s in this case?
> 
> If there is really a faster link here then we need some way to pass
> that information to the drivers.  We rely on the information from the
> upstream bridges and the pcie core helper functions.

I think you may use "pci_dev->is_thunderbolt" in the GPU driver and then
just use whatever the real PCI link speed & width is between the GPU and
the downstream port it connects to.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 14:54         ` Michel Dänzer
@ 2019-07-01 16:01           ` Timur Kristóf
  2019-07-02  8:09             ` Michel Dänzer
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-01 16:01 UTC (permalink / raw)
  To: Michel Dänzer, Mika Westerberg; +Cc: michael.jamet, dri-devel

On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
> On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
> > I haven't found a good way to measure the maximum PCIe throughput
> > between the CPU and GPU,
> 
> amdgpu.benchmark=3
> 
> on the kernel command line will measure throughput for various
> transfer
> sizes during driver initialization.

Thanks, I will definitely try that.
Is this the only way to do this, or is there a way to benchmark it
after it already booted?

> > but I did take a look at AMD's sysfs interface at
> > /sys/class/drm/card1/device/pcie_bw which while running the
> > bottlenecked
> > game. The highest throughput I saw there was only 2.43 Gbit /sec.
> 
> PCIe bandwidth generally isn't a bottleneck for games, since they
> don't
> constantly transfer large data volumes across PCIe, but store them in
> the GPU's local VRAM, which is connected at much higher bandwidth.

There are reasons why I think the problem is the bandwidth:
1. The same issues don't happen when the GPU is not used with a TB3
enclosure.
2. In case of radeonsi, the problem was mitigated once Marek's SDMA
patch was merged, which hugely reduces the PCIe bandwidth use.
3. In less optimized cases (for example D9VK), the problem is still
very noticable.

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-01 16:01           ` Timur Kristóf
@ 2019-07-02  8:09             ` Michel Dänzer
  2019-07-02  9:49               ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Michel Dänzer @ 2019-07-02  8:09 UTC (permalink / raw)
  To: Timur Kristóf, Mika Westerberg; +Cc: michael.jamet, dri-devel

On 2019-07-01 6:01 p.m., Timur Kristóf wrote:
> On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
>> On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
>>> I haven't found a good way to measure the maximum PCIe throughput
>>> between the CPU and GPU,
>>
>> amdgpu.benchmark=3
>>
>> on the kernel command line will measure throughput for various
>> transfer
>> sizes during driver initialization.
> 
> Thanks, I will definitely try that.
> Is this the only way to do this, or is there a way to benchmark it
> after it already booted?

The former. At least in theory, it's possible to unload the amdgpu
module while nothing is using it, then load it again.


>>> but I did take a look at AMD's sysfs interface at
>>> /sys/class/drm/card1/device/pcie_bw which while running the
>>> bottlenecked
>>> game. The highest throughput I saw there was only 2.43 Gbit /sec.
>>
>> PCIe bandwidth generally isn't a bottleneck for games, since they
>> don't
>> constantly transfer large data volumes across PCIe, but store them in
>> the GPU's local VRAM, which is connected at much higher bandwidth.
> 
> There are reasons why I think the problem is the bandwidth:
> 1. The same issues don't happen when the GPU is not used with a TB3
> enclosure.
> 2. In case of radeonsi, the problem was mitigated once Marek's SDMA
> patch was merged, which hugely reduces the PCIe bandwidth use.
> 3. In less optimized cases (for example D9VK), the problem is still
> very noticable.

However, since you saw as much as ~20 Gbit/s under different
circumstances, the 2.43 Gbit/s used by this game clearly isn't a hard
limit; there must be other limiting factors.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-02  8:09             ` Michel Dänzer
@ 2019-07-02  9:49               ` Timur Kristóf
  2019-07-03  8:07                 ` Michel Dänzer
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-02  9:49 UTC (permalink / raw)
  To: Michel Dänzer, Mika Westerberg; +Cc: michael.jamet, dri-devel

On Tue, 2019-07-02 at 10:09 +0200, Michel Dänzer wrote:
> On 2019-07-01 6:01 p.m., Timur Kristóf wrote:
> > On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
> > > On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
> > > > I haven't found a good way to measure the maximum PCIe
> > > > throughput
> > > > between the CPU and GPU,
> > > 
> > > amdgpu.benchmark=3
> > > 
> > > on the kernel command line will measure throughput for various
> > > transfer
> > > sizes during driver initialization.
> > 
> > Thanks, I will definitely try that.
> > Is this the only way to do this, or is there a way to benchmark it
> > after it already booted?
> 
> The former. At least in theory, it's possible to unload the amdgpu
> module while nothing is using it, then load it again.

Okay, so I booted my system with amdgpu.benchmark=3
You can find the full dmesg log here: https://pastebin.com/zN9FYGw4

The result is between 1-5 Gbit / sec depending on the transfer size
(the higher the better), which corresponds to neither the 8 Gbit / sec
that the kernel thinks it is limited to, nor the 20 Gbit / sec which I
measured earlier with pcie_bw. Since pcie_bw only shows the maximum
PCIe packet size (and not the actual size), could it be that it's so
inaccurate that the 20 Gbit / sec is a fluke?

Side note: after booting with amdgpu.benchmark=3 the graphical session
was useless and straight out hanged the system after I logged in. So I
had to reboot into runlevel 3 to be able to save the above dmesg log.

> 
> > > > but I did take a look at AMD's sysfs interface at
> > > > /sys/class/drm/card1/device/pcie_bw which while running the
> > > > bottlenecked
> > > > game. The highest throughput I saw there was only 2.43 Gbit
> > > > /sec.
> > > 
> > > PCIe bandwidth generally isn't a bottleneck for games, since they
> > > don't
> > > constantly transfer large data volumes across PCIe, but store
> > > them in
> > > the GPU's local VRAM, which is connected at much higher
> > > bandwidth.
> > 
> > There are reasons why I think the problem is the bandwidth:
> > 1. The same issues don't happen when the GPU is not used with a TB3
> > enclosure.
> > 2. In case of radeonsi, the problem was mitigated once Marek's SDMA
> > patch was merged, which hugely reduces the PCIe bandwidth use.
> > 3. In less optimized cases (for example D9VK), the problem is still
> > very noticable.
> 
> However, since you saw as much as ~20 Gbit/s under different
> circumstances, the 2.43 Gbit/s used by this game clearly isn't a hard
> limit; there must be other limiting factors.

There may be other factors, yes. I can't offer a good explanation on
what exactly is happening, but it's pretty clear that amdgpu can't take
full advantage of the TB3 link, so it seemed like a good idea to start
investigating this first.

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-02  9:49               ` Timur Kristóf
@ 2019-07-03  8:07                 ` Michel Dänzer
  2019-07-03 11:04                   ` Timur Kristóf
  2019-07-03 18:44                   ` Marek Olšák
  0 siblings, 2 replies; 34+ messages in thread
From: Michel Dänzer @ 2019-07-03  8:07 UTC (permalink / raw)
  To: Timur Kristóf, Mika Westerberg; +Cc: michael.jamet, dri-devel

On 2019-07-02 11:49 a.m., Timur Kristóf wrote:
> On Tue, 2019-07-02 at 10:09 +0200, Michel Dänzer wrote:
>> On 2019-07-01 6:01 p.m., Timur Kristóf wrote:
>>> On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
>>>> On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
>>>>> I haven't found a good way to measure the maximum PCIe
>>>>> throughput
>>>>> between the CPU and GPU,
>>>>
>>>> amdgpu.benchmark=3
>>>>
>>>> on the kernel command line will measure throughput for various
>>>> transfer
>>>> sizes during driver initialization.
>>>
>>> Thanks, I will definitely try that.
>>> Is this the only way to do this, or is there a way to benchmark it
>>> after it already booted?
>>
>> The former. At least in theory, it's possible to unload the amdgpu
>> module while nothing is using it, then load it again.
> 
> Okay, so I booted my system with amdgpu.benchmark=3
> You can find the full dmesg log here: https://pastebin.com/zN9FYGw4
> 
> The result is between 1-5 Gbit / sec depending on the transfer size
> (the higher the better), which corresponds to neither the 8 Gbit / sec
> that the kernel thinks it is limited to, nor the 20 Gbit / sec which I
> measured earlier with pcie_bw.

5 Gbit/s throughput could be consistent with 8 Gbit/s theoretical
bandwidth, due to various overhead.


> Since pcie_bw only shows the maximum PCIe packet size (and not the
> actual size), could it be that it's so inaccurate that the 20 Gbit /
> sec is a fluke?

Seems likely or at least plausible.


>>>>> but I did take a look at AMD's sysfs interface at
>>>>> /sys/class/drm/card1/device/pcie_bw which while running the
>>>>> bottlenecked
>>>>> game. The highest throughput I saw there was only 2.43 Gbit
>>>>> /sec.
>>>>
>>>> PCIe bandwidth generally isn't a bottleneck for games, since they
>>>> don't
>>>> constantly transfer large data volumes across PCIe, but store
>>>> them in
>>>> the GPU's local VRAM, which is connected at much higher
>>>> bandwidth.
>>>
>>> There are reasons why I think the problem is the bandwidth:
>>> 1. The same issues don't happen when the GPU is not used with a TB3
>>> enclosure.
>>> 2. In case of radeonsi, the problem was mitigated once Marek's SDMA
>>> patch was merged, which hugely reduces the PCIe bandwidth use.
>>> 3. In less optimized cases (for example D9VK), the problem is still
>>> very noticable.
>>
>> However, since you saw as much as ~20 Gbit/s under different
>> circumstances, the 2.43 Gbit/s used by this game clearly isn't a hard
>> limit; there must be other limiting factors.
> 
> There may be other factors, yes. I can't offer a good explanation on
> what exactly is happening, but it's pretty clear that amdgpu can't take
> full advantage of the TB3 link, so it seemed like a good idea to start
> investigating this first.

Yeah, actually it would be consistent with ~16-32 KB granularity
transfers based on your measurements above, which is plausible. So
making sure that the driver doesn't artificially limit the PCIe
bandwidth might indeed help.

OTOH this also indicates a similar potential for improvement by using
larger transfers in Mesa and/or the kernel.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-03  8:07                 ` Michel Dänzer
@ 2019-07-03 11:04                   ` Timur Kristóf
  2019-07-04  8:26                     ` Michel Dänzer
  2019-07-03 18:44                   ` Marek Olšák
  1 sibling, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-03 11:04 UTC (permalink / raw)
  To: Michel Dänzer, Mika Westerberg; +Cc: michael.jamet, dri-devel


> > Okay, so I booted my system with amdgpu.benchmark=3
> > You can find the full dmesg log here: https://pastebin.com/zN9FYGw4
> > 
> > The result is between 1-5 Gbit / sec depending on the transfer size
> > (the higher the better), which corresponds to neither the 8 Gbit /
> > sec
> > that the kernel thinks it is limited to, nor the 20 Gbit / sec
> > which I
> > measured earlier with pcie_bw.
> 
> 5 Gbit/s throughput could be consistent with 8 Gbit/s theoretical
> bandwidth, due to various overhead.

Okay, that's good to know.

> > Since pcie_bw only shows the maximum PCIe packet size (and not the
> > actual size), could it be that it's so inaccurate that the 20 Gbit
> > /
> > sec is a fluke?
> 
> Seems likely or at least plausible.

Thanks for the confirmation. It also looks like it is the slowest with
small transfers, which I assume mesa is doing for this game.

> > 
> > There may be other factors, yes. I can't offer a good explanation
> > on
> > what exactly is happening, but it's pretty clear that amdgpu can't
> > take
> > full advantage of the TB3 link, so it seemed like a good idea to
> > start
> > investigating this first.
> 
> Yeah, actually it would be consistent with ~16-32 KB granularity
> transfers based on your measurements above, which is plausible. So
> making sure that the driver doesn't artificially limit the PCIe
> bandwidth might indeed help.

Can you point me to the place where amdgpu decides the PCIe link speed?
I'd like to try to tweak it a little bit to see if that helps at all.

> OTOH this also indicates a similar potential for improvement by using
> larger transfers in Mesa and/or the kernel.

Yes, that sounds like it would be worth looking into.

Out of curiosity, is there a performace decrease with small transfers
on a "normal" PCIe port too, or is this specific to TB3?

Best regards,
Tim


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-03  8:07                 ` Michel Dänzer
  2019-07-03 11:04                   ` Timur Kristóf
@ 2019-07-03 18:44                   ` Marek Olšák
  2019-07-05  9:27                     ` Timur Kristóf
  1 sibling, 1 reply; 34+ messages in thread
From: Marek Olšák @ 2019-07-03 18:44 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: michael.jamet, Mika Westerberg, dri-devel, Timur Kristóf


[-- Attachment #1.1: Type: text/plain, Size: 3922 bytes --]

You can run:
AMD_DEBUG=testdmaperf glxgears

It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
different methods of transfering data.

Marek

On Wed, Jul 3, 2019 at 4:07 AM Michel Dänzer <michel@daenzer.net> wrote:

> On 2019-07-02 11:49 a.m., Timur Kristóf wrote:
> > On Tue, 2019-07-02 at 10:09 +0200, Michel Dänzer wrote:
> >> On 2019-07-01 6:01 p.m., Timur Kristóf wrote:
> >>> On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
> >>>> On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
> >>>>> I haven't found a good way to measure the maximum PCIe
> >>>>> throughput
> >>>>> between the CPU and GPU,
> >>>>
> >>>> amdgpu.benchmark=3
> >>>>
> >>>> on the kernel command line will measure throughput for various
> >>>> transfer
> >>>> sizes during driver initialization.
> >>>
> >>> Thanks, I will definitely try that.
> >>> Is this the only way to do this, or is there a way to benchmark it
> >>> after it already booted?
> >>
> >> The former. At least in theory, it's possible to unload the amdgpu
> >> module while nothing is using it, then load it again.
> >
> > Okay, so I booted my system with amdgpu.benchmark=3
> > You can find the full dmesg log here: https://pastebin.com/zN9FYGw4
> >
> > The result is between 1-5 Gbit / sec depending on the transfer size
> > (the higher the better), which corresponds to neither the 8 Gbit / sec
> > that the kernel thinks it is limited to, nor the 20 Gbit / sec which I
> > measured earlier with pcie_bw.
>
> 5 Gbit/s throughput could be consistent with 8 Gbit/s theoretical
> bandwidth, due to various overhead.
>
>
> > Since pcie_bw only shows the maximum PCIe packet size (and not the
> > actual size), could it be that it's so inaccurate that the 20 Gbit /
> > sec is a fluke?
>
> Seems likely or at least plausible.
>
>
> >>>>> but I did take a look at AMD's sysfs interface at
> >>>>> /sys/class/drm/card1/device/pcie_bw which while running the
> >>>>> bottlenecked
> >>>>> game. The highest throughput I saw there was only 2.43 Gbit
> >>>>> /sec.
> >>>>
> >>>> PCIe bandwidth generally isn't a bottleneck for games, since they
> >>>> don't
> >>>> constantly transfer large data volumes across PCIe, but store
> >>>> them in
> >>>> the GPU's local VRAM, which is connected at much higher
> >>>> bandwidth.
> >>>
> >>> There are reasons why I think the problem is the bandwidth:
> >>> 1. The same issues don't happen when the GPU is not used with a TB3
> >>> enclosure.
> >>> 2. In case of radeonsi, the problem was mitigated once Marek's SDMA
> >>> patch was merged, which hugely reduces the PCIe bandwidth use.
> >>> 3. In less optimized cases (for example D9VK), the problem is still
> >>> very noticable.
> >>
> >> However, since you saw as much as ~20 Gbit/s under different
> >> circumstances, the 2.43 Gbit/s used by this game clearly isn't a hard
> >> limit; there must be other limiting factors.
> >
> > There may be other factors, yes. I can't offer a good explanation on
> > what exactly is happening, but it's pretty clear that amdgpu can't take
> > full advantage of the TB3 link, so it seemed like a good idea to start
> > investigating this first.
>
> Yeah, actually it would be consistent with ~16-32 KB granularity
> transfers based on your measurements above, which is plausible. So
> making sure that the driver doesn't artificially limit the PCIe
> bandwidth might indeed help.
>
> OTOH this also indicates a similar potential for improvement by using
> larger transfers in Mesa and/or the kernel.
>
>
> --
> Earthling Michel Dänzer               |              https://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

[-- Attachment #1.2: Type: text/html, Size: 5371 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-03 11:04                   ` Timur Kristóf
@ 2019-07-04  8:26                     ` Michel Dänzer
  2019-07-05  9:17                       ` Timur Kristóf
  2019-07-05 13:36                       ` Alex Deucher
  0 siblings, 2 replies; 34+ messages in thread
From: Michel Dänzer @ 2019-07-04  8:26 UTC (permalink / raw)
  To: Timur Kristóf, Mika Westerberg, Alex Deucher
  Cc: michael.jamet, dri-devel

On 2019-07-03 1:04 p.m., Timur Kristóf wrote:
> 
>>> There may be other factors, yes. I can't offer a good explanation
>>> on
>>> what exactly is happening, but it's pretty clear that amdgpu can't
>>> take
>>> full advantage of the TB3 link, so it seemed like a good idea to
>>> start
>>> investigating this first.
>>
>> Yeah, actually it would be consistent with ~16-32 KB granularity
>> transfers based on your measurements above, which is plausible. So
>> making sure that the driver doesn't artificially limit the PCIe
>> bandwidth might indeed help.
> 
> Can you point me to the place where amdgpu decides the PCIe link speed?
> I'd like to try to tweak it a little bit to see if that helps at all.

I'm not sure offhand, Alex or anyone?


>> OTOH this also indicates a similar potential for improvement by using
>> larger transfers in Mesa and/or the kernel.
> 
> Yes, that sounds like it would be worth looking into.
> 
> Out of curiosity, is there a performace decrease with small transfers
> on a "normal" PCIe port too, or is this specific to TB3?

It's not TB3 specific. With a "normal" 8 GT/s x16 port, I get between
~256 MB/s for 4 KB transfers and ~12 GB/s for 4 MB transfers (even
larger transfers seem slightly slower again). This also looks consistent
with your measurements in that the practical limit seems to be around
75% of the theoretical bandwidth.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-04  8:26                     ` Michel Dänzer
@ 2019-07-05  9:17                       ` Timur Kristóf
  2019-07-05 13:36                       ` Alex Deucher
  1 sibling, 0 replies; 34+ messages in thread
From: Timur Kristóf @ 2019-07-05  9:17 UTC (permalink / raw)
  To: Michel Dänzer, Mika Westerberg, Alex Deucher
  Cc: michael.jamet, dri-devel


> > Can you point me to the place where amdgpu decides the PCIe link
> > speed?
> > I'd like to try to tweak it a little bit to see if that helps at
> > all.
> 
> I'm not sure offhand, Alex or anyone?

Thus far, I started by looking at how the pp_dpm_pcie sysfs interface
works, and found smu7_hwmgr which seems to be the only hwmgr that
actually outputs anything on PP_PCIE:
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c#L4462

However, its output is definitely incorrect. It tells me that the
supported PCIe modes are:
cat /sys/class/drm/card1/device/pp_dpm_pcie 
0: 2.5GT/s, x8 
1: 8.0GT/s, x16

It allows me to change between these two modes, but the change doesn't
seem to have any actual effect on the transfer speeds.

Neither of those modes actually makes sense. Amdgpu doesn't seem to be
aware of the fact that it runs on a x4 link. In fact, the
smu7_get_current_pcie_lane_number function even has an assertion:
PP_ASSERT_WITH_CODE((7 >= link_width),

On the other hand:
cat /sys/class/drm/card1/device/current_link_width
4

So I don't understand how it can even work with PCIe x4, why doesn't
that assertion get triggered on my system?

> > Out of curiosity, is there a performace decrease with small
> > transfers
> > on a "normal" PCIe port too, or is this specific to TB3?
> 
> It's not TB3 specific. With a "normal" 8 GT/s x16 port, I get between
> ~256 MB/s for 4 KB transfers and ~12 GB/s for 4 MB transfers (even
> larger transfers seem slightly slower again). This also looks
> consistent
> with your measurements in that the practical limit seems to be around
> 75% of the theoretical bandwidth.

Sounds like your idea to try to optimize mesa to use larger transfers
is a good idea, then.

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-03 18:44                   ` Marek Olšák
@ 2019-07-05  9:27                     ` Timur Kristóf
  2019-07-05 15:35                       ` Marek Olšák
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-05  9:27 UTC (permalink / raw)
  To: Marek Olšák, Michel Dänzer
  Cc: michael.jamet, Mika Westerberg, dri-devel

On Wed, 2019-07-03 at 14:44 -0400, Marek Olšák wrote:
> You can run:
> AMD_DEBUG=testdmaperf glxgears
> 
> It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
> different methods of transfering data.
> 
> Marek


Thanks Marek, I didn't know about that option.
Tried it, here is the output: https://pastebin.com/raw/9SAAbbAA

I'm not quite sure how to interpret the numbers, they are inconsistent
with the results from both pcie_bw and amdgpu.benchmark, for example
GTT->VRAM at a 128 KB is around 1400 MB/s (I assume that is megabytes /
sec, right?).

It is also weird that unlike amdgpu.benchmark, the larger than 128 KB
transfers didn't actually get slightly slower.

Michel, can you make sense of this?

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-04  8:26                     ` Michel Dänzer
  2019-07-05  9:17                       ` Timur Kristóf
@ 2019-07-05 13:36                       ` Alex Deucher
  2019-07-18  9:11                         ` Timur Kristóf
  1 sibling, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2019-07-05 13:36 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: michael.jamet, Mika Westerberg, dri-devel, Timur Kristóf

On Thu, Jul 4, 2019 at 6:55 AM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-07-03 1:04 p.m., Timur Kristóf wrote:
> >
> >>> There may be other factors, yes. I can't offer a good explanation
> >>> on
> >>> what exactly is happening, but it's pretty clear that amdgpu can't
> >>> take
> >>> full advantage of the TB3 link, so it seemed like a good idea to
> >>> start
> >>> investigating this first.
> >>
> >> Yeah, actually it would be consistent with ~16-32 KB granularity
> >> transfers based on your measurements above, which is plausible. So
> >> making sure that the driver doesn't artificially limit the PCIe
> >> bandwidth might indeed help.
> >
> > Can you point me to the place where amdgpu decides the PCIe link speed?
> > I'd like to try to tweak it a little bit to see if that helps at all.
>
> I'm not sure offhand, Alex or anyone?

amdgpu_device_get_pcie_info() in amdgpu_device.c.

>
>
> >> OTOH this also indicates a similar potential for improvement by using
> >> larger transfers in Mesa and/or the kernel.
> >
> > Yes, that sounds like it would be worth looking into.
> >
> > Out of curiosity, is there a performace decrease with small transfers
> > on a "normal" PCIe port too, or is this specific to TB3?
>
> It's not TB3 specific. With a "normal" 8 GT/s x16 port, I get between
> ~256 MB/s for 4 KB transfers and ~12 GB/s for 4 MB transfers (even
> larger transfers seem slightly slower again). This also looks consistent
> with your measurements in that the practical limit seems to be around
> 75% of the theoretical bandwidth.
>
>
> --
> Earthling Michel Dänzer               |              https://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-05  9:27                     ` Timur Kristóf
@ 2019-07-05 15:35                       ` Marek Olšák
  2019-07-05 16:01                         ` Timur Kristóf
       [not found]                         ` <8f0c2d7780430d40dd1e17a82484d236eae3f981.camel@gmail.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Marek Olšák @ 2019-07-05 15:35 UTC (permalink / raw)
  To: Timur Kristóf
  Cc: michael.jamet, Michel Dänzer, Mika Westerberg, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 806 bytes --]

On Fri, Jul 5, 2019 at 5:27 AM Timur Kristóf <timur.kristof@gmail.com>
wrote:

> On Wed, 2019-07-03 at 14:44 -0400, Marek Olšák wrote:
> > You can run:
> > AMD_DEBUG=testdmaperf glxgears
> >
> > It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
> > different methods of transfering data.
> >
> > Marek
>
>
> Thanks Marek, I didn't know about that option.
> Tried it, here is the output: https://pastebin.com/raw/9SAAbbAA
>
> I'm not quite sure how to interpret the numbers, they are inconsistent
> with the results from both pcie_bw and amdgpu.benchmark, for example
> GTT->VRAM at a 128 KB is around 1400 MB/s (I assume that is megabytes /
> sec, right?).
>

Based on the SDMA results, you have 2.4 GB/s. For 128KB, it's 2.2 GB/s for
GTT->VRAM copies.

Marek

[-- Attachment #1.2: Type: text/html, Size: 1281 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-05 15:35                       ` Marek Olšák
@ 2019-07-05 16:01                         ` Timur Kristóf
       [not found]                         ` <8f0c2d7780430d40dd1e17a82484d236eae3f981.camel@gmail.com>
  1 sibling, 0 replies; 34+ messages in thread
From: Timur Kristóf @ 2019-07-05 16:01 UTC (permalink / raw)
  To: maraeo; +Cc: michael.jamet, michel, mika.westerberg, dri-devel



On Friday, 5 July 2019, Marek Olšák wrote:
> On Fri, Jul 5, 2019 at 5:27 AM Timur Kristóf <timur.kristof@gmail.com>
> wrote:
> 
> > On Wed, 2019-07-03 at 14:44 -0400, Marek Olšák wrote:
> > > You can run:
> > > AMD_DEBUG=testdmaperf glxgears
> > >
> > > It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
> > > different methods of transfering data.
> > >
> > > Marek
> >
> >
> > Thanks Marek, I didn't know about that option.
> > Tried it, here is the output: https://pastebin.com/raw/9SAAbbAA
> >
> > I'm not quite sure how to interpret the numbers, they are inconsistent
> > with the results from both pcie_bw and amdgpu.benchmark, for example
> > GTT->VRAM at a 128 KB is around 1400 MB/s (I assume that is megabytes /
> > sec, right?).
> >
> 
> Based on the SDMA results, you have 2.4 GB/s. For 128KB, it's 2.2 GB/s for
> GTT->VRAM copies.
> 
> Marek

That's interesting, AFAIU that would be 17.6 Gbit/sec. But how can that be so much faster than the 5 Gbit/sec result from amdgpu.benchmark?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-05 13:36                       ` Alex Deucher
@ 2019-07-18  9:11                         ` Timur Kristóf
  2019-07-18 13:50                           ` Alex Deucher
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-18  9:11 UTC (permalink / raw)
  To: Alex Deucher, Michel Dänzer
  Cc: michael.jamet, Mika Westerberg, dri-devel

On Fri, 2019-07-05 at 09:36 -0400, Alex Deucher wrote:
> On Thu, Jul 4, 2019 at 6:55 AM Michel Dänzer <michel@daenzer.net>
> wrote:
> > On 2019-07-03 1:04 p.m., Timur Kristóf wrote:
> > > > > There may be other factors, yes. I can't offer a good
> > > > > explanation
> > > > > on
> > > > > what exactly is happening, but it's pretty clear that amdgpu
> > > > > can't
> > > > > take
> > > > > full advantage of the TB3 link, so it seemed like a good idea
> > > > > to
> > > > > start
> > > > > investigating this first.
> > > > 
> > > > Yeah, actually it would be consistent with ~16-32 KB
> > > > granularity
> > > > transfers based on your measurements above, which is plausible.
> > > > So
> > > > making sure that the driver doesn't artificially limit the PCIe
> > > > bandwidth might indeed help.
> > > 
> > > Can you point me to the place where amdgpu decides the PCIe link
> > > speed?
> > > I'd like to try to tweak it a little bit to see if that helps at
> > > all.
> > 
> > I'm not sure offhand, Alex or anyone?
> 
> amdgpu_device_get_pcie_info() in amdgpu_device.c.


Hi Alex,

I took a look at amdgpu_device_get_pcie_info() and found that it uses
pcie_bandwidth_available to determine the capabilities of the PCIe
port. However, pcie_bandwidth_available gives you only the current
bandwidth as set by the PCIe link status register, not the maximum
capability.

I think something along these lines would fix it:
https://pastebin.com/LscEMKMc

It seems to me that the PCIe capabilities are only used in a few places
in the code, so this patch fixes pp_dpm_pcie. However it doesn't affect
the actual performance.

What do you think?

Best regards,
Tim

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
       [not found]                         ` <8f0c2d7780430d40dd1e17a82484d236eae3f981.camel@gmail.com>
@ 2019-07-18 10:29                           ` Michel Dänzer
  2019-07-22  9:39                             ` Timur Kristóf
  0 siblings, 1 reply; 34+ messages in thread
From: Michel Dänzer @ 2019-07-18 10:29 UTC (permalink / raw)
  To: Timur Kristóf, Marek Olšák
  Cc: michael.jamet, Mika Westerberg, dri-devel

On 2019-07-18 11:06 a.m., Timur Kristóf wrote:
>>> Thanks Marek, I didn't know about that option.
>>> Tried it, here is the output: https://pastebin.com/raw/9SAAbbAA
>>>
>>> I'm not quite sure how to interpret the numbers, they are
>>> inconsistent
>>> with the results from both pcie_bw and amdgpu.benchmark, for
>>> example
>>> GTT->VRAM at a 128 KB is around 1400 MB/s (I assume that is
>>> megabytes /
>>> sec, right?).
>>
>> Based on the SDMA results, you have 2.4 GB/s. For 128KB, it's 2.2
>> GB/s for GTT->VRAM copies.
> 
> In the meantime I had a chat with Michel on IRC and he suggested that
> maybe amdgpu.benchmark=3 gives lower results because it uses a less
> than optimal way to do the benchmark.
> 
> Looking at the results from the mesa benchmark a bit more closely, I
> see that the SDMA can do:
> VRAM->GTT: 3087 MB/s = 24 Gbit/sec
> GTT->VRAM: 2433 MB/s = 19 Gbit/sec
> 
> So on Polaris at least, the SDMA is the fastest, and the other transfer
> methods can't match it. I also did the same test on Navi, where it's
> different: all other transfer methods are much closer to the SDMA, but
> the max speed is still around 20-24 Gbit / sec.
> 
> I still have a few questions:
> 
> 1. Why is the GTT->VRAM copy so much slower than the VRAM->GTT copy?
> 
> 2. Why is the bus limited to 24 Gbit/sec? I would expect the
> Thunderbolt port to give me at least 32 Gbit/sec for PCIe traffic.

That's unrealistic I'm afraid. As I said on IRC, from the GPU POV
there's an 8 GT/s x4 PCIe link, so ~29.8 Gbit/s (= 32 billion bit/s; I
missed this nuance on IRC) is the theoretical raw bandwidth. However, in
practice that's not achievable due to various overhead[0], and I'm only
seeing up to ~90% utilization of the theoretical bandwidth with a
"normal" x16 link as well. I wouldn't expect higher utilization without
seeing some evidence to suggest it's possible.


[0] According to
https://www.tested.com/tech/457440-theoretical-vs-actual-bandwidth-pci-express-and-thunderbolt/
, PCIe 3.0 uses 1.54% of the raw bandwidth for its internal encoding.
Also keep in mind all CPU<->GPU communication has to go through the PCIe
link, e.g. for programming the transfers, in-band signalling from the
GPU to the PCIe port where the data is being transferred to/from, ...

-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-18  9:11                         ` Timur Kristóf
@ 2019-07-18 13:50                           ` Alex Deucher
       [not found]                             ` <172a41d97d383a8989ebd213bb4230a2df4d636d.camel@gmail.com>
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Deucher @ 2019-07-18 13:50 UTC (permalink / raw)
  To: Timur Kristóf
  Cc: michael.jamet, Michel Dänzer, Mika Westerberg, dri-devel

On Thu, Jul 18, 2019 at 5:11 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> On Fri, 2019-07-05 at 09:36 -0400, Alex Deucher wrote:
> > On Thu, Jul 4, 2019 at 6:55 AM Michel Dänzer <michel@daenzer.net>
> > wrote:
> > > On 2019-07-03 1:04 p.m., Timur Kristóf wrote:
> > > > > > There may be other factors, yes. I can't offer a good
> > > > > > explanation
> > > > > > on
> > > > > > what exactly is happening, but it's pretty clear that amdgpu
> > > > > > can't
> > > > > > take
> > > > > > full advantage of the TB3 link, so it seemed like a good idea
> > > > > > to
> > > > > > start
> > > > > > investigating this first.
> > > > >
> > > > > Yeah, actually it would be consistent with ~16-32 KB
> > > > > granularity
> > > > > transfers based on your measurements above, which is plausible.
> > > > > So
> > > > > making sure that the driver doesn't artificially limit the PCIe
> > > > > bandwidth might indeed help.
> > > >
> > > > Can you point me to the place where amdgpu decides the PCIe link
> > > > speed?
> > > > I'd like to try to tweak it a little bit to see if that helps at
> > > > all.
> > >
> > > I'm not sure offhand, Alex or anyone?
> >
> > amdgpu_device_get_pcie_info() in amdgpu_device.c.
>
>
> Hi Alex,
>
> I took a look at amdgpu_device_get_pcie_info() and found that it uses
> pcie_bandwidth_available to determine the capabilities of the PCIe
> port. However, pcie_bandwidth_available gives you only the current
> bandwidth as set by the PCIe link status register, not the maximum
> capability.
>
> I think something along these lines would fix it:
> https://pastebin.com/LscEMKMc
>
> It seems to me that the PCIe capabilities are only used in a few places
> in the code, so this patch fixes pp_dpm_pcie. However it doesn't affect
> the actual performance.
>
> What do you think?

I think we want the current bandwidth.  The GPU can only control the
speed of its local link.  If there are upstream links that are slower
than its local link, it doesn't make sense to run the local link at
faster speeds because it will burn extra power it will just run into a
bottleneck at the next link.  In general, most systems negotiate the
fastest link speed supported by both ends at power up.

Alex

>
> Best regards,
> Tim
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
       [not found]                             ` <172a41d97d383a8989ebd213bb4230a2df4d636d.camel@gmail.com>
@ 2019-07-19 14:29                               ` Alex Deucher
  0 siblings, 0 replies; 34+ messages in thread
From: Alex Deucher @ 2019-07-19 14:29 UTC (permalink / raw)
  To: Timur Kristóf
  Cc: michael.jamet, Michel Dänzer, Mika Westerberg, dri-devel

On Thu, Jul 18, 2019 at 10:38 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
>
> > >
> > > I took a look at amdgpu_device_get_pcie_info() and found that it
> > > uses
> > > pcie_bandwidth_available to determine the capabilities of the PCIe
> > > port. However, pcie_bandwidth_available gives you only the current
> > > bandwidth as set by the PCIe link status register, not the maximum
> > > capability.
> > >
> > > I think something along these lines would fix it:
> > > https://pastebin.com/LscEMKMc
> > >
> > > It seems to me that the PCIe capabilities are only used in a few
> > > places
> > > in the code, so this patch fixes pp_dpm_pcie. However it doesn't
> > > affect
> > > the actual performance.
> > >
> > > What do you think?
> >
> > I think we want the current bandwidth.  The GPU can only control the
> > speed of its local link.  If there are upstream links that are slower
> > than its local link, it doesn't make sense to run the local link at
> > faster speeds because it will burn extra power it will just run into
> > a
> > bottleneck at the next link.  In general, most systems negotiate the
> > fastest link speed supported by both ends at power up.
> >
> > Alex
>
> Currently, if the GPU connected to a TB3 port, the driver thinks that
> 2.5 GT/s is the best speed that it can use, even though the hardware
> itself uses 8 GT/s. So what the driver thinks is inconsistent with what
> the hardware does. This messes up pp_dpm_pcie.
>
> As far as I understand, PCIe bridge devices can change their link speed
> in runtime based on how they are used or what power state they are in,
> so it makes sense here to request the best speed they are capable of. I
> might be wrong about that.

I don't know of any bridges off hand that change their link speeds on
demand.  That said, I'm certainly not a PCI expert.  Our GPUs for
instance have a micro-controller on them which changes the speed on
demand.  Presumably other devices would need something similar.

>
> If you think this change is undesireable, then maybe it would be worth
> to follow Mika's suggestion and add something along the lines of
> dev->is_thunderbolt so that the correct available bandwidth could still
> be determined.

Ideally, it would be added to the core pci helpers so that each driver
that uses them doesn't have to duplicate the same functionality.

Alex


>
> Tim
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-18 10:29                           ` Michel Dänzer
@ 2019-07-22  9:39                             ` Timur Kristóf
  2019-07-23  8:11                               ` Michel Dänzer
  0 siblings, 1 reply; 34+ messages in thread
From: Timur Kristóf @ 2019-07-22  9:39 UTC (permalink / raw)
  To: Michel Dänzer, Marek Olšák
  Cc: michael.jamet, Mika Westerberg, dri-devel


> > 
> > 1. Why is the GTT->VRAM copy so much slower than the VRAM->GTT
> > copy?
> > 
> > 2. Why is the bus limited to 24 Gbit/sec? I would expect the
> > Thunderbolt port to give me at least 32 Gbit/sec for PCIe traffic.
> 
> That's unrealistic I'm afraid. As I said on IRC, from the GPU POV
> there's an 8 GT/s x4 PCIe link, so ~29.8 Gbit/s (= 32 billion bit/s;
> I
> missed this nuance on IRC) is the theoretical raw bandwidth. However,
> in
> practice that's not achievable due to various overhead[0], and I'm
> only
> seeing up to ~90% utilization of the theoretical bandwidth with a
> "normal" x16 link as well. I wouldn't expect higher utilization
> without
> seeing some evidence to suggest it's possible.
> 
> 
> [0] According to
> https://www.tested.com/tech/457440-theoretical-vs-actual-bandwidth-pci-express-and-thunderbolt/
> , PCIe 3.0 uses 1.54% of the raw bandwidth for its internal encoding.
> Also keep in mind all CPU<->GPU communication has to go through the
> PCIe
> link, e.g. for programming the transfers, in-band signalling from the
> GPU to the PCIe port where the data is being transferred to/from, ...

Good point, I used 1024 and not 1000. My mistake.

There is something else:
In the same benchmark there is a "fill->GTT  ,SDMA" row which has a
4035 MB/s number. If that traffic goes through the TB3 interface then
we just found our 32 Gbit/sec.

Now the question is, if I understand this correctly and the SDMA can
indeed do 32 Gbit/sec for "fill->GTT", then why can't it do the same
with other kinds of transfers? Not sure if there is a good answer to
that question though.

Also I still don't fully understand why GTT->VRAM is slower than VRAM-
>GTT, when the bandwidth is clearly available.

Best regards,
Tim



Side note: with regards to that 1.5% figure, the TB3 tech brief[0]
explicitly mentions this and says that it isn't carried over: "the
underlying protocol uses some data to provide encoding overhead which
is not carried over the Thunderbolt 3 link reducing the consumed
bandwidth by roughly 20 percent (DisplayPort) or 1.5 percent (PCI
Express Gen 3)"

[0] https://thunderbolttechnology.net/sites/default/files/Thunderbolt3_TechBrief_FINAL.pdf

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?
  2019-07-22  9:39                             ` Timur Kristóf
@ 2019-07-23  8:11                               ` Michel Dänzer
  0 siblings, 0 replies; 34+ messages in thread
From: Michel Dänzer @ 2019-07-23  8:11 UTC (permalink / raw)
  To: Timur Kristóf, Marek Olšák
  Cc: michael.jamet, Mika Westerberg, dri-devel

On 2019-07-22 11:39 a.m., Timur Kristóf wrote:
>>>
>>> 1. Why is the GTT->VRAM copy so much slower than the VRAM->GTT
>>> copy?
>>>
>>> 2. Why is the bus limited to 24 Gbit/sec? I would expect the
>>> Thunderbolt port to give me at least 32 Gbit/sec for PCIe traffic.
>>
>> That's unrealistic I'm afraid. As I said on IRC, from the GPU POV
>> there's an 8 GT/s x4 PCIe link, so ~29.8 Gbit/s (= 32 billion bit/s;
>> I
>> missed this nuance on IRC) is the theoretical raw bandwidth. However,
>> in
>> practice that's not achievable due to various overhead[0], and I'm
>> only
>> seeing up to ~90% utilization of the theoretical bandwidth with a
>> "normal" x16 link as well. I wouldn't expect higher utilization
>> without
>> seeing some evidence to suggest it's possible.
>>
>>
>> [0] According to
>> https://www.tested.com/tech/457440-theoretical-vs-actual-bandwidth-pci-express-and-thunderbolt/
>> , PCIe 3.0 uses 1.54% of the raw bandwidth for its internal encoding.
>> Also keep in mind all CPU<->GPU communication has to go through the
>> PCIe
>> link, e.g. for programming the transfers, in-band signalling from the
>> GPU to the PCIe port where the data is being transferred to/from, ...
> 
> Good point, I used 1024 and not 1000. My mistake.
> 
> There is something else:
> In the same benchmark there is a "fill->GTT  ,SDMA" row which has a
> 4035 MB/s number. If that traffic goes through the TB3 interface then
> we just found our 32 Gbit/sec.

The GPU is only connected to the host via PCIe, there's nowhere else it
could go through.


> Now the question is, if I understand this correctly and the SDMA can
> indeed do 32 Gbit/sec for "fill->GTT", then why can't it do the same
> with other kinds of transfers? Not sure if there is a good answer to
> that question though.
> 
> Also I still don't fully understand why GTT->VRAM is slower than VRAM-
>> GTT, when the bandwidth is clearly available.

While those are interesting questions at some level, I don't think they
will get us closer to solving your problem. It comes down to identifying
inefficient transfers across PCIe and optimizing them.


> Side note: with regards to that 1.5% figure, the TB3 tech brief[0]
> explicitly mentions this and says that it isn't carried over: "the
> underlying protocol uses some data to provide encoding overhead which
> is not carried over the Thunderbolt 3 link reducing the consumed
> bandwidth by roughly 20 percent (DisplayPort) or 1.5 percent (PCI
> Express Gen 3)"

That just means the internal TB3 link only carries the payload data from
the PCIe link, not the 1.5% of bits used for the PCIe encoding. TB3
cannot magically make the PCIe link itself work without the encoding.


-- 
Earthling Michel Dänzer               |              https://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-07-23  8:11 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-28 10:23 Why is Thunderbolt 3 limited to 2.5 GT/s on Linux? Timur Kristóf
2019-06-28 10:32 ` Mika Westerberg
2019-06-28 11:08   ` Timur Kristóf
2019-06-28 11:34     ` Mika Westerberg
2019-06-28 12:21       ` Timur Kristóf
2019-06-28 12:53         ` Mika Westerberg
2019-06-28 13:33           ` Timur Kristóf
2019-06-28 14:14             ` Mika Westerberg
2019-06-28 14:53               ` Timur Kristóf
2019-07-01 11:44                 ` Mika Westerberg
2019-07-01 14:25                   ` Timur Kristóf
2019-07-01 14:28         ` Alex Deucher
2019-07-01 14:38           ` Timur Kristóf
2019-07-01 14:46             ` Alex Deucher
2019-07-01 15:10               ` Mika Westerberg
2019-07-01 14:54         ` Michel Dänzer
2019-07-01 16:01           ` Timur Kristóf
2019-07-02  8:09             ` Michel Dänzer
2019-07-02  9:49               ` Timur Kristóf
2019-07-03  8:07                 ` Michel Dänzer
2019-07-03 11:04                   ` Timur Kristóf
2019-07-04  8:26                     ` Michel Dänzer
2019-07-05  9:17                       ` Timur Kristóf
2019-07-05 13:36                       ` Alex Deucher
2019-07-18  9:11                         ` Timur Kristóf
2019-07-18 13:50                           ` Alex Deucher
     [not found]                             ` <172a41d97d383a8989ebd213bb4230a2df4d636d.camel@gmail.com>
2019-07-19 14:29                               ` Alex Deucher
2019-07-03 18:44                   ` Marek Olšák
2019-07-05  9:27                     ` Timur Kristóf
2019-07-05 15:35                       ` Marek Olšák
2019-07-05 16:01                         ` Timur Kristóf
     [not found]                         ` <8f0c2d7780430d40dd1e17a82484d236eae3f981.camel@gmail.com>
2019-07-18 10:29                           ` Michel Dänzer
2019-07-22  9:39                             ` Timur Kristóf
2019-07-23  8:11                               ` Michel Dänzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.