linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Understanding P2P DMA related errors
       [not found] <CAFGSPrzM_pRZ-JEWimKYDPzv76t_Nw2Q6od19S_3dzbG_0-bDA@mail.gmail.com>
@ 2022-10-04 23:44 ` Logan Gunthorpe
       [not found]   ` <CAFGSPrz2ym5oEot9gLi3Z38PWS5A_wCFM4OWk36U_RazDMR67A@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Logan Gunthorpe @ 2022-10-04 23:44 UTC (permalink / raw)
  To: Ramesh Errabolu, linux-pci



On 2022-10-04 17:23, Ramesh Errabolu wrote:
> Could I request some help in understanding some PCIe P2P related errors.
> 
>     [72.896624] amdgpu 0000:67:00.0: cannot be used for peer-to-peer DMA
>     as the client and provider (0000:19:00.0) do not share an upstream
>     bridge or whitelisted host bridge
> 
> 
> *System Information*:
> 
>   * The kernel is tagged as 5.14.21
>   * The last entry in the whitelist is   {PCI_VENDOR_ID_INTEL, 0x2030 -
>     31, 32, 33, 20,  0}
>       o p2pdma.c - LINK
>         <https://elixir.bootlin.com/linux/v5.14.21/source/drivers/pci/p2pdma.c>
>   * Output of PCIe device on the system that might reference root
>     complex is:
>       o fe:00.3 Host bridge [0600]: Intel Corporation Device [8086:0998]
>       o Could you confirm if the command I ran is correct. I am not sure
>       o *sudo lspci -nn | grep  -C 1 -i host*
>       o If above command is not correct, how can I get root complex
>         device's id correctly
> 
> I tried to reason if the two AMD devices are connected to two different
> root complex devices. Looking at the PCIe device tree, I don't see that
> to be the case. Perhaps I am not interpreting the PCIe device tree
> correctly. Including below a short fragment:
> 
> 
>     +-[0000:e2]-+-00.0  Intel Corporation Device 09a2
>      |           +-00.1  Intel Corporation Device 09a4
>      |           +-00.2  Intel Corporation Device 09a3
>      |           +-00.4  Intel Corporation Device 0998
>      |          * \-02.0-[e3-e5]*----00.0-[e4-e5]----00.0-[e5]----00.0
>      Advanced Micro Devices, Inc. [AMD/ATI]
> 
>     I am reading this as follows:
> 
>       o Device E2:02.0, a Intel PCI bridge is connected to Domain 0000
>       o Device E3:00.0, a PCI bridge from AMD is connected to Intel PCI
>         bridge device E2:02.
>       o Device E4:00.0, a PCI bridge from AMD is connected to AMD PCI
>         bridge device E3:00.0
>       o Device E5:00.0, a Display controller is connected to AMD PCI
>         bridge E4:00.0
> 
> Per my reading, in the above tree devices *E2:02.0* (*8086:347A*)
> and *E2:00.4* (*8086:09A2*) are not connected to each other directly.
> More importantly they should be considered as PEERs / SIBLINGs.
> Downstream from E2:02.0 is the AMD device E5:00.0 (*1002:740F*). In this
> reading AMD device is not connected to the root complex device. A
> similar pattern is seen with regards to other AMD devices. Basically all
> of the AMD devices connect to the domain (*0000*) via different buses.
> Importantly in their connection to the domain there is no root complex
> device. *Is my reading WRONG*? What is also not clear is how adding the
> device *8086:09A2* to the whitelist helps as the packets do not go
> through that device?
> 

Hmm, looks like a really new Ice-Lake system. Doesn't even have proper
PCI database entries yet. The topology seems a bit unusual, but those
have been getting ever stranger with each new generation.

09a2 looks like the host bridge device id. I'd probably try adding that
to the white list and see what happens.

Logan



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Understanding P2P DMA related errors
       [not found]   ` <CAFGSPrz2ym5oEot9gLi3Z38PWS5A_wCFM4OWk36U_RazDMR67A@mail.gmail.com>
@ 2022-10-05 15:54     ` Logan Gunthorpe
  2022-10-05 17:09       ` Bjorn Helgaas
                         ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Logan Gunthorpe @ 2022-10-05 15:54 UTC (permalink / raw)
  To: Ramesh Errabolu; +Cc: linux-pci



On 2022-10-04 22:42, Ramesh Errabolu wrote:
> Hi,
> 
> Thanks for taking a look at this. I will see if I can add 0x09A2 to the
> whitelist and see what happens. But could you clarify my reading of the
> device tree. In the tree I don't see an AMD device attached to the
> 0x09A2 device. Is that a misread on my part? Would appreciate it if you
> could shed light on this aspect.

The two AMD devices are connected to the [0000:16] and [0000:64] buses
respectively both are Intel 09a2. I'm not sure if the whitelist code
will handle this topology. You may need to make more substantial changes
to handle it. If adding the device to the white list doesn't work you
can try disabling the check. If that all works then we'll need to
somehow add support for this topology.

Logan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Understanding P2P DMA related errors
  2022-10-05 15:54     ` Logan Gunthorpe
@ 2022-10-05 17:09       ` Bjorn Helgaas
  2022-10-06  2:56       ` Ramesh Errabolu
  2022-10-06 19:33       ` Ramesh Errabolu
  2 siblings, 0 replies; 6+ messages in thread
From: Bjorn Helgaas @ 2022-10-05 17:09 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Ramesh Errabolu, linux-pci

On Wed, Oct 05, 2022 at 09:54:50AM -0600, Logan Gunthorpe wrote:
> On 2022-10-04 22:42, Ramesh Errabolu wrote:
> > Hi,
> > 
> > Thanks for taking a look at this. I will see if I can add 0x09A2 to the
> > whitelist and see what happens. But could you clarify my reading of the
> > device tree. In the tree I don't see an AMD device attached to the
> > 0x09A2 device. Is that a misread on my part? Would appreciate it if you
> > could shed light on this aspect.

Ramesh, just FIYI, your emails aren't making it to the mailing list,
probably because they're "too fancy," e.g., they are multi-part or
contain HTML.  See http://vger.kernel.org/majordomo-info.html

You can see the effect at
https://lore.kernel.org/all/5d3b257a-c125-fdd6-e29f-229e54679f45@deltatee.com/
The archive contains Logan's responses, but not the emails from you
that Logan is responding to.

Bjorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Understanding P2P DMA related errors
  2022-10-05 15:54     ` Logan Gunthorpe
  2022-10-05 17:09       ` Bjorn Helgaas
@ 2022-10-06  2:56       ` Ramesh Errabolu
  2022-10-06 15:26         ` Logan Gunthorpe
  2022-10-06 19:33       ` Ramesh Errabolu
  2 siblings, 1 reply; 6+ messages in thread
From: Ramesh Errabolu @ 2022-10-06  2:56 UTC (permalink / raw)
  To: logang; +Cc: linux-pci, ramesh.errabolu


Logan,

You are right about AMD devices connecting to buses [0000:16] and [0000:64].
However I am unable to understand as to how you extend that to mean they
belong to Intel 0x09A2.

Per my understanding I am expecting Root Complex enumerated as a device,
with various other devices hanging off one or more ports/buses. In the
PCIe device tree, I don't see that.

I see the [domain::bus] as the root of the AMD device. Furthermore I see
Intel devices 0x09A2 hanging off the same domain::bus. I will take your
word, but the way the root complex is reported could be less confusing.

If I could make a request, it will be very helpfulf for folks who don't
dabble in this area with a simple cheat sheet plus write explaining with
examples the various root complexes and the variou end-points hanging off
of them.

Let me know if I could help in this effort.

Regards,
Ramesh


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Understanding P2P DMA related errors
  2022-10-06  2:56       ` Ramesh Errabolu
@ 2022-10-06 15:26         ` Logan Gunthorpe
  0 siblings, 0 replies; 6+ messages in thread
From: Logan Gunthorpe @ 2022-10-06 15:26 UTC (permalink / raw)
  To: Ramesh Errabolu; +Cc: linux-pci, ramesh.errabolu




On 2022-10-05 20:56, Ramesh Errabolu wrote:
> 
> Logan,
> 
> You are right about AMD devices connecting to buses [0000:16] and [0000:64].
> However I am unable to understand as to how you extend that to mean they
> belong to Intel 0x09A2.

Well the root bus in your tree is 09A2 and each of the 16 and 64 buses each 
have a 09A2. So it's my guess that 09A2 is the root complex it just shows 
up multiple times.

> Per my understanding I am expecting Root Complex enumerated as a device,
> with various other devices hanging off one or more ports/buses. In the
> PCIe device tree, I don't see that.
> 
> I see the [domain::bus] as the root of the AMD device. Furthermore I see
> Intel devices 0x09A2 hanging off the same domain::bus. I will take your
> word, but the way the root complex is reported could be less confusing.

Yup. Like I said, this is a bit strange. 

> If I could make a request, it will be very helpfulf for folks who don't
> dabble in this area with a simple cheat sheet plus write explaining with
> examples the various root complexes and the variou end-points hanging off
> of them.

I don't really know any more than you do here. You'd have to ask Intel what 
their newer topologies imply. They keep coming up with new ways to organize
things and its not clear what it means from a P2P perspective. 

But really what needs to happen is to verify P2PDMA works between ports and
find a way for the whitelist code to accept it if it does.

Logan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Understanding P2P DMA related errors
  2022-10-05 15:54     ` Logan Gunthorpe
  2022-10-05 17:09       ` Bjorn Helgaas
  2022-10-06  2:56       ` Ramesh Errabolu
@ 2022-10-06 19:33       ` Ramesh Errabolu
  2 siblings, 0 replies; 6+ messages in thread
From: Ramesh Errabolu @ 2022-10-06 19:33 UTC (permalink / raw)
  To: logang; +Cc: linux-pci, ramesh.errabolu


Logan,

Wanted to thank you for all the time you have given me. I now understand
the PCIe device tree better. The thing that was throwing me off is the
way I was looking at the "lspci -tv" output.

I realized that Root Complex should be understood to mean a set of special
devices hanging of a BUS. The most important member of this set is the device
that acts as "HOST BRIDGE". This becomes apparent when the device tree is sketched
out on a piece of paper. One can then circumscribe this set logically to form
a logical device - "ROOT COMPLEX". I wish I could share my sketch via email.

Including below lspci output of this device set:

    localhost:~ # 
    localhost:~ # lspci -vs 0000:16:00.0
    16:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
        Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, NUMA node 0
        Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00

    localhost:~ # 
    localhost:~ # lspci -vs 0000:16:00.1
    16:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
        Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, NUMA node 0
        Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00

    localhost:~ # 
    localhost:~ # lspci -vs 0000:16:00.2
    16:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
        Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, NUMA node 0
        Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00

    localhost:~ # 
    localhost:~ # lspci -vs 0000:16:00.4
    16:00.4 Host bridge: Intel Corporation Device 0998
        Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, NUMA node 0
        Capabilities: [40] Express Root Complex Integrated Endpoint, MSI 00

    localhost:~ # 
    localhost:~ # 

In the above log one can see the device 0000:64:00.4 playing the role of
"HOST BRIDGE' while the remaining devices 0000:64:00.0/1/2 play the role
of devices that act as ENDPOINT. I suspect, not sure, these devices play
a role in Inter-Root complex transactions. If so the whitelist should have
all these devices to support P2P traffic.

Interestingly in a patch I could find I see only 0x09A2 being added. Perhaps
the intention was to support only those systems that have 0x09A2 and not these
other devices such as 0x09A3 and 0x09A4 which may be newer.

Regards,
Ramesh


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-06 19:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAFGSPrzM_pRZ-JEWimKYDPzv76t_Nw2Q6od19S_3dzbG_0-bDA@mail.gmail.com>
2022-10-04 23:44 ` Understanding P2P DMA related errors Logan Gunthorpe
     [not found]   ` <CAFGSPrz2ym5oEot9gLi3Z38PWS5A_wCFM4OWk36U_RazDMR67A@mail.gmail.com>
2022-10-05 15:54     ` Logan Gunthorpe
2022-10-05 17:09       ` Bjorn Helgaas
2022-10-06  2:56       ` Ramesh Errabolu
2022-10-06 15:26         ` Logan Gunthorpe
2022-10-06 19:33       ` Ramesh Errabolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).