Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
       [not found]   ` <F99BFE80-FC15-40A0-BB3E-1B53B6CF9B05@meituan.com>
@ 2017-07-26  6:21     ` Marcel Apfelbaum
  2017-07-26 15:29       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Marcel Apfelbaum @ 2017-07-26  6:21 UTC (permalink / raw)
  To: 陈博, alex.williamson, Michael Tsirkin; +Cc: qemu-devel

On 25/07/2017 11:53, 陈博 wrote:
> To accelerate data traversing between devices under the same PCIE Root 
> Port or Switch.
> 
> See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html
> 

Hi,

It may be possible, but maybe PCIe Switch assignment is not
the only way to go.

Adding Alex and Michael for their input on this matter.
More info at:
https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html

Thanks,
Marcel

> 
> 陈博
> 
> 
> 
> 在 2017年7月24日，下午7:36，marcel@redhat.com <mailto:marcel@redhat.com> 
> 写道：
> 
>> On 24/07/2017 13:24, 陈博 wrote:
>>> Is there any chance we could passthrough a real PCIE Root Port device 
>>> into VM?
>>
>> Hi,
>>
>> Is an interesting thought, I didn't think about
>> it yet. What is the scenario?
>>
>> Please be sure to CC the qemu-devel next time :)
>>
>> Thanks,
>> Marcel
>>
>>> 陈博
>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-07-26  6:21     ` [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 Marcel Apfelbaum
@ 2017-07-26 15:29       ` Alex Williamson
  2017-07-26 16:06         ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-07-26 15:29 UTC (permalink / raw)
  To: Marcel Apfelbaum; +Cc: 陈博, Michael Tsirkin, qemu-devel

On Wed, 26 Jul 2017 09:21:38 +0300
Marcel Apfelbaum <marcel@redhat.com> wrote:

> On 25/07/2017 11:53, 陈博 wrote:
> > To accelerate data traversing between devices under the same PCIE Root 
> > Port or Switch.
> > 
> > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html
> >   
> 
> Hi,
> 
> It may be possible, but maybe PCIe Switch assignment is not
> the only way to go.
> 
> Adding Alex and Michael for their input on this matter.
> More info at:
> https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html

I think you need to look at where the IOMMU is in the topology and what
address space the devices are working in when assigned to a VM to
realize that it doesn't make any sense to assign switch ports to a VM.
GPUs cannot do switch level peer to peer when assigned because they are
operating in an I/O virtual address space.  This is why we configure
ACS on downstream ports to prevent peer to peer.  Peer to peer
transactions must be forwarded upstream by the switch ports in order to
reach the IOMMU for translation.  Note however that we do populate peer
to peer mappings within the IOMMU, so if the hardware supports it, the
IOMMU can reflect the transaction back out to the I/O bus to reach the
other device without CPU involvement.

Therefore I think the better solution, if it encourages the NVIDIA
driver to do the right thing, is to use emulated switches.  Assigning
the physical switch would really do nothing more than make the PCIe link
information more correct in the VM, everything else about the switch
would be emulated.  Even still, unless you have an I/O topology which
integrates the IOMMU into the switch itself, the data flow still needs
to go all the way to the root complex to hit the IOMMU before being
reflected to the other device.  Direct peer to peer between downstream
switch ports operates in the wrong address space.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-07-26 15:29       ` Alex Williamson
@ 2017-07-26 16:06         ` Michael S. Tsirkin
  2017-07-26 17:32           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2017-07-26 16:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Marcel Apfelbaum, 陈博, qemu-devel

On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote:
> On Wed, 26 Jul 2017 09:21:38 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
> 
> > On 25/07/2017 11:53, 陈博 wrote:
> > > To accelerate data traversing between devices under the same PCIE Root 
> > > Port or Switch.
> > > 
> > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html
> > >   
> > 
> > Hi,
> > 
> > It may be possible, but maybe PCIe Switch assignment is not
> > the only way to go.
> > 
> > Adding Alex and Michael for their input on this matter.
> > More info at:
> > https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html
> 
> I think you need to look at where the IOMMU is in the topology and what
> address space the devices are working in when assigned to a VM to
> realize that it doesn't make any sense to assign switch ports to a VM.
> GPUs cannot do switch level peer to peer when assigned because they are
> operating in an I/O virtual address space.  This is why we configure
> ACS on downstream ports to prevent peer to peer.  Peer to peer
> transactions must be forwarded upstream by the switch ports in order to
> reach the IOMMU for translation.  Note however that we do populate peer
> to peer mappings within the IOMMU, so if the hardware supports it, the
> IOMMU can reflect the transaction back out to the I/O bus to reach the
> other device without CPU involvement.
> 
> Therefore I think the better solution, if it encourages the NVIDIA
> driver to do the right thing, is to use emulated switches.  Assigning
> the physical switch would really do nothing more than make the PCIe link
> information more correct in the VM, everything else about the switch
> would be emulated.  Even still, unless you have an I/O topology which
> integrates the IOMMU into the switch itself, the data flow still needs
> to go all the way to the root complex to hit the IOMMU before being
> reflected to the other device.  Direct peer to peer between downstream
> switch ports operates in the wrong address space.  Thanks,
> 
> Alex

That's true of course. What would make sense would be for
hardware vendors to add ATS support to their cards.

Then peer to peer should be allowed by hypervisor for translated transactions.

Gives you the performance benefit without the security issues.

Does anyone know whether any hardware implements this?

Of course that would mostly be transparent to guests so
you would still use an emulated switch as Alex suggested.

-- 
MST

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-07-26 16:06         ` Michael S. Tsirkin
@ 2017-07-26 17:32           ` Alex Williamson
  2017-08-01  5:04             ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-07-26 17:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Marcel Apfelbaum, 陈博, qemu-devel

On Wed, 26 Jul 2017 19:06:58 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote:
> > On Wed, 26 Jul 2017 09:21:38 +0300
> > Marcel Apfelbaum <marcel@redhat.com> wrote:
> >   
> > > On 25/07/2017 11:53, 陈博 wrote:  
> > > > To accelerate data traversing between devices under the same PCIE Root 
> > > > Port or Switch.
> > > > 
> > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html
> > > >     
> > > 
> > > Hi,
> > > 
> > > It may be possible, but maybe PCIe Switch assignment is not
> > > the only way to go.
> > > 
> > > Adding Alex and Michael for their input on this matter.
> > > More info at:
> > > https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html  
> > 
> > I think you need to look at where the IOMMU is in the topology and what
> > address space the devices are working in when assigned to a VM to
> > realize that it doesn't make any sense to assign switch ports to a VM.
> > GPUs cannot do switch level peer to peer when assigned because they are
> > operating in an I/O virtual address space.  This is why we configure
> > ACS on downstream ports to prevent peer to peer.  Peer to peer
> > transactions must be forwarded upstream by the switch ports in order to
> > reach the IOMMU for translation.  Note however that we do populate peer
> > to peer mappings within the IOMMU, so if the hardware supports it, the
> > IOMMU can reflect the transaction back out to the I/O bus to reach the
> > other device without CPU involvement.
> > 
> > Therefore I think the better solution, if it encourages the NVIDIA
> > driver to do the right thing, is to use emulated switches.  Assigning
> > the physical switch would really do nothing more than make the PCIe link
> > information more correct in the VM, everything else about the switch
> > would be emulated.  Even still, unless you have an I/O topology which
> > integrates the IOMMU into the switch itself, the data flow still needs
> > to go all the way to the root complex to hit the IOMMU before being
> > reflected to the other device.  Direct peer to peer between downstream
> > switch ports operates in the wrong address space.  Thanks,
> > 
> > Alex  
> 
> That's true of course. What would make sense would be for
> hardware vendors to add ATS support to their cards.
> 
> Then peer to peer should be allowed by hypervisor for translated transactions.
> 
> Gives you the performance benefit without the security issues.
> 
> Does anyone know whether any hardware implements this?

GPUs often do implement ATS and the ACS DT (Direct Translated P2P)
capability should handle routing requests with the Address Type field
indicating a translated address directly between downstream ports.  DT
is however not part of the standard set of ACS bits that we enable.  It
seems like it might be fairly easy to poke the DT enable bit with
setpci from userspace to test whether this "just works", providing of
course you can get the driver to attempt to do peer to peer and ATS is
already functioning on the GPU.  If so, then we should look at where
in the code to do that enabling automatically.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-07-26 17:32           ` Alex Williamson
@ 2017-08-01  5:04             ` Bob Chen
  2017-08-01  5:46               ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-01  5:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

Hi,

This is a sketch of my hardware topology.

          CPU0         <- QPI ->        CPU1
           |                             |
    Root Port(at PCIe.0)        Root Port(at PCIe.1)
       /        \                   /       \
    Switch    Switch             Switch    Switch
     /   \      /  \              /   \    /    \
   GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU


And below are the p2p bandwidth test results.

Host：
   D\D     0      1      2      3      4      5      6      7
     0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
     1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
     2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
     3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
     4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
     5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
     6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
     7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15

VM：
   D\D     0      1      2      3      4      5      6      7
     0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
     1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
     2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
     3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
     4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
     5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
     6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
     7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23


In the VM, the bandwidth between two GPUs under the same physical switch is
obviously lower, as per the reasons you said in former threads.

But what confused me most is that GPUs under different switches could
achieve the same speed, as well as in the Host. Does that mean after IOMMU
address translation, data traversing has utilized QPI bus by default? Even
these two devices do not belong to the same PCIe bus?

In a word, I'm trying to build a massive deep-learning/HPC infrastructure
for the cloud environment. Nvidia itself released a solution based on
dockers, and I believe qemu/VMs could also do it. Hopefully I could get
some help from the community.

The emulated switch you suggested looks like a good option to me, I will
have a try.


Thanks,
Bob


2017-07-27 1:32 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Wed, 26 Jul 2017 19:06:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> > On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote:
> > > On Wed, 26 Jul 2017 09:21:38 +0300
> > > Marcel Apfelbaum <marcel@redhat.com> wrote:
> > >
> > > > On 25/07/2017 11:53, 陈博 wrote:
> > > > > To accelerate data traversing between devices under the same PCIE
> Root
> > > > > Port or Switch.
> > > > >
> > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-
> 07/msg07209.html
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > It may be possible, but maybe PCIe Switch assignment is not
> > > > the only way to go.
> > > >
> > > > Adding Alex and Michael for their input on this matter.
> > > > More info at:
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-
> 07/msg07209.html
> > >
> > > I think you need to look at where the IOMMU is in the topology and what
> > > address space the devices are working in when assigned to a VM to
> > > realize that it doesn't make any sense to assign switch ports to a VM.
> > > GPUs cannot do switch level peer to peer when assigned because they are
> > > operating in an I/O virtual address space.  This is why we configure
> > > ACS on downstream ports to prevent peer to peer.  Peer to peer
> > > transactions must be forwarded upstream by the switch ports in order to
> > > reach the IOMMU for translation.  Note however that we do populate peer
> > > to peer mappings within the IOMMU, so if the hardware supports it, the
> > > IOMMU can reflect the transaction back out to the I/O bus to reach the
> > > other device without CPU involvement.
> > >
> > > Therefore I think the better solution, if it encourages the NVIDIA
> > > driver to do the right thing, is to use emulated switches.  Assigning
> > > the physical switch would really do nothing more than make the PCIe
> link
> > > information more correct in the VM, everything else about the switch
> > > would be emulated.  Even still, unless you have an I/O topology which
> > > integrates the IOMMU into the switch itself, the data flow still needs
> > > to go all the way to the root complex to hit the IOMMU before being
> > > reflected to the other device.  Direct peer to peer between downstream
> > > switch ports operates in the wrong address space.  Thanks,
> > >
> > > Alex
> >
> > That's true of course. What would make sense would be for
> > hardware vendors to add ATS support to their cards.
> >
> > Then peer to peer should be allowed by hypervisor for translated
> transactions.
> >
> > Gives you the performance benefit without the security issues.
> >
> > Does anyone know whether any hardware implements this?
>
> GPUs often do implement ATS and the ACS DT (Direct Translated P2P)
> capability should handle routing requests with the Address Type field
> indicating a translated address directly between downstream ports.  DT
> is however not part of the standard set of ACS bits that we enable.  It
> seems like it might be fairly easy to poke the DT enable bit with
> setpci from userspace to test whether this "just works", providing of
> course you can get the driver to attempt to do peer to peer and ATS is
> already functioning on the GPU.  If so, then we should look at where
> in the code to do that enabling automatically.  Thanks,
>
> Alex
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01  5:04             ` Bob Chen
@ 2017-08-01  5:46               ` Alex Williamson
  2017-08-01  9:35                 ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-08-01  5:46 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, 1 Aug 2017 13:04:46 +0800
Bob Chen <a175818323@gmail.com> wrote:

> Hi,
> 
> This is a sketch of my hardware topology.
> 
>           CPU0         <- QPI ->        CPU1
>            |                             |
>     Root Port(at PCIe.0)        Root Port(at PCIe.1)
>        /        \                   /       \

Are each of these lines above separate root ports?  ie. each root
complex hosts two root ports, each with a two-port switch downstream of
it?

>     Switch    Switch             Switch    Switch
>      /   \      /  \              /   \    /    \
>    GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU
> 
> 
> And below are the p2p bandwidth test results.
> 
> Host：
>    D\D     0      1      2      3      4      5      6      7
>      0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
>      1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
>      2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
>      3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
>      4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
>      5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
>      6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
>      7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> 
> VM：
>    D\D     0      1      2      3      4      5      6      7
>      0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
>      1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
>      2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
>      3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
>      4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
>      5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
>      6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
>      7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23

Interesting test, how do you get these numbers?  What are the units,
GB/s?

> In the VM, the bandwidth between two GPUs under the same physical switch is
> obviously lower, as per the reasons you said in former threads.

Hmm, I'm not sure I can explain why the number is lower than to more
remote GPUs though.  Is the test simultaneously reading and writing and
therefore we overload the link to the upstream switch port?  Otherwise
I'd expect the bidirectional support in PCIe to be able to handle the
bandwidth.  Does the test have a read-only or write-only mode?

> But what confused me most is that GPUs under different switches could
> achieve the same speed, as well as in the Host. Does that mean after IOMMU
> address translation, data traversing has utilized QPI bus by default? Even
> these two devices do not belong to the same PCIe bus?

Yes, of course.  Once the transaction is translated by the IOMMU it's
just a matter of routing the resulting address, whether that's back
down the I/O hierarchy under the same root complex or across the QPI
link to the other root complex.  The translated address could just as
easily be to RAM that lives on the other side of the QPI link.  Also, it
seems like the IOMMU overhead is perhaps negligible here, unless the
IOMMU is actually being used in both cases.

In the host test, is the IOMMU still enabled?  The routing of PCIe
transactions is going to be governed by ACS, which Linux enables
whenever the IOMMU is enabled, not just when a device is assigned to a
VM.  It would be interesting to see if another performance tier is
exposed if the IOMMU is entirely disabled, or perhaps it might better
expose the overhead of the IOMMU translation.  It would also be
interesting to see the ACS settings in lspci for each downstream port
for each test.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01  5:46               ` Alex Williamson
@ 2017-08-01  9:35                 ` Bob Chen
  2017-08-01 14:39                   ` Michael S. Tsirkin
  2017-08-01 15:01                   ` Alex Williamson
  0 siblings, 2 replies; 26+ messages in thread
From: Bob Chen @ 2017-08-01  9:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Tue, 1 Aug 2017 13:04:46 +0800
> Bob Chen <a175818323@gmail.com> wrote:
>
> > Hi,
> >
> > This is a sketch of my hardware topology.
> >
> >           CPU0         <- QPI ->        CPU1
> >            |                             |
> >     Root Port(at PCIe.0)        Root Port(at PCIe.1)
> >        /        \                   /       \
>
> Are each of these lines above separate root ports?  ie. each root
> complex hosts two root ports, each with a two-port switch downstream of
> it?
>

Not quite sure if root complex is a concept or a real physical device ...

But according to my observation by `lspci -vt`, there are indeed 4 Root
Ports in the system. So the sketch might need a tiny update.


          CPU0         <- QPI ->        CPU1

           |                             |

      Root Complex(device?)      Root Complex(device?)

         /    \                       /    \

    Root Port  Root Port         Root Port  Root Port

       /        \                   /        \

    Switch    Switch             Switch    Switch

     /   \      /  \              /   \     /   \

   GPU   GPU  GPU  GPU          GPU   GPU  GPU   GPU



>
> >     Switch    Switch             Switch    Switch
> >      /   \      /  \              /   \    /    \
> >    GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU
> >
> >
> > And below are the p2p bandwidth test results.
> >
> > Host：
> >    D\D     0      1      2      3      4      5      6      7
> >      0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> >      1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> >      2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> >      3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> >      4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> >      5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> >      6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> >      7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> >
> > VM：
> >    D\D     0      1      2      3      4      5      6      7
> >      0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> >      1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> >      2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> >      3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> >      4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> >      5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> >      6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> >      7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23
>
> Interesting test, how do you get these numbers?  What are the units,
> GB/s?
>



A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
GB/s. Asynchronous read and write. Bidirectional.

However, the Unidirectional test had shown a different result. Didn't fall
down to a half.

VM:
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
     1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
     2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
     3  11.30  11.31  10.08 425.05  11.10  11.07  11.09  11.06
     4  11.16  11.17  11.21  11.17 423.67  10.08  11.25  11.28
     5  10.97  11.01  11.07  11.02  10.09 425.52  11.23  11.27
     6  11.09  11.13  11.16  11.10  11.28  11.33 422.71  10.10
     7  11.13  11.09  11.15  11.11  11.36  11.33  10.02 422.75

Host:
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 424.13  13.38  10.17  10.17  11.23  11.21  10.94  11.22
     1  13.38 424.06  10.18  10.19  11.20  11.19  11.19  11.14
     2  10.18  10.18 422.75  13.38  11.19  11.19  11.17  11.17
     3  10.18  10.18  13.38 425.05  11.05  11.08  11.08  11.06
     4  11.01  11.06  11.06  11.03 423.21  13.38  10.17  10.17
     5  10.91  10.91  10.89  10.92  13.38 425.52  10.18  10.18
     6  11.28  11.30  11.32  11.31  10.19  10.18 424.59  13.37
     7  11.18  11.20  11.16  11.21  10.17  10.19  13.38 424.13



>
> > In the VM, the bandwidth between two GPUs under the same physical switch
> is
> > obviously lower, as per the reasons you said in former threads.
>
> Hmm, I'm not sure I can explain why the number is lower than to more
> remote GPUs though.  Is the test simultaneously reading and writing and
> therefore we overload the link to the upstream switch port?  Otherwise
> I'd expect the bidirectional support in PCIe to be able to handle the
> bandwidth.  Does the test have a read-only or write-only mode?
>
> > But what confused me most is that GPUs under different switches could
> > achieve the same speed, as well as in the Host. Does that mean after
> IOMMU
> > address translation, data traversing has utilized QPI bus by default?
> Even
> > these two devices do not belong to the same PCIe bus?
>
> Yes, of course.  Once the transaction is translated by the IOMMU it's
> just a matter of routing the resulting address, whether that's back
> down the I/O hierarchy under the same root complex or across the QPI
> link to the other root complex.  The translated address could just as
> easily be to RAM that lives on the other side of the QPI link.  Also, it
> seems like the IOMMU overhead is perhaps negligible here, unless the
> IOMMU is actually being used in both cases.
>


Yes, the overhead of bandwidth is negligible, but the latency is not as
good as we expected. I assume it is IOMMU address translation to blame.

I ran this twice with IOMMU on/off on Host, the results were the same.

VM:
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7
     0   4.53  13.44  13.60  13.60  14.37  14.51  14.55  14.49
     1  13.47   4.41  13.37  13.37  14.49  14.51  14.56  14.52
     2  13.38  13.61   4.32  13.47  14.45  14.43  14.53  14.33
     3  13.55  13.60  13.38   4.45  14.50  14.48  14.54  14.51
     4  13.85  13.72  13.71  13.81   4.47  14.61  14.58  14.47
     5  13.75  13.77  13.75  13.77  14.46   4.46  14.52  14.45
     6  13.76  13.78  13.73  13.84  14.50  14.55   4.45  14.53
     7  13.73  13.78  13.76  13.80  14.53  14.63  14.56   4.46

Host:
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7
     0   3.66   5.88   6.59   6.58  15.26  15.15  15.03  15.14
     1   5.80   3.66   6.50   6.50  15.15  15.04  15.06  15.00
     2   6.58   6.52   4.12   5.85  15.16  15.06  15.00  15.04
     3   6.80   6.81   6.71   4.12  15.12  13.08  13.75  13.31
     4  14.91  14.18  14.34  12.93   4.13   6.45   6.56   6.63
     5  15.17  14.99  15.03  14.57   5.61   3.49   6.19   6.29
     6  15.12  14.78  14.60  13.47   6.16   6.15   3.53   5.68
     7  15.00  14.65  14.82  14.28   6.16   6.15   5.44   3.56



>
> In the host test, is the IOMMU still enabled?  The routing of PCIe
> transactions is going to be governed by ACS, which Linux enables
> whenever the IOMMU is enabled, not just when a device is assigned to a
> VM.  It would be interesting to see if another performance tier is
> exposed if the IOMMU is entirely disabled, or perhaps it might better
> expose the overhead of the IOMMU translation.  It would also be
> interesting to see the ACS settings in lspci for each downstream port
> for each test.  Thanks,
>
> Alex
>


How to display GPU's ACS settings? Like this?

[420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01  9:35                 ` Bob Chen
@ 2017-08-01 14:39                   ` Michael S. Tsirkin
  2017-08-01 15:01                   ` Alex Williamson
  1 sibling, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2017-08-01 14:39 UTC (permalink / raw)
  To: Bob Chen; +Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, Aug 01, 2017 at 05:35:40PM +0800, Bob Chen wrote:
> How to display GPU's ACS settings? Like this?
> 
> [420 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC-
> UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC-
> UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC-
> UnsupReq- ACSViol-

Right but that's AER. You want ACS (Access Control Services).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01  9:35                 ` Bob Chen
  2017-08-01 14:39                   ` Michael S. Tsirkin
@ 2017-08-01 15:01                   ` Alex Williamson
  2017-08-07 13:00                     ` Bob Chen
  2017-08-07 13:04                     ` Bob Chen
  1 sibling, 2 replies; 26+ messages in thread
From: Alex Williamson @ 2017-08-01 15:01 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, 1 Aug 2017 17:35:40 +0800
Bob Chen <a175818323@gmail.com> wrote:

> 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:
> 
> > On Tue, 1 Aug 2017 13:04:46 +0800
> > Bob Chen <a175818323@gmail.com> wrote:
> >  
> > > Hi,
> > >
> > > This is a sketch of my hardware topology.
> > >
> > >           CPU0         <- QPI ->        CPU1
> > >            |                             |
> > >     Root Port(at PCIe.0)        Root Port(at PCIe.1)
> > >        /        \                   /       \  
> >
> > Are each of these lines above separate root ports?  ie. each root
> > complex hosts two root ports, each with a two-port switch downstream of
> > it?
> >  
> 
> Not quite sure if root complex is a concept or a real physical device ...
> 
> But according to my observation by `lspci -vt`, there are indeed 4 Root
> Ports in the system. So the sketch might need a tiny update.
> 
> 
>           CPU0         <- QPI ->        CPU1
> 
>            |                             |
> 
>       Root Complex(device?)      Root Complex(device?)
> 
>          /    \                       /    \
> 
>     Root Port  Root Port         Root Port  Root Port
> 
>        /        \                   /        \
> 
>     Switch    Switch             Switch    Switch
> 
>      /   \      /  \              /   \     /   \
> 
>    GPU   GPU  GPU  GPU          GPU   GPU  GPU   GPU


Yes, that's what I expected.  So the numbers make sense, the immediate
sibling GPU would share bandwidth between the root port and upstream
switch port, any other GPU should not double-up on any single link.

> > >     Switch    Switch             Switch    Switch
> > >      /   \      /  \              /   \    /    \
> > >    GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU
> > >
> > >
> > > And below are the p2p bandwidth test results.
> > >
> > > Host：
> > >    D\D     0      1      2      3      4      5      6      7
> > >      0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> > >      1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> > >      2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> > >      3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> > >      4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> > >      5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> > >      6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> > >      7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> > >
> > > VM：
> > >    D\D     0      1      2      3      4      5      6      7
> > >      0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> > >      1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> > >      2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> > >      3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> > >      4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> > >      5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> > >      6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> > >      7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23  
> >
> > Interesting test, how do you get these numbers?  What are the units,
> > GB/s?
> >  
> 
> 
> 
> A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
> GB/s. Asynchronous read and write. Bidirectional.
> 
> However, the Unidirectional test had shown a different result. Didn't fall
> down to a half.
> 
> VM:
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3      4      5      6      7
>      0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
>      1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
>      2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
>      3  11.30  11.31  10.08 425.05  11.10  11.07  11.09  11.06
>      4  11.16  11.17  11.21  11.17 423.67  10.08  11.25  11.28
>      5  10.97  11.01  11.07  11.02  10.09 425.52  11.23  11.27
>      6  11.09  11.13  11.16  11.10  11.28  11.33 422.71  10.10
>      7  11.13  11.09  11.15  11.11  11.36  11.33  10.02 422.75
> 
> Host:
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3      4      5      6      7
>      0 424.13  13.38  10.17  10.17  11.23  11.21  10.94  11.22
>      1  13.38 424.06  10.18  10.19  11.20  11.19  11.19  11.14
>      2  10.18  10.18 422.75  13.38  11.19  11.19  11.17  11.17
>      3  10.18  10.18  13.38 425.05  11.05  11.08  11.08  11.06
>      4  11.01  11.06  11.06  11.03 423.21  13.38  10.17  10.17
>      5  10.91  10.91  10.89  10.92  13.38 425.52  10.18  10.18
>      6  11.28  11.30  11.32  11.31  10.19  10.18 424.59  13.37
>      7  11.18  11.20  11.16  11.21  10.17  10.19  13.38 424.13

Looks right, a unidirectional test would create bidirectional data
flows on the root port to upstream switch link and should be able to
saturate that link.  With the bidirectional test, that link becomes a
bottleneck.
 
> > > In the VM, the bandwidth between two GPUs under the same physical switch  
> > is  
> > > obviously lower, as per the reasons you said in former threads.  
> >
> > Hmm, I'm not sure I can explain why the number is lower than to more
> > remote GPUs though.  Is the test simultaneously reading and writing and
> > therefore we overload the link to the upstream switch port?  Otherwise
> > I'd expect the bidirectional support in PCIe to be able to handle the
> > bandwidth.  Does the test have a read-only or write-only mode?
> >  
> > > But what confused me most is that GPUs under different switches could
> > > achieve the same speed, as well as in the Host. Does that mean after  
> > IOMMU  
> > > address translation, data traversing has utilized QPI bus by default?  
> > Even  
> > > these two devices do not belong to the same PCIe bus?  
> >
> > Yes, of course.  Once the transaction is translated by the IOMMU it's
> > just a matter of routing the resulting address, whether that's back
> > down the I/O hierarchy under the same root complex or across the QPI
> > link to the other root complex.  The translated address could just as
> > easily be to RAM that lives on the other side of the QPI link.  Also, it
> > seems like the IOMMU overhead is perhaps negligible here, unless the
> > IOMMU is actually being used in both cases.
> >  
> 
> 
> Yes, the overhead of bandwidth is negligible, but the latency is not as
> good as we expected. I assume it is IOMMU address translation to blame.
> 
> I ran this twice with IOMMU on/off on Host, the results were the same.
> 
> VM:
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3      4      5      6      7
>      0   4.53  13.44  13.60  13.60  14.37  14.51  14.55  14.49
>      1  13.47   4.41  13.37  13.37  14.49  14.51  14.56  14.52
>      2  13.38  13.61   4.32  13.47  14.45  14.43  14.53  14.33
>      3  13.55  13.60  13.38   4.45  14.50  14.48  14.54  14.51
>      4  13.85  13.72  13.71  13.81   4.47  14.61  14.58  14.47
>      5  13.75  13.77  13.75  13.77  14.46   4.46  14.52  14.45
>      6  13.76  13.78  13.73  13.84  14.50  14.55   4.45  14.53
>      7  13.73  13.78  13.76  13.80  14.53  14.63  14.56   4.46
> 
> Host:
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3      4      5      6      7
>      0   3.66   5.88   6.59   6.58  15.26  15.15  15.03  15.14
>      1   5.80   3.66   6.50   6.50  15.15  15.04  15.06  15.00
>      2   6.58   6.52   4.12   5.85  15.16  15.06  15.00  15.04
>      3   6.80   6.81   6.71   4.12  15.12  13.08  13.75  13.31
>      4  14.91  14.18  14.34  12.93   4.13   6.45   6.56   6.63
>      5  15.17  14.99  15.03  14.57   5.61   3.49   6.19   6.29
>      6  15.12  14.78  14.60  13.47   6.16   6.15   3.53   5.68
>      7  15.00  14.65  14.82  14.28   6.16   6.15   5.44   3.56

Yes, the IOMMU is not free, page table walks are occurring here.  Are
you using 1G pages for the VM?  2G?  Does this platform support 1G
super pages on the IOMMU?  (cat /sys/class/iommu/*/intel-iommu/cap, bit
34 is 2MB page support, bit 35 is 1G).  All modern Xeons should support
1G so you'll want to use 1G hugepages in the VM to take advantage of
that.

> > In the host test, is the IOMMU still enabled?  The routing of PCIe
> > transactions is going to be governed by ACS, which Linux enables
> > whenever the IOMMU is enabled, not just when a device is assigned to a
> > VM.  It would be interesting to see if another performance tier is
> > exposed if the IOMMU is entirely disabled, or perhaps it might better
> > expose the overhead of the IOMMU translation.  It would also be
> > interesting to see the ACS settings in lspci for each downstream port
> > for each test.  Thanks,
> >
> > Alex
> >  
> 
> 
> How to display GPU's ACS settings? Like this?
> 
> [420 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
> ECRC- UnsupReq- ACSViol-

As Michael notes, this is AER, ACS is Access Control Services.  It
should be another capability in lspci.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01 15:01                   ` Alex Williamson
@ 2017-08-07 13:00                     ` Bob Chen
  2017-08-07 15:52                       ` Alex Williamson
  2017-08-07 13:04                     ` Bob Chen
  1 sibling, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-07 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

Bad news... The performance had dropped dramatically when using emulated
switches.

I was referring to the PCIe doc at
https://github.com/qemu/qemu/blob/master/docs/pcie.txt

# qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
q35,accel=kvm -nodefaults -nodefconfig \
-device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
-device x3130-upstream,id=upstream_port1,bus=root_port1 \
-device
xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11
\
-device
xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12
\
-device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
-device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
-device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
-device x3130-upstream,id=upstream_port2,bus=root_port2 \
-device
xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21
\
-device
xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22
\
-device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
-device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
...


Not 8 GPUs this time, only 4.

*1. Attached to pcie bus directly (former situation):*

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 420.93  10.03  11.07  11.09
     1  10.04 425.05  11.08  10.97
     2  11.17  11.17 425.07  10.07
     3  11.25  11.25  10.07 423.64
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 425.98  10.03  11.07  11.09
     1   9.99 426.43  11.07  11.07
     2  11.04  11.20 425.98   9.89
     3  11.21  11.21  10.06 425.97
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 430.67  10.45  19.59  19.58
     1  10.44 428.81  19.49  19.53
     2  19.62  19.62 429.52  10.57
     3  19.60  19.66  10.43 427.38
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 429.47  10.47  19.52  19.39
     1  10.48 427.15  19.64  19.52
     2  19.64  19.59 429.02  10.42
     3  19.60  19.64  10.47 427.81
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.50  13.72  14.49  14.44
     1  13.65   4.53  14.52  14.33
     2  14.22  13.82   4.52  14.50
     3  13.87  13.75  14.53   4.55
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.44  13.56  14.58  14.45
     1  13.56   4.48  14.39  14.45
     2  13.85  13.93   4.86  14.80
     3  14.51  14.23  14.70   4.72


*2. Attached to emulated Root Port and Switches:*

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 420.48   3.15   3.12   3.12
     1   3.13 422.31   3.12   3.12
     2   3.08   3.09 421.40   3.13
     3   3.10   3.10   3.13 418.68
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 418.68   3.14   3.12   3.12
     1   3.15 420.03   3.12   3.12
     2   3.11   3.10 421.39   3.14
     3   3.11   3.08   3.13 419.13
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 424.36   5.36   5.35   5.34
     1   5.36 424.36   5.34   5.34
     2   5.35   5.36 425.52   5.35
     3   5.36   5.36   5.34 425.29
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 422.98   5.35   5.35   5.35
     1   5.35 423.44   5.34   5.33
     2   5.35   5.35 425.29   5.35
     3   5.35   5.34   5.34 423.21
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.79  16.59  16.38  16.22
     1  16.62   4.77  16.35  16.69
     2  16.77  16.66   4.03  16.68
     3  16.54  16.56  16.78   4.08
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.51  16.56  16.58  16.66
     1  15.65   3.87  16.74  16.61
     2  16.59  16.81   3.96  16.70
     3  16.47  16.28  16.68   4.03


Is it because the heavy load of CPU emulation had caused a bottleneck?



2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Tue, 1 Aug 2017 17:35:40 +0800
> Bob Chen <a175818323@gmail.com> wrote:
>
> > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:
> >
> > > On Tue, 1 Aug 2017 13:04:46 +0800
> > > Bob Chen <a175818323@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a sketch of my hardware topology.
> > > >
> > > >           CPU0         <- QPI ->        CPU1
> > > >            |                             |
> > > >     Root Port(at PCIe.0)        Root Port(at PCIe.1)
> > > >        /        \                   /       \
> > >
> > > Are each of these lines above separate root ports?  ie. each root
> > > complex hosts two root ports, each with a two-port switch downstream of
> > > it?
> > >
> >
> > Not quite sure if root complex is a concept or a real physical device ...
> >
> > But according to my observation by `lspci -vt`, there are indeed 4 Root
> > Ports in the system. So the sketch might need a tiny update.
> >
> >
> >           CPU0         <- QPI ->        CPU1
> >
> >            |                             |
> >
> >       Root Complex(device?)      Root Complex(device?)
> >
> >          /    \                       /    \
> >
> >     Root Port  Root Port         Root Port  Root Port
> >
> >        /        \                   /        \
> >
> >     Switch    Switch             Switch    Switch
> >
> >      /   \      /  \              /   \     /   \
> >
> >    GPU   GPU  GPU  GPU          GPU   GPU  GPU   GPU
>
>
> Yes, that's what I expected.  So the numbers make sense, the immediate
> sibling GPU would share bandwidth between the root port and upstream
> switch port, any other GPU should not double-up on any single link.
>
> > > >     Switch    Switch             Switch    Switch
> > > >      /   \      /  \              /   \    /    \
> > > >    GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU
> > > >
> > > >
> > > > And below are the p2p bandwidth test results.
> > > >
> > > > Host：
> > > >    D\D     0      1      2      3      4      5      6      7
> > > >      0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> > > >      1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> > > >      2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> > > >      3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> > > >      4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> > > >      5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> > > >      6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> > > >      7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> > > >
> > > > VM：
> > > >    D\D     0      1      2      3      4      5      6      7
> > > >      0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> > > >      1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> > > >      2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> > > >      3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> > > >      4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> > > >      5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> > > >      6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> > > >      7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23
> > >
> > > Interesting test, how do you get these numbers?  What are the units,
> > > GB/s?
> > >
> >
> >
> >
> > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
> > GB/s. Asynchronous read and write. Bidirectional.
> >
> > However, the Unidirectional test had shown a different result. Didn't
> fall
> > down to a half.
> >
> > VM:
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3      4      5      6      7
> >      0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
> >      1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
> >      2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
> >      3  11.30  11.31  10.08 425.05  11.10  11.07  11.09  11.06
> >      4  11.16  11.17  11.21  11.17 423.67  10.08  11.25  11.28
> >      5  10.97  11.01  11.07  11.02  10.09 425.52  11.23  11.27
> >      6  11.09  11.13  11.16  11.10  11.28  11.33 422.71  10.10
> >      7  11.13  11.09  11.15  11.11  11.36  11.33  10.02 422.75
> >
> > Host:
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3      4      5      6      7
> >      0 424.13  13.38  10.17  10.17  11.23  11.21  10.94  11.22
> >      1  13.38 424.06  10.18  10.19  11.20  11.19  11.19  11.14
> >      2  10.18  10.18 422.75  13.38  11.19  11.19  11.17  11.17
> >      3  10.18  10.18  13.38 425.05  11.05  11.08  11.08  11.06
> >      4  11.01  11.06  11.06  11.03 423.21  13.38  10.17  10.17
> >      5  10.91  10.91  10.89  10.92  13.38 425.52  10.18  10.18
> >      6  11.28  11.30  11.32  11.31  10.19  10.18 424.59  13.37
> >      7  11.18  11.20  11.16  11.21  10.17  10.19  13.38 424.13
>
> Looks right, a unidirectional test would create bidirectional data
> flows on the root port to upstream switch link and should be able to
> saturate that link.  With the bidirectional test, that link becomes a
> bottleneck.
>
> > > > In the VM, the bandwidth between two GPUs under the same physical
> switch
> > > is
> > > > obviously lower, as per the reasons you said in former threads.
> > >
> > > Hmm, I'm not sure I can explain why the number is lower than to more
> > > remote GPUs though.  Is the test simultaneously reading and writing and
> > > therefore we overload the link to the upstream switch port?  Otherwise
> > > I'd expect the bidirectional support in PCIe to be able to handle the
> > > bandwidth.  Does the test have a read-only or write-only mode?
> > >
> > > > But what confused me most is that GPUs under different switches could
> > > > achieve the same speed, as well as in the Host. Does that mean after
> > > IOMMU
> > > > address translation, data traversing has utilized QPI bus by default?
> > > Even
> > > > these two devices do not belong to the same PCIe bus?
> > >
> > > Yes, of course.  Once the transaction is translated by the IOMMU it's
> > > just a matter of routing the resulting address, whether that's back
> > > down the I/O hierarchy under the same root complex or across the QPI
> > > link to the other root complex.  The translated address could just as
> > > easily be to RAM that lives on the other side of the QPI link.  Also,
> it
> > > seems like the IOMMU overhead is perhaps negligible here, unless the
> > > IOMMU is actually being used in both cases.
> > >
> >
> >
> > Yes, the overhead of bandwidth is negligible, but the latency is not as
> > good as we expected. I assume it is IOMMU address translation to blame.
> >
> > I ran this twice with IOMMU on/off on Host, the results were the same.
> >
> > VM:
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3      4      5      6      7
> >      0   4.53  13.44  13.60  13.60  14.37  14.51  14.55  14.49
> >      1  13.47   4.41  13.37  13.37  14.49  14.51  14.56  14.52
> >      2  13.38  13.61   4.32  13.47  14.45  14.43  14.53  14.33
> >      3  13.55  13.60  13.38   4.45  14.50  14.48  14.54  14.51
> >      4  13.85  13.72  13.71  13.81   4.47  14.61  14.58  14.47
> >      5  13.75  13.77  13.75  13.77  14.46   4.46  14.52  14.45
> >      6  13.76  13.78  13.73  13.84  14.50  14.55   4.45  14.53
> >      7  13.73  13.78  13.76  13.80  14.53  14.63  14.56   4.46
> >
> > Host:
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3      4      5      6      7
> >      0   3.66   5.88   6.59   6.58  15.26  15.15  15.03  15.14
> >      1   5.80   3.66   6.50   6.50  15.15  15.04  15.06  15.00
> >      2   6.58   6.52   4.12   5.85  15.16  15.06  15.00  15.04
> >      3   6.80   6.81   6.71   4.12  15.12  13.08  13.75  13.31
> >      4  14.91  14.18  14.34  12.93   4.13   6.45   6.56   6.63
> >      5  15.17  14.99  15.03  14.57   5.61   3.49   6.19   6.29
> >      6  15.12  14.78  14.60  13.47   6.16   6.15   3.53   5.68
> >      7  15.00  14.65  14.82  14.28   6.16   6.15   5.44   3.56
>
> Yes, the IOMMU is not free, page table walks are occurring here.  Are
> you using 1G pages for the VM?  2G?  Does this platform support 1G
> super pages on the IOMMU?  (cat /sys/class/iommu/*/intel-iommu/cap, bit
> 34 is 2MB page support, bit 35 is 1G).  All modern Xeons should support
> 1G so you'll want to use 1G hugepages in the VM to take advantage of
> that.
>
> > > In the host test, is the IOMMU still enabled?  The routing of PCIe
> > > transactions is going to be governed by ACS, which Linux enables
> > > whenever the IOMMU is enabled, not just when a device is assigned to a
> > > VM.  It would be interesting to see if another performance tier is
> > > exposed if the IOMMU is entirely disabled, or perhaps it might better
> > > expose the overhead of the IOMMU translation.  It would also be
> > > interesting to see the ACS settings in lspci for each downstream port
> > > for each test.  Thanks,
> > >
> > > Alex
> > >
> >
> >
> > How to display GPU's ACS settings? Like this?
> >
> > [420 v2] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> > ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> > ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
> > ECRC- UnsupReq- ACSViol-
>
> As Michael notes, this is AER, ACS is Access Control Services.  It
> should be another capability in lspci.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-07 13:00                     ` Bob Chen
@ 2017-08-07 15:52                       ` Alex Williamson
  2017-08-08  1:44                         ` Bob Chen
  2017-08-08 20:07                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 26+ messages in thread
From: Alex Williamson @ 2017-08-07 15:52 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Mon, 7 Aug 2017 21:00:04 +0800
Bob Chen <a175818323@gmail.com> wrote:

> Bad news... The performance had dropped dramatically when using emulated
> switches.
> 
> I was referring to the PCIe doc at
> https://github.com/qemu/qemu/blob/master/docs/pcie.txt
> 
> # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
> q35,accel=kvm -nodefaults -nodefconfig \
> -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
> -device x3130-upstream,id=upstream_port1,bus=root_port1 \
> -device
> xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11
> \
> -device
> xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12
> \
> -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
> -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
> -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
> -device x3130-upstream,id=upstream_port2,bus=root_port2 \
> -device
> xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21
> \
> -device
> xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22
> \
> -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
> -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
> ...
> 
> 
> Not 8 GPUs this time, only 4.
> 
> *1. Attached to pcie bus directly (former situation):*
> 
> Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 420.93  10.03  11.07  11.09
>      1  10.04 425.05  11.08  10.97
>      2  11.17  11.17 425.07  10.07
>      3  11.25  11.25  10.07 423.64
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 425.98  10.03  11.07  11.09
>      1   9.99 426.43  11.07  11.07
>      2  11.04  11.20 425.98   9.89
>      3  11.21  11.21  10.06 425.97
> Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 430.67  10.45  19.59  19.58
>      1  10.44 428.81  19.49  19.53
>      2  19.62  19.62 429.52  10.57
>      3  19.60  19.66  10.43 427.38
> Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 429.47  10.47  19.52  19.39
>      1  10.48 427.15  19.64  19.52
>      2  19.64  19.59 429.02  10.42
>      3  19.60  19.64  10.47 427.81
> P2P=Disabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.50  13.72  14.49  14.44
>      1  13.65   4.53  14.52  14.33
>      2  14.22  13.82   4.52  14.50
>      3  13.87  13.75  14.53   4.55
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.44  13.56  14.58  14.45
>      1  13.56   4.48  14.39  14.45
>      2  13.85  13.93   4.86  14.80
>      3  14.51  14.23  14.70   4.72
> 
> 
> *2. Attached to emulated Root Port and Switches:*
> 
> Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 420.48   3.15   3.12   3.12
>      1   3.13 422.31   3.12   3.12
>      2   3.08   3.09 421.40   3.13
>      3   3.10   3.10   3.13 418.68
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 418.68   3.14   3.12   3.12
>      1   3.15 420.03   3.12   3.12
>      2   3.11   3.10 421.39   3.14
>      3   3.11   3.08   3.13 419.13
> Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 424.36   5.36   5.35   5.34
>      1   5.36 424.36   5.34   5.34
>      2   5.35   5.36 425.52   5.35
>      3   5.36   5.36   5.34 425.29
> Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 422.98   5.35   5.35   5.35
>      1   5.35 423.44   5.34   5.33
>      2   5.35   5.35 425.29   5.35
>      3   5.35   5.34   5.34 423.21
> P2P=Disabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.79  16.59  16.38  16.22
>      1  16.62   4.77  16.35  16.69
>      2  16.77  16.66   4.03  16.68
>      3  16.54  16.56  16.78   4.08
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.51  16.56  16.58  16.66
>      1  15.65   3.87  16.74  16.61
>      2  16.59  16.81   3.96  16.70
>      3  16.47  16.28  16.68   4.03
> 
> 
> Is it because the heavy load of CPU emulation had caused a bottleneck?

QEMU should really not be involved in the data flow, once the memory
slots are configured in KVM, we really should not be exiting out to
QEMU regardless of the topology.  I wonder if it has something to do
with the link speed/width advertised on the switch port.  I don't think
the endpoint can actually downshift the physical link, so lspci on the
host should probably still show the full bandwidth capability, but
maybe the driver is somehow doing rate limiting.  PCIe gets a little
more complicated as we go to newer versions, so it's not quite as
simple as exposing a different bit configuration to advertise 8GT/s,
x16.  Last I tried to do link matching it was deemed too complicated
for something I couldn't prove at the time had measurable value.  This
might be a good way to prove that value if it makes a difference here.
I can't think why else you'd see such a performance difference, but
testing to see if the KVM exit rate is significantly different could
still be an interesting verification.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-07 15:52                       ` Alex Williamson
@ 2017-08-08  1:44                         ` Bob Chen
  2017-08-08  8:06                           ` Bob Chen
  2017-08-08 16:53                           ` Alex Williamson
  2017-08-08 20:07                         ` Michael S. Tsirkin
  1 sibling, 2 replies; 26+ messages in thread
From: Bob Chen @ 2017-08-08  1:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

1. How to test the KVM exit rate?

2. The switches are separate devices of PLX Technology

# lspci -s 07:08.0 -nn
07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)

# This is one of the Root Ports in the system.
[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D
DMI2
             +-01.0-[01]----00.0  LSI Logic / Symbios Logic MegaRAID SAS
2208 [Thunderbolt]
             +-02.0-[02-05]--
             +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
Corporation GP102 [TITAN Xp]
             |                               |            \-00.1  NVIDIA
Corporation GP102 HDMI Audio Controller
             |                               \-10.0-[09]--+-00.0  NVIDIA
Corporation GP102 [TITAN Xp]
             |                                            \-00.1  NVIDIA
Corporation GP102 HDMI Audio Controller




3. ACS

It seemed that I had misunderstood your point? I finally found ACS
information on switches, not on GPUs.

Capabilities: [f24 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+
DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl-
DirectTrans-



2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Mon, 7 Aug 2017 21:00:04 +0800
> Bob Chen <a175818323@gmail.com> wrote:
>
> > Bad news... The performance had dropped dramatically when using emulated
> > switches.
> >
> > I was referring to the PCIe doc at
> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt
> >
> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
> > q35,accel=kvm -nodefaults -nodefconfig \
> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \
> > -device
> > xio3130-downstream,id=downstream_port1,bus=upstream_
> port1,chassis=11,slot=11
> > \
> > -device
> > xio3130-downstream,id=downstream_port2,bus=upstream_
> port1,chassis=12,slot=12
> > \
> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \
> > -device
> > xio3130-downstream,id=downstream_port3,bus=upstream_
> port2,chassis=21,slot=21
> > \
> > -device
> > xio3130-downstream,id=downstream_port4,bus=upstream_
> port2,chassis=22,slot=22
> > \
> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
> > ...
> >
> >
> > Not 8 GPUs this time, only 4.
> >
> > *1. Attached to pcie bus directly (former situation):*
> >
> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 420.93  10.03  11.07  11.09
> >      1  10.04 425.05  11.08  10.97
> >      2  11.17  11.17 425.07  10.07
> >      3  11.25  11.25  10.07 423.64
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 425.98  10.03  11.07  11.09
> >      1   9.99 426.43  11.07  11.07
> >      2  11.04  11.20 425.98   9.89
> >      3  11.21  11.21  10.06 425.97
> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 430.67  10.45  19.59  19.58
> >      1  10.44 428.81  19.49  19.53
> >      2  19.62  19.62 429.52  10.57
> >      3  19.60  19.66  10.43 427.38
> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 429.47  10.47  19.52  19.39
> >      1  10.48 427.15  19.64  19.52
> >      2  19.64  19.59 429.02  10.42
> >      3  19.60  19.64  10.47 427.81
> > P2P=Disabled Latency Matrix (us)
> >    D\D     0      1      2      3
> >      0   4.50  13.72  14.49  14.44
> >      1  13.65   4.53  14.52  14.33
> >      2  14.22  13.82   4.52  14.50
> >      3  13.87  13.75  14.53   4.55
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3
> >      0   4.44  13.56  14.58  14.45
> >      1  13.56   4.48  14.39  14.45
> >      2  13.85  13.93   4.86  14.80
> >      3  14.51  14.23  14.70   4.72
> >
> >
> > *2. Attached to emulated Root Port and Switches:*
> >
> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 420.48   3.15   3.12   3.12
> >      1   3.13 422.31   3.12   3.12
> >      2   3.08   3.09 421.40   3.13
> >      3   3.10   3.10   3.13 418.68
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 418.68   3.14   3.12   3.12
> >      1   3.15 420.03   3.12   3.12
> >      2   3.11   3.10 421.39   3.14
> >      3   3.11   3.08   3.13 419.13
> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 424.36   5.36   5.35   5.34
> >      1   5.36 424.36   5.34   5.34
> >      2   5.35   5.36 425.52   5.35
> >      3   5.36   5.36   5.34 425.29
> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3
> >      0 422.98   5.35   5.35   5.35
> >      1   5.35 423.44   5.34   5.33
> >      2   5.35   5.35 425.29   5.35
> >      3   5.35   5.34   5.34 423.21
> > P2P=Disabled Latency Matrix (us)
> >    D\D     0      1      2      3
> >      0   4.79  16.59  16.38  16.22
> >      1  16.62   4.77  16.35  16.69
> >      2  16.77  16.66   4.03  16.68
> >      3  16.54  16.56  16.78   4.08
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3
> >      0   4.51  16.56  16.58  16.66
> >      1  15.65   3.87  16.74  16.61
> >      2  16.59  16.81   3.96  16.70
> >      3  16.47  16.28  16.68   4.03
> >
> >
> > Is it because the heavy load of CPU emulation had caused a bottleneck?
>
> QEMU should really not be involved in the data flow, once the memory
> slots are configured in KVM, we really should not be exiting out to
> QEMU regardless of the topology.  I wonder if it has something to do
> with the link speed/width advertised on the switch port.  I don't think
> the endpoint can actually downshift the physical link, so lspci on the
> host should probably still show the full bandwidth capability, but
> maybe the driver is somehow doing rate limiting.  PCIe gets a little
> more complicated as we go to newer versions, so it's not quite as
> simple as exposing a different bit configuration to advertise 8GT/s,
> x16.  Last I tried to do link matching it was deemed too complicated
> for something I couldn't prove at the time had measurable value.  This
> might be a good way to prove that value if it makes a difference here.
> I can't think why else you'd see such a performance difference, but
> testing to see if the KVM exit rate is significantly different could
> still be an interesting verification.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-08  1:44                         ` Bob Chen
@ 2017-08-08  8:06                           ` Bob Chen
  2017-08-08 16:53                           ` Alex Williamson
  1 sibling, 0 replies; 26+ messages in thread
From: Bob Chen @ 2017-08-08  8:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

Plus:

1 GB hugepages neither improved bandwidth nor latency. Results remained the
same.

2017-08-08 9:44 GMT+08:00 Bob Chen <a175818323@gmail.com>:

> 1. How to test the KVM exit rate?
>
> 2. The switches are separate devices of PLX Technology
>
> # lspci -s 07:08.0 -nn
> 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
> PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)
>
> # This is one of the Root Ports in the system.
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon
> D DMI2
>              +-01.0-[01]----00.0  LSI Logic / Symbios Logic MegaRAID SAS
> 2208 [Thunderbolt]
>              +-02.0-[02-05]--
>              +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                               |            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>              |                               \-10.0-[09]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                                            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>
>
>
>
> 3. ACS
>
> It seemed that I had misunderstood your point? I finally found ACS
> information on switches, not on GPUs.
>
> Capabilities: [f24 v1] Access Control Services
> ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl+ DirectTrans+
> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl- DirectTrans-
>
>
>
> 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:
>
>> On Mon, 7 Aug 2017 21:00:04 +0800
>> Bob Chen <a175818323@gmail.com> wrote:
>>
>> > Bad news... The performance had dropped dramatically when using emulated
>> > switches.
>> >
>> > I was referring to the PCIe doc at
>> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt
>> >
>> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
>> > q35,accel=kvm -nodefaults -nodefconfig \
>> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \
>> > -device
>> > xio3130-downstream,id=downstream_port1,bus=upstream_port1,
>> chassis=11,slot=11
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port2,bus=upstream_port1,
>> chassis=12,slot=12
>> > \
>> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
>> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
>> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \
>> > -device
>> > xio3130-downstream,id=downstream_port3,bus=upstream_port2,
>> chassis=21,slot=21
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port4,bus=upstream_port2,
>> chassis=22,slot=22
>> > \
>> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
>> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
>> > ...
>> >
>> >
>> > Not 8 GPUs this time, only 4.
>> >
>> > *1. Attached to pcie bus directly (former situation):*
>> >
>> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 420.93  10.03  11.07  11.09
>> >      1  10.04 425.05  11.08  10.97
>> >      2  11.17  11.17 425.07  10.07
>> >      3  11.25  11.25  10.07 423.64
>> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 425.98  10.03  11.07  11.09
>> >      1   9.99 426.43  11.07  11.07
>> >      2  11.04  11.20 425.98   9.89
>> >      3  11.21  11.21  10.06 425.97
>> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 430.67  10.45  19.59  19.58
>> >      1  10.44 428.81  19.49  19.53
>> >      2  19.62  19.62 429.52  10.57
>> >      3  19.60  19.66  10.43 427.38
>> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 429.47  10.47  19.52  19.39
>> >      1  10.48 427.15  19.64  19.52
>> >      2  19.64  19.59 429.02  10.42
>> >      3  19.60  19.64  10.47 427.81
>> > P2P=Disabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.50  13.72  14.49  14.44
>> >      1  13.65   4.53  14.52  14.33
>> >      2  14.22  13.82   4.52  14.50
>> >      3  13.87  13.75  14.53   4.55
>> > P2P=Enabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.44  13.56  14.58  14.45
>> >      1  13.56   4.48  14.39  14.45
>> >      2  13.85  13.93   4.86  14.80
>> >      3  14.51  14.23  14.70   4.72
>> >
>> >
>> > *2. Attached to emulated Root Port and Switches:*
>> >
>> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 420.48   3.15   3.12   3.12
>> >      1   3.13 422.31   3.12   3.12
>> >      2   3.08   3.09 421.40   3.13
>> >      3   3.10   3.10   3.13 418.68
>> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 418.68   3.14   3.12   3.12
>> >      1   3.15 420.03   3.12   3.12
>> >      2   3.11   3.10 421.39   3.14
>> >      3   3.11   3.08   3.13 419.13
>> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 424.36   5.36   5.35   5.34
>> >      1   5.36 424.36   5.34   5.34
>> >      2   5.35   5.36 425.52   5.35
>> >      3   5.36   5.36   5.34 425.29
>> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 422.98   5.35   5.35   5.35
>> >      1   5.35 423.44   5.34   5.33
>> >      2   5.35   5.35 425.29   5.35
>> >      3   5.35   5.34   5.34 423.21
>> > P2P=Disabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.79  16.59  16.38  16.22
>> >      1  16.62   4.77  16.35  16.69
>> >      2  16.77  16.66   4.03  16.68
>> >      3  16.54  16.56  16.78   4.08
>> > P2P=Enabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.51  16.56  16.58  16.66
>> >      1  15.65   3.87  16.74  16.61
>> >      2  16.59  16.81   3.96  16.70
>> >      3  16.47  16.28  16.68   4.03
>> >
>> >
>> > Is it because the heavy load of CPU emulation had caused a bottleneck?
>>
>> QEMU should really not be involved in the data flow, once the memory
>> slots are configured in KVM, we really should not be exiting out to
>> QEMU regardless of the topology.  I wonder if it has something to do
>> with the link speed/width advertised on the switch port.  I don't think
>> the endpoint can actually downshift the physical link, so lspci on the
>> host should probably still show the full bandwidth capability, but
>> maybe the driver is somehow doing rate limiting.  PCIe gets a little
>> more complicated as we go to newer versions, so it's not quite as
>> simple as exposing a different bit configuration to advertise 8GT/s,
>> x16.  Last I tried to do link matching it was deemed too complicated
>> for something I couldn't prove at the time had measurable value.  This
>> might be a good way to prove that value if it makes a difference here.
>> I can't think why else you'd see such a performance difference, but
>> testing to see if the KVM exit rate is significantly different could
>> still be an interesting verification.  Thanks,
>>
>> Alex
>>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-08  1:44                         ` Bob Chen
  2017-08-08  8:06                           ` Bob Chen
@ 2017-08-08 16:53                           ` Alex Williamson
  1 sibling, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2017-08-08 16:53 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, 8 Aug 2017 09:44:56 +0800
Bob Chen <a175818323@gmail.com> wrote:

> 1. How to test the KVM exit rate?

You can use tracing: http://www.linux-kvm.org/page/Tracing
 
> 2. The switches are separate devices of PLX Technology
> 
> # lspci -s 07:08.0 -nn
> 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
> PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)
> 
> # This is one of the Root Ports in the system.
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D
> DMI2
>              +-01.0-[01]----00.0  LSI Logic / Symbios Logic MegaRAID SAS
> 2208 [Thunderbolt]
>              +-02.0-[02-05]--
>              +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                               |            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>              |                               \-10.0-[09]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                                            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
> 
> 
> 
> 
> 3. ACS
> 
> It seemed that I had misunderstood your point? I finally found ACS
> information on switches, not on GPUs.
> 
> Capabilities: [f24 v1] Access Control Services
> ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+
> DirectTrans+
> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl-
> DirectTrans-

Yes, NVIDIA uses the same PLX PEX 8747 on the switches on the cards I
have access to.  Unfortunately the endpoints in my case do not support
ATS, so the endpoint cannot generate a pre-translated address that
would take advantage of the DT capability on the switch port if we were
to enable it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-07 15:52                       ` Alex Williamson
  2017-08-08  1:44                         ` Bob Chen
@ 2017-08-08 20:07                         ` Michael S. Tsirkin
  2017-08-22  7:04                           ` Bob Chen
  1 sibling, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2017-08-08 20:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Bob Chen, Marcel Apfelbaum, 陈博, qemu-devel

On Mon, Aug 07, 2017 at 09:52:24AM -0600, Alex Williamson wrote:
> I wonder if it has something to do
> with the link speed/width advertised on the switch port.  I don't think
> the endpoint can actually downshift the physical link, so lspci on the
> host should probably still show the full bandwidth capability, but
> maybe the driver is somehow doing rate limiting.  PCIe gets a little
> more complicated as we go to newer versions, so it's not quite as
> simple as exposing a different bit configuration to advertise 8GT/s,
> x16. Last I tried to do link matching it was deemed too complicated
> for something I couldn't prove at the time had measurable value.  This
> might be a good way to prove that value if it makes a difference here.
> I can't think why else you'd see such a performance difference, but
> testing to see if the KVM exit rate is significantly different could
> still be an interesting verification.

It might be easiest to just dust off that patch and see whether it
helps.

-- 
MST

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-08 20:07                         ` Michael S. Tsirkin
@ 2017-08-22  7:04                           ` Bob Chen
  2017-08-22 16:56                             ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-22  7:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2804 bytes --]

Hi,

I got a spec from Nvidia which illustrates how to enable GPU p2p in
virtualization environment. (See attached)

The key is to append the legacy pci capabilities list when setting up the
hypervisor, with a Nvidia customized capability config.

I added some hack in hw/vfio/pci.c and managed to implement that.

Then I found the GPU was able to recognize its peer, and the latency has
dropped. ✅

However the bandwidth didn't improve, but decreased instead. ❌

Any suggestions?


# p2pBandwidthLatencyTest in VM

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]

Device: 0, Tesla M60, pciBusID: 0, pciDeviceID: 15, pciDomainID:0

Device: 1, Tesla M60, pciBusID: 0, pciDeviceID: 16, pciDomainID:0

Device=0 CAN Access Peer Device=1

Device=1 CAN Access Peer Device=0

P2P Connectivity Matrix

     D\D     0     1

     0     1     1

     1     1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 114.04   5.33

     1   5.42 113.91

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 113.93   4.13

     1   4.13 119.65

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 120.50   5.55

     1   5.55 134.98

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 135.45   5.03   # Even worse, used to be 10

     1   5.02 135.30

P2P=Disabled Latency Matrix (us)

   D\D     0      1

     0   5.74  15.61

     1  16.05   5.75

P2P=Enabled Latency Matrix (us)

   D\D     0      1

     0   5.47   8.23   # Improved, used to be 18

     1   8.06   5.46

2017-08-09 4:07 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:

> On Mon, Aug 07, 2017 at 09:52:24AM -0600, Alex Williamson wrote:
> > I wonder if it has something to do
> > with the link speed/width advertised on the switch port.  I don't think
> > the endpoint can actually downshift the physical link, so lspci on the
> > host should probably still show the full bandwidth capability, but
> > maybe the driver is somehow doing rate limiting.  PCIe gets a little
> > more complicated as we go to newer versions, so it's not quite as
> > simple as exposing a different bit configuration to advertise 8GT/s,
> > x16. Last I tried to do link matching it was deemed too complicated
> > for something I couldn't prove at the time had measurable value.  This
> > might be a good way to prove that value if it makes a difference here.
> > I can't think why else you'd see such a performance difference, but
> > testing to see if the KVM exit rate is significantly different could
> > still be an interesting verification.
>
> It might be easiest to just dust off that patch and see whether it
> helps.
>
> --
> MST
>

[-- Attachment #2: NVIDIAGPUDirectwithPCIPass-ThroughVirtualization.pdf --]
[-- Type: application/pdf, Size: 349330 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-22  7:04                           ` Bob Chen
@ 2017-08-22 16:56                             ` Alex Williamson
  2017-08-22 18:06                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-08-22 16:56 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, 22 Aug 2017 15:04:55 +0800
Bob Chen <a175818323@gmail.com> wrote:

> Hi,
> 
> I got a spec from Nvidia which illustrates how to enable GPU p2p in
> virtualization environment. (See attached)

Neat, looks like we should implement a new QEMU vfio-pci option,
something like nvidia-gpudirect-p2p-id=.  I don't think I'd want to
code the policy of where to enable it into QEMU or the kernel, so we'd
push it up to management layers or users to decide.

> The key is to append the legacy pci capabilities list when setting up the
> hypervisor, with a Nvidia customized capability config.
> 
> I added some hack in hw/vfio/pci.c and managed to implement that.
> 
> Then I found the GPU was able to recognize its peer, and the latency has
> dropped. ✅
> 
> However the bandwidth didn't improve, but decreased instead. ❌
> 
> Any suggestions?

What's the VM topology?  I've found that in a Q35 configuration with
GPUs downstream of an emulated root port, the NVIDIA driver in the
guest will downshift the physical link rate to 2.5GT/s and never
increase it back to 8GT/s.  I believe this is because the virtual
downstream port only advertises Gen1 link speeds.  If the GPUs are on
the root complex (ie. pcie.0) the physical link will run at 2.5GT/s
when the GPU is idle and upshift to 8GT/s under load.  This also
happens if the GPU is exposed in a conventional PCI topology to the
VM.  Another interesting data point is that an older Kepler GRID card
does not have this issue, dynamically shifting the link speed under
load regardless of the VM PCI/e topology, while a new M60 using the
same driver experiences this problem.  I've filed a bug with NVIDIA as
this seems to be a regression, but it appears (untested) that the
hypervisor should take the approach of exposing full, up-to-date PCIe
link capabilities and report a link status matching the downstream
devices.

I'd suggest during your testing, watch lspci info for the GPU from the
host, noting the behavior of LnkSta (Link Status) to check if the
devices gets stuck at 2.5GT/s in your VM configuration and adjust the
topology until it works, likely placing the GPUs on pcie.0 for a Q35
based machine.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-22 16:56                             ` Alex Williamson
@ 2017-08-22 18:06                               ` Michael S. Tsirkin
  2017-08-29 10:41                                 ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2017-08-22 18:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Bob Chen, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote:
> On Tue, 22 Aug 2017 15:04:55 +0800
> Bob Chen <a175818323@gmail.com> wrote:
> 
> > Hi,
> > 
> > I got a spec from Nvidia which illustrates how to enable GPU p2p in
> > virtualization environment. (See attached)
> 
> Neat, looks like we should implement a new QEMU vfio-pci option,
> something like nvidia-gpudirect-p2p-id=.  I don't think I'd want to
> code the policy of where to enable it into QEMU or the kernel, so we'd
> push it up to management layers or users to decide.
>  
> > The key is to append the legacy pci capabilities list when setting up the
> > hypervisor, with a Nvidia customized capability config.
> > 
> > I added some hack in hw/vfio/pci.c and managed to implement that.
> > 
> > Then I found the GPU was able to recognize its peer, and the latency has
> > dropped. ✅
> > 
> > However the bandwidth didn't improve, but decreased instead. ❌
> > 
> > Any suggestions?
> 
> What's the VM topology?  I've found that in a Q35 configuration with
> GPUs downstream of an emulated root port, the NVIDIA driver in the
> guest will downshift the physical link rate to 2.5GT/s and never
> increase it back to 8GT/s.  I believe this is because the virtual
> downstream port only advertises Gen1 link speeds.


Fixing that would be nice, and it's great that you now actually have a
reproducer that can be used to test it properly.

Exposing higher link speeds is a bit of work since there are now all
kind of corner cases to cover as guests may play with link speeds and we
must pretend we change it accordingly.  An especially interesting
question is what to do with the assigned device when guest tries to play
with port link speed. It's kind of similar to AER in that respect.

I guess we can just ignore it for starters.

>  If the GPUs are on
> the root complex (ie. pcie.0) the physical link will run at 2.5GT/s
> when the GPU is idle and upshift to 8GT/s under load.  This also
> happens if the GPU is exposed in a conventional PCI topology to the
> VM.  Another interesting data point is that an older Kepler GRID card
> does not have this issue, dynamically shifting the link speed under
> load regardless of the VM PCI/e topology, while a new M60 using the
> same driver experiences this problem.  I've filed a bug with NVIDIA as
> this seems to be a regression, but it appears (untested) that the
> hypervisor should take the approach of exposing full, up-to-date PCIe
> link capabilities and report a link status matching the downstream
> devices.


> I'd suggest during your testing, watch lspci info for the GPU from the
> host, noting the behavior of LnkSta (Link Status) to check if the
> devices gets stuck at 2.5GT/s in your VM configuration and adjust the
> topology until it works, likely placing the GPUs on pcie.0 for a Q35
> based machine.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-22 18:06                               ` Michael S. Tsirkin
@ 2017-08-29 10:41                                 ` Bob Chen
  2017-08-29 14:13                                   ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-29 10:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel

The topology is already having all GPUs directly attached to root bus 0. In
this situation you can't see the LnkSta attribute in any capabilities.

The other way of using emulated switch would somehow show this attribute,
at 8 GT/s, although the real bandwidth is low as usual.

2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:

> On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote:
> > On Tue, 22 Aug 2017 15:04:55 +0800
> > Bob Chen <a175818323@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I got a spec from Nvidia which illustrates how to enable GPU p2p in
> > > virtualization environment. (See attached)
> >
> > Neat, looks like we should implement a new QEMU vfio-pci option,
> > something like nvidia-gpudirect-p2p-id=.  I don't think I'd want to
> > code the policy of where to enable it into QEMU or the kernel, so we'd
> > push it up to management layers or users to decide.
> >
> > > The key is to append the legacy pci capabilities list when setting up
> the
> > > hypervisor, with a Nvidia customized capability config.
> > >
> > > I added some hack in hw/vfio/pci.c and managed to implement that.
> > >
> > > Then I found the GPU was able to recognize its peer, and the latency
> has
> > > dropped. ✅
> > >
> > > However the bandwidth didn't improve, but decreased instead. ❌
> > >
> > > Any suggestions?
> >
> > What's the VM topology?  I've found that in a Q35 configuration with
> > GPUs downstream of an emulated root port, the NVIDIA driver in the
> > guest will downshift the physical link rate to 2.5GT/s and never
> > increase it back to 8GT/s.  I believe this is because the virtual
> > downstream port only advertises Gen1 link speeds.
>
>
> Fixing that would be nice, and it's great that you now actually have a
> reproducer that can be used to test it properly.
>
> Exposing higher link speeds is a bit of work since there are now all
> kind of corner cases to cover as guests may play with link speeds and we
> must pretend we change it accordingly.  An especially interesting
> question is what to do with the assigned device when guest tries to play
> with port link speed. It's kind of similar to AER in that respect.
>
> I guess we can just ignore it for starters.
>
> >  If the GPUs are on
> > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s
> > when the GPU is idle and upshift to 8GT/s under load.  This also
> > happens if the GPU is exposed in a conventional PCI topology to the
> > VM.  Another interesting data point is that an older Kepler GRID card
> > does not have this issue, dynamically shifting the link speed under
> > load regardless of the VM PCI/e topology, while a new M60 using the
> > same driver experiences this problem.  I've filed a bug with NVIDIA as
> > this seems to be a regression, but it appears (untested) that the
> > hypervisor should take the approach of exposing full, up-to-date PCIe
> > link capabilities and report a link status matching the downstream
> > devices.
>
>
> > I'd suggest during your testing, watch lspci info for the GPU from the
> > host, noting the behavior of LnkSta (Link Status) to check if the
> > devices gets stuck at 2.5GT/s in your VM configuration and adjust the
> > topology until it works, likely placing the GPUs on pcie.0 for a Q35
> > based machine.  Thanks,
> >
> > Alex
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-29 10:41                                 ` Bob Chen
@ 2017-08-29 14:13                                   ` Alex Williamson
  2017-08-30  9:41                                     ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-08-29 14:13 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Tue, 29 Aug 2017 18:41:44 +0800
Bob Chen <a175818323@gmail.com> wrote:

> The topology is already having all GPUs directly attached to root bus 0. In
> this situation you can't see the LnkSta attribute in any capabilities.

Right, this is why I suggested viewing the physical device lspci info
from the host.  I haven't seen the suck link issue with devices on the
root bus, but it may be worth double checking.  Thanks,

Alex
 
> The other way of using emulated switch would somehow show this attribute,
> at 8 GT/s, although the real bandwidth is low as usual.
> 
> 2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>:
> 
> > On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote:  
> > > On Tue, 22 Aug 2017 15:04:55 +0800
> > > Bob Chen <a175818323@gmail.com> wrote:
> > >  
> > > > Hi,
> > > >
> > > > I got a spec from Nvidia which illustrates how to enable GPU p2p in
> > > > virtualization environment. (See attached)  
> > >
> > > Neat, looks like we should implement a new QEMU vfio-pci option,
> > > something like nvidia-gpudirect-p2p-id=.  I don't think I'd want to
> > > code the policy of where to enable it into QEMU or the kernel, so we'd
> > > push it up to management layers or users to decide.
> > >  
> > > > The key is to append the legacy pci capabilities list when setting up  
> > the  
> > > > hypervisor, with a Nvidia customized capability config.
> > > >
> > > > I added some hack in hw/vfio/pci.c and managed to implement that.
> > > >
> > > > Then I found the GPU was able to recognize its peer, and the latency  
> > has  
> > > > dropped. ✅
> > > >
> > > > However the bandwidth didn't improve, but decreased instead. ❌
> > > >
> > > > Any suggestions?  
> > >
> > > What's the VM topology?  I've found that in a Q35 configuration with
> > > GPUs downstream of an emulated root port, the NVIDIA driver in the
> > > guest will downshift the physical link rate to 2.5GT/s and never
> > > increase it back to 8GT/s.  I believe this is because the virtual
> > > downstream port only advertises Gen1 link speeds.  
> >
> >
> > Fixing that would be nice, and it's great that you now actually have a
> > reproducer that can be used to test it properly.
> >
> > Exposing higher link speeds is a bit of work since there are now all
> > kind of corner cases to cover as guests may play with link speeds and we
> > must pretend we change it accordingly.  An especially interesting
> > question is what to do with the assigned device when guest tries to play
> > with port link speed. It's kind of similar to AER in that respect.
> >
> > I guess we can just ignore it for starters.
> >  
> > >  If the GPUs are on
> > > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s
> > > when the GPU is idle and upshift to 8GT/s under load.  This also
> > > happens if the GPU is exposed in a conventional PCI topology to the
> > > VM.  Another interesting data point is that an older Kepler GRID card
> > > does not have this issue, dynamically shifting the link speed under
> > > load regardless of the VM PCI/e topology, while a new M60 using the
> > > same driver experiences this problem.  I've filed a bug with NVIDIA as
> > > this seems to be a regression, but it appears (untested) that the
> > > hypervisor should take the approach of exposing full, up-to-date PCIe
> > > link capabilities and report a link status matching the downstream
> > > devices.  
> >
> >  
> > > I'd suggest during your testing, watch lspci info for the GPU from the
> > > host, noting the behavior of LnkSta (Link Status) to check if the
> > > devices gets stuck at 2.5GT/s in your VM configuration and adjust the
> > > topology until it works, likely placing the GPUs on pcie.0 for a Q35
> > > based machine.  Thanks,
> > >
> > > Alex  
> >  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-29 14:13                                   ` Alex Williamson
@ 2017-08-30  9:41                                     ` Bob Chen
  2017-08-30 16:43                                       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-30  9:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

I think I have observed what you said...

The link speed on host remained 8GT/s until I finished running
p2pBandwidthLatencyTest
for the first time. Then it became 2.5GT/s...


# lspci -s 09:00.0 -vvv
09:00.0 3D controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
Subsystem: NVIDIA Corporation Device 115e
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 7
NUMA node: 0
Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 3b800000000 (64-bit, prefetchable) [size=8G]
Region 3: Memory at 3ba00000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #16, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>
Capabilities: [900 v1] #19
Kernel driver in use: vfio-pci
Kernel modules: nouveau

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-30  9:41                                     ` Bob Chen
@ 2017-08-30 16:43                                       ` Alex Williamson
  2017-09-01  9:58                                         ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2017-08-30 16:43 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Wed, 30 Aug 2017 17:41:20 +0800
Bob Chen <a175818323@gmail.com> wrote:

> I think I have observed what you said...
> 
> The link speed on host remained 8GT/s until I finished running
> p2pBandwidthLatencyTest
> for the first time. Then it became 2.5GT/s...
> 
> 
> # lspci -s 09:00.0 -vvv
...
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-

So long as the device renegotiates to 8GT/s under load rather than
getting stuck at 2.5GT/s, I think this is the expected behavior.  This
is a power saving measure by the driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-30 16:43                                       ` Alex Williamson
@ 2017-09-01  9:58                                         ` Bob Chen
  2017-11-30  8:06                                           ` Bob Chen
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-09-01  9:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

More updates:

1. This behavior was found not only on M60, but also on TITAN 1080Ti or Xp.

2. When not setting up the p2p compatibility, i.e. run the original qemu
with GPUs attached to the root pcie bus, the LnkSta on host always remains
at 8 GT/s. Don't know why the new p2p change would cause the GPU driver in
guest to re-negotiate its speed.

I think it has gone beyond the community's responsibility to debug this
tricky issue. So I have contacted nvidia for technical support, and they
are expected to send me a reply in next few weeks. Will keep you guys
updated.

Bob

2017-08-31 0:43 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Wed, 30 Aug 2017 17:41:20 +0800
> Bob Chen <a175818323@gmail.com> wrote:
>
> > I think I have observed what you said...
> >
> > The link speed on host remained 8GT/s until I finished running
> > p2pBandwidthLatencyTest
> > for the first time. Then it became 2.5GT/s...
> >
> >
> > # lspci -s 09:00.0 -vvv
> ...
> > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
>
> So long as the device renegotiates to 8GT/s under load rather than
> getting stuck at 2.5GT/s, I think this is the expected behavior.  This
> is a power saving measure by the driver.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-09-01  9:58                                         ` Bob Chen
@ 2017-11-30  8:06                                           ` Bob Chen
  0 siblings, 0 replies; 26+ messages in thread
From: Bob Chen @ 2017-11-30  8:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

Hi,

After 3 months of work and investigation, and tedious mail discussions with
Nvidia, I think some progress have been made, in terms of the
GPUDirect(p2p) in virtual environment.

The only remaining issue then, is the low bidirectional bandwidth between
two sibling GPUs under the same PCIe switch.

We expanded the tests to run on even more GPU cards, so the results seemed
to be explicit now.


P40 is OK, and its hardware topology on host is:
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
v4/Xeon D DMI2
             +-01.0-[03]----00.0  LSI Logic / Symbios Logic MegaRAID SAS-3
3008 [Fury]
             +-02.0-[04]----00.0  NVIDIA Corporation GP102GL [Tesla P40]
             +-03.0-[02]----00.0  NVIDIA Corporation GP102GL [Tesla P40]


M60, not OK, low bandwidth:
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3
v4/Xeon D DMI2
             +-01.0-[06]----00.0  LSI Logic / Symbios Logic MegaRAID SAS-3
3008 [Fury]
             +-02.0-[07-0a]----00.0-[08-0a]--+-08.0-[09]----00.0  NVIDIA
Corporation GM204GL [Tesla M60]
             |                               \-10.0-[0a]----00.0  NVIDIA
Corporation GM204GL [Tesla M60]


V100, not OK, low bandwidth:
\-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon
D DMI2
             +-01.0-[01]--+-00.0  Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
             |            \-00.1  Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
             +-02.0-[02-05]----00.0-[03-05]--+-08.0-[04]----00.0  NVIDIA
Corporation GV100 [Tesla V100 PCIe]
             |                               \-10.0-[05]----00.0  NVIDIA
Corporation GV100 [Tesla V100 PCIe]



So what might be the actual effect of the PLX switch hardware for GPU data
flow? Although it is not visible in guest OS.
Nvidia tech-support guys are not familiar with virtualization. They asked
us to consult the community first.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-01 15:01                   ` Alex Williamson
  2017-08-07 13:00                     ` Bob Chen
@ 2017-08-07 13:04                     ` Bob Chen
  2017-08-07 16:00                       ` Alex Williamson
  1 sibling, 1 reply; 26+ messages in thread
From: Bob Chen @ 2017-08-07 13:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

Besides, I checked the lspci -vvv output, no capabilities of Access Control
are seen.

2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:

> On Tue, 1 Aug 2017 17:35:40 +0800
> Bob Chen <a175818323@gmail.com> wrote:
>
> > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>:
> >
> > > On Tue, 1 Aug 2017 13:04:46 +0800
> > > Bob Chen <a175818323@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a sketch of my hardware topology.
> > > >
> > > >           CPU0         <- QPI ->        CPU1
> > > >            |                             |
> > > >     Root Port(at PCIe.0)        Root Port(at PCIe.1)
> > > >        /        \                   /       \
> > >
> > > Are each of these lines above separate root ports?  ie. each root
> > > complex hosts two root ports, each with a two-port switch downstream of
> > > it?
> > >
> >
> > Not quite sure if root complex is a concept or a real physical device ...
> >
> > But according to my observation by `lspci -vt`, there are indeed 4 Root
> > Ports in the system. So the sketch might need a tiny update.
> >
> >
> >           CPU0         <- QPI ->        CPU1
> >
> >            |                             |
> >
> >       Root Complex(device?)      Root Complex(device?)
> >
> >          /    \                       /    \
> >
> >     Root Port  Root Port         Root Port  Root Port
> >
> >        /        \                   /        \
> >
> >     Switch    Switch             Switch    Switch
> >
> >      /   \      /  \              /   \     /   \
> >
> >    GPU   GPU  GPU  GPU          GPU   GPU  GPU   GPU
>
>
> Yes, that's what I expected.  So the numbers make sense, the immediate
> sibling GPU would share bandwidth between the root port and upstream
> switch port, any other GPU should not double-up on any single link.
>
> > > >     Switch    Switch             Switch    Switch
> > > >      /   \      /  \              /   \    /    \
> > > >    GPU   GPU  GPU  GPU          GPU   GPU GPU   GPU
> > > >
> > > >
> > > > And below are the p2p bandwidth test results.
> > > >
> > > > Host：
> > > >    D\D     0      1      2      3      4      5      6      7
> > > >      0 426.91  25.32  19.72  19.72  19.69  19.68  19.75  19.66
> > > >      1  25.31 427.61  19.74  19.72  19.66  19.68  19.74  19.73
> > > >      2  19.73  19.73 429.49  25.33  19.66  19.74  19.73  19.74
> > > >      3  19.72  19.71  25.36 426.68  19.70  19.71  19.77  19.74
> > > >      4  19.72  19.72  19.73  19.75 425.75  25.33  19.72  19.71
> > > >      5  19.71  19.75  19.76  19.75  25.35 428.11  19.69  19.70
> > > >      6  19.76  19.72  19.79  19.78  19.73  19.74 425.75  25.35
> > > >      7  19.69  19.75  19.79  19.75  19.72  19.72  25.39 427.15
> > > >
> > > > VM：
> > > >    D\D     0      1      2      3      4      5      6      7
> > > >      0 427.38  10.52  18.99  19.11  19.75  19.62  19.75  19.71
> > > >      1  10.53 426.68  19.28  19.19  19.73  19.71  19.72  19.73
> > > >      2  18.88  19.30 426.92  10.48  19.66  19.71  19.67  19.68
> > > >      3  18.93  19.18  10.45 426.94  19.69  19.72  19.67  19.72
> > > >      4  19.60  19.66  19.69  19.70 428.13  10.49  19.40  19.57
> > > >      5  19.52  19.74  19.72  19.69  10.44 426.45  19.68  19.61
> > > >      6  19.63  19.50  19.72  19.64  19.59  19.66 426.91  10.47
> > > >      7  19.69  19.75  19.70  19.69  19.66  19.74  10.45 426.23
> > >
> > > Interesting test, how do you get these numbers?  What are the units,
> > > GB/s?
> > >
> >
> >
> >
> > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are
> > GB/s. Asynchronous read and write. Bidirectional.
> >
> > However, the Unidirectional test had shown a different result. Didn't
> fall
> > down to a half.
> >
> > VM:
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3      4      5      6      7
> >      0 424.07  10.02  11.33  11.30  11.09  11.05  11.06  11.10
> >      1  10.05 425.98  11.40  11.33  11.08  11.10  11.13  11.09
> >      2  11.31  11.28 423.67  10.10  11.14  11.13  11.13  11.11
> >      3  11.30  11.31  10.08 425.05  11.10  11.07  11.09  11.06
> >      4  11.16  11.17  11.21  11.17 423.67  10.08  11.25  11.28
> >      5  10.97  11.01  11.07  11.02  10.09 425.52  11.23  11.27
> >      6  11.09  11.13  11.16  11.10  11.28  11.33 422.71  10.10
> >      7  11.13  11.09  11.15  11.11  11.36  11.33  10.02 422.75
> >
> > Host:
> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
> >    D\D     0      1      2      3      4      5      6      7
> >      0 424.13  13.38  10.17  10.17  11.23  11.21  10.94  11.22
> >      1  13.38 424.06  10.18  10.19  11.20  11.19  11.19  11.14
> >      2  10.18  10.18 422.75  13.38  11.19  11.19  11.17  11.17
> >      3  10.18  10.18  13.38 425.05  11.05  11.08  11.08  11.06
> >      4  11.01  11.06  11.06  11.03 423.21  13.38  10.17  10.17
> >      5  10.91  10.91  10.89  10.92  13.38 425.52  10.18  10.18
> >      6  11.28  11.30  11.32  11.31  10.19  10.18 424.59  13.37
> >      7  11.18  11.20  11.16  11.21  10.17  10.19  13.38 424.13
>
> Looks right, a unidirectional test would create bidirectional data
> flows on the root port to upstream switch link and should be able to
> saturate that link.  With the bidirectional test, that link becomes a
> bottleneck.
>
> > > > In the VM, the bandwidth between two GPUs under the same physical
> switch
> > > is
> > > > obviously lower, as per the reasons you said in former threads.
> > >
> > > Hmm, I'm not sure I can explain why the number is lower than to more
> > > remote GPUs though.  Is the test simultaneously reading and writing and
> > > therefore we overload the link to the upstream switch port?  Otherwise
> > > I'd expect the bidirectional support in PCIe to be able to handle the
> > > bandwidth.  Does the test have a read-only or write-only mode?
> > >
> > > > But what confused me most is that GPUs under different switches could
> > > > achieve the same speed, as well as in the Host. Does that mean after
> > > IOMMU
> > > > address translation, data traversing has utilized QPI bus by default?
> > > Even
> > > > these two devices do not belong to the same PCIe bus?
> > >
> > > Yes, of course.  Once the transaction is translated by the IOMMU it's
> > > just a matter of routing the resulting address, whether that's back
> > > down the I/O hierarchy under the same root complex or across the QPI
> > > link to the other root complex.  The translated address could just as
> > > easily be to RAM that lives on the other side of the QPI link.  Also,
> it
> > > seems like the IOMMU overhead is perhaps negligible here, unless the
> > > IOMMU is actually being used in both cases.
> > >
> >
> >
> > Yes, the overhead of bandwidth is negligible, but the latency is not as
> > good as we expected. I assume it is IOMMU address translation to blame.
> >
> > I ran this twice with IOMMU on/off on Host, the results were the same.
> >
> > VM:
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3      4      5      6      7
> >      0   4.53  13.44  13.60  13.60  14.37  14.51  14.55  14.49
> >      1  13.47   4.41  13.37  13.37  14.49  14.51  14.56  14.52
> >      2  13.38  13.61   4.32  13.47  14.45  14.43  14.53  14.33
> >      3  13.55  13.60  13.38   4.45  14.50  14.48  14.54  14.51
> >      4  13.85  13.72  13.71  13.81   4.47  14.61  14.58  14.47
> >      5  13.75  13.77  13.75  13.77  14.46   4.46  14.52  14.45
> >      6  13.76  13.78  13.73  13.84  14.50  14.55   4.45  14.53
> >      7  13.73  13.78  13.76  13.80  14.53  14.63  14.56   4.46
> >
> > Host:
> > P2P=Enabled Latency Matrix (us)
> >    D\D     0      1      2      3      4      5      6      7
> >      0   3.66   5.88   6.59   6.58  15.26  15.15  15.03  15.14
> >      1   5.80   3.66   6.50   6.50  15.15  15.04  15.06  15.00
> >      2   6.58   6.52   4.12   5.85  15.16  15.06  15.00  15.04
> >      3   6.80   6.81   6.71   4.12  15.12  13.08  13.75  13.31
> >      4  14.91  14.18  14.34  12.93   4.13   6.45   6.56   6.63
> >      5  15.17  14.99  15.03  14.57   5.61   3.49   6.19   6.29
> >      6  15.12  14.78  14.60  13.47   6.16   6.15   3.53   5.68
> >      7  15.00  14.65  14.82  14.28   6.16   6.15   5.44   3.56
>
> Yes, the IOMMU is not free, page table walks are occurring here.  Are
> you using 1G pages for the VM?  2G?  Does this platform support 1G
> super pages on the IOMMU?  (cat /sys/class/iommu/*/intel-iommu/cap, bit
> 34 is 2MB page support, bit 35 is 1G).  All modern Xeons should support
> 1G so you'll want to use 1G hugepages in the VM to take advantage of
> that.
>
> > > In the host test, is the IOMMU still enabled?  The routing of PCIe
> > > transactions is going to be governed by ACS, which Linux enables
> > > whenever the IOMMU is enabled, not just when a device is assigned to a
> > > VM.  It would be interesting to see if another performance tier is
> > > exposed if the IOMMU is entirely disabled, or perhaps it might better
> > > expose the overhead of the IOMMU translation.  It would also be
> > > interesting to see the ACS settings in lspci for each downstream port
> > > for each test.  Thanks,
> > >
> > > Alex
> > >
> >
> >
> > How to display GPU's ACS settings? Like this?
> >
> > [420 v2] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> > ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> > ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
> > ECRC- UnsupReq- ACSViol-
>
> As Michael notes, this is AER, ACS is Access Control Services.  It
> should be another capability in lspci.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】
  2017-08-07 13:04                     ` Bob Chen
@ 2017-08-07 16:00                       ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2017-08-07 16:00 UTC (permalink / raw)
  To: Bob Chen
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel

On Mon, 7 Aug 2017 21:04:16 +0800
Bob Chen <a175818323@gmail.com> wrote:

> Besides, I checked the lspci -vvv output, no capabilities of Access Control
> are seen.

Are these switches onboard an NVIDIA card or are they separate
components?  The examples I have on NVIDIA cards do include ACS:

+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0  NVIDIA Corporation GK107GL [GRID K1]
                                +-09.0-[45]----00.0  NVIDIA Corporation GK107GL [GRID K1]
                                +-10.0-[46]----00.0  NVIDIA Corporation GK107GL [GRID K1]
                                \-11.0-[47]----00.0  NVIDIA Corporation GK107GL [GRID K1]

# lspci -vvvs 43: | grep -A 2 "Access Control Services"
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
--
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
--
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
--
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0  NVIDIA Corporation GM204GL [Tesla M60]
                                \-10.0-[07]----00.0  NVIDIA Corporation GM204GL [Tesla M60]

# lspci -vvvs 5: | grep -A 2 "Access Control Services"
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
--
	Capabilities: [f24 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

Without ACS on the downstream switch ports, the GPUs sharing the switch
will be in the same IOMMU group and we have no ability to control
anything about the routing between downstream ports.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-11-30  8:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4E0AFA5F-44D6-4624-A99F-68A7FE52F397@meituan.com>
     [not found] ` <4b31a711-a52e-25d3-4a7c-1be8521097d9@redhat.com>
     [not found]   ` <F99BFE80-FC15-40A0-BB3E-1B53B6CF9B05@meituan.com>
2017-07-26  6:21     ` [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 Marcel Apfelbaum
2017-07-26 15:29       ` Alex Williamson
2017-07-26 16:06         ` Michael S. Tsirkin
2017-07-26 17:32           ` Alex Williamson
2017-08-01  5:04             ` Bob Chen
2017-08-01  5:46               ` Alex Williamson
2017-08-01  9:35                 ` Bob Chen
2017-08-01 14:39                   ` Michael S. Tsirkin
2017-08-01 15:01                   ` Alex Williamson
2017-08-07 13:00                     ` Bob Chen
2017-08-07 15:52                       ` Alex Williamson
2017-08-08  1:44                         ` Bob Chen
2017-08-08  8:06                           ` Bob Chen
2017-08-08 16:53                           ` Alex Williamson
2017-08-08 20:07                         ` Michael S. Tsirkin
2017-08-22  7:04                           ` Bob Chen
2017-08-22 16:56                             ` Alex Williamson
2017-08-22 18:06                               ` Michael S. Tsirkin
2017-08-29 10:41                                 ` Bob Chen
2017-08-29 14:13                                   ` Alex Williamson
2017-08-30  9:41                                     ` Bob Chen
2017-08-30 16:43                                       ` Alex Williamson
2017-09-01  9:58                                         ` Bob Chen
2017-11-30  8:06                                           ` Bob Chen
2017-08-07 13:04                     ` Bob Chen
2017-08-07 16:00                       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.