* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 [not found] ` <F99BFE80-FC15-40A0-BB3E-1B53B6CF9B05@meituan.com> @ 2017-07-26 6:21 ` Marcel Apfelbaum 2017-07-26 15:29 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Marcel Apfelbaum @ 2017-07-26 6:21 UTC (permalink / raw) To: 陈博, alex.williamson, Michael Tsirkin; +Cc: qemu-devel On 25/07/2017 11:53, 陈博 wrote: > To accelerate data traversing between devices under the same PCIE Root > Port or Switch. > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > Hi, It may be possible, but maybe PCIe Switch assignment is not the only way to go. Adding Alex and Michael for their input on this matter. More info at: https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html Thanks, Marcel > > 陈博 > > > > 在 2017年7月24日,下午7:36,marcel@redhat.com <mailto:marcel@redhat.com> > 写道: > >> On 24/07/2017 13:24, 陈博 wrote: >>> Is there any chance we could passthrough a real PCIE Root Port device >>> into VM? >> >> Hi, >> >> Is an interesting thought, I didn't think about >> it yet. What is the scenario? >> >> Please be sure to CC the qemu-devel next time :) >> >> Thanks, >> Marcel >> >>> 陈博 >> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-07-26 6:21 ` [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 Marcel Apfelbaum @ 2017-07-26 15:29 ` Alex Williamson 2017-07-26 16:06 ` Michael S. Tsirkin 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-07-26 15:29 UTC (permalink / raw) To: Marcel Apfelbaum; +Cc: 陈博, Michael Tsirkin, qemu-devel On Wed, 26 Jul 2017 09:21:38 +0300 Marcel Apfelbaum <marcel@redhat.com> wrote: > On 25/07/2017 11:53, 陈博 wrote: > > To accelerate data traversing between devices under the same PCIE Root > > Port or Switch. > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > > > > Hi, > > It may be possible, but maybe PCIe Switch assignment is not > the only way to go. > > Adding Alex and Michael for their input on this matter. > More info at: > https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html I think you need to look at where the IOMMU is in the topology and what address space the devices are working in when assigned to a VM to realize that it doesn't make any sense to assign switch ports to a VM. GPUs cannot do switch level peer to peer when assigned because they are operating in an I/O virtual address space. This is why we configure ACS on downstream ports to prevent peer to peer. Peer to peer transactions must be forwarded upstream by the switch ports in order to reach the IOMMU for translation. Note however that we do populate peer to peer mappings within the IOMMU, so if the hardware supports it, the IOMMU can reflect the transaction back out to the I/O bus to reach the other device without CPU involvement. Therefore I think the better solution, if it encourages the NVIDIA driver to do the right thing, is to use emulated switches. Assigning the physical switch would really do nothing more than make the PCIe link information more correct in the VM, everything else about the switch would be emulated. Even still, unless you have an I/O topology which integrates the IOMMU into the switch itself, the data flow still needs to go all the way to the root complex to hit the IOMMU before being reflected to the other device. Direct peer to peer between downstream switch ports operates in the wrong address space. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-07-26 15:29 ` Alex Williamson @ 2017-07-26 16:06 ` Michael S. Tsirkin 2017-07-26 17:32 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2017-07-26 16:06 UTC (permalink / raw) To: Alex Williamson; +Cc: Marcel Apfelbaum, 陈博, qemu-devel On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote: > On Wed, 26 Jul 2017 09:21:38 +0300 > Marcel Apfelbaum <marcel@redhat.com> wrote: > > > On 25/07/2017 11:53, 陈博 wrote: > > > To accelerate data traversing between devices under the same PCIE Root > > > Port or Switch. > > > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > > > > > > > Hi, > > > > It may be possible, but maybe PCIe Switch assignment is not > > the only way to go. > > > > Adding Alex and Michael for their input on this matter. > > More info at: > > https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > > I think you need to look at where the IOMMU is in the topology and what > address space the devices are working in when assigned to a VM to > realize that it doesn't make any sense to assign switch ports to a VM. > GPUs cannot do switch level peer to peer when assigned because they are > operating in an I/O virtual address space. This is why we configure > ACS on downstream ports to prevent peer to peer. Peer to peer > transactions must be forwarded upstream by the switch ports in order to > reach the IOMMU for translation. Note however that we do populate peer > to peer mappings within the IOMMU, so if the hardware supports it, the > IOMMU can reflect the transaction back out to the I/O bus to reach the > other device without CPU involvement. > > Therefore I think the better solution, if it encourages the NVIDIA > driver to do the right thing, is to use emulated switches. Assigning > the physical switch would really do nothing more than make the PCIe link > information more correct in the VM, everything else about the switch > would be emulated. Even still, unless you have an I/O topology which > integrates the IOMMU into the switch itself, the data flow still needs > to go all the way to the root complex to hit the IOMMU before being > reflected to the other device. Direct peer to peer between downstream > switch ports operates in the wrong address space. Thanks, > > Alex That's true of course. What would make sense would be for hardware vendors to add ATS support to their cards. Then peer to peer should be allowed by hypervisor for translated transactions. Gives you the performance benefit without the security issues. Does anyone know whether any hardware implements this? Of course that would mostly be transparent to guests so you would still use an emulated switch as Alex suggested. -- MST ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-07-26 16:06 ` Michael S. Tsirkin @ 2017-07-26 17:32 ` Alex Williamson 2017-08-01 5:04 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-07-26 17:32 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Marcel Apfelbaum, 陈博, qemu-devel On Wed, 26 Jul 2017 19:06:58 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote: > > On Wed, 26 Jul 2017 09:21:38 +0300 > > Marcel Apfelbaum <marcel@redhat.com> wrote: > > > > > On 25/07/2017 11:53, 陈博 wrote: > > > > To accelerate data traversing between devices under the same PCIE Root > > > > Port or Switch. > > > > > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > > > > > > > > > > Hi, > > > > > > It may be possible, but maybe PCIe Switch assignment is not > > > the only way to go. > > > > > > Adding Alex and Michael for their input on this matter. > > > More info at: > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-07/msg07209.html > > > > I think you need to look at where the IOMMU is in the topology and what > > address space the devices are working in when assigned to a VM to > > realize that it doesn't make any sense to assign switch ports to a VM. > > GPUs cannot do switch level peer to peer when assigned because they are > > operating in an I/O virtual address space. This is why we configure > > ACS on downstream ports to prevent peer to peer. Peer to peer > > transactions must be forwarded upstream by the switch ports in order to > > reach the IOMMU for translation. Note however that we do populate peer > > to peer mappings within the IOMMU, so if the hardware supports it, the > > IOMMU can reflect the transaction back out to the I/O bus to reach the > > other device without CPU involvement. > > > > Therefore I think the better solution, if it encourages the NVIDIA > > driver to do the right thing, is to use emulated switches. Assigning > > the physical switch would really do nothing more than make the PCIe link > > information more correct in the VM, everything else about the switch > > would be emulated. Even still, unless you have an I/O topology which > > integrates the IOMMU into the switch itself, the data flow still needs > > to go all the way to the root complex to hit the IOMMU before being > > reflected to the other device. Direct peer to peer between downstream > > switch ports operates in the wrong address space. Thanks, > > > > Alex > > That's true of course. What would make sense would be for > hardware vendors to add ATS support to their cards. > > Then peer to peer should be allowed by hypervisor for translated transactions. > > Gives you the performance benefit without the security issues. > > Does anyone know whether any hardware implements this? GPUs often do implement ATS and the ACS DT (Direct Translated P2P) capability should handle routing requests with the Address Type field indicating a translated address directly between downstream ports. DT is however not part of the standard set of ACS bits that we enable. It seems like it might be fairly easy to poke the DT enable bit with setpci from userspace to test whether this "just works", providing of course you can get the driver to attempt to do peer to peer and ATS is already functioning on the GPU. If so, then we should look at where in the code to do that enabling automatically. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-07-26 17:32 ` Alex Williamson @ 2017-08-01 5:04 ` Bob Chen 2017-08-01 5:46 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-01 5:04 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel Hi, This is a sketch of my hardware topology. CPU0 <- QPI -> CPU1 | | Root Port(at PCIe.0) Root Port(at PCIe.1) / \ / \ Switch Switch Switch Switch / \ / \ / \ / \ GPU GPU GPU GPU GPU GPU GPU GPU And below are the p2p bandwidth test results. Host: D\D 0 1 2 3 4 5 6 7 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 VM: D\D 0 1 2 3 4 5 6 7 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 In the VM, the bandwidth between two GPUs under the same physical switch is obviously lower, as per the reasons you said in former threads. But what confused me most is that GPUs under different switches could achieve the same speed, as well as in the Host. Does that mean after IOMMU address translation, data traversing has utilized QPI bus by default? Even these two devices do not belong to the same PCIe bus? In a word, I'm trying to build a massive deep-learning/HPC infrastructure for the cloud environment. Nvidia itself released a solution based on dockers, and I believe qemu/VMs could also do it. Hopefully I could get some help from the community. The emulated switch you suggested looks like a good option to me, I will have a try. Thanks, Bob 2017-07-27 1:32 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Wed, 26 Jul 2017 19:06:58 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jul 26, 2017 at 09:29:31AM -0600, Alex Williamson wrote: > > > On Wed, 26 Jul 2017 09:21:38 +0300 > > > Marcel Apfelbaum <marcel@redhat.com> wrote: > > > > > > > On 25/07/2017 11:53, 陈博 wrote: > > > > > To accelerate data traversing between devices under the same PCIE > Root > > > > > Port or Switch. > > > > > > > > > > See https://lists.nongnu.org/archive/html/qemu-devel/2017- > 07/msg07209.html > > > > > > > > > > > > > Hi, > > > > > > > > It may be possible, but maybe PCIe Switch assignment is not > > > > the only way to go. > > > > > > > > Adding Alex and Michael for their input on this matter. > > > > More info at: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2017- > 07/msg07209.html > > > > > > I think you need to look at where the IOMMU is in the topology and what > > > address space the devices are working in when assigned to a VM to > > > realize that it doesn't make any sense to assign switch ports to a VM. > > > GPUs cannot do switch level peer to peer when assigned because they are > > > operating in an I/O virtual address space. This is why we configure > > > ACS on downstream ports to prevent peer to peer. Peer to peer > > > transactions must be forwarded upstream by the switch ports in order to > > > reach the IOMMU for translation. Note however that we do populate peer > > > to peer mappings within the IOMMU, so if the hardware supports it, the > > > IOMMU can reflect the transaction back out to the I/O bus to reach the > > > other device without CPU involvement. > > > > > > Therefore I think the better solution, if it encourages the NVIDIA > > > driver to do the right thing, is to use emulated switches. Assigning > > > the physical switch would really do nothing more than make the PCIe > link > > > information more correct in the VM, everything else about the switch > > > would be emulated. Even still, unless you have an I/O topology which > > > integrates the IOMMU into the switch itself, the data flow still needs > > > to go all the way to the root complex to hit the IOMMU before being > > > reflected to the other device. Direct peer to peer between downstream > > > switch ports operates in the wrong address space. Thanks, > > > > > > Alex > > > > That's true of course. What would make sense would be for > > hardware vendors to add ATS support to their cards. > > > > Then peer to peer should be allowed by hypervisor for translated > transactions. > > > > Gives you the performance benefit without the security issues. > > > > Does anyone know whether any hardware implements this? > > GPUs often do implement ATS and the ACS DT (Direct Translated P2P) > capability should handle routing requests with the Address Type field > indicating a translated address directly between downstream ports. DT > is however not part of the standard set of ACS bits that we enable. It > seems like it might be fairly easy to poke the DT enable bit with > setpci from userspace to test whether this "just works", providing of > course you can get the driver to attempt to do peer to peer and ATS is > already functioning on the GPU. If so, then we should look at where > in the code to do that enabling automatically. Thanks, > > Alex > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 5:04 ` Bob Chen @ 2017-08-01 5:46 ` Alex Williamson 2017-08-01 9:35 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-08-01 5:46 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Tue, 1 Aug 2017 13:04:46 +0800 Bob Chen <a175818323@gmail.com> wrote: > Hi, > > This is a sketch of my hardware topology. > > CPU0 <- QPI -> CPU1 > | | > Root Port(at PCIe.0) Root Port(at PCIe.1) > / \ / \ Are each of these lines above separate root ports? ie. each root complex hosts two root ports, each with a two-port switch downstream of it? > Switch Switch Switch Switch > / \ / \ / \ / \ > GPU GPU GPU GPU GPU GPU GPU GPU > > > And below are the p2p bandwidth test results. > > Host: > D\D 0 1 2 3 4 5 6 7 > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > VM: > D\D 0 1 2 3 4 5 6 7 > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 Interesting test, how do you get these numbers? What are the units, GB/s? > In the VM, the bandwidth between two GPUs under the same physical switch is > obviously lower, as per the reasons you said in former threads. Hmm, I'm not sure I can explain why the number is lower than to more remote GPUs though. Is the test simultaneously reading and writing and therefore we overload the link to the upstream switch port? Otherwise I'd expect the bidirectional support in PCIe to be able to handle the bandwidth. Does the test have a read-only or write-only mode? > But what confused me most is that GPUs under different switches could > achieve the same speed, as well as in the Host. Does that mean after IOMMU > address translation, data traversing has utilized QPI bus by default? Even > these two devices do not belong to the same PCIe bus? Yes, of course. Once the transaction is translated by the IOMMU it's just a matter of routing the resulting address, whether that's back down the I/O hierarchy under the same root complex or across the QPI link to the other root complex. The translated address could just as easily be to RAM that lives on the other side of the QPI link. Also, it seems like the IOMMU overhead is perhaps negligible here, unless the IOMMU is actually being used in both cases. In the host test, is the IOMMU still enabled? The routing of PCIe transactions is going to be governed by ACS, which Linux enables whenever the IOMMU is enabled, not just when a device is assigned to a VM. It would be interesting to see if another performance tier is exposed if the IOMMU is entirely disabled, or perhaps it might better expose the overhead of the IOMMU translation. It would also be interesting to see the ACS settings in lspci for each downstream port for each test. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 5:46 ` Alex Williamson @ 2017-08-01 9:35 ` Bob Chen 2017-08-01 14:39 ` Michael S. Tsirkin 2017-08-01 15:01 ` Alex Williamson 0 siblings, 2 replies; 26+ messages in thread From: Bob Chen @ 2017-08-01 9:35 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Tue, 1 Aug 2017 13:04:46 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > Hi, > > > > This is a sketch of my hardware topology. > > > > CPU0 <- QPI -> CPU1 > > | | > > Root Port(at PCIe.0) Root Port(at PCIe.1) > > / \ / \ > > Are each of these lines above separate root ports? ie. each root > complex hosts two root ports, each with a two-port switch downstream of > it? > Not quite sure if root complex is a concept or a real physical device ... But according to my observation by `lspci -vt`, there are indeed 4 Root Ports in the system. So the sketch might need a tiny update. CPU0 <- QPI -> CPU1 | | Root Complex(device?) Root Complex(device?) / \ / \ Root Port Root Port Root Port Root Port / \ / \ Switch Switch Switch Switch / \ / \ / \ / \ GPU GPU GPU GPU GPU GPU GPU GPU > > > Switch Switch Switch Switch > > / \ / \ / \ / \ > > GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > And below are the p2p bandwidth test results. > > > > Host: > > D\D 0 1 2 3 4 5 6 7 > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > VM: > > D\D 0 1 2 3 4 5 6 7 > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > Interesting test, how do you get these numbers? What are the units, > GB/s? > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are GB/s. Asynchronous read and write. Bidirectional. However, the Unidirectional test had shown a different result. Didn't fall down to a half. VM: Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 3 11.30 11.31 10.08 425.05 11.10 11.07 11.09 11.06 4 11.16 11.17 11.21 11.17 423.67 10.08 11.25 11.28 5 10.97 11.01 11.07 11.02 10.09 425.52 11.23 11.27 6 11.09 11.13 11.16 11.10 11.28 11.33 422.71 10.10 7 11.13 11.09 11.15 11.11 11.36 11.33 10.02 422.75 Host: Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 424.13 13.38 10.17 10.17 11.23 11.21 10.94 11.22 1 13.38 424.06 10.18 10.19 11.20 11.19 11.19 11.14 2 10.18 10.18 422.75 13.38 11.19 11.19 11.17 11.17 3 10.18 10.18 13.38 425.05 11.05 11.08 11.08 11.06 4 11.01 11.06 11.06 11.03 423.21 13.38 10.17 10.17 5 10.91 10.91 10.89 10.92 13.38 425.52 10.18 10.18 6 11.28 11.30 11.32 11.31 10.19 10.18 424.59 13.37 7 11.18 11.20 11.16 11.21 10.17 10.19 13.38 424.13 > > > In the VM, the bandwidth between two GPUs under the same physical switch > is > > obviously lower, as per the reasons you said in former threads. > > Hmm, I'm not sure I can explain why the number is lower than to more > remote GPUs though. Is the test simultaneously reading and writing and > therefore we overload the link to the upstream switch port? Otherwise > I'd expect the bidirectional support in PCIe to be able to handle the > bandwidth. Does the test have a read-only or write-only mode? > > > But what confused me most is that GPUs under different switches could > > achieve the same speed, as well as in the Host. Does that mean after > IOMMU > > address translation, data traversing has utilized QPI bus by default? > Even > > these two devices do not belong to the same PCIe bus? > > Yes, of course. Once the transaction is translated by the IOMMU it's > just a matter of routing the resulting address, whether that's back > down the I/O hierarchy under the same root complex or across the QPI > link to the other root complex. The translated address could just as > easily be to RAM that lives on the other side of the QPI link. Also, it > seems like the IOMMU overhead is perhaps negligible here, unless the > IOMMU is actually being used in both cases. > Yes, the overhead of bandwidth is negligible, but the latency is not as good as we expected. I assume it is IOMMU address translation to blame. I ran this twice with IOMMU on/off on Host, the results were the same. VM: P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 4 5 6 7 0 4.53 13.44 13.60 13.60 14.37 14.51 14.55 14.49 1 13.47 4.41 13.37 13.37 14.49 14.51 14.56 14.52 2 13.38 13.61 4.32 13.47 14.45 14.43 14.53 14.33 3 13.55 13.60 13.38 4.45 14.50 14.48 14.54 14.51 4 13.85 13.72 13.71 13.81 4.47 14.61 14.58 14.47 5 13.75 13.77 13.75 13.77 14.46 4.46 14.52 14.45 6 13.76 13.78 13.73 13.84 14.50 14.55 4.45 14.53 7 13.73 13.78 13.76 13.80 14.53 14.63 14.56 4.46 Host: P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 4 5 6 7 0 3.66 5.88 6.59 6.58 15.26 15.15 15.03 15.14 1 5.80 3.66 6.50 6.50 15.15 15.04 15.06 15.00 2 6.58 6.52 4.12 5.85 15.16 15.06 15.00 15.04 3 6.80 6.81 6.71 4.12 15.12 13.08 13.75 13.31 4 14.91 14.18 14.34 12.93 4.13 6.45 6.56 6.63 5 15.17 14.99 15.03 14.57 5.61 3.49 6.19 6.29 6 15.12 14.78 14.60 13.47 6.16 6.15 3.53 5.68 7 15.00 14.65 14.82 14.28 6.16 6.15 5.44 3.56 > > In the host test, is the IOMMU still enabled? The routing of PCIe > transactions is going to be governed by ACS, which Linux enables > whenever the IOMMU is enabled, not just when a device is assigned to a > VM. It would be interesting to see if another performance tier is > exposed if the IOMMU is entirely disabled, or perhaps it might better > expose the overhead of the IOMMU translation. It would also be > interesting to see the ACS settings in lspci for each downstream port > for each test. Thanks, > > Alex > How to display GPU's ACS settings? Like this? [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 9:35 ` Bob Chen @ 2017-08-01 14:39 ` Michael S. Tsirkin 2017-08-01 15:01 ` Alex Williamson 1 sibling, 0 replies; 26+ messages in thread From: Michael S. Tsirkin @ 2017-08-01 14:39 UTC (permalink / raw) To: Bob Chen; +Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel On Tue, Aug 01, 2017 at 05:35:40PM +0800, Bob Chen wrote: > How to display GPU's ACS settings? Like this? > > [420 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- > UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- > UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- > UnsupReq- ACSViol- Right but that's AER. You want ACS (Access Control Services). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 9:35 ` Bob Chen 2017-08-01 14:39 ` Michael S. Tsirkin @ 2017-08-01 15:01 ` Alex Williamson 2017-08-07 13:00 ` Bob Chen 2017-08-07 13:04 ` Bob Chen 1 sibling, 2 replies; 26+ messages in thread From: Alex Williamson @ 2017-08-01 15:01 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Tue, 1 Aug 2017 17:35:40 +0800 Bob Chen <a175818323@gmail.com> wrote: > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > > > On Tue, 1 Aug 2017 13:04:46 +0800 > > Bob Chen <a175818323@gmail.com> wrote: > > > > > Hi, > > > > > > This is a sketch of my hardware topology. > > > > > > CPU0 <- QPI -> CPU1 > > > | | > > > Root Port(at PCIe.0) Root Port(at PCIe.1) > > > / \ / \ > > > > Are each of these lines above separate root ports? ie. each root > > complex hosts two root ports, each with a two-port switch downstream of > > it? > > > > Not quite sure if root complex is a concept or a real physical device ... > > But according to my observation by `lspci -vt`, there are indeed 4 Root > Ports in the system. So the sketch might need a tiny update. > > > CPU0 <- QPI -> CPU1 > > | | > > Root Complex(device?) Root Complex(device?) > > / \ / \ > > Root Port Root Port Root Port Root Port > > / \ / \ > > Switch Switch Switch Switch > > / \ / \ / \ / \ > > GPU GPU GPU GPU GPU GPU GPU GPU Yes, that's what I expected. So the numbers make sense, the immediate sibling GPU would share bandwidth between the root port and upstream switch port, any other GPU should not double-up on any single link. > > > Switch Switch Switch Switch > > > / \ / \ / \ / \ > > > GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > > > > And below are the p2p bandwidth test results. > > > > > > Host: > > > D\D 0 1 2 3 4 5 6 7 > > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > > > VM: > > > D\D 0 1 2 3 4 5 6 7 > > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > > > Interesting test, how do you get these numbers? What are the units, > > GB/s? > > > > > > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are > GB/s. Asynchronous read and write. Bidirectional. > > However, the Unidirectional test had shown a different result. Didn't fall > down to a half. > > VM: > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 4 5 6 7 > 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 > 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 > 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 > 3 11.30 11.31 10.08 425.05 11.10 11.07 11.09 11.06 > 4 11.16 11.17 11.21 11.17 423.67 10.08 11.25 11.28 > 5 10.97 11.01 11.07 11.02 10.09 425.52 11.23 11.27 > 6 11.09 11.13 11.16 11.10 11.28 11.33 422.71 10.10 > 7 11.13 11.09 11.15 11.11 11.36 11.33 10.02 422.75 > > Host: > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 4 5 6 7 > 0 424.13 13.38 10.17 10.17 11.23 11.21 10.94 11.22 > 1 13.38 424.06 10.18 10.19 11.20 11.19 11.19 11.14 > 2 10.18 10.18 422.75 13.38 11.19 11.19 11.17 11.17 > 3 10.18 10.18 13.38 425.05 11.05 11.08 11.08 11.06 > 4 11.01 11.06 11.06 11.03 423.21 13.38 10.17 10.17 > 5 10.91 10.91 10.89 10.92 13.38 425.52 10.18 10.18 > 6 11.28 11.30 11.32 11.31 10.19 10.18 424.59 13.37 > 7 11.18 11.20 11.16 11.21 10.17 10.19 13.38 424.13 Looks right, a unidirectional test would create bidirectional data flows on the root port to upstream switch link and should be able to saturate that link. With the bidirectional test, that link becomes a bottleneck. > > > In the VM, the bandwidth between two GPUs under the same physical switch > > is > > > obviously lower, as per the reasons you said in former threads. > > > > Hmm, I'm not sure I can explain why the number is lower than to more > > remote GPUs though. Is the test simultaneously reading and writing and > > therefore we overload the link to the upstream switch port? Otherwise > > I'd expect the bidirectional support in PCIe to be able to handle the > > bandwidth. Does the test have a read-only or write-only mode? > > > > > But what confused me most is that GPUs under different switches could > > > achieve the same speed, as well as in the Host. Does that mean after > > IOMMU > > > address translation, data traversing has utilized QPI bus by default? > > Even > > > these two devices do not belong to the same PCIe bus? > > > > Yes, of course. Once the transaction is translated by the IOMMU it's > > just a matter of routing the resulting address, whether that's back > > down the I/O hierarchy under the same root complex or across the QPI > > link to the other root complex. The translated address could just as > > easily be to RAM that lives on the other side of the QPI link. Also, it > > seems like the IOMMU overhead is perhaps negligible here, unless the > > IOMMU is actually being used in both cases. > > > > > Yes, the overhead of bandwidth is negligible, but the latency is not as > good as we expected. I assume it is IOMMU address translation to blame. > > I ran this twice with IOMMU on/off on Host, the results were the same. > > VM: > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 4 5 6 7 > 0 4.53 13.44 13.60 13.60 14.37 14.51 14.55 14.49 > 1 13.47 4.41 13.37 13.37 14.49 14.51 14.56 14.52 > 2 13.38 13.61 4.32 13.47 14.45 14.43 14.53 14.33 > 3 13.55 13.60 13.38 4.45 14.50 14.48 14.54 14.51 > 4 13.85 13.72 13.71 13.81 4.47 14.61 14.58 14.47 > 5 13.75 13.77 13.75 13.77 14.46 4.46 14.52 14.45 > 6 13.76 13.78 13.73 13.84 14.50 14.55 4.45 14.53 > 7 13.73 13.78 13.76 13.80 14.53 14.63 14.56 4.46 > > Host: > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 4 5 6 7 > 0 3.66 5.88 6.59 6.58 15.26 15.15 15.03 15.14 > 1 5.80 3.66 6.50 6.50 15.15 15.04 15.06 15.00 > 2 6.58 6.52 4.12 5.85 15.16 15.06 15.00 15.04 > 3 6.80 6.81 6.71 4.12 15.12 13.08 13.75 13.31 > 4 14.91 14.18 14.34 12.93 4.13 6.45 6.56 6.63 > 5 15.17 14.99 15.03 14.57 5.61 3.49 6.19 6.29 > 6 15.12 14.78 14.60 13.47 6.16 6.15 3.53 5.68 > 7 15.00 14.65 14.82 14.28 6.16 6.15 5.44 3.56 Yes, the IOMMU is not free, page table walks are occurring here. Are you using 1G pages for the VM? 2G? Does this platform support 1G super pages on the IOMMU? (cat /sys/class/iommu/*/intel-iommu/cap, bit 34 is 2MB page support, bit 35 is 1G). All modern Xeons should support 1G so you'll want to use 1G hugepages in the VM to take advantage of that. > > In the host test, is the IOMMU still enabled? The routing of PCIe > > transactions is going to be governed by ACS, which Linux enables > > whenever the IOMMU is enabled, not just when a device is assigned to a > > VM. It would be interesting to see if another performance tier is > > exposed if the IOMMU is entirely disabled, or perhaps it might better > > expose the overhead of the IOMMU translation. It would also be > > interesting to see the ACS settings in lspci for each downstream port > > for each test. Thanks, > > > > Alex > > > > > How to display GPU's ACS settings? Like this? > > [420 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ > ECRC- UnsupReq- ACSViol- As Michael notes, this is AER, ACS is Access Control Services. It should be another capability in lspci. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 15:01 ` Alex Williamson @ 2017-08-07 13:00 ` Bob Chen 2017-08-07 15:52 ` Alex Williamson 2017-08-07 13:04 ` Bob Chen 1 sibling, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-07 13:00 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel Bad news... The performance had dropped dramatically when using emulated switches. I was referring to the PCIe doc at https://github.com/qemu/qemu/blob/master/docs/pcie.txt # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine q35,accel=kvm -nodefaults -nodefconfig \ -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ -device x3130-upstream,id=upstream_port1,bus=root_port1 \ -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11 \ -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12 \ -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ -device x3130-upstream,id=upstream_port2,bus=root_port2 \ -device xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21 \ -device xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22 \ -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ ... Not 8 GPUs this time, only 4. *1. Attached to pcie bus directly (former situation):* Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 420.93 10.03 11.07 11.09 1 10.04 425.05 11.08 10.97 2 11.17 11.17 425.07 10.07 3 11.25 11.25 10.07 423.64 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 425.98 10.03 11.07 11.09 1 9.99 426.43 11.07 11.07 2 11.04 11.20 425.98 9.89 3 11.21 11.21 10.06 425.97 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 430.67 10.45 19.59 19.58 1 10.44 428.81 19.49 19.53 2 19.62 19.62 429.52 10.57 3 19.60 19.66 10.43 427.38 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 429.47 10.47 19.52 19.39 1 10.48 427.15 19.64 19.52 2 19.64 19.59 429.02 10.42 3 19.60 19.64 10.47 427.81 P2P=Disabled Latency Matrix (us) D\D 0 1 2 3 0 4.50 13.72 14.49 14.44 1 13.65 4.53 14.52 14.33 2 14.22 13.82 4.52 14.50 3 13.87 13.75 14.53 4.55 P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 0 4.44 13.56 14.58 14.45 1 13.56 4.48 14.39 14.45 2 13.85 13.93 4.86 14.80 3 14.51 14.23 14.70 4.72 *2. Attached to emulated Root Port and Switches:* Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 420.48 3.15 3.12 3.12 1 3.13 422.31 3.12 3.12 2 3.08 3.09 421.40 3.13 3 3.10 3.10 3.13 418.68 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 418.68 3.14 3.12 3.12 1 3.15 420.03 3.12 3.12 2 3.11 3.10 421.39 3.14 3 3.11 3.08 3.13 419.13 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 424.36 5.36 5.35 5.34 1 5.36 424.36 5.34 5.34 2 5.35 5.36 425.52 5.35 3 5.36 5.36 5.34 425.29 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 422.98 5.35 5.35 5.35 1 5.35 423.44 5.34 5.33 2 5.35 5.35 425.29 5.35 3 5.35 5.34 5.34 423.21 P2P=Disabled Latency Matrix (us) D\D 0 1 2 3 0 4.79 16.59 16.38 16.22 1 16.62 4.77 16.35 16.69 2 16.77 16.66 4.03 16.68 3 16.54 16.56 16.78 4.08 P2P=Enabled Latency Matrix (us) D\D 0 1 2 3 0 4.51 16.56 16.58 16.66 1 15.65 3.87 16.74 16.61 2 16.59 16.81 3.96 16.70 3 16.47 16.28 16.68 4.03 Is it because the heavy load of CPU emulation had caused a bottleneck? 2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Tue, 1 Aug 2017 17:35:40 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > > > > > On Tue, 1 Aug 2017 13:04:46 +0800 > > > Bob Chen <a175818323@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > This is a sketch of my hardware topology. > > > > > > > > CPU0 <- QPI -> CPU1 > > > > | | > > > > Root Port(at PCIe.0) Root Port(at PCIe.1) > > > > / \ / \ > > > > > > Are each of these lines above separate root ports? ie. each root > > > complex hosts two root ports, each with a two-port switch downstream of > > > it? > > > > > > > Not quite sure if root complex is a concept or a real physical device ... > > > > But according to my observation by `lspci -vt`, there are indeed 4 Root > > Ports in the system. So the sketch might need a tiny update. > > > > > > CPU0 <- QPI -> CPU1 > > > > | | > > > > Root Complex(device?) Root Complex(device?) > > > > / \ / \ > > > > Root Port Root Port Root Port Root Port > > > > / \ / \ > > > > Switch Switch Switch Switch > > > > / \ / \ / \ / \ > > > > GPU GPU GPU GPU GPU GPU GPU GPU > > > Yes, that's what I expected. So the numbers make sense, the immediate > sibling GPU would share bandwidth between the root port and upstream > switch port, any other GPU should not double-up on any single link. > > > > > Switch Switch Switch Switch > > > > / \ / \ / \ / \ > > > > GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > > > > > > > And below are the p2p bandwidth test results. > > > > > > > > Host: > > > > D\D 0 1 2 3 4 5 6 7 > > > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > > > > > VM: > > > > D\D 0 1 2 3 4 5 6 7 > > > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > > > > > Interesting test, how do you get these numbers? What are the units, > > > GB/s? > > > > > > > > > > > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are > > GB/s. Asynchronous read and write. Bidirectional. > > > > However, the Unidirectional test had shown a different result. Didn't > fall > > down to a half. > > > > VM: > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 4 5 6 7 > > 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 > > 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 > > 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 > > 3 11.30 11.31 10.08 425.05 11.10 11.07 11.09 11.06 > > 4 11.16 11.17 11.21 11.17 423.67 10.08 11.25 11.28 > > 5 10.97 11.01 11.07 11.02 10.09 425.52 11.23 11.27 > > 6 11.09 11.13 11.16 11.10 11.28 11.33 422.71 10.10 > > 7 11.13 11.09 11.15 11.11 11.36 11.33 10.02 422.75 > > > > Host: > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 4 5 6 7 > > 0 424.13 13.38 10.17 10.17 11.23 11.21 10.94 11.22 > > 1 13.38 424.06 10.18 10.19 11.20 11.19 11.19 11.14 > > 2 10.18 10.18 422.75 13.38 11.19 11.19 11.17 11.17 > > 3 10.18 10.18 13.38 425.05 11.05 11.08 11.08 11.06 > > 4 11.01 11.06 11.06 11.03 423.21 13.38 10.17 10.17 > > 5 10.91 10.91 10.89 10.92 13.38 425.52 10.18 10.18 > > 6 11.28 11.30 11.32 11.31 10.19 10.18 424.59 13.37 > > 7 11.18 11.20 11.16 11.21 10.17 10.19 13.38 424.13 > > Looks right, a unidirectional test would create bidirectional data > flows on the root port to upstream switch link and should be able to > saturate that link. With the bidirectional test, that link becomes a > bottleneck. > > > > > In the VM, the bandwidth between two GPUs under the same physical > switch > > > is > > > > obviously lower, as per the reasons you said in former threads. > > > > > > Hmm, I'm not sure I can explain why the number is lower than to more > > > remote GPUs though. Is the test simultaneously reading and writing and > > > therefore we overload the link to the upstream switch port? Otherwise > > > I'd expect the bidirectional support in PCIe to be able to handle the > > > bandwidth. Does the test have a read-only or write-only mode? > > > > > > > But what confused me most is that GPUs under different switches could > > > > achieve the same speed, as well as in the Host. Does that mean after > > > IOMMU > > > > address translation, data traversing has utilized QPI bus by default? > > > Even > > > > these two devices do not belong to the same PCIe bus? > > > > > > Yes, of course. Once the transaction is translated by the IOMMU it's > > > just a matter of routing the resulting address, whether that's back > > > down the I/O hierarchy under the same root complex or across the QPI > > > link to the other root complex. The translated address could just as > > > easily be to RAM that lives on the other side of the QPI link. Also, > it > > > seems like the IOMMU overhead is perhaps negligible here, unless the > > > IOMMU is actually being used in both cases. > > > > > > > > > Yes, the overhead of bandwidth is negligible, but the latency is not as > > good as we expected. I assume it is IOMMU address translation to blame. > > > > I ran this twice with IOMMU on/off on Host, the results were the same. > > > > VM: > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 4 5 6 7 > > 0 4.53 13.44 13.60 13.60 14.37 14.51 14.55 14.49 > > 1 13.47 4.41 13.37 13.37 14.49 14.51 14.56 14.52 > > 2 13.38 13.61 4.32 13.47 14.45 14.43 14.53 14.33 > > 3 13.55 13.60 13.38 4.45 14.50 14.48 14.54 14.51 > > 4 13.85 13.72 13.71 13.81 4.47 14.61 14.58 14.47 > > 5 13.75 13.77 13.75 13.77 14.46 4.46 14.52 14.45 > > 6 13.76 13.78 13.73 13.84 14.50 14.55 4.45 14.53 > > 7 13.73 13.78 13.76 13.80 14.53 14.63 14.56 4.46 > > > > Host: > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 4 5 6 7 > > 0 3.66 5.88 6.59 6.58 15.26 15.15 15.03 15.14 > > 1 5.80 3.66 6.50 6.50 15.15 15.04 15.06 15.00 > > 2 6.58 6.52 4.12 5.85 15.16 15.06 15.00 15.04 > > 3 6.80 6.81 6.71 4.12 15.12 13.08 13.75 13.31 > > 4 14.91 14.18 14.34 12.93 4.13 6.45 6.56 6.63 > > 5 15.17 14.99 15.03 14.57 5.61 3.49 6.19 6.29 > > 6 15.12 14.78 14.60 13.47 6.16 6.15 3.53 5.68 > > 7 15.00 14.65 14.82 14.28 6.16 6.15 5.44 3.56 > > Yes, the IOMMU is not free, page table walks are occurring here. Are > you using 1G pages for the VM? 2G? Does this platform support 1G > super pages on the IOMMU? (cat /sys/class/iommu/*/intel-iommu/cap, bit > 34 is 2MB page support, bit 35 is 1G). All modern Xeons should support > 1G so you'll want to use 1G hugepages in the VM to take advantage of > that. > > > > In the host test, is the IOMMU still enabled? The routing of PCIe > > > transactions is going to be governed by ACS, which Linux enables > > > whenever the IOMMU is enabled, not just when a device is assigned to a > > > VM. It would be interesting to see if another performance tier is > > > exposed if the IOMMU is entirely disabled, or perhaps it might better > > > expose the overhead of the IOMMU translation. It would also be > > > interesting to see the ACS settings in lspci for each downstream port > > > for each test. Thanks, > > > > > > Alex > > > > > > > > > How to display GPU's ACS settings? Like this? > > > > [420 v2] Advanced Error Reporting > > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > > ECRC- UnsupReq- ACSViol- > > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > > ECRC- UnsupReq- ACSViol- > > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ > > ECRC- UnsupReq- ACSViol- > > As Michael notes, this is AER, ACS is Access Control Services. It > should be another capability in lspci. Thanks, > > Alex > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-07 13:00 ` Bob Chen @ 2017-08-07 15:52 ` Alex Williamson 2017-08-08 1:44 ` Bob Chen 2017-08-08 20:07 ` Michael S. Tsirkin 0 siblings, 2 replies; 26+ messages in thread From: Alex Williamson @ 2017-08-07 15:52 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Mon, 7 Aug 2017 21:00:04 +0800 Bob Chen <a175818323@gmail.com> wrote: > Bad news... The performance had dropped dramatically when using emulated > switches. > > I was referring to the PCIe doc at > https://github.com/qemu/qemu/blob/master/docs/pcie.txt > > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine > q35,accel=kvm -nodefaults -nodefconfig \ > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ > -device > xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11 > \ > -device > xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12 > \ > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ > -device > xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21 > \ > -device > xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22 > \ > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ > ... > > > Not 8 GPUs this time, only 4. > > *1. Attached to pcie bus directly (former situation):* > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 420.93 10.03 11.07 11.09 > 1 10.04 425.05 11.08 10.97 > 2 11.17 11.17 425.07 10.07 > 3 11.25 11.25 10.07 423.64 > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 425.98 10.03 11.07 11.09 > 1 9.99 426.43 11.07 11.07 > 2 11.04 11.20 425.98 9.89 > 3 11.21 11.21 10.06 425.97 > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 430.67 10.45 19.59 19.58 > 1 10.44 428.81 19.49 19.53 > 2 19.62 19.62 429.52 10.57 > 3 19.60 19.66 10.43 427.38 > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 429.47 10.47 19.52 19.39 > 1 10.48 427.15 19.64 19.52 > 2 19.64 19.59 429.02 10.42 > 3 19.60 19.64 10.47 427.81 > P2P=Disabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.50 13.72 14.49 14.44 > 1 13.65 4.53 14.52 14.33 > 2 14.22 13.82 4.52 14.50 > 3 13.87 13.75 14.53 4.55 > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.44 13.56 14.58 14.45 > 1 13.56 4.48 14.39 14.45 > 2 13.85 13.93 4.86 14.80 > 3 14.51 14.23 14.70 4.72 > > > *2. Attached to emulated Root Port and Switches:* > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 420.48 3.15 3.12 3.12 > 1 3.13 422.31 3.12 3.12 > 2 3.08 3.09 421.40 3.13 > 3 3.10 3.10 3.13 418.68 > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 418.68 3.14 3.12 3.12 > 1 3.15 420.03 3.12 3.12 > 2 3.11 3.10 421.39 3.14 > 3 3.11 3.08 3.13 419.13 > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 424.36 5.36 5.35 5.34 > 1 5.36 424.36 5.34 5.34 > 2 5.35 5.36 425.52 5.35 > 3 5.36 5.36 5.34 425.29 > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 422.98 5.35 5.35 5.35 > 1 5.35 423.44 5.34 5.33 > 2 5.35 5.35 425.29 5.35 > 3 5.35 5.34 5.34 423.21 > P2P=Disabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.79 16.59 16.38 16.22 > 1 16.62 4.77 16.35 16.69 > 2 16.77 16.66 4.03 16.68 > 3 16.54 16.56 16.78 4.08 > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.51 16.56 16.58 16.66 > 1 15.65 3.87 16.74 16.61 > 2 16.59 16.81 3.96 16.70 > 3 16.47 16.28 16.68 4.03 > > > Is it because the heavy load of CPU emulation had caused a bottleneck? QEMU should really not be involved in the data flow, once the memory slots are configured in KVM, we really should not be exiting out to QEMU regardless of the topology. I wonder if it has something to do with the link speed/width advertised on the switch port. I don't think the endpoint can actually downshift the physical link, so lspci on the host should probably still show the full bandwidth capability, but maybe the driver is somehow doing rate limiting. PCIe gets a little more complicated as we go to newer versions, so it's not quite as simple as exposing a different bit configuration to advertise 8GT/s, x16. Last I tried to do link matching it was deemed too complicated for something I couldn't prove at the time had measurable value. This might be a good way to prove that value if it makes a difference here. I can't think why else you'd see such a performance difference, but testing to see if the KVM exit rate is significantly different could still be an interesting verification. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-07 15:52 ` Alex Williamson @ 2017-08-08 1:44 ` Bob Chen 2017-08-08 8:06 ` Bob Chen 2017-08-08 16:53 ` Alex Williamson 2017-08-08 20:07 ` Michael S. Tsirkin 1 sibling, 2 replies; 26+ messages in thread From: Bob Chen @ 2017-08-08 1:44 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel 1. How to test the KVM exit rate? 2. The switches are separate devices of PLX Technology # lspci -s 07:08.0 -nn 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca) # This is one of the Root Ports in the system. [0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[01]----00.0 LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] +-02.0-[02-05]-- +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0 NVIDIA Corporation GP102 [TITAN Xp] | | \-00.1 NVIDIA Corporation GP102 HDMI Audio Controller | \-10.0-[09]--+-00.0 NVIDIA Corporation GP102 [TITAN Xp] | \-00.1 NVIDIA Corporation GP102 HDMI Audio Controller 3. ACS It seemed that I had misunderstood your point? I finally found ACS information on switches, not on GPUs. Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Mon, 7 Aug 2017 21:00:04 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > Bad news... The performance had dropped dramatically when using emulated > > switches. > > > > I was referring to the PCIe doc at > > https://github.com/qemu/qemu/blob/master/docs/pcie.txt > > > > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine > > q35,accel=kvm -nodefaults -nodefconfig \ > > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ > > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ > > -device > > xio3130-downstream,id=downstream_port1,bus=upstream_ > port1,chassis=11,slot=11 > > \ > > -device > > xio3130-downstream,id=downstream_port2,bus=upstream_ > port1,chassis=12,slot=12 > > \ > > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ > > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ > > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ > > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ > > -device > > xio3130-downstream,id=downstream_port3,bus=upstream_ > port2,chassis=21,slot=21 > > \ > > -device > > xio3130-downstream,id=downstream_port4,bus=upstream_ > port2,chassis=22,slot=22 > > \ > > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ > > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ > > ... > > > > > > Not 8 GPUs this time, only 4. > > > > *1. Attached to pcie bus directly (former situation):* > > > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 420.93 10.03 11.07 11.09 > > 1 10.04 425.05 11.08 10.97 > > 2 11.17 11.17 425.07 10.07 > > 3 11.25 11.25 10.07 423.64 > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 425.98 10.03 11.07 11.09 > > 1 9.99 426.43 11.07 11.07 > > 2 11.04 11.20 425.98 9.89 > > 3 11.21 11.21 10.06 425.97 > > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 430.67 10.45 19.59 19.58 > > 1 10.44 428.81 19.49 19.53 > > 2 19.62 19.62 429.52 10.57 > > 3 19.60 19.66 10.43 427.38 > > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 429.47 10.47 19.52 19.39 > > 1 10.48 427.15 19.64 19.52 > > 2 19.64 19.59 429.02 10.42 > > 3 19.60 19.64 10.47 427.81 > > P2P=Disabled Latency Matrix (us) > > D\D 0 1 2 3 > > 0 4.50 13.72 14.49 14.44 > > 1 13.65 4.53 14.52 14.33 > > 2 14.22 13.82 4.52 14.50 > > 3 13.87 13.75 14.53 4.55 > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 > > 0 4.44 13.56 14.58 14.45 > > 1 13.56 4.48 14.39 14.45 > > 2 13.85 13.93 4.86 14.80 > > 3 14.51 14.23 14.70 4.72 > > > > > > *2. Attached to emulated Root Port and Switches:* > > > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 420.48 3.15 3.12 3.12 > > 1 3.13 422.31 3.12 3.12 > > 2 3.08 3.09 421.40 3.13 > > 3 3.10 3.10 3.13 418.68 > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 418.68 3.14 3.12 3.12 > > 1 3.15 420.03 3.12 3.12 > > 2 3.11 3.10 421.39 3.14 > > 3 3.11 3.08 3.13 419.13 > > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 424.36 5.36 5.35 5.34 > > 1 5.36 424.36 5.34 5.34 > > 2 5.35 5.36 425.52 5.35 > > 3 5.36 5.36 5.34 425.29 > > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 > > 0 422.98 5.35 5.35 5.35 > > 1 5.35 423.44 5.34 5.33 > > 2 5.35 5.35 425.29 5.35 > > 3 5.35 5.34 5.34 423.21 > > P2P=Disabled Latency Matrix (us) > > D\D 0 1 2 3 > > 0 4.79 16.59 16.38 16.22 > > 1 16.62 4.77 16.35 16.69 > > 2 16.77 16.66 4.03 16.68 > > 3 16.54 16.56 16.78 4.08 > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 > > 0 4.51 16.56 16.58 16.66 > > 1 15.65 3.87 16.74 16.61 > > 2 16.59 16.81 3.96 16.70 > > 3 16.47 16.28 16.68 4.03 > > > > > > Is it because the heavy load of CPU emulation had caused a bottleneck? > > QEMU should really not be involved in the data flow, once the memory > slots are configured in KVM, we really should not be exiting out to > QEMU regardless of the topology. I wonder if it has something to do > with the link speed/width advertised on the switch port. I don't think > the endpoint can actually downshift the physical link, so lspci on the > host should probably still show the full bandwidth capability, but > maybe the driver is somehow doing rate limiting. PCIe gets a little > more complicated as we go to newer versions, so it's not quite as > simple as exposing a different bit configuration to advertise 8GT/s, > x16. Last I tried to do link matching it was deemed too complicated > for something I couldn't prove at the time had measurable value. This > might be a good way to prove that value if it makes a difference here. > I can't think why else you'd see such a performance difference, but > testing to see if the KVM exit rate is significantly different could > still be an interesting verification. Thanks, > > Alex > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-08 1:44 ` Bob Chen @ 2017-08-08 8:06 ` Bob Chen 2017-08-08 16:53 ` Alex Williamson 1 sibling, 0 replies; 26+ messages in thread From: Bob Chen @ 2017-08-08 8:06 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel Plus: 1 GB hugepages neither improved bandwidth nor latency. Results remained the same. 2017-08-08 9:44 GMT+08:00 Bob Chen <a175818323@gmail.com>: > 1. How to test the KVM exit rate? > > 2. The switches are separate devices of PLX Technology > > # lspci -s 07:08.0 -nn > 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port > PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca) > > # This is one of the Root Ports in the system. > [0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon > D DMI2 > +-01.0-[01]----00.0 LSI Logic / Symbios Logic MegaRAID SAS > 2208 [Thunderbolt] > +-02.0-[02-05]-- > +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > | | \-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > | \-10.0-[09]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > | \-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > > > > > 3. ACS > > It seemed that I had misunderstood your point? I finally found ACS > information on switches, not on GPUs. > > Capabilities: [f24 v1] Access Control Services > ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl+ DirectTrans+ > ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl- DirectTrans- > > > > 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > >> On Mon, 7 Aug 2017 21:00:04 +0800 >> Bob Chen <a175818323@gmail.com> wrote: >> >> > Bad news... The performance had dropped dramatically when using emulated >> > switches. >> > >> > I was referring to the PCIe doc at >> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt >> > >> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine >> > q35,accel=kvm -nodefaults -nodefconfig \ >> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ >> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ >> > -device >> > xio3130-downstream,id=downstream_port1,bus=upstream_port1, >> chassis=11,slot=11 >> > \ >> > -device >> > xio3130-downstream,id=downstream_port2,bus=upstream_port1, >> chassis=12,slot=12 >> > \ >> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ >> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ >> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ >> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ >> > -device >> > xio3130-downstream,id=downstream_port3,bus=upstream_port2, >> chassis=21,slot=21 >> > \ >> > -device >> > xio3130-downstream,id=downstream_port4,bus=upstream_port2, >> chassis=22,slot=22 >> > \ >> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ >> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ >> > ... >> > >> > >> > Not 8 GPUs this time, only 4. >> > >> > *1. Attached to pcie bus directly (former situation):* >> > >> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 420.93 10.03 11.07 11.09 >> > 1 10.04 425.05 11.08 10.97 >> > 2 11.17 11.17 425.07 10.07 >> > 3 11.25 11.25 10.07 423.64 >> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 425.98 10.03 11.07 11.09 >> > 1 9.99 426.43 11.07 11.07 >> > 2 11.04 11.20 425.98 9.89 >> > 3 11.21 11.21 10.06 425.97 >> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 430.67 10.45 19.59 19.58 >> > 1 10.44 428.81 19.49 19.53 >> > 2 19.62 19.62 429.52 10.57 >> > 3 19.60 19.66 10.43 427.38 >> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 429.47 10.47 19.52 19.39 >> > 1 10.48 427.15 19.64 19.52 >> > 2 19.64 19.59 429.02 10.42 >> > 3 19.60 19.64 10.47 427.81 >> > P2P=Disabled Latency Matrix (us) >> > D\D 0 1 2 3 >> > 0 4.50 13.72 14.49 14.44 >> > 1 13.65 4.53 14.52 14.33 >> > 2 14.22 13.82 4.52 14.50 >> > 3 13.87 13.75 14.53 4.55 >> > P2P=Enabled Latency Matrix (us) >> > D\D 0 1 2 3 >> > 0 4.44 13.56 14.58 14.45 >> > 1 13.56 4.48 14.39 14.45 >> > 2 13.85 13.93 4.86 14.80 >> > 3 14.51 14.23 14.70 4.72 >> > >> > >> > *2. Attached to emulated Root Port and Switches:* >> > >> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 420.48 3.15 3.12 3.12 >> > 1 3.13 422.31 3.12 3.12 >> > 2 3.08 3.09 421.40 3.13 >> > 3 3.10 3.10 3.13 418.68 >> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 418.68 3.14 3.12 3.12 >> > 1 3.15 420.03 3.12 3.12 >> > 2 3.11 3.10 421.39 3.14 >> > 3 3.11 3.08 3.13 419.13 >> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 424.36 5.36 5.35 5.34 >> > 1 5.36 424.36 5.34 5.34 >> > 2 5.35 5.36 425.52 5.35 >> > 3 5.36 5.36 5.34 425.29 >> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) >> > D\D 0 1 2 3 >> > 0 422.98 5.35 5.35 5.35 >> > 1 5.35 423.44 5.34 5.33 >> > 2 5.35 5.35 425.29 5.35 >> > 3 5.35 5.34 5.34 423.21 >> > P2P=Disabled Latency Matrix (us) >> > D\D 0 1 2 3 >> > 0 4.79 16.59 16.38 16.22 >> > 1 16.62 4.77 16.35 16.69 >> > 2 16.77 16.66 4.03 16.68 >> > 3 16.54 16.56 16.78 4.08 >> > P2P=Enabled Latency Matrix (us) >> > D\D 0 1 2 3 >> > 0 4.51 16.56 16.58 16.66 >> > 1 15.65 3.87 16.74 16.61 >> > 2 16.59 16.81 3.96 16.70 >> > 3 16.47 16.28 16.68 4.03 >> > >> > >> > Is it because the heavy load of CPU emulation had caused a bottleneck? >> >> QEMU should really not be involved in the data flow, once the memory >> slots are configured in KVM, we really should not be exiting out to >> QEMU regardless of the topology. I wonder if it has something to do >> with the link speed/width advertised on the switch port. I don't think >> the endpoint can actually downshift the physical link, so lspci on the >> host should probably still show the full bandwidth capability, but >> maybe the driver is somehow doing rate limiting. PCIe gets a little >> more complicated as we go to newer versions, so it's not quite as >> simple as exposing a different bit configuration to advertise 8GT/s, >> x16. Last I tried to do link matching it was deemed too complicated >> for something I couldn't prove at the time had measurable value. This >> might be a good way to prove that value if it makes a difference here. >> I can't think why else you'd see such a performance difference, but >> testing to see if the KVM exit rate is significantly different could >> still be an interesting verification. Thanks, >> >> Alex >> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-08 1:44 ` Bob Chen 2017-08-08 8:06 ` Bob Chen @ 2017-08-08 16:53 ` Alex Williamson 1 sibling, 0 replies; 26+ messages in thread From: Alex Williamson @ 2017-08-08 16:53 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Tue, 8 Aug 2017 09:44:56 +0800 Bob Chen <a175818323@gmail.com> wrote: > 1. How to test the KVM exit rate? You can use tracing: http://www.linux-kvm.org/page/Tracing > 2. The switches are separate devices of PLX Technology > > # lspci -s 07:08.0 -nn > 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port > PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca) > > # This is one of the Root Ports in the system. > [0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D > DMI2 > +-01.0-[01]----00.0 LSI Logic / Symbios Logic MegaRAID SAS > 2208 [Thunderbolt] > +-02.0-[02-05]-- > +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > | | \-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > | \-10.0-[09]--+-00.0 NVIDIA > Corporation GP102 [TITAN Xp] > | \-00.1 NVIDIA > Corporation GP102 HDMI Audio Controller > > > > > 3. ACS > > It seemed that I had misunderstood your point? I finally found ACS > information on switches, not on GPUs. > > Capabilities: [f24 v1] Access Control Services > ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ > DirectTrans+ > ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- > DirectTrans- Yes, NVIDIA uses the same PLX PEX 8747 on the switches on the cards I have access to. Unfortunately the endpoints in my case do not support ATS, so the endpoint cannot generate a pre-translated address that would take advantage of the DT capability on the switch port if we were to enable it. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-07 15:52 ` Alex Williamson 2017-08-08 1:44 ` Bob Chen @ 2017-08-08 20:07 ` Michael S. Tsirkin 2017-08-22 7:04 ` Bob Chen 1 sibling, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2017-08-08 20:07 UTC (permalink / raw) To: Alex Williamson; +Cc: Bob Chen, Marcel Apfelbaum, 陈博, qemu-devel On Mon, Aug 07, 2017 at 09:52:24AM -0600, Alex Williamson wrote: > I wonder if it has something to do > with the link speed/width advertised on the switch port. I don't think > the endpoint can actually downshift the physical link, so lspci on the > host should probably still show the full bandwidth capability, but > maybe the driver is somehow doing rate limiting. PCIe gets a little > more complicated as we go to newer versions, so it's not quite as > simple as exposing a different bit configuration to advertise 8GT/s, > x16. Last I tried to do link matching it was deemed too complicated > for something I couldn't prove at the time had measurable value. This > might be a good way to prove that value if it makes a difference here. > I can't think why else you'd see such a performance difference, but > testing to see if the KVM exit rate is significantly different could > still be an interesting verification. It might be easiest to just dust off that patch and see whether it helps. -- MST ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-08 20:07 ` Michael S. Tsirkin @ 2017-08-22 7:04 ` Bob Chen 2017-08-22 16:56 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-22 7:04 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel [-- Attachment #1: Type: text/plain, Size: 2804 bytes --] Hi, I got a spec from Nvidia which illustrates how to enable GPU p2p in virtualization environment. (See attached) The key is to append the legacy pci capabilities list when setting up the hypervisor, with a Nvidia customized capability config. I added some hack in hw/vfio/pci.c and managed to implement that. Then I found the GPU was able to recognize its peer, and the latency has dropped. ✅ However the bandwidth didn't improve, but decreased instead. ❌ Any suggestions? # p2pBandwidthLatencyTest in VM [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Tesla M60, pciBusID: 0, pciDeviceID: 15, pciDomainID:0 Device: 1, Tesla M60, pciBusID: 0, pciDeviceID: 16, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 114.04 5.33 1 5.42 113.91 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 113.93 4.13 1 4.13 119.65 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 120.50 5.55 1 5.55 134.98 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 135.45 5.03 # Even worse, used to be 10 1 5.02 135.30 P2P=Disabled Latency Matrix (us) D\D 0 1 0 5.74 15.61 1 16.05 5.75 P2P=Enabled Latency Matrix (us) D\D 0 1 0 5.47 8.23 # Improved, used to be 18 1 8.06 5.46 2017-08-09 4:07 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>: > On Mon, Aug 07, 2017 at 09:52:24AM -0600, Alex Williamson wrote: > > I wonder if it has something to do > > with the link speed/width advertised on the switch port. I don't think > > the endpoint can actually downshift the physical link, so lspci on the > > host should probably still show the full bandwidth capability, but > > maybe the driver is somehow doing rate limiting. PCIe gets a little > > more complicated as we go to newer versions, so it's not quite as > > simple as exposing a different bit configuration to advertise 8GT/s, > > x16. Last I tried to do link matching it was deemed too complicated > > for something I couldn't prove at the time had measurable value. This > > might be a good way to prove that value if it makes a difference here. > > I can't think why else you'd see such a performance difference, but > > testing to see if the KVM exit rate is significantly different could > > still be an interesting verification. > > It might be easiest to just dust off that patch and see whether it > helps. > > -- > MST > [-- Attachment #2: NVIDIAGPUDirectwithPCIPass-ThroughVirtualization.pdf --] [-- Type: application/pdf, Size: 349330 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-22 7:04 ` Bob Chen @ 2017-08-22 16:56 ` Alex Williamson 2017-08-22 18:06 ` Michael S. Tsirkin 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-08-22 16:56 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Tue, 22 Aug 2017 15:04:55 +0800 Bob Chen <a175818323@gmail.com> wrote: > Hi, > > I got a spec from Nvidia which illustrates how to enable GPU p2p in > virtualization environment. (See attached) Neat, looks like we should implement a new QEMU vfio-pci option, something like nvidia-gpudirect-p2p-id=. I don't think I'd want to code the policy of where to enable it into QEMU or the kernel, so we'd push it up to management layers or users to decide. > The key is to append the legacy pci capabilities list when setting up the > hypervisor, with a Nvidia customized capability config. > > I added some hack in hw/vfio/pci.c and managed to implement that. > > Then I found the GPU was able to recognize its peer, and the latency has > dropped. ✅ > > However the bandwidth didn't improve, but decreased instead. ❌ > > Any suggestions? What's the VM topology? I've found that in a Q35 configuration with GPUs downstream of an emulated root port, the NVIDIA driver in the guest will downshift the physical link rate to 2.5GT/s and never increase it back to 8GT/s. I believe this is because the virtual downstream port only advertises Gen1 link speeds. If the GPUs are on the root complex (ie. pcie.0) the physical link will run at 2.5GT/s when the GPU is idle and upshift to 8GT/s under load. This also happens if the GPU is exposed in a conventional PCI topology to the VM. Another interesting data point is that an older Kepler GRID card does not have this issue, dynamically shifting the link speed under load regardless of the VM PCI/e topology, while a new M60 using the same driver experiences this problem. I've filed a bug with NVIDIA as this seems to be a regression, but it appears (untested) that the hypervisor should take the approach of exposing full, up-to-date PCIe link capabilities and report a link status matching the downstream devices. I'd suggest during your testing, watch lspci info for the GPU from the host, noting the behavior of LnkSta (Link Status) to check if the devices gets stuck at 2.5GT/s in your VM configuration and adjust the topology until it works, likely placing the GPUs on pcie.0 for a Q35 based machine. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-22 16:56 ` Alex Williamson @ 2017-08-22 18:06 ` Michael S. Tsirkin 2017-08-29 10:41 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Michael S. Tsirkin @ 2017-08-22 18:06 UTC (permalink / raw) To: Alex Williamson; +Cc: Bob Chen, Marcel Apfelbaum, 陈博, qemu-devel On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote: > On Tue, 22 Aug 2017 15:04:55 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > Hi, > > > > I got a spec from Nvidia which illustrates how to enable GPU p2p in > > virtualization environment. (See attached) > > Neat, looks like we should implement a new QEMU vfio-pci option, > something like nvidia-gpudirect-p2p-id=. I don't think I'd want to > code the policy of where to enable it into QEMU or the kernel, so we'd > push it up to management layers or users to decide. > > > The key is to append the legacy pci capabilities list when setting up the > > hypervisor, with a Nvidia customized capability config. > > > > I added some hack in hw/vfio/pci.c and managed to implement that. > > > > Then I found the GPU was able to recognize its peer, and the latency has > > dropped. ✅ > > > > However the bandwidth didn't improve, but decreased instead. ❌ > > > > Any suggestions? > > What's the VM topology? I've found that in a Q35 configuration with > GPUs downstream of an emulated root port, the NVIDIA driver in the > guest will downshift the physical link rate to 2.5GT/s and never > increase it back to 8GT/s. I believe this is because the virtual > downstream port only advertises Gen1 link speeds. Fixing that would be nice, and it's great that you now actually have a reproducer that can be used to test it properly. Exposing higher link speeds is a bit of work since there are now all kind of corner cases to cover as guests may play with link speeds and we must pretend we change it accordingly. An especially interesting question is what to do with the assigned device when guest tries to play with port link speed. It's kind of similar to AER in that respect. I guess we can just ignore it for starters. > If the GPUs are on > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s > when the GPU is idle and upshift to 8GT/s under load. This also > happens if the GPU is exposed in a conventional PCI topology to the > VM. Another interesting data point is that an older Kepler GRID card > does not have this issue, dynamically shifting the link speed under > load regardless of the VM PCI/e topology, while a new M60 using the > same driver experiences this problem. I've filed a bug with NVIDIA as > this seems to be a regression, but it appears (untested) that the > hypervisor should take the approach of exposing full, up-to-date PCIe > link capabilities and report a link status matching the downstream > devices. > I'd suggest during your testing, watch lspci info for the GPU from the > host, noting the behavior of LnkSta (Link Status) to check if the > devices gets stuck at 2.5GT/s in your VM configuration and adjust the > topology until it works, likely placing the GPUs on pcie.0 for a Q35 > based machine. Thanks, > > Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-22 18:06 ` Michael S. Tsirkin @ 2017-08-29 10:41 ` Bob Chen 2017-08-29 14:13 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-29 10:41 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Alex Williamson, Marcel Apfelbaum, 陈博, qemu-devel The topology is already having all GPUs directly attached to root bus 0. In this situation you can't see the LnkSta attribute in any capabilities. The other way of using emulated switch would somehow show this attribute, at 8 GT/s, although the real bandwidth is low as usual. 2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>: > On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote: > > On Tue, 22 Aug 2017 15:04:55 +0800 > > Bob Chen <a175818323@gmail.com> wrote: > > > > > Hi, > > > > > > I got a spec from Nvidia which illustrates how to enable GPU p2p in > > > virtualization environment. (See attached) > > > > Neat, looks like we should implement a new QEMU vfio-pci option, > > something like nvidia-gpudirect-p2p-id=. I don't think I'd want to > > code the policy of where to enable it into QEMU or the kernel, so we'd > > push it up to management layers or users to decide. > > > > > The key is to append the legacy pci capabilities list when setting up > the > > > hypervisor, with a Nvidia customized capability config. > > > > > > I added some hack in hw/vfio/pci.c and managed to implement that. > > > > > > Then I found the GPU was able to recognize its peer, and the latency > has > > > dropped. ✅ > > > > > > However the bandwidth didn't improve, but decreased instead. ❌ > > > > > > Any suggestions? > > > > What's the VM topology? I've found that in a Q35 configuration with > > GPUs downstream of an emulated root port, the NVIDIA driver in the > > guest will downshift the physical link rate to 2.5GT/s and never > > increase it back to 8GT/s. I believe this is because the virtual > > downstream port only advertises Gen1 link speeds. > > > Fixing that would be nice, and it's great that you now actually have a > reproducer that can be used to test it properly. > > Exposing higher link speeds is a bit of work since there are now all > kind of corner cases to cover as guests may play with link speeds and we > must pretend we change it accordingly. An especially interesting > question is what to do with the assigned device when guest tries to play > with port link speed. It's kind of similar to AER in that respect. > > I guess we can just ignore it for starters. > > > If the GPUs are on > > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s > > when the GPU is idle and upshift to 8GT/s under load. This also > > happens if the GPU is exposed in a conventional PCI topology to the > > VM. Another interesting data point is that an older Kepler GRID card > > does not have this issue, dynamically shifting the link speed under > > load regardless of the VM PCI/e topology, while a new M60 using the > > same driver experiences this problem. I've filed a bug with NVIDIA as > > this seems to be a regression, but it appears (untested) that the > > hypervisor should take the approach of exposing full, up-to-date PCIe > > link capabilities and report a link status matching the downstream > > devices. > > > > I'd suggest during your testing, watch lspci info for the GPU from the > > host, noting the behavior of LnkSta (Link Status) to check if the > > devices gets stuck at 2.5GT/s in your VM configuration and adjust the > > topology until it works, likely placing the GPUs on pcie.0 for a Q35 > > based machine. Thanks, > > > > Alex > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-29 10:41 ` Bob Chen @ 2017-08-29 14:13 ` Alex Williamson 2017-08-30 9:41 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-08-29 14:13 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Tue, 29 Aug 2017 18:41:44 +0800 Bob Chen <a175818323@gmail.com> wrote: > The topology is already having all GPUs directly attached to root bus 0. In > this situation you can't see the LnkSta attribute in any capabilities. Right, this is why I suggested viewing the physical device lspci info from the host. I haven't seen the suck link issue with devices on the root bus, but it may be worth double checking. Thanks, Alex > The other way of using emulated switch would somehow show this attribute, > at 8 GT/s, although the real bandwidth is low as usual. > > 2017-08-23 2:06 GMT+08:00 Michael S. Tsirkin <mst@redhat.com>: > > > On Tue, Aug 22, 2017 at 10:56:59AM -0600, Alex Williamson wrote: > > > On Tue, 22 Aug 2017 15:04:55 +0800 > > > Bob Chen <a175818323@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > I got a spec from Nvidia which illustrates how to enable GPU p2p in > > > > virtualization environment. (See attached) > > > > > > Neat, looks like we should implement a new QEMU vfio-pci option, > > > something like nvidia-gpudirect-p2p-id=. I don't think I'd want to > > > code the policy of where to enable it into QEMU or the kernel, so we'd > > > push it up to management layers or users to decide. > > > > > > > The key is to append the legacy pci capabilities list when setting up > > the > > > > hypervisor, with a Nvidia customized capability config. > > > > > > > > I added some hack in hw/vfio/pci.c and managed to implement that. > > > > > > > > Then I found the GPU was able to recognize its peer, and the latency > > has > > > > dropped. ✅ > > > > > > > > However the bandwidth didn't improve, but decreased instead. ❌ > > > > > > > > Any suggestions? > > > > > > What's the VM topology? I've found that in a Q35 configuration with > > > GPUs downstream of an emulated root port, the NVIDIA driver in the > > > guest will downshift the physical link rate to 2.5GT/s and never > > > increase it back to 8GT/s. I believe this is because the virtual > > > downstream port only advertises Gen1 link speeds. > > > > > > Fixing that would be nice, and it's great that you now actually have a > > reproducer that can be used to test it properly. > > > > Exposing higher link speeds is a bit of work since there are now all > > kind of corner cases to cover as guests may play with link speeds and we > > must pretend we change it accordingly. An especially interesting > > question is what to do with the assigned device when guest tries to play > > with port link speed. It's kind of similar to AER in that respect. > > > > I guess we can just ignore it for starters. > > > > > If the GPUs are on > > > the root complex (ie. pcie.0) the physical link will run at 2.5GT/s > > > when the GPU is idle and upshift to 8GT/s under load. This also > > > happens if the GPU is exposed in a conventional PCI topology to the > > > VM. Another interesting data point is that an older Kepler GRID card > > > does not have this issue, dynamically shifting the link speed under > > > load regardless of the VM PCI/e topology, while a new M60 using the > > > same driver experiences this problem. I've filed a bug with NVIDIA as > > > this seems to be a regression, but it appears (untested) that the > > > hypervisor should take the approach of exposing full, up-to-date PCIe > > > link capabilities and report a link status matching the downstream > > > devices. > > > > > > > I'd suggest during your testing, watch lspci info for the GPU from the > > > host, noting the behavior of LnkSta (Link Status) to check if the > > > devices gets stuck at 2.5GT/s in your VM configuration and adjust the > > > topology until it works, likely placing the GPUs on pcie.0 for a Q35 > > > based machine. Thanks, > > > > > > Alex > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-29 14:13 ` Alex Williamson @ 2017-08-30 9:41 ` Bob Chen 2017-08-30 16:43 ` Alex Williamson 0 siblings, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-30 9:41 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel I think I have observed what you said... The link speed on host remained 8GT/s until I finished running p2pBandwidthLatencyTest for the first time. Then it became 2.5GT/s... # lspci -s 09:00.0 -vvv 09:00.0 3D controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1) Subsystem: NVIDIA Corporation Device 115e Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 7 NUMA node: 0 Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 3b800000000 (64-bit, prefetchable) [size=8G] Region 3: Memory at 3ba00000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #16, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [250 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+ Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] #19 Kernel driver in use: vfio-pci Kernel modules: nouveau ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-30 9:41 ` Bob Chen @ 2017-08-30 16:43 ` Alex Williamson 2017-09-01 9:58 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Alex Williamson @ 2017-08-30 16:43 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Wed, 30 Aug 2017 17:41:20 +0800 Bob Chen <a175818323@gmail.com> wrote: > I think I have observed what you said... > > The link speed on host remained 8GT/s until I finished running > p2pBandwidthLatencyTest > for the first time. Then it became 2.5GT/s... > > > # lspci -s 09:00.0 -vvv ... > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- So long as the device renegotiates to 8GT/s under load rather than getting stuck at 2.5GT/s, I think this is the expected behavior. This is a power saving measure by the driver. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-30 16:43 ` Alex Williamson @ 2017-09-01 9:58 ` Bob Chen 2017-11-30 8:06 ` Bob Chen 0 siblings, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-09-01 9:58 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel More updates: 1. This behavior was found not only on M60, but also on TITAN 1080Ti or Xp. 2. When not setting up the p2p compatibility, i.e. run the original qemu with GPUs attached to the root pcie bus, the LnkSta on host always remains at 8 GT/s. Don't know why the new p2p change would cause the GPU driver in guest to re-negotiate its speed. I think it has gone beyond the community's responsibility to debug this tricky issue. So I have contacted nvidia for technical support, and they are expected to send me a reply in next few weeks. Will keep you guys updated. Bob 2017-08-31 0:43 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Wed, 30 Aug 2017 17:41:20 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > I think I have observed what you said... > > > > The link speed on host remained 8GT/s until I finished running > > p2pBandwidthLatencyTest > > for the first time. Then it became 2.5GT/s... > > > > > > # lspci -s 09:00.0 -vvv > ... > > LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- > > So long as the device renegotiates to 8GT/s under load rather than > getting stuck at 2.5GT/s, I think this is the expected behavior. This > is a power saving measure by the driver. Thanks, > > Alex > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-09-01 9:58 ` Bob Chen @ 2017-11-30 8:06 ` Bob Chen 0 siblings, 0 replies; 26+ messages in thread From: Bob Chen @ 2017-11-30 8:06 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel Hi, After 3 months of work and investigation, and tedious mail discussions with Nvidia, I think some progress have been made, in terms of the GPUDirect(p2p) in virtual environment. The only remaining issue then, is the low bidirectional bandwidth between two sibling GPUs under the same PCIe switch. We expanded the tests to run on even more GPU cards, so the results seemed to be explicit now. P40 is OK, and its hardware topology on host is: \-[0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[03]----00.0 LSI Logic / Symbios Logic MegaRAID SAS-3 3008 [Fury] +-02.0-[04]----00.0 NVIDIA Corporation GP102GL [Tesla P40] +-03.0-[02]----00.0 NVIDIA Corporation GP102GL [Tesla P40] M60, not OK, low bandwidth: \-[0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[06]----00.0 LSI Logic / Symbios Logic MegaRAID SAS-3 3008 [Fury] +-02.0-[07-0a]----00.0-[08-0a]--+-08.0-[09]----00.0 NVIDIA Corporation GM204GL [Tesla M60] | \-10.0-[0a]----00.0 NVIDIA Corporation GM204GL [Tesla M60] V100, not OK, low bandwidth: \-[0000:00]-+-00.0 Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 +-01.0-[01]--+-00.0 Mellanox Technologies MT27710 Family [ConnectX-4 Lx] | \-00.1 Mellanox Technologies MT27710 Family [ConnectX-4 Lx] +-02.0-[02-05]----00.0-[03-05]--+-08.0-[04]----00.0 NVIDIA Corporation GV100 [Tesla V100 PCIe] | \-10.0-[05]----00.0 NVIDIA Corporation GV100 [Tesla V100 PCIe] So what might be the actual effect of the PLX switch hardware for GPU data flow? Although it is not visible in guest OS. Nvidia tech-support guys are not familiar with virtualization. They asked us to consult the community first. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-01 15:01 ` Alex Williamson 2017-08-07 13:00 ` Bob Chen @ 2017-08-07 13:04 ` Bob Chen 2017-08-07 16:00 ` Alex Williamson 1 sibling, 1 reply; 26+ messages in thread From: Bob Chen @ 2017-08-07 13:04 UTC (permalink / raw) To: Alex Williamson Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel Besides, I checked the lspci -vvv output, no capabilities of Access Control are seen. 2017-08-01 23:01 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > On Tue, 1 Aug 2017 17:35:40 +0800 > Bob Chen <a175818323@gmail.com> wrote: > > > 2017-08-01 13:46 GMT+08:00 Alex Williamson <alex.williamson@redhat.com>: > > > > > On Tue, 1 Aug 2017 13:04:46 +0800 > > > Bob Chen <a175818323@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > This is a sketch of my hardware topology. > > > > > > > > CPU0 <- QPI -> CPU1 > > > > | | > > > > Root Port(at PCIe.0) Root Port(at PCIe.1) > > > > / \ / \ > > > > > > Are each of these lines above separate root ports? ie. each root > > > complex hosts two root ports, each with a two-port switch downstream of > > > it? > > > > > > > Not quite sure if root complex is a concept or a real physical device ... > > > > But according to my observation by `lspci -vt`, there are indeed 4 Root > > Ports in the system. So the sketch might need a tiny update. > > > > > > CPU0 <- QPI -> CPU1 > > > > | | > > > > Root Complex(device?) Root Complex(device?) > > > > / \ / \ > > > > Root Port Root Port Root Port Root Port > > > > / \ / \ > > > > Switch Switch Switch Switch > > > > / \ / \ / \ / \ > > > > GPU GPU GPU GPU GPU GPU GPU GPU > > > Yes, that's what I expected. So the numbers make sense, the immediate > sibling GPU would share bandwidth between the root port and upstream > switch port, any other GPU should not double-up on any single link. > > > > > Switch Switch Switch Switch > > > > / \ / \ / \ / \ > > > > GPU GPU GPU GPU GPU GPU GPU GPU > > > > > > > > > > > > And below are the p2p bandwidth test results. > > > > > > > > Host: > > > > D\D 0 1 2 3 4 5 6 7 > > > > 0 426.91 25.32 19.72 19.72 19.69 19.68 19.75 19.66 > > > > 1 25.31 427.61 19.74 19.72 19.66 19.68 19.74 19.73 > > > > 2 19.73 19.73 429.49 25.33 19.66 19.74 19.73 19.74 > > > > 3 19.72 19.71 25.36 426.68 19.70 19.71 19.77 19.74 > > > > 4 19.72 19.72 19.73 19.75 425.75 25.33 19.72 19.71 > > > > 5 19.71 19.75 19.76 19.75 25.35 428.11 19.69 19.70 > > > > 6 19.76 19.72 19.79 19.78 19.73 19.74 425.75 25.35 > > > > 7 19.69 19.75 19.79 19.75 19.72 19.72 25.39 427.15 > > > > > > > > VM: > > > > D\D 0 1 2 3 4 5 6 7 > > > > 0 427.38 10.52 18.99 19.11 19.75 19.62 19.75 19.71 > > > > 1 10.53 426.68 19.28 19.19 19.73 19.71 19.72 19.73 > > > > 2 18.88 19.30 426.92 10.48 19.66 19.71 19.67 19.68 > > > > 3 18.93 19.18 10.45 426.94 19.69 19.72 19.67 19.72 > > > > 4 19.60 19.66 19.69 19.70 428.13 10.49 19.40 19.57 > > > > 5 19.52 19.74 19.72 19.69 10.44 426.45 19.68 19.61 > > > > 6 19.63 19.50 19.72 19.64 19.59 19.66 426.91 10.47 > > > > 7 19.69 19.75 19.70 19.69 19.66 19.74 10.45 426.23 > > > > > > Interesting test, how do you get these numbers? What are the units, > > > GB/s? > > > > > > > > > > > A p2pBandwidthLatencyTest from Nvidia CUDA sample code. Units are > > GB/s. Asynchronous read and write. Bidirectional. > > > > However, the Unidirectional test had shown a different result. Didn't > fall > > down to a half. > > > > VM: > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 4 5 6 7 > > 0 424.07 10.02 11.33 11.30 11.09 11.05 11.06 11.10 > > 1 10.05 425.98 11.40 11.33 11.08 11.10 11.13 11.09 > > 2 11.31 11.28 423.67 10.10 11.14 11.13 11.13 11.11 > > 3 11.30 11.31 10.08 425.05 11.10 11.07 11.09 11.06 > > 4 11.16 11.17 11.21 11.17 423.67 10.08 11.25 11.28 > > 5 10.97 11.01 11.07 11.02 10.09 425.52 11.23 11.27 > > 6 11.09 11.13 11.16 11.10 11.28 11.33 422.71 10.10 > > 7 11.13 11.09 11.15 11.11 11.36 11.33 10.02 422.75 > > > > Host: > > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > > D\D 0 1 2 3 4 5 6 7 > > 0 424.13 13.38 10.17 10.17 11.23 11.21 10.94 11.22 > > 1 13.38 424.06 10.18 10.19 11.20 11.19 11.19 11.14 > > 2 10.18 10.18 422.75 13.38 11.19 11.19 11.17 11.17 > > 3 10.18 10.18 13.38 425.05 11.05 11.08 11.08 11.06 > > 4 11.01 11.06 11.06 11.03 423.21 13.38 10.17 10.17 > > 5 10.91 10.91 10.89 10.92 13.38 425.52 10.18 10.18 > > 6 11.28 11.30 11.32 11.31 10.19 10.18 424.59 13.37 > > 7 11.18 11.20 11.16 11.21 10.17 10.19 13.38 424.13 > > Looks right, a unidirectional test would create bidirectional data > flows on the root port to upstream switch link and should be able to > saturate that link. With the bidirectional test, that link becomes a > bottleneck. > > > > > In the VM, the bandwidth between two GPUs under the same physical > switch > > > is > > > > obviously lower, as per the reasons you said in former threads. > > > > > > Hmm, I'm not sure I can explain why the number is lower than to more > > > remote GPUs though. Is the test simultaneously reading and writing and > > > therefore we overload the link to the upstream switch port? Otherwise > > > I'd expect the bidirectional support in PCIe to be able to handle the > > > bandwidth. Does the test have a read-only or write-only mode? > > > > > > > But what confused me most is that GPUs under different switches could > > > > achieve the same speed, as well as in the Host. Does that mean after > > > IOMMU > > > > address translation, data traversing has utilized QPI bus by default? > > > Even > > > > these two devices do not belong to the same PCIe bus? > > > > > > Yes, of course. Once the transaction is translated by the IOMMU it's > > > just a matter of routing the resulting address, whether that's back > > > down the I/O hierarchy under the same root complex or across the QPI > > > link to the other root complex. The translated address could just as > > > easily be to RAM that lives on the other side of the QPI link. Also, > it > > > seems like the IOMMU overhead is perhaps negligible here, unless the > > > IOMMU is actually being used in both cases. > > > > > > > > > Yes, the overhead of bandwidth is negligible, but the latency is not as > > good as we expected. I assume it is IOMMU address translation to blame. > > > > I ran this twice with IOMMU on/off on Host, the results were the same. > > > > VM: > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 4 5 6 7 > > 0 4.53 13.44 13.60 13.60 14.37 14.51 14.55 14.49 > > 1 13.47 4.41 13.37 13.37 14.49 14.51 14.56 14.52 > > 2 13.38 13.61 4.32 13.47 14.45 14.43 14.53 14.33 > > 3 13.55 13.60 13.38 4.45 14.50 14.48 14.54 14.51 > > 4 13.85 13.72 13.71 13.81 4.47 14.61 14.58 14.47 > > 5 13.75 13.77 13.75 13.77 14.46 4.46 14.52 14.45 > > 6 13.76 13.78 13.73 13.84 14.50 14.55 4.45 14.53 > > 7 13.73 13.78 13.76 13.80 14.53 14.63 14.56 4.46 > > > > Host: > > P2P=Enabled Latency Matrix (us) > > D\D 0 1 2 3 4 5 6 7 > > 0 3.66 5.88 6.59 6.58 15.26 15.15 15.03 15.14 > > 1 5.80 3.66 6.50 6.50 15.15 15.04 15.06 15.00 > > 2 6.58 6.52 4.12 5.85 15.16 15.06 15.00 15.04 > > 3 6.80 6.81 6.71 4.12 15.12 13.08 13.75 13.31 > > 4 14.91 14.18 14.34 12.93 4.13 6.45 6.56 6.63 > > 5 15.17 14.99 15.03 14.57 5.61 3.49 6.19 6.29 > > 6 15.12 14.78 14.60 13.47 6.16 6.15 3.53 5.68 > > 7 15.00 14.65 14.82 14.28 6.16 6.15 5.44 3.56 > > Yes, the IOMMU is not free, page table walks are occurring here. Are > you using 1G pages for the VM? 2G? Does this platform support 1G > super pages on the IOMMU? (cat /sys/class/iommu/*/intel-iommu/cap, bit > 34 is 2MB page support, bit 35 is 1G). All modern Xeons should support > 1G so you'll want to use 1G hugepages in the VM to take advantage of > that. > > > > In the host test, is the IOMMU still enabled? The routing of PCIe > > > transactions is going to be governed by ACS, which Linux enables > > > whenever the IOMMU is enabled, not just when a device is assigned to a > > > VM. It would be interesting to see if another performance tier is > > > exposed if the IOMMU is entirely disabled, or perhaps it might better > > > expose the overhead of the IOMMU translation. It would also be > > > interesting to see the ACS settings in lspci for each downstream port > > > for each test. Thanks, > > > > > > Alex > > > > > > > > > How to display GPU's ACS settings? Like this? > > > > [420 v2] Advanced Error Reporting > > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > > ECRC- UnsupReq- ACSViol- > > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- > > ECRC- UnsupReq- ACSViol- > > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ > > ECRC- UnsupReq- ACSViol- > > As Michael notes, this is AER, ACS is Access Control Services. It > should be another capability in lspci. Thanks, > > Alex > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 2017-08-07 13:04 ` Bob Chen @ 2017-08-07 16:00 ` Alex Williamson 0 siblings, 0 replies; 26+ messages in thread From: Alex Williamson @ 2017-08-07 16:00 UTC (permalink / raw) To: Bob Chen Cc: Michael S. Tsirkin, Marcel Apfelbaum, 陈博, qemu-devel On Mon, 7 Aug 2017 21:04:16 +0800 Bob Chen <a175818323@gmail.com> wrote: > Besides, I checked the lspci -vvv output, no capabilities of Access Control > are seen. Are these switches onboard an NVIDIA card or are they separate components? The examples I have on NVIDIA cards do include ACS: +-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0 NVIDIA Corporation GK107GL [GRID K1] +-09.0-[45]----00.0 NVIDIA Corporation GK107GL [GRID K1] +-10.0-[46]----00.0 NVIDIA Corporation GK107GL [GRID K1] \-11.0-[47]----00.0 NVIDIA Corporation GK107GL [GRID K1] # lspci -vvvs 43: | grep -A 2 "Access Control Services" Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- -- Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- -- Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- -- Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- +-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0 NVIDIA Corporation GM204GL [Tesla M60] \-10.0-[07]----00.0 NVIDIA Corporation GM204GL [Tesla M60] # lspci -vvvs 5: | grep -A 2 "Access Control Services" Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- -- Capabilities: [f24 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Without ACS on the downstream switch ports, the GPUs sharing the switch will be in the same IOMMU group and we have no ability to control anything about the routing between downstream ports. Thanks, Alex ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-11-30 8:06 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <4E0AFA5F-44D6-4624-A99F-68A7FE52F397@meituan.com> [not found] ` <4b31a711-a52e-25d3-4a7c-1be8521097d9@redhat.com> [not found] ` <F99BFE80-FC15-40A0-BB3E-1B53B6CF9B05@meituan.com> 2017-07-26 6:21 ` [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】 Marcel Apfelbaum 2017-07-26 15:29 ` Alex Williamson 2017-07-26 16:06 ` Michael S. Tsirkin 2017-07-26 17:32 ` Alex Williamson 2017-08-01 5:04 ` Bob Chen 2017-08-01 5:46 ` Alex Williamson 2017-08-01 9:35 ` Bob Chen 2017-08-01 14:39 ` Michael S. Tsirkin 2017-08-01 15:01 ` Alex Williamson 2017-08-07 13:00 ` Bob Chen 2017-08-07 15:52 ` Alex Williamson 2017-08-08 1:44 ` Bob Chen 2017-08-08 8:06 ` Bob Chen 2017-08-08 16:53 ` Alex Williamson 2017-08-08 20:07 ` Michael S. Tsirkin 2017-08-22 7:04 ` Bob Chen 2017-08-22 16:56 ` Alex Williamson 2017-08-22 18:06 ` Michael S. Tsirkin 2017-08-29 10:41 ` Bob Chen 2017-08-29 14:13 ` Alex Williamson 2017-08-30 9:41 ` Bob Chen 2017-08-30 16:43 ` Alex Williamson 2017-09-01 9:58 ` Bob Chen 2017-11-30 8:06 ` Bob Chen 2017-08-07 13:04 ` Bob Chen 2017-08-07 16:00 ` Alex Williamson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.