All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Qemu-devel] iommu emulation
       [not found]     ` <CAHyh4xhOPmfLoU_fvtbBF1Wqbzji9q6rp_bRN38qfnwvhQq+9A@mail.gmail.com>
@ 2017-02-09  3:52       ` Peter Xu
  2017-02-09 13:01         ` Jintack Lim
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-02-09  3:52 UTC (permalink / raw)
  To: Jintack Lim; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

(cc qemu-devel and Alex)

On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim <jintack@cs.columbia.edu> wrote:
> > Hi Peter,
> >
> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu <peterx@redhat.com> wrote:
> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
> >>> Hi Peter and Michael,
> >>
> >> Hi, Jintack,
> >>
> >>>
> >>> I would like to get some help to run a VM with the emulated iommu. I
> >>> have tried for a few days to make it work, but I couldn't.
> >>>
> >>> What I want to do eventually is to assign a network device to the
> >>> nested VM so that I can measure the performance of applications
> >>> running in the nested VM.
> >>
> >> Good to know that you are going to use [4] to do something useful. :-)
> >>
> >> However, could I ask why you want to measure the performance of
> >> application inside nested VM rather than host? That's something I am
> >> just curious about, considering that virtualization stack will
> >> definitely introduce overhead along the way, and I don't know whether
> >> that'll affect your measurement to the application.
> >
> > I have added nested virtualization support to KVM/ARM, which is under
> > review now. I found that application performance running inside the
> > nested VM is really bad both on ARM and x86, and I'm trying to figure
> > out what's the real overhead. I think one way to figure that out is to
> > see if the direct device assignment to L2 helps to reduce the overhead
> > or not.

I see. IIUC you are trying to use an assigned device to replace your
old emulated device in L2 guest to see whether performance will drop
as well, right? Then at least I can know that you won't need a nested
VT-d here (so we should not need a vIOMMU in L2 guest).

In that case, I think we can give it a shot, considering that L1 guest
will use vfio-pci for that assigned device as well, and when L2 guest
QEMU uses this assigned device, it'll use a static mapping (just to
map the whole GPA for L2 guest) there, so even if you are using a
kernel driver in L2 guest with your to-be-tested application, we
should still be having a static mapping in vIOMMU in L1 guest, which
is IMHO fine from performance POV.

I cced Alex in case I missed anything here.

> >
> >>
> >> Another thing to mention is that (in case you don't know that), device
> >> assignment with VT-d protection would be even slower than generic VMs
> >> (without Intel IOMMU protection) if you are using generic kernel
> >> drivers in the guest, since we may need real-time DMA translation on
> >> data path.
> >>
> >
> > So, this is the comparison between using virtio and using the device
> > assignment for L1? I have tested application performance running
> > inside L1 with and without iommu, and I found that the performance is
> > better with iommu. I thought whether the device is assigned to L1 or
> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
> > I misunderstood what you said?

I failed to understand why an vIOMMU could help boost performance. :(
Could you provide your command line here so that I can try to
reproduce?

Besides, what I mentioned above is just in case you don't know that
vIOMMU will drag down the performance in most cases.

I think here to be more explicit, the overhead of vIOMMU is different
for assigned devices and emulated ones.

  (1) For emulated devices, the overhead is when we do the
      translation, or say when we do the DMA operation. We need
      real-time translation which should drag down the performance.

  (2) For assigned devices (our case), the overhead is when we setup
      the pages (since we are trapping the setup procedures via CM
      bit). However, after it's setup, we should have no much
      performance drag when we really do the data transfer (during
      DMA) since that'll all be done in the hardware IOMMU (no matter
      whether the device is assigned to L1/L2 guest).

Now, after I know your use case now (use vIOMMU in L1 guest, don't use
vIOMMU in L2 guest, only use assigned devices), I suspect we would
have no big problem according to (2).

> >
> >>>
> >>> First, I am having trouble to boot a VM with the emulated iommu. I
> >>> have posted my problem to the qemu user mailing list[1],
> >>
> >> Here I would suggest that you cc qemu-devel as well next time:
> >>
> >>   qemu-devel@nongnu.org
> >>
> >> Since I guess not all people are registered to qemu-discuss, at least
> >> I am not in that loop. Imho cc qemu-devel could let the question
> >> spread to more people, and it'll get a higher chance to be answered.
> >
> > Thanks. I'll cc qemu-devel next time.
> >
> >>
> >>> but to put it
> >>> in a nutshell, I'd like to know the setting I can reuse to boot a VM
> >>> with the emulated iommu. (e.g. how to create a VM with q35 chipset
> >>> and/or libvirt xml if you use virsh).
> >>
> >> IIUC you are looking for device assignment for the nested VM case. So,
> >> firstly, you may need my tree to run this (see below). Then, maybe you
> >> can try to boot a L1 guest with assigned device (under VT-d
> >> protection), with command:
> >>
> >> $qemu -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> >>       -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> >>       -device vfio-pci,host=$HOST_PCI_ADDR \
> >>       $YOUR_IMAGE_PATH
> >>
> >
> > Thanks! I'll try this right away.
> >
> >> Here $HOST_PCI_ADDR should be something like 05:00.0, which is the
> >> host PCI address of the device to be assigned to guest.
> >>
> >> (If you go over the cover letter in [4], you'll see similar command
> >>  line there, though with some more devices assigned, and with traces)
> >>
> >> If you are playing with nested VM, you'll also need a L2 guest, which
> >> will be run inside the L1 guest. It'll require similar command line,
> >> but I would suggest you first try a L2 guest without intel-iommu
> >> device. Frankly speaking I haven't played with that yet, so just let
> >> me know if you got any problem, which is possible. :-)
> >>
> 
> I was able to boot L2 guest without assigning a network device
> successfully. (host iommu was on, L1 iommu was on, and the network
> device was assigned to L1)
> 
> Then, I unbound the network device in L1 and bound it to vfio-pci.
> When I try to run L2 with the following command, I got an assertion.
> 
> # ./qemu-system-x86_64 -M q35,accel=kvm \
> -m 8G \
> -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> -device vfio-pci,host=00:03.0,id=net0
> 
> qemu-system-x86_64: hw/pci/pcie.c:686: pcie_add_capability: Assertion
> `prev >= 0x100' failed.
> Aborted (core dumped)
> 
> Thoughts?

I don't know whether it'll has anything to do with how vfio-pci works,
anyway I cced Alex and the list in case there is quick answer.

I'll reproduce this nested case and update when I got anything.

Thanks!

> 
> >
> > Ok. I'll let you know!
> >
> >>>
> >>> I'm using QEMU 2.8.0, kernel 4.6.0-rc5, libvirt 3.0.0, and this is my
> >>> libvirt xml [2], which gives me DMAR error during the VM boot[3].
> >>>
> >>> I also wonder if the VM can successfully assign a device (i.e. network
> >>> device in my case) to the nested VM if I use this patch series from
> >>> you. [4]
> >>
> >> Yes, for your nested device assignment requirement you may need to use
> >> the tree posted in [4], rather than any other qemu versions. [4] is
> >> still during review (which Alex should have mentioned in the other
> >> thread), so you may need to build it on your own to get
> >> qemu-system-x86_64 binary, which is located at:
> >>
> >>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v7
> >>
> >> (this link is in [4] as well)
> >>
> >
> > Thanks a lot.
> >
> >>>
> >>> I mostly work on ARM architecture, especially nested virtualization on
> >>> ARM, and I'm trying to become accustomed to x86 environment :)
> >>
> >> Hope you'll quickly get used to it. :-)
> >>
> >> Regards,
> >>
> >> -- peterx
> >>
> 

-- peterx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-09  3:52       ` [Qemu-devel] iommu emulation Peter Xu
@ 2017-02-09 13:01         ` Jintack Lim
  2017-02-14  7:35           ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-09 13:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

On Wed, Feb 8, 2017 at 10:52 PM, Peter Xu <peterx@redhat.com> wrote:
> (cc qemu-devel and Alex)
>
> On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
>> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim <jintack@cs.columbia.edu> wrote:
>> > Hi Peter,
>> >
>> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu <peterx@redhat.com> wrote:
>> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
>> >>> Hi Peter and Michael,
>> >>
>> >> Hi, Jintack,
>> >>
>> >>>
>> >>> I would like to get some help to run a VM with the emulated iommu. I
>> >>> have tried for a few days to make it work, but I couldn't.
>> >>>
>> >>> What I want to do eventually is to assign a network device to the
>> >>> nested VM so that I can measure the performance of applications
>> >>> running in the nested VM.
>> >>
>> >> Good to know that you are going to use [4] to do something useful. :-)
>> >>
>> >> However, could I ask why you want to measure the performance of
>> >> application inside nested VM rather than host? That's something I am
>> >> just curious about, considering that virtualization stack will
>> >> definitely introduce overhead along the way, and I don't know whether
>> >> that'll affect your measurement to the application.
>> >
>> > I have added nested virtualization support to KVM/ARM, which is under
>> > review now. I found that application performance running inside the
>> > nested VM is really bad both on ARM and x86, and I'm trying to figure
>> > out what's the real overhead. I think one way to figure that out is to
>> > see if the direct device assignment to L2 helps to reduce the overhead
>> > or not.
>
> I see. IIUC you are trying to use an assigned device to replace your
> old emulated device in L2 guest to see whether performance will drop
> as well, right? Then at least I can know that you won't need a nested
> VT-d here (so we should not need a vIOMMU in L2 guest).

That's right.

>
> In that case, I think we can give it a shot, considering that L1 guest
> will use vfio-pci for that assigned device as well, and when L2 guest
> QEMU uses this assigned device, it'll use a static mapping (just to
> map the whole GPA for L2 guest) there, so even if you are using a
> kernel driver in L2 guest with your to-be-tested application, we
> should still be having a static mapping in vIOMMU in L1 guest, which
> is IMHO fine from performance POV.
>
> I cced Alex in case I missed anything here.
>
>> >
>> >>
>> >> Another thing to mention is that (in case you don't know that), device
>> >> assignment with VT-d protection would be even slower than generic VMs
>> >> (without Intel IOMMU protection) if you are using generic kernel
>> >> drivers in the guest, since we may need real-time DMA translation on
>> >> data path.
>> >>
>> >
>> > So, this is the comparison between using virtio and using the device
>> > assignment for L1? I have tested application performance running
>> > inside L1 with and without iommu, and I found that the performance is
>> > better with iommu. I thought whether the device is assigned to L1 or
>> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
>> > I misunderstood what you said?
>
> I failed to understand why an vIOMMU could help boost performance. :(
> Could you provide your command line here so that I can try to
> reproduce?

Sure. This is the command line to launch L1 VM

qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
-m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
-drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
-smp 4,sockets=4,cores=1,threads=1 \
-device vfio-pci,host=08:00.0,id=net0

And this is for L2 VM.

./qemu-system-x86_64 -M q35,accel=kvm \
-m 8G \
-drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
-device vfio-pci,host=00:03.0,id=net0

>
> Besides, what I mentioned above is just in case you don't know that
> vIOMMU will drag down the performance in most cases.
>
> I think here to be more explicit, the overhead of vIOMMU is different
> for assigned devices and emulated ones.
>
>   (1) For emulated devices, the overhead is when we do the
>       translation, or say when we do the DMA operation. We need
>       real-time translation which should drag down the performance.
>
>   (2) For assigned devices (our case), the overhead is when we setup
>       the pages (since we are trapping the setup procedures via CM
>       bit). However, after it's setup, we should have no much
>       performance drag when we really do the data transfer (during
>       DMA) since that'll all be done in the hardware IOMMU (no matter
>       whether the device is assigned to L1/L2 guest).
>
> Now, after I know your use case now (use vIOMMU in L1 guest, don't use
> vIOMMU in L2 guest, only use assigned devices), I suspect we would
> have no big problem according to (2).
>
>> >
>> >>>
>> >>> First, I am having trouble to boot a VM with the emulated iommu. I
>> >>> have posted my problem to the qemu user mailing list[1],
>> >>
>> >> Here I would suggest that you cc qemu-devel as well next time:
>> >>
>> >>   qemu-devel@nongnu.org
>> >>
>> >> Since I guess not all people are registered to qemu-discuss, at least
>> >> I am not in that loop. Imho cc qemu-devel could let the question
>> >> spread to more people, and it'll get a higher chance to be answered.
>> >
>> > Thanks. I'll cc qemu-devel next time.
>> >
>> >>
>> >>> but to put it
>> >>> in a nutshell, I'd like to know the setting I can reuse to boot a VM
>> >>> with the emulated iommu. (e.g. how to create a VM with q35 chipset
>> >>> and/or libvirt xml if you use virsh).
>> >>
>> >> IIUC you are looking for device assignment for the nested VM case. So,
>> >> firstly, you may need my tree to run this (see below). Then, maybe you
>> >> can try to boot a L1 guest with assigned device (under VT-d
>> >> protection), with command:
>> >>
>> >> $qemu -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>> >>       -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>> >>       -device vfio-pci,host=$HOST_PCI_ADDR \
>> >>       $YOUR_IMAGE_PATH
>> >>
>> >
>> > Thanks! I'll try this right away.
>> >
>> >> Here $HOST_PCI_ADDR should be something like 05:00.0, which is the
>> >> host PCI address of the device to be assigned to guest.
>> >>
>> >> (If you go over the cover letter in [4], you'll see similar command
>> >>  line there, though with some more devices assigned, and with traces)
>> >>
>> >> If you are playing with nested VM, you'll also need a L2 guest, which
>> >> will be run inside the L1 guest. It'll require similar command line,
>> >> but I would suggest you first try a L2 guest without intel-iommu
>> >> device. Frankly speaking I haven't played with that yet, so just let
>> >> me know if you got any problem, which is possible. :-)
>> >>
>>
>> I was able to boot L2 guest without assigning a network device
>> successfully. (host iommu was on, L1 iommu was on, and the network
>> device was assigned to L1)
>>
>> Then, I unbound the network device in L1 and bound it to vfio-pci.
>> When I try to run L2 with the following command, I got an assertion.
>>
>> # ./qemu-system-x86_64 -M q35,accel=kvm \
>> -m 8G \
>> -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
>> -device vfio-pci,host=00:03.0,id=net0
>>
>> qemu-system-x86_64: hw/pci/pcie.c:686: pcie_add_capability: Assertion
>> `prev >= 0x100' failed.
>> Aborted (core dumped)
>>
>> Thoughts?
>
> I don't know whether it'll has anything to do with how vfio-pci works,
> anyway I cced Alex and the list in case there is quick answer.
>
> I'll reproduce this nested case and update when I got anything.

Thanks!

>
> Thanks!
>
>>
>> >
>> > Ok. I'll let you know!
>> >
>> >>>
>> >>> I'm using QEMU 2.8.0, kernel 4.6.0-rc5, libvirt 3.0.0, and this is my
>> >>> libvirt xml [2], which gives me DMAR error during the VM boot[3].
>> >>>
>> >>> I also wonder if the VM can successfully assign a device (i.e. network
>> >>> device in my case) to the nested VM if I use this patch series from
>> >>> you. [4]
>> >>
>> >> Yes, for your nested device assignment requirement you may need to use
>> >> the tree posted in [4], rather than any other qemu versions. [4] is
>> >> still during review (which Alex should have mentioned in the other
>> >> thread), so you may need to build it on your own to get
>> >> qemu-system-x86_64 binary, which is located at:
>> >>
>> >>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v7
>> >>
>> >> (this link is in [4] as well)
>> >>
>> >
>> > Thanks a lot.
>> >
>> >>>
>> >>> I mostly work on ARM architecture, especially nested virtualization on
>> >>> ARM, and I'm trying to become accustomed to x86 environment :)
>> >>
>> >> Hope you'll quickly get used to it. :-)
>> >>
>> >> Regards,
>> >>
>> >> -- peterx
>> >>
>>
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-09 13:01         ` Jintack Lim
@ 2017-02-14  7:35           ` Peter Xu
  2017-02-14 12:50             ` Jintack Lim
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-02-14  7:35 UTC (permalink / raw)
  To: Jintack Lim; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

On Thu, Feb 09, 2017 at 08:01:14AM -0500, Jintack Lim wrote:
> On Wed, Feb 8, 2017 at 10:52 PM, Peter Xu <peterx@redhat.com> wrote:
> > (cc qemu-devel and Alex)
> >
> > On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
> >> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim <jintack@cs.columbia.edu> wrote:
> >> > Hi Peter,
> >> >
> >> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu <peterx@redhat.com> wrote:
> >> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
> >> >>> Hi Peter and Michael,
> >> >>
> >> >> Hi, Jintack,
> >> >>
> >> >>>
> >> >>> I would like to get some help to run a VM with the emulated iommu. I
> >> >>> have tried for a few days to make it work, but I couldn't.
> >> >>>
> >> >>> What I want to do eventually is to assign a network device to the
> >> >>> nested VM so that I can measure the performance of applications
> >> >>> running in the nested VM.
> >> >>
> >> >> Good to know that you are going to use [4] to do something useful. :-)
> >> >>
> >> >> However, could I ask why you want to measure the performance of
> >> >> application inside nested VM rather than host? That's something I am
> >> >> just curious about, considering that virtualization stack will
> >> >> definitely introduce overhead along the way, and I don't know whether
> >> >> that'll affect your measurement to the application.
> >> >
> >> > I have added nested virtualization support to KVM/ARM, which is under
> >> > review now. I found that application performance running inside the
> >> > nested VM is really bad both on ARM and x86, and I'm trying to figure
> >> > out what's the real overhead. I think one way to figure that out is to
> >> > see if the direct device assignment to L2 helps to reduce the overhead
> >> > or not.
> >
> > I see. IIUC you are trying to use an assigned device to replace your
> > old emulated device in L2 guest to see whether performance will drop
> > as well, right? Then at least I can know that you won't need a nested
> > VT-d here (so we should not need a vIOMMU in L2 guest).
> 
> That's right.
> 
> >
> > In that case, I think we can give it a shot, considering that L1 guest
> > will use vfio-pci for that assigned device as well, and when L2 guest
> > QEMU uses this assigned device, it'll use a static mapping (just to
> > map the whole GPA for L2 guest) there, so even if you are using a
> > kernel driver in L2 guest with your to-be-tested application, we
> > should still be having a static mapping in vIOMMU in L1 guest, which
> > is IMHO fine from performance POV.
> >
> > I cced Alex in case I missed anything here.
> >
> >> >
> >> >>
> >> >> Another thing to mention is that (in case you don't know that), device
> >> >> assignment with VT-d protection would be even slower than generic VMs
> >> >> (without Intel IOMMU protection) if you are using generic kernel
> >> >> drivers in the guest, since we may need real-time DMA translation on
> >> >> data path.
> >> >>
> >> >
> >> > So, this is the comparison between using virtio and using the device
> >> > assignment for L1? I have tested application performance running
> >> > inside L1 with and without iommu, and I found that the performance is
> >> > better with iommu.

Here iiuc you mean that "L1 guest with vIOMMU performs better than
when without vIOMMU", while ...

> >> > I thought whether the device is assigned to L1 or
> >> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
> >> > I misunderstood what you said?
> >
> > I failed to understand why an vIOMMU could help boost performance. :(
> > Could you provide your command line here so that I can try to
> > reproduce?
> 
> Sure. This is the command line to launch L1 VM
> 
> qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> -smp 4,sockets=4,cores=1,threads=1 \
> -device vfio-pci,host=08:00.0,id=net0
> 
> And this is for L2 VM.
> 
> ./qemu-system-x86_64 -M q35,accel=kvm \
> -m 8G \
> -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> -device vfio-pci,host=00:03.0,id=net0

... here looks like these are command lines for L1/L2 guest, rather
than L1 guest with/without vIOMMU?

> 
> >
> > Besides, what I mentioned above is just in case you don't know that
> > vIOMMU will drag down the performance in most cases.
> >
> > I think here to be more explicit, the overhead of vIOMMU is different
> > for assigned devices and emulated ones.
> >
> >   (1) For emulated devices, the overhead is when we do the
> >       translation, or say when we do the DMA operation. We need
> >       real-time translation which should drag down the performance.
> >
> >   (2) For assigned devices (our case), the overhead is when we setup
> >       the pages (since we are trapping the setup procedures via CM
> >       bit). However, after it's setup, we should have no much
> >       performance drag when we really do the data transfer (during
> >       DMA) since that'll all be done in the hardware IOMMU (no matter
> >       whether the device is assigned to L1/L2 guest).
> >
> > Now, after I know your use case now (use vIOMMU in L1 guest, don't use
> > vIOMMU in L2 guest, only use assigned devices), I suspect we would
> > have no big problem according to (2).
> >
> >> >
> >> >>>
> >> >>> First, I am having trouble to boot a VM with the emulated iommu. I
> >> >>> have posted my problem to the qemu user mailing list[1],
> >> >>
> >> >> Here I would suggest that you cc qemu-devel as well next time:
> >> >>
> >> >>   qemu-devel@nongnu.org
> >> >>
> >> >> Since I guess not all people are registered to qemu-discuss, at least
> >> >> I am not in that loop. Imho cc qemu-devel could let the question
> >> >> spread to more people, and it'll get a higher chance to be answered.
> >> >
> >> > Thanks. I'll cc qemu-devel next time.
> >> >
> >> >>
> >> >>> but to put it
> >> >>> in a nutshell, I'd like to know the setting I can reuse to boot a VM
> >> >>> with the emulated iommu. (e.g. how to create a VM with q35 chipset
> >> >>> and/or libvirt xml if you use virsh).
> >> >>
> >> >> IIUC you are looking for device assignment for the nested VM case. So,
> >> >> firstly, you may need my tree to run this (see below). Then, maybe you
> >> >> can try to boot a L1 guest with assigned device (under VT-d
> >> >> protection), with command:
> >> >>
> >> >> $qemu -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> >> >>       -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> >> >>       -device vfio-pci,host=$HOST_PCI_ADDR \
> >> >>       $YOUR_IMAGE_PATH
> >> >>
> >> >
> >> > Thanks! I'll try this right away.
> >> >
> >> >> Here $HOST_PCI_ADDR should be something like 05:00.0, which is the
> >> >> host PCI address of the device to be assigned to guest.
> >> >>
> >> >> (If you go over the cover letter in [4], you'll see similar command
> >> >>  line there, though with some more devices assigned, and with traces)
> >> >>
> >> >> If you are playing with nested VM, you'll also need a L2 guest, which
> >> >> will be run inside the L1 guest. It'll require similar command line,
> >> >> but I would suggest you first try a L2 guest without intel-iommu
> >> >> device. Frankly speaking I haven't played with that yet, so just let
> >> >> me know if you got any problem, which is possible. :-)
> >> >>
> >>
> >> I was able to boot L2 guest without assigning a network device
> >> successfully. (host iommu was on, L1 iommu was on, and the network
> >> device was assigned to L1)
> >>
> >> Then, I unbound the network device in L1 and bound it to vfio-pci.
> >> When I try to run L2 with the following command, I got an assertion.
> >>
> >> # ./qemu-system-x86_64 -M q35,accel=kvm \
> >> -m 8G \
> >> -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> >> -device vfio-pci,host=00:03.0,id=net0
> >>
> >> qemu-system-x86_64: hw/pci/pcie.c:686: pcie_add_capability: Assertion
> >> `prev >= 0x100' failed.
> >> Aborted (core dumped)
> >>
> >> Thoughts?
> >
> > I don't know whether it'll has anything to do with how vfio-pci works,
> > anyway I cced Alex and the list in case there is quick answer.
> >
> > I'll reproduce this nested case and update when I got anything.
> 
> Thanks!

I tried to reproduce this issue with the following 10g network card:

00:03.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

In my case, both L1/L2 guests can boot with the assigned device. I
also did a quick netperf TCP STREAM test, the result is (in case you
are interested):

   L1 guest: 1.12Gbps
   L2 guest: 8.26Gbps

First of all, just to confirm that you were using the same qemu binary
in both host and L1 guest, right?

Then, I *think* above assertion you encountered would fail only if
prev == 0 here, but I still don't quite sure why was that happening.
Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
guest?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-14  7:35           ` Peter Xu
@ 2017-02-14 12:50             ` Jintack Lim
  2017-02-15  2:52               ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-14 12:50 UTC (permalink / raw)
  To: Peter Xu; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

On Tue, Feb 14, 2017 at 2:35 AM, Peter Xu <peterx@redhat.com> wrote:

> On Thu, Feb 09, 2017 at 08:01:14AM -0500, Jintack Lim wrote:
> > On Wed, Feb 8, 2017 at 10:52 PM, Peter Xu <peterx@redhat.com> wrote:
> > > (cc qemu-devel and Alex)
> > >
> > > On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
> > >> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim <jintack@cs.columbia.edu>
> wrote:
> > >> > Hi Peter,
> > >> >
> > >> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu <peterx@redhat.com>
> wrote:
> > >> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
> > >> >>> Hi Peter and Michael,
> > >> >>
> > >> >> Hi, Jintack,
> > >> >>
> > >> >>>
> > >> >>> I would like to get some help to run a VM with the emulated
> iommu. I
> > >> >>> have tried for a few days to make it work, but I couldn't.
> > >> >>>
> > >> >>> What I want to do eventually is to assign a network device to the
> > >> >>> nested VM so that I can measure the performance of applications
> > >> >>> running in the nested VM.
> > >> >>
> > >> >> Good to know that you are going to use [4] to do something useful.
> :-)
> > >> >>
> > >> >> However, could I ask why you want to measure the performance of
> > >> >> application inside nested VM rather than host? That's something I
> am
> > >> >> just curious about, considering that virtualization stack will
> > >> >> definitely introduce overhead along the way, and I don't know
> whether
> > >> >> that'll affect your measurement to the application.
> > >> >
> > >> > I have added nested virtualization support to KVM/ARM, which is
> under
> > >> > review now. I found that application performance running inside the
> > >> > nested VM is really bad both on ARM and x86, and I'm trying to
> figure
> > >> > out what's the real overhead. I think one way to figure that out is
> to
> > >> > see if the direct device assignment to L2 helps to reduce the
> overhead
> > >> > or not.
> > >
> > > I see. IIUC you are trying to use an assigned device to replace your
> > > old emulated device in L2 guest to see whether performance will drop
> > > as well, right? Then at least I can know that you won't need a nested
> > > VT-d here (so we should not need a vIOMMU in L2 guest).
> >
> > That's right.
> >
> > >
> > > In that case, I think we can give it a shot, considering that L1 guest
> > > will use vfio-pci for that assigned device as well, and when L2 guest
> > > QEMU uses this assigned device, it'll use a static mapping (just to
> > > map the whole GPA for L2 guest) there, so even if you are using a
> > > kernel driver in L2 guest with your to-be-tested application, we
> > > should still be having a static mapping in vIOMMU in L1 guest, which
> > > is IMHO fine from performance POV.
> > >
> > > I cced Alex in case I missed anything here.
> > >
> > >> >
> > >> >>
> > >> >> Another thing to mention is that (in case you don't know that),
> device
> > >> >> assignment with VT-d protection would be even slower than generic
> VMs
> > >> >> (without Intel IOMMU protection) if you are using generic kernel
> > >> >> drivers in the guest, since we may need real-time DMA translation
> on
> > >> >> data path.
> > >> >>
> > >> >
> > >> > So, this is the comparison between using virtio and using the device
> > >> > assignment for L1? I have tested application performance running
> > >> > inside L1 with and without iommu, and I found that the performance
> is
> > >> > better with iommu.
>
> Here iiuc you mean that "L1 guest with vIOMMU performs better than
> when without vIOMMU", while ...
>

Ah, I think I wrote the second sentence wrong. What I really meant is that
I tested the performance using virtio and the device direct assignment for
L1.


>
> > >> > I thought whether the device is assigned to L1 or
> > >> > L2, the DMA translation is done by iommu, which is pretty fast?
> Maybe
> > >> > I misunderstood what you said?
> > >
> > > I failed to understand why an vIOMMU could help boost performance. :(
> > > Could you provide your command line here so that I can try to
> > > reproduce?
> >
> > Sure. This is the command line to launch L1 VM
> >
> > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > -smp 4,sockets=4,cores=1,threads=1 \
> > -device vfio-pci,host=08:00.0,id=net0
> >
> > And this is for L2 VM.
> >
> > ./qemu-system-x86_64 -M q35,accel=kvm \
> > -m 8G \
> > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > -device vfio-pci,host=00:03.0,id=net0
>
> ... here looks like these are command lines for L1/L2 guest, rather
> than L1 guest with/without vIOMMU?
>

That's right. I thought you were asking about command lines for L1/L2 guest
:(.
I think I made the confusion, and as I said above, I didn't mean to talk
about the performance of L1 guest with/without vIOMMO.
We can move on!


>
> >
> > >
> > > Besides, what I mentioned above is just in case you don't know that
> > > vIOMMU will drag down the performance in most cases.
> > >
> > > I think here to be more explicit, the overhead of vIOMMU is different
> > > for assigned devices and emulated ones.
> > >
> > >   (1) For emulated devices, the overhead is when we do the
> > >       translation, or say when we do the DMA operation. We need
> > >       real-time translation which should drag down the performance.
> > >
> > >   (2) For assigned devices (our case), the overhead is when we setup
> > >       the pages (since we are trapping the setup procedures via CM
> > >       bit). However, after it's setup, we should have no much
> > >       performance drag when we really do the data transfer (during
> > >       DMA) since that'll all be done in the hardware IOMMU (no matter
> > >       whether the device is assigned to L1/L2 guest).
> > >
> > > Now, after I know your use case now (use vIOMMU in L1 guest, don't use
> > > vIOMMU in L2 guest, only use assigned devices), I suspect we would
> > > have no big problem according to (2).
> > >
> > >> >
> > >> >>>
> > >> >>> First, I am having trouble to boot a VM with the emulated iommu. I
> > >> >>> have posted my problem to the qemu user mailing list[1],
> > >> >>
> > >> >> Here I would suggest that you cc qemu-devel as well next time:
> > >> >>
> > >> >>   qemu-devel@nongnu.org
> > >> >>
> > >> >> Since I guess not all people are registered to qemu-discuss, at
> least
> > >> >> I am not in that loop. Imho cc qemu-devel could let the question
> > >> >> spread to more people, and it'll get a higher chance to be
> answered.
> > >> >
> > >> > Thanks. I'll cc qemu-devel next time.
> > >> >
> > >> >>
> > >> >>> but to put it
> > >> >>> in a nutshell, I'd like to know the setting I can reuse to boot a
> VM
> > >> >>> with the emulated iommu. (e.g. how to create a VM with q35 chipset
> > >> >>> and/or libvirt xml if you use virsh).
> > >> >>
> > >> >> IIUC you are looking for device assignment for the nested VM case.
> So,
> > >> >> firstly, you may need my tree to run this (see below). Then, maybe
> you
> > >> >> can try to boot a L1 guest with assigned device (under VT-d
> > >> >> protection), with command:
> > >> >>
> > >> >> $qemu -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> > >> >>       -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > >> >>       -device vfio-pci,host=$HOST_PCI_ADDR \
> > >> >>       $YOUR_IMAGE_PATH
> > >> >>
> > >> >
> > >> > Thanks! I'll try this right away.
> > >> >
> > >> >> Here $HOST_PCI_ADDR should be something like 05:00.0, which is the
> > >> >> host PCI address of the device to be assigned to guest.
> > >> >>
> > >> >> (If you go over the cover letter in [4], you'll see similar command
> > >> >>  line there, though with some more devices assigned, and with
> traces)
> > >> >>
> > >> >> If you are playing with nested VM, you'll also need a L2 guest,
> which
> > >> >> will be run inside the L1 guest. It'll require similar command
> line,
> > >> >> but I would suggest you first try a L2 guest without intel-iommu
> > >> >> device. Frankly speaking I haven't played with that yet, so just
> let
> > >> >> me know if you got any problem, which is possible. :-)
> > >> >>
> > >>
> > >> I was able to boot L2 guest without assigning a network device
> > >> successfully. (host iommu was on, L1 iommu was on, and the network
> > >> device was assigned to L1)
> > >>
> > >> Then, I unbound the network device in L1 and bound it to vfio-pci.
> > >> When I try to run L2 with the following command, I got an assertion.
> > >>
> > >> # ./qemu-system-x86_64 -M q35,accel=kvm \
> > >> -m 8G \
> > >> -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > >> -device vfio-pci,host=00:03.0,id=net0
> > >>
> > >> qemu-system-x86_64: hw/pci/pcie.c:686: pcie_add_capability: Assertion
> > >> `prev >= 0x100' failed.
> > >> Aborted (core dumped)
> > >>
> > >> Thoughts?
> > >
> > > I don't know whether it'll has anything to do with how vfio-pci works,
> > > anyway I cced Alex and the list in case there is quick answer.
> > >
> > > I'll reproduce this nested case and update when I got anything.
> >
> > Thanks!
>
> I tried to reproduce this issue with the following 10g network card:
>
> 00:03.0 Ethernet controller: Intel Corporation Ethernet Controller
> 10-Gigabit X540-AT2 (rev 01)
>
> In my case, both L1/L2 guests can boot with the assigned device. I
> also did a quick netperf TCP STREAM test, the result is (in case you
> are interested):
>
>    L1 guest: 1.12Gbps
>    L2 guest: 8.26Gbps
>
> First of all, just to confirm that you were using the same qemu binary
> in both host and L1 guest, right?
>

Right. I'm using your branch.


>
> Then, I *think* above assertion you encountered would fail only if
> prev == 0 here, but I still don't quite sure why was that happening.
> Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> guest?
>

Sure. This is from my L1 guest.

root@guest0:~# lspci -vvv -s 00:03.0
00:03.0 Network controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
Subsystem: Mellanox Technologies Device 0050
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 23
Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
Expansion ROM at fea00000 [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX354A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: MCX354A-FCBT
[EC] Engineering changes: A4
[SN] Serial number: MT1346X00791
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 105 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
Capabilities: [100 v0] #00
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-15-5b-80
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: mlx4_core



> Thanks,
>
> -- peterx
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-14 12:50             ` Jintack Lim
@ 2017-02-15  2:52               ` Peter Xu
  2017-02-15  3:34                 ` Peter Xu
  2017-02-15 22:05                 ` Jintack Lim
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Xu @ 2017-02-15  2:52 UTC (permalink / raw)
  To: Jintack Lim; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:

[...]

> > > >> > I misunderstood what you said?
> > > >
> > > > I failed to understand why an vIOMMU could help boost performance. :(
> > > > Could you provide your command line here so that I can try to
> > > > reproduce?
> > >
> > > Sure. This is the command line to launch L1 VM
> > >
> > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > -smp 4,sockets=4,cores=1,threads=1 \
> > > -device vfio-pci,host=08:00.0,id=net0
> > >
> > > And this is for L2 VM.
> > >
> > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > -m 8G \
> > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > -device vfio-pci,host=00:03.0,id=net0
> >
> > ... here looks like these are command lines for L1/L2 guest, rather
> > than L1 guest with/without vIOMMU?
> >
> 
> That's right. I thought you were asking about command lines for L1/L2 guest
> :(.
> I think I made the confusion, and as I said above, I didn't mean to talk
> about the performance of L1 guest with/without vIOMMO.
> We can move on!

I see. Sure! :-)

[...]

> >
> > Then, I *think* above assertion you encountered would fail only if
> > prev == 0 here, but I still don't quite sure why was that happening.
> > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > guest?
> >
> 
> Sure. This is from my L1 guest.

Hmm... I think I found the problem...

> 
> root@guest0:~# lspci -vvv -s 00:03.0
> 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> Subsystem: Mellanox Technologies Device 0050
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 23
> Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> Expansion ROM at fea00000 [disabled] [size=1M]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [48] Vital Product Data
> Product Name: CX354A - ConnectX-3 QSFP
> Read-only fields:
> [PN] Part number: MCX354A-FCBT
> [EC] Engineering changes: A4
> [SN] Serial number: MT1346X00791
> [V0] Vendor specific: PCIe Gen3 x8
> [RV] Reserved: checksum good, 0 byte(s) reserved
> Read/write fields:
> [V1] Vendor specific: N/A
> [YA] Asset tag: N/A
> [RW] Read-write area: 105 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 253 byte(s) free
> [RW] Read-write area: 252 byte(s) free
> End
> Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> Vector table: BAR=0 offset=0007c000
> PBA: BAR=0 offset=0007d000
> Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0
> ExtTag- RBE+
> DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 256 bytes, MaxReadReq 4096 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> Supported
> DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> Capabilities: [100 v0] #00

Here we have the head of ecap capability as cap_id==0, then when we
boot the l2 guest with the same device, we'll first copy this
cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
problem since pcie_find_capability_list() will thought there is no cap
at all (cap_id==0 is skipped).

Do you want to try this "hacky patch" to see whether it works for you?

------8<-------
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..bacd302 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 
     }
 
-    /* Cleanup chain head ID if necessary */
-    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
-        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-    }
-
     g_free(config);
     return;
 }
------>8-------

I don't think it's a good solution (it just used 0xffff instead of 0x0
for the masked cap_id, then l2 guest would like to co-op with it), but
it should workaround this temporarily. I'll try to think of a better
one later and post when proper.

(Alex, please leave comment if you have any better suggestion before
 mine :)

Thanks,

-- peterx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15  2:52               ` Peter Xu
@ 2017-02-15  3:34                 ` Peter Xu
  2017-02-15 18:15                   ` Alex Williamson
  2017-02-15 22:05                 ` Jintack Lim
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-02-15  3:34 UTC (permalink / raw)
  To: Jintack Lim, Alex Williamson; +Cc: mst, QEMU Devel Mailing List

On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote:

[...]

> > >
> > > Then, I *think* above assertion you encountered would fail only if
> > > prev == 0 here, but I still don't quite sure why was that happening.
> > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > guest?
> > >
> > 
> > Sure. This is from my L1 guest.
> 
> Hmm... I think I found the problem...
> 
> > 
> > root@guest0:~# lspci -vvv -s 00:03.0
> > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > Subsystem: Mellanox Technologies Device 0050
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 23
> > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > Expansion ROM at fea00000 [disabled] [size=1M]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [48] Vital Product Data
> > Product Name: CX354A - ConnectX-3 QSFP
> > Read-only fields:
> > [PN] Part number: MCX354A-FCBT
> > [EC] Engineering changes: A4
> > [SN] Serial number: MT1346X00791
> > [V0] Vendor specific: PCIe Gen3 x8
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > Read/write fields:
> > [V1] Vendor specific: N/A
> > [YA] Asset tag: N/A
> > [RW] Read-write area: 105 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 252 byte(s) free
> > End
> > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > Vector table: BAR=0 offset=0007c000
> > PBA: BAR=0 offset=0007d000
> > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > Supported
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> > Capabilities: [100 v0] #00
> 
> Here we have the head of ecap capability as cap_id==0, then when we
> boot the l2 guest with the same device, we'll first copy this
> cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> problem since pcie_find_capability_list() will thought there is no cap
> at all (cap_id==0 is skipped).
> 
> Do you want to try this "hacky patch" to see whether it works for you?
> 
> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..bacd302 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>  
>      }
>  
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ------>8-------
> 
> I don't think it's a good solution (it just used 0xffff instead of 0x0
> for the masked cap_id, then l2 guest would like to co-op with it), but
> it should workaround this temporarily. I'll try to think of a better
> one later and post when proper.
> 
> (Alex, please leave comment if you have any better suggestion before
>  mine :)

Alex, do you like something like below to fix above issue that Jintack
has encountered?

(note: this code is not for compile, only trying show what I mean...)

------8<-------
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..4dca631 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
      */
     config = g_memdup(pdev->config, vdev->config_size);

-    /*
-     * Extended capabilities are chained with each pointing to the next, so we
-     * can drop anything other than the head of the chain simply by modifying
-     * the previous next pointer.  For the head of the chain, we can modify the
-     * capability ID to something that cannot match a valid capability.  ID
-     * 0 is reserved for this since absence of capabilities is indicated by
-     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
-     * uses ID 0 as reserved for list management and will incorrectly match and
-     * assert if we attempt to pre-load the head of the chain with this ID.
-     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
-     * part for identifying absence of capabilities in a root complex register
-     * block.  If the ID still exists after adding capabilities, switch back to
-     * zero.  We'll mark this entire first dword as emulated for this purpose.
-     */
-    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
-                 PCI_EXT_CAP(0xFFFF, 0, 0));
-    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
-    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
-
     for (next = PCI_CONFIG_SPACE_SIZE; next;
          next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
         header = pci_get_long(config + next);
@@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
         switch (cap_id) {
         case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
         case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
+            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
+            ...
             trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
             break;
         default:
@@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)

     }

-    /* Cleanup chain head ID if necessary */
-    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
-        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-    }
-
     g_free(config);
     return;
 }
----->8-----

Since after all we need the assumption that 0xffff is reserved for
cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
which is imho error-prone and hacky.

Thanks,

-- peterx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15  3:34                 ` Peter Xu
@ 2017-02-15 18:15                   ` Alex Williamson
  2017-02-16  2:28                     ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2017-02-15 18:15 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jintack Lim, mst, QEMU Devel Mailing List

On Wed, 15 Feb 2017 11:34:52 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote:
> 
> [...]
> 
> > > >
> > > > Then, I *think* above assertion you encountered would fail only if
> > > > prev == 0 here, but I still don't quite sure why was that happening.
> > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > > guest?
> > > >  
> > > 
> > > Sure. This is from my L1 guest.  
> > 
> > Hmm... I think I found the problem...
> >   
> > > 
> > > root@guest0:~# lspci -vvv -s 00:03.0
> > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > [ConnectX-3]
> > > Subsystem: Mellanox Technologies Device 0050
> > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR+ FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > > <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0, Cache Line Size: 64 bytes
> > > Interrupt: pin A routed to IRQ 23
> > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > Capabilities: [48] Vital Product Data
> > > Product Name: CX354A - ConnectX-3 QSFP
> > > Read-only fields:
> > > [PN] Part number: MCX354A-FCBT
> > > [EC] Engineering changes: A4
> > > [SN] Serial number: MT1346X00791
> > > [V0] Vendor specific: PCIe Gen3 x8
> > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > Read/write fields:
> > > [V1] Vendor specific: N/A
> > > [YA] Asset tag: N/A
> > > [RW] Read-write area: 105 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 252 byte(s) free
> > > End
> > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > Vector table: BAR=0 offset=0007c000
> > > PBA: BAR=0 offset=0007d000
> > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > ExtTag- RBE+
> > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > Supported
> > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> > > Capabilities: [100 v0] #00  
> > 
> > Here we have the head of ecap capability as cap_id==0, then when we
> > boot the l2 guest with the same device, we'll first copy this
> > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > problem since pcie_find_capability_list() will thought there is no cap
> > at all (cap_id==0 is skipped).
> > 
> > Do you want to try this "hacky patch" to see whether it works for you?
> > 
> > ------8<-------
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 332f41d..bacd302 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >  
> >      }
> >  
> > -    /* Cleanup chain head ID if necessary */
> > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > -    }
> > -
> >      g_free(config);
> >      return;
> >  }  
> > ------>8-------  
> > 
> > I don't think it's a good solution (it just used 0xffff instead of 0x0
> > for the masked cap_id, then l2 guest would like to co-op with it), but
> > it should workaround this temporarily. I'll try to think of a better
> > one later and post when proper.
> > 
> > (Alex, please leave comment if you have any better suggestion before
> >  mine :)  
> 
> Alex, do you like something like below to fix above issue that Jintack
> has encountered?
> 
> (note: this code is not for compile, only trying show what I mean...)
> 
> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..4dca631 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>       */
>      config = g_memdup(pdev->config, vdev->config_size);
> 
> -    /*
> -     * Extended capabilities are chained with each pointing to the next, so we
> -     * can drop anything other than the head of the chain simply by modifying
> -     * the previous next pointer.  For the head of the chain, we can modify the
> -     * capability ID to something that cannot match a valid capability.  ID
> -     * 0 is reserved for this since absence of capabilities is indicated by
> -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> -     * uses ID 0 as reserved for list management and will incorrectly match and
> -     * assert if we attempt to pre-load the head of the chain with this ID.
> -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> -     * part for identifying absence of capabilities in a root complex register
> -     * block.  If the ID still exists after adding capabilities, switch back to
> -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> -     */
> -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> -
>      for (next = PCI_CONFIG_SPACE_SIZE; next;
>           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>          header = pci_get_long(config + next);
> @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>          switch (cap_id) {
>          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
>          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> +            ...
>              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
>              break;
>          default:
> @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> 
>      }
> 
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ----->8-----  
> 
> Since after all we need the assumption that 0xffff is reserved for
> cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> which is imho error-prone and hacky.

This doesn't fix the bug, which is that pcie_add_capability() uses a
valid capability ID for it's own internal tracking.  It's only doing
this to find the end of the capability chain, which we could do in a
spec complaint way by looking for a zero next pointer.  Fix that and
then vfio doesn't need to do this set to 0xffff then back to zero
nonsense at all.  Capability ID zero is valid.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15  2:52               ` Peter Xu
  2017-02-15  3:34                 ` Peter Xu
@ 2017-02-15 22:05                 ` Jintack Lim
  2017-02-15 22:50                   ` Alex Williamson
  1 sibling, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-15 22:05 UTC (permalink / raw)
  To: Peter Xu; +Cc: mst, Alex Williamson, QEMU Devel Mailing List

On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
>
> [...]
>
> > > > >> > I misunderstood what you said?
> > > > >
> > > > > I failed to understand why an vIOMMU could help boost performance.
> :(
> > > > > Could you provide your command line here so that I can try to
> > > > > reproduce?
> > > >
> > > > Sure. This is the command line to launch L1 VM
> > > >
> > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > -device vfio-pci,host=08:00.0,id=net0
> > > >
> > > > And this is for L2 VM.
> > > >
> > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > -m 8G \
> > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > -device vfio-pci,host=00:03.0,id=net0
> > >
> > > ... here looks like these are command lines for L1/L2 guest, rather
> > > than L1 guest with/without vIOMMU?
> > >
> >
> > That's right. I thought you were asking about command lines for L1/L2
> guest
> > :(.
> > I think I made the confusion, and as I said above, I didn't mean to talk
> > about the performance of L1 guest with/without vIOMMO.
> > We can move on!
>
> I see. Sure! :-)
>
> [...]
>
> > >
> > > Then, I *think* above assertion you encountered would fail only if
> > > prev == 0 here, but I still don't quite sure why was that happening.
> > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > guest?
> > >
> >
> > Sure. This is from my L1 guest.
>
> Hmm... I think I found the problem...
>
> >
> > root@guest0:~# lspci -vvv -s 00:03.0
> > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > Subsystem: Mellanox Technologies Device 0050
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 23
> > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > Expansion ROM at fea00000 [disabled] [size=1M]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-
> )
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [48] Vital Product Data
> > Product Name: CX354A - ConnectX-3 QSFP
> > Read-only fields:
> > [PN] Part number: MCX354A-FCBT
> > [EC] Engineering changes: A4
> > [SN] Serial number: MT1346X00791
> > [V0] Vendor specific: PCIe Gen3 x8
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > Read/write fields:
> > [V1] Vendor specific: N/A
> > [YA] Asset tag: N/A
> > [RW] Read-write area: 105 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 252 byte(s) free
> > End
> > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > Vector table: BAR=0 offset=0007c000
> > PBA: BAR=0 offset=0007d000
> > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > Supported
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF
> Disabled
> > Capabilities: [100 v0] #00
>
> Here we have the head of ecap capability as cap_id==0, then when we
> boot the l2 guest with the same device, we'll first copy this
> cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> problem since pcie_find_capability_list() will thought there is no cap
> at all (cap_id==0 is skipped).
>
> Do you want to try this "hacky patch" to see whether it works for you?
>

Thanks for following this up!

I just tried this, and I got some different message this time.

qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.


Thanks,
Jintack


> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..bacd302 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>
>      }
>
> -    /* Cleanup chain head ID if necessary */
> -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> -    }
> -
>      g_free(config);
>      return;
>  }
> ------>8-------
>
> I don't think it's a good solution (it just used 0xffff instead of 0x0
> for the masked cap_id, then l2 guest would like to co-op with it), but
> it should workaround this temporarily. I'll try to think of a better
> one later and post when proper.
>
> (Alex, please leave comment if you have any better suggestion before
>  mine :)
>
> Thanks,
>
> -- peterx
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15 22:05                 ` Jintack Lim
@ 2017-02-15 22:50                   ` Alex Williamson
  2017-02-15 23:25                     ` Jintack Lim
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2017-02-15 22:50 UTC (permalink / raw)
  To: Jintack Lim; +Cc: Peter Xu, mst, QEMU Devel Mailing List

On Wed, 15 Feb 2017 17:05:35 -0500
Jintack Lim <jintack@cs.columbia.edu> wrote:

> On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> >
> > [...]
> >  
> > > > > >> > I misunderstood what you said?  
> > > > > >
> > > > > > I failed to understand why an vIOMMU could help boost performance.  
> > :(  
> > > > > > Could you provide your command line here so that I can try to
> > > > > > reproduce?  
> > > > >
> > > > > Sure. This is the command line to launch L1 VM
> > > > >
> > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
> > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > >
> > > > > And this is for L2 VM.
> > > > >
> > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > -m 8G \
> > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > -device vfio-pci,host=00:03.0,id=net0  
> > > >
> > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > than L1 guest with/without vIOMMU?
> > > >  
> > >
> > > That's right. I thought you were asking about command lines for L1/L2  
> > guest  
> > > :(.
> > > I think I made the confusion, and as I said above, I didn't mean to talk
> > > about the performance of L1 guest with/without vIOMMO.
> > > We can move on!  
> >
> > I see. Sure! :-)
> >
> > [...]
> >  
> > > >
> > > > Then, I *think* above assertion you encountered would fail only if
> > > > prev == 0 here, but I still don't quite sure why was that happening.
> > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > > guest?
> > > >  
> > >
> > > Sure. This is from my L1 guest.  
> >
> > Hmm... I think I found the problem...
> >  
> > >
> > > root@guest0:~# lspci -vvv -s 00:03.0
> > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > [ConnectX-3]
> > > Subsystem: Mellanox Technologies Device 0050
> > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > Stepping- SERR+ FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > > <MAbort- >SERR- <PERR- INTx-
> > > Latency: 0, Cache Line Size: 64 bytes
> > > Interrupt: pin A routed to IRQ 23
> > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > Capabilities: [40] Power Management version 3
> > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-  
> > )  
> > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > Capabilities: [48] Vital Product Data
> > > Product Name: CX354A - ConnectX-3 QSFP
> > > Read-only fields:
> > > [PN] Part number: MCX354A-FCBT
> > > [EC] Engineering changes: A4
> > > [SN] Serial number: MT1346X00791
> > > [V0] Vendor specific: PCIe Gen3 x8
> > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > Read/write fields:
> > > [V1] Vendor specific: N/A
> > > [YA] Asset tag: N/A
> > > [RW] Read-write area: 105 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 253 byte(s) free
> > > [RW] Read-write area: 252 byte(s) free
> > > End
> > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > Vector table: BAR=0 offset=0007c000
> > > PBA: BAR=0 offset=0007d000
> > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > ExtTag- RBE+
> > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > Supported
> > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF  
> > Disabled  
> > > Capabilities: [100 v0] #00  
> >
> > Here we have the head of ecap capability as cap_id==0, then when we
> > boot the l2 guest with the same device, we'll first copy this
> > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > problem since pcie_find_capability_list() will thought there is no cap
> > at all (cap_id==0 is skipped).
> >
> > Do you want to try this "hacky patch" to see whether it works for you?
> >  
> 
> Thanks for following this up!
> 
> I just tried this, and I got some different message this time.
> 
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.

Possibly very true, it might affect the reliability of the device in
the l2 guest, but shouldn't prevent it from being assigned.  What's the
reset mechanism on the physical device (lspci -vvv from host please).
Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15 22:50                   ` Alex Williamson
@ 2017-02-15 23:25                     ` Jintack Lim
  2017-02-16  1:17                       ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-15 23:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, mst, QEMU Devel Mailing List

On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com
> wrote:

> On Wed, 15 Feb 2017 17:05:35 -0500
> Jintack Lim <jintack@cs.columbia.edu> wrote:
>
> > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> >
> > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > >
> > > [...]
> > >
> > > > > > >> > I misunderstood what you said?
> > > > > > >
> > > > > > > I failed to understand why an vIOMMU could help boost
> performance.
> > > :(
> > > > > > > Could you provide your command line here so that I can try to
> > > > > > > reproduce?
> > > > > >
> > > > > > Sure. This is the command line to launch L1 VM
> > > > > >
> > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host
> \
> > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > >
> > > > > > And this is for L2 VM.
> > > > > >
> > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > -m 8G \
> > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > -device vfio-pci,host=00:03.0,id=net0
> > > > >
> > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > than L1 guest with/without vIOMMU?
> > > > >
> > > >
> > > > That's right. I thought you were asking about command lines for L1/L2
> > > guest
> > > > :(.
> > > > I think I made the confusion, and as I said above, I didn't mean to
> talk
> > > > about the performance of L1 guest with/without vIOMMO.
> > > > We can move on!
> > >
> > > I see. Sure! :-)
> > >
> > > [...]
> > >
> > > > >
> > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > prev == 0 here, but I still don't quite sure why was that
> happening.
> > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in
> your L1
> > > > > guest?
> > > > >
> > > >
> > > > Sure. This is from my L1 guest.
> > >
> > > Hmm... I think I found the problem...
> > >
> > > >
> > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > [ConnectX-3]
> > > > Subsystem: Mellanox Technologies Device 0050
> > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort-
> > > > <MAbort- >SERR- <PERR- INTx-
> > > > Latency: 0, Cache Line Size: 64 bytes
> > > > Interrupt: pin A routed to IRQ 23
> > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-
> > > )
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > Capabilities: [48] Vital Product Data
> > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > Read-only fields:
> > > > [PN] Part number: MCX354A-FCBT
> > > > [EC] Engineering changes: A4
> > > > [SN] Serial number: MT1346X00791
> > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > Read/write fields:
> > > > [V1] Vendor specific: N/A
> > > > [YA] Asset tag: N/A
> > > > [RW] Read-write area: 105 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 252 byte(s) free
> > > > End
> > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > > Vector table: BAR=0 offset=0007c000
> > > > PBA: BAR=0 offset=0007d000
> > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint,
> MSI 00
> > > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > > ExtTag- RBE+
> > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > > Supported
> > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF
> > > Disabled
> > > > Capabilities: [100 v0] #00
> > >
> > > Here we have the head of ecap capability as cap_id==0, then when we
> > > boot the l2 guest with the same device, we'll first copy this
> > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > > problem since pcie_find_capability_list() will thought there is no cap
> > > at all (cap_id==0 is skipped).
> > >
> > > Do you want to try this "hacky patch" to see whether it works for you?
> > >
> >
> > Thanks for following this up!
> >
> > I just tried this, and I got some different message this time.
> >
> > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > reset mechanism.
> > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > reset mechanism.
>
> Possibly very true, it might affect the reliability of the device in
> the l2 guest, but shouldn't prevent it from being assigned.  What's the
> reset mechanism on the physical device (lspci -vvv from host please).
>

Thanks, Alex.
This is from the host (L0).

08:00.0 Network controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
Subsystem: Mellanox Technologies Device 0050
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 31
Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M]
Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M]
Expansion ROM at d9000000 [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX354A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: MCX354A-FCBT
[EC] Engineering changes: A4
[SN] Serial number: MT1346X00624
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 105 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s
unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-15-51-10
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [18c v1] #19
Kernel driver in use: vfio-pci


Thanks,
>
> Alex
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15 23:25                     ` Jintack Lim
@ 2017-02-16  1:17                       ` Alex Williamson
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Williamson @ 2017-02-16  1:17 UTC (permalink / raw)
  To: Jintack Lim; +Cc: Peter Xu, mst, QEMU Devel Mailing List

On Wed, 15 Feb 2017 18:25:26 -0500
Jintack Lim <jintack@cs.columbia.edu> wrote:

> On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com
> > wrote:  
> 
> > On Wed, 15 Feb 2017 17:05:35 -0500
> > Jintack Lim <jintack@cs.columbia.edu> wrote:
> >  
> > > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote:
> > >  
> > > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > > >
> > > > [...]
> > > >  
> > > > > > > >> > I misunderstood what you said?  
> > > > > > > >
> > > > > > > > I failed to understand why an vIOMMU could help boost  
> > performance.  
> > > > :(  
> > > > > > > > Could you provide your command line here so that I can try to
> > > > > > > > reproduce?  
> > > > > > >
> > > > > > > Sure. This is the command line to launch L1 VM
> > > > > > >
> > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host  
> > \  
> > > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > > >
> > > > > > > And this is for L2 VM.
> > > > > > >
> > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > > -m 8G \
> > > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > > -device vfio-pci,host=00:03.0,id=net0  
> > > > > >
> > > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > > than L1 guest with/without vIOMMU?
> > > > > >  
> > > > >
> > > > > That's right. I thought you were asking about command lines for L1/L2  
> > > > guest  
> > > > > :(.
> > > > > I think I made the confusion, and as I said above, I didn't mean to  
> > talk  
> > > > > about the performance of L1 guest with/without vIOMMO.
> > > > > We can move on!  
> > > >
> > > > I see. Sure! :-)
> > > >
> > > > [...]
> > > >  
> > > > > >
> > > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > > prev == 0 here, but I still don't quite sure why was that  
> > happening.  
> > > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in  
> > your L1  
> > > > > > guest?
> > > > > >  
> > > > >
> > > > > Sure. This is from my L1 guest.  
> > > >
> > > > Hmm... I think I found the problem...
> > > >  
> > > > >
> > > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > > [ConnectX-3]
> > > > > Subsystem: Mellanox Technologies Device 0050
> > > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  
> > <TAbort-  
> > > > > <MAbort- >SERR- <PERR- INTx-
> > > > > Latency: 0, Cache Line Size: 64 bytes
> > > > > Interrupt: pin A routed to IRQ 23
> > > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > > > > Expansion ROM at fea00000 [disabled] [size=1M]
> > > > > Capabilities: [40] Power Management version 3
> > > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA  
> > PME(D0-,D1-,D2-,D3hot-,D3cold-  
> > > > )  
> > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > > Capabilities: [48] Vital Product Data
> > > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > > Read-only fields:
> > > > > [PN] Part number: MCX354A-FCBT
> > > > > [EC] Engineering changes: A4
> > > > > [SN] Serial number: MT1346X00791
> > > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > > Read/write fields:
> > > > > [V1] Vendor specific: N/A
> > > > > [YA] Asset tag: N/A
> > > > > [RW] Read-write area: 105 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 253 byte(s) free
> > > > > [RW] Read-write area: 252 byte(s) free
> > > > > End
> > > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > > > Vector table: BAR=0 offset=0007c000
> > > > > PBA: BAR=0 offset=0007d000
> > > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint,  
> > MSI 00  
> > > > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > > > ExtTag- RBE+
> > > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > > > Supported
> > > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF  
> > > > Disabled  
> > > > > Capabilities: [100 v0] #00  
> > > >
> > > > Here we have the head of ecap capability as cap_id==0, then when we
> > > > boot the l2 guest with the same device, we'll first copy this
> > > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> > > > problem since pcie_find_capability_list() will thought there is no cap
> > > > at all (cap_id==0 is skipped).
> > > >
> > > > Do you want to try this "hacky patch" to see whether it works for you?
> > > >  
> > >
> > > Thanks for following this up!
> > >
> > > I just tried this, and I got some different message this time.
> > >
> > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > > reset mechanism.
> > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> > > reset mechanism.  
> >
> > Possibly very true, it might affect the reliability of the device in
> > the l2 guest, but shouldn't prevent it from being assigned.  What's the
> > reset mechanism on the physical device (lspci -vvv from host please).
> >  
> 
> Thanks, Alex.
> This is from the host (L0).
> 
> 08:00.0 Network controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> Subsystem: Mellanox Technologies Device 0050
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin A routed to IRQ 31
> Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M]
> Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M]
> Expansion ROM at d9000000 [disabled] [size=1M]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Does not support reset on D3->D0 transition.

> Capabilities: [60] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-

Does not support PCIe FLR.

No AF capability.  Looks right to me, the only mechanism available to
the host is a bus reset, which isn't available to the VM.  If you were
to configure it downstream of a root port, the VM might think it could
reset the device, but I'm pretty sure it cannot.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-15 18:15                   ` Alex Williamson
@ 2017-02-16  2:28                     ` Peter Xu
  2017-02-16  2:47                       ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-02-16  2:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jintack Lim, mst, QEMU Devel Mailing List

On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:

[...]

> > Alex, do you like something like below to fix above issue that Jintack
> > has encountered?
> > 
> > (note: this code is not for compile, only trying show what I mean...)
> > 
> > ------8<-------
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 332f41d..4dca631 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >       */
> >      config = g_memdup(pdev->config, vdev->config_size);
> > 
> > -    /*
> > -     * Extended capabilities are chained with each pointing to the next, so we
> > -     * can drop anything other than the head of the chain simply by modifying
> > -     * the previous next pointer.  For the head of the chain, we can modify the
> > -     * capability ID to something that cannot match a valid capability.  ID
> > -     * 0 is reserved for this since absence of capabilities is indicated by
> > -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> > -     * uses ID 0 as reserved for list management and will incorrectly match and
> > -     * assert if we attempt to pre-load the head of the chain with this ID.
> > -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> > -     * part for identifying absence of capabilities in a root complex register
> > -     * block.  If the ID still exists after adding capabilities, switch back to
> > -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> > -     */
> > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > -
> >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> >          header = pci_get_long(config + next);
> > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >          switch (cap_id) {
> >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> > +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> > +            ...
> >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
> >              break;
> >          default:
> > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > 
> >      }
> > 
> > -    /* Cleanup chain head ID if necessary */
> > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > -    }
> > -
> >      g_free(config);
> >      return;
> >  }
> > ----->8-----  
> > 
> > Since after all we need the assumption that 0xffff is reserved for
> > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > which is imho error-prone and hacky.
> 
> This doesn't fix the bug, which is that pcie_add_capability() uses a
> valid capability ID for it's own internal tracking.  It's only doing
> this to find the end of the capability chain, which we could do in a
> spec complaint way by looking for a zero next pointer.  Fix that and
> then vfio doesn't need to do this set to 0xffff then back to zero
> nonsense at all.  Capability ID zero is valid.  Thanks,

Yeah I see Michael's fix on the capability list stuff. However, imho
these are two different issues? Or say, even if with that patch, we
should still need this hack (first 0x0, then 0xffff) right? Since
looks like that patch didn't solve the problem if the first pcie ecap
is masked at 0x100.

Please correct me if I missed anything. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-16  2:28                     ` Peter Xu
@ 2017-02-16  2:47                       ` Alex Williamson
  2017-02-21 10:33                         ` Jintack Lim
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2017-02-16  2:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jintack Lim, mst, QEMU Devel Mailing List

On Thu, 16 Feb 2017 10:28:39 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> 
> [...]
> 
> > > Alex, do you like something like below to fix above issue that Jintack
> > > has encountered?
> > > 
> > > (note: this code is not for compile, only trying show what I mean...)
> > > 
> > > ------8<-------
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 332f41d..4dca631 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >       */
> > >      config = g_memdup(pdev->config, vdev->config_size);
> > > 
> > > -    /*
> > > -     * Extended capabilities are chained with each pointing to the next, so we
> > > -     * can drop anything other than the head of the chain simply by modifying
> > > -     * the previous next pointer.  For the head of the chain, we can modify the
> > > -     * capability ID to something that cannot match a valid capability.  ID
> > > -     * 0 is reserved for this since absence of capabilities is indicated by
> > > -     * 0 for the ID, version, AND next pointer.  However, pcie_add_capability()
> > > -     * uses ID 0 as reserved for list management and will incorrectly match and
> > > -     * assert if we attempt to pre-load the head of the chain with this ID.
> > > -     * Use ID 0xFFFF temporarily since it is also seems to be reserved in
> > > -     * part for identifying absence of capabilities in a root complex register
> > > -     * block.  If the ID still exists after adding capabilities, switch back to
> > > -     * zero.  We'll mark this entire first dword as emulated for this purpose.
> > > -     */
> > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > -    pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > > -
> > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > >          header = pci_get_long(config + next);
> > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >          switch (cap_id) {
> > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> > > +            /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
> > > +            ...
> > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
> > >              break;
> > >          default:
> > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > > 
> > >      }
> > > 
> > > -    /* Cleanup chain head ID if necessary */
> > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > -    }
> > > -
> > >      g_free(config);
> > >      return;
> > >  }  
> > > ----->8-----    
> > > 
> > > Since after all we need the assumption that 0xffff is reserved for
> > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > > which is imho error-prone and hacky.  
> > 
> > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > valid capability ID for it's own internal tracking.  It's only doing
> > this to find the end of the capability chain, which we could do in a
> > spec complaint way by looking for a zero next pointer.  Fix that and
> > then vfio doesn't need to do this set to 0xffff then back to zero
> > nonsense at all.  Capability ID zero is valid.  Thanks,  
> 
> Yeah I see Michael's fix on the capability list stuff. However, imho
> these are two different issues? Or say, even if with that patch, we
> should still need this hack (first 0x0, then 0xffff) right? Since
> looks like that patch didn't solve the problem if the first pcie ecap
> is masked at 0x100.

I thought the problem was that QEMU in the host exposes a device with a
capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
capability ID of 0 because that's how it finds the end of the chain.
Therefore if we make QEMU not use capability ID 0 for internal
purposes, things work.  vfio using 0xffff and swapping back to 0x0
becomes unnecessary, but doesn't hurt anything.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-16  2:47                       ` Alex Williamson
@ 2017-02-21 10:33                         ` Jintack Lim
  2017-02-23 23:04                           ` Jintack Lim
  0 siblings, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-21 10:33 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, mst, QEMU Devel Mailing List

On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <alex.williamson@redhat.com
> wrote:

> On Thu, 16 Feb 2017 10:28:39 +0800
> Peter Xu <peterx@redhat.com> wrote:
>
> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> >
> > [...]
> >
> > > > Alex, do you like something like below to fix above issue that
> Jintack
> > > > has encountered?
> > > >
> > > > (note: this code is not for compile, only trying show what I mean...)
> > > >
> > > > ------8<-------
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 332f41d..4dca631 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >       */
> > > >      config = g_memdup(pdev->config, vdev->config_size);
> > > >
> > > > -    /*
> > > > -     * Extended capabilities are chained with each pointing to the
> next, so we
> > > > -     * can drop anything other than the head of the chain simply by
> modifying
> > > > -     * the previous next pointer.  For the head of the chain, we
> can modify the
> > > > -     * capability ID to something that cannot match a valid
> capability.  ID
> > > > -     * 0 is reserved for this since absence of capabilities is
> indicated by
> > > > -     * 0 for the ID, version, AND next pointer.  However,
> pcie_add_capability()
> > > > -     * uses ID 0 as reserved for list management and will
> incorrectly match and
> > > > -     * assert if we attempt to pre-load the head of the chain with
> this ID.
> > > > -     * Use ID 0xFFFF temporarily since it is also seems to be
> reserved in
> > > > -     * part for identifying absence of capabilities in a root
> complex register
> > > > -     * block.  If the ID still exists after adding capabilities,
> switch back to
> > > > -     * zero.  We'll mark this entire first dword as emulated for
> this purpose.
> > > > -     */
> > > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
> > > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -    pci_set_long(vdev->emulated_config_bits +
> PCI_CONFIG_SPACE_SIZE, ~0);
> > > > -
> > > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
> > > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > > >          header = pci_get_long(config + next);
> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >          switch (cap_id) {
> > > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
> OVMF */
> > > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
> virtualization */
> > > > +            /* keep this ecap header (4 bytes), but mask cap_id to
> 0xffff */
> > > > +            ...
> > > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
> cap_id, next);
> > > >              break;
> > > >          default:
> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >
> > > >      }
> > > >
> > > > -    /* Cleanup chain head ID if necessary */
> > > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
> 0xFFFF) {
> > > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -    }
> > > > -
> > > >      g_free(config);
> > > >      return;
> > > >  }
> > > > ----->8-----
> > > >
> > > > Since after all we need the assumption that 0xffff is reserved for
> > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
> > > > which is imho error-prone and hacky.
> > >
> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > > valid capability ID for it's own internal tracking.  It's only doing
> > > this to find the end of the capability chain, which we could do in a
> > > spec complaint way by looking for a zero next pointer.  Fix that and
> > > then vfio doesn't need to do this set to 0xffff then back to zero
> > > nonsense at all.  Capability ID zero is valid.  Thanks,
> >
> > Yeah I see Michael's fix on the capability list stuff. However, imho
> > these are two different issues? Or say, even if with that patch, we
> > should still need this hack (first 0x0, then 0xffff) right? Since
> > looks like that patch didn't solve the problem if the first pcie ecap
> > is masked at 0x100.
>
> I thought the problem was that QEMU in the host exposes a device with a
> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
> capability ID of 0 because that's how it finds the end of the chain.
> Therefore if we make QEMU not use capability ID 0 for internal
> purposes, things work.  vfio using 0xffff and swapping back to 0x0
> becomes unnecessary, but doesn't hurt anything.  Thanks,
>

I've applied Peter's hack and Michael's patch below, but still can't use
the assigned device in L2.
 commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
    Author: Michael S. Tsirkin <mst@redhat.com>
    Date:   Wed Feb 15 22:37:45 2017 +0200

    pci/pcie: don't assume cap id 0 is reserved

I was able to boot L2 with following qemu warnings,
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.
qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
reset mechanism.

but then I don't see the network device, which I was trying to assign to
L2, in L2.
This is from L2 dmesg, and it looks like the device is not initialized.

[    5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    5.891563] mlx4_core: Initializing 0000:00:03.0
[    5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23
[    6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported command
interface revision 0
[    6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000)
[    6.926490] mlx4_core 0000:00:03.0: This driver version supports only
revisions 2 to 3
[    6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting
[    6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting.

This is the full kernel log from L2.
https://paste.ubuntu.com/24039462/

L0, L1 and L2 are using the same kernel, so I think they are using the same
device driver.
This is the L0/L1 kernel log about the network device.

--- From L0 ---
[    8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    8.175543] mlx4_core: Initializing 0000:08:00.0
[   14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device
supports 8.0GT/s
[   14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device
supports x8
[   14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
2014)
[   14.722295] mlx4_en 0000:08:00.0: Activating port:2
[   14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings
[   14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings
[   14.747826] mlx4_en: 0000:08:00.0: Port 2:   frag:0 - size:1522 prefix:0
stride:1536
[   14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port
[   14.764036] mlx4_en 0000:08:00.0: registered PHC clock

--- From L1 ---
[    3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[    3.791089] mlx4_core: Initializing 0000:00:03.0
[    9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW
capabilities
[    9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
2014)
[    9.204503] mlx4_en 0000:00:03.0: Activating port:2
[    9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings
[    9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings
[    9.214131] mlx4_en: 0000:00:03.0: Port 2:   frag:0 - size:1522 prefix:0
stride:1536
[    9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port
[    9.216377] mlx4_en 0000:00:03.0: registered PHC clock
[    9.261518] mlx4_en: eth1: Link Up
[    9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1

Any thoughts?


> Alex
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-21 10:33                         ` Jintack Lim
@ 2017-02-23 23:04                           ` Jintack Lim
  2017-03-02 22:20                             ` Bandan Das
  0 siblings, 1 reply; 19+ messages in thread
From: Jintack Lim @ 2017-02-23 23:04 UTC (permalink / raw)
  To: QEMU Devel Mailing List
  Cc: Peter Xu, \ Michael S . Tsirkin \, Bandan Das, Alex Williamson

[cc Bandan]

On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
wrote:

>
>
> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> alex.williamson@redhat.com> wrote:
>
>> On Thu, 16 Feb 2017 10:28:39 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
>> >
>> > [...]
>> >
>> > > > Alex, do you like something like below to fix above issue that
>> Jintack
>> > > > has encountered?
>> > > >
>> > > > (note: this code is not for compile, only trying show what I
>> mean...)
>> > > >
>> > > > ------8<-------
>> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> > > > index 332f41d..4dca631 100644
>> > > > --- a/hw/vfio/pci.c
>> > > > +++ b/hw/vfio/pci.c
>> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >       */
>> > > >      config = g_memdup(pdev->config, vdev->config_size);
>> > > >
>> > > > -    /*
>> > > > -     * Extended capabilities are chained with each pointing to the
>> next, so we
>> > > > -     * can drop anything other than the head of the chain simply
>> by modifying
>> > > > -     * the previous next pointer.  For the head of the chain, we
>> can modify the
>> > > > -     * capability ID to something that cannot match a valid
>> capability.  ID
>> > > > -     * 0 is reserved for this since absence of capabilities is
>> indicated by
>> > > > -     * 0 for the ID, version, AND next pointer.  However,
>> pcie_add_capability()
>> > > > -     * uses ID 0 as reserved for list management and will
>> incorrectly match and
>> > > > -     * assert if we attempt to pre-load the head of the chain with
>> this ID.
>> > > > -     * Use ID 0xFFFF temporarily since it is also seems to be
>> reserved in
>> > > > -     * part for identifying absence of capabilities in a root
>> complex register
>> > > > -     * block.  If the ID still exists after adding capabilities,
>> switch back to
>> > > > -     * zero.  We'll mark this entire first dword as emulated for
>> this purpose.
>> > > > -     */
>> > > > -    pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
>> > > > -                 PCI_EXT_CAP(0xFFFF, 0, 0));
>> > > > -    pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -    pci_set_long(vdev->emulated_config_bits +
>> PCI_CONFIG_SPACE_SIZE, ~0);
>> > > > -
>> > > >      for (next = PCI_CONFIG_SPACE_SIZE; next;
>> > > >           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>> > > >          header = pci_get_long(config + next);
>> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >          switch (cap_id) {
>> > > >          case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
>> OVMF */
>> > > >          case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
>> virtualization */
>> > > > +            /* keep this ecap header (4 bytes), but mask cap_id to
>> 0xffff */
>> > > > +            ...
>> > > >              trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
>> cap_id, next);
>> > > >              break;
>> > > >          default:
>> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >
>> > > >      }
>> > > >
>> > > > -    /* Cleanup chain head ID if necessary */
>> > > > -    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
>> 0xFFFF) {
>> > > > -        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -    }
>> > > > -
>> > > >      g_free(config);
>> > > >      return;
>> > > >  }
>> > > > ----->8-----
>> > > >
>> > > > Since after all we need the assumption that 0xffff is reserved for
>> > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
>> > > > which is imho error-prone and hacky.
>> > >
>> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
>> > > valid capability ID for it's own internal tracking.  It's only doing
>> > > this to find the end of the capability chain, which we could do in a
>> > > spec complaint way by looking for a zero next pointer.  Fix that and
>> > > then vfio doesn't need to do this set to 0xffff then back to zero
>> > > nonsense at all.  Capability ID zero is valid.  Thanks,
>> >
>> > Yeah I see Michael's fix on the capability list stuff. However, imho
>> > these are two different issues? Or say, even if with that patch, we
>> > should still need this hack (first 0x0, then 0xffff) right? Since
>> > looks like that patch didn't solve the problem if the first pcie ecap
>> > is masked at 0x100.
>>
>> I thought the problem was that QEMU in the host exposes a device with a
>> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
>> capability ID of 0 because that's how it finds the end of the chain.
>> Therefore if we make QEMU not use capability ID 0 for internal
>> purposes, things work.  vfio using 0xffff and swapping back to 0x0
>> becomes unnecessary, but doesn't hurt anything.  Thanks,
>>
>
> I've applied Peter's hack and Michael's patch below, but still can't use
> the assigned device in L2.
>  commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
>     Author: Michael S. Tsirkin <mst@redhat.com>
>     Date:   Wed Feb 15 22:37:45 2017 +0200
>
>     pci/pcie: don't assume cap id 0 is reserved
>
> I was able to boot L2 with following qemu warnings,
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
> qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available
> reset mechanism.
>
> but then I don't see the network device, which I was trying to assign to
> L2, in L2.
> This is from L2 dmesg, and it looks like the device is not initialized.
>
> [    5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    5.891563] mlx4_core: Initializing 0000:00:03.0
> [    5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23
> [    6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported
> command interface revision 0
> [    6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000)
> [    6.926490] mlx4_core 0000:00:03.0: This driver version supports only
> revisions 2 to 3
> [    6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting
> [    6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting.
>
> This is the full kernel log from L2.
> https://paste.ubuntu.com/24039462/
>
> L0, L1 and L2 are using the same kernel, so I think they are using the
> same device driver.
> This is the L0/L1 kernel log about the network device.
>
> --- From L0 ---
> [    8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    8.175543] mlx4_core: Initializing 0000:08:00.0
> [   14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device
> supports 8.0GT/s
> [   14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device
> supports x8
> [   14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
> 2014)
> [   14.722295] mlx4_en 0000:08:00.0: Activating port:2
> [   14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings
> [   14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings
> [   14.747826] mlx4_en: 0000:08:00.0: Port 2:   frag:0 - size:1522
> prefix:0 stride:1536
> [   14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port
> [   14.764036] mlx4_en 0000:08:00.0: registered PHC clock
>
> --- From L1 ---
> [    3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    3.791089] mlx4_core: Initializing 0000:00:03.0
> [    9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW
> capabilities
> [    9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb
> 2014)
> [    9.204503] mlx4_en 0000:00:03.0: Activating port:2
> [    9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings
> [    9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings
> [    9.214131] mlx4_en: 0000:00:03.0: Port 2:   frag:0 - size:1522
> prefix:0 stride:1536
> [    9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port
> [    9.216377] mlx4_en 0000:00:03.0: registered PHC clock
> [    9.261518] mlx4_en: eth1: Link Up
> [    9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1
>
> Any thoughts?
>

I've tried another network device on a different machine. It has "Intel
Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
controller. I got the same problem of getting the network device
initialization failure in L2. I think I'm missing something since I heard
from Bandan that he had no problem to assign a device to L2 with ixgbe.

This is the error message from dmesg in L2.

[    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
version 4.2.1-k
[    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
[    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
[    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12

I checked that L2 indeed had that device.
root@guest0:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
Network Connection (rev 01)

I'm describing steps I took, so if you notice something wrong, PLEASE let
me know.

1. [L0] Check the device with lspci. Result is [1]
2. [L0] Unbind from the original driver and bind to vfio-pci driver
following [2][3]
3. [L0] Start L1 with this script. [4]
4. [L1] L1 is able to use the network device.
5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
the step 2.
6. [L1] Start L2 with this script. [5]
7. [L2] Got the init failure error message above.

[1] https://paste.ubuntu.com/24055745/
[2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
[3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
[4] https://paste.ubuntu.com/24055715/
[5] https://paste.ubuntu.com/24055720/

Thanks,
Jintack


>
>
>> Alex
>>
>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-02-23 23:04                           ` Jintack Lim
@ 2017-03-02 22:20                             ` Bandan Das
  2017-03-02 23:36                               ` Jintack Lim
  2017-03-03  3:43                               ` Peter Xu
  0 siblings, 2 replies; 19+ messages in thread
From: Bandan Das @ 2017-03-02 22:20 UTC (permalink / raw)
  To: Jintack Lim
  Cc: QEMU Devel Mailing List, Alex Williamson, Peter Xu,
	 Michael S . Tsirkin " <mst@redhat.com>

Jintack Lim <jintack@cs.columbia.edu> writes:

> [cc Bandan]
>
> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
> wrote:
>
>>
>>
>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>> alex.williamson@redhat.com> wrote:
...
>>
>
> I've tried another network device on a different machine. It has "Intel
> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> controller. I got the same problem of getting the network device
> initialization failure in L2. I think I'm missing something since I heard
> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>
> This is the error message from dmesg in L2.
>
> [    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> version 4.2.1-k
> [    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> [    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
> [    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12
>
> I checked that L2 indeed had that device.
> root@guest0:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> Controller
> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> Network Connection (rev 01)

Jintack, any progress with this ?

I am testing on a X540-AT2 and I see a different behavior. It appears
config succeeds but the driver keeps resetting the device due to a Tx
hang:

[ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
resetting adapter
[ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
timeout
[ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter

This may be device specific but I think the actual behavior you see is
also dependent on the ixgbe driver in the guest. Are you on a recent
kernel ? Also, can you point me to the hack (by Peter) that you have
mentioned above ?

Thanks,
Bandan

> I'm describing steps I took, so if you notice something wrong, PLEASE let
> me know.
>
> 1. [L0] Check the device with lspci. Result is [1]
> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
> following [2][3]
> 3. [L0] Start L1 with this script. [4]
> 4. [L1] L1 is able to use the network device.
> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
> the step 2.
> 6. [L1] Start L2 with this script. [5]
> 7. [L2] Got the init failure error message above.
>
> [1] https://paste.ubuntu.com/24055745/
> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
> [4] https://paste.ubuntu.com/24055715/
> [5] https://paste.ubuntu.com/24055720/
>
> Thanks,
> Jintack
>
>
>>
>>
>>> Alex
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-03-02 22:20                             ` Bandan Das
@ 2017-03-02 23:36                               ` Jintack Lim
  2017-03-03  3:43                               ` Peter Xu
  1 sibling, 0 replies; 19+ messages in thread
From: Jintack Lim @ 2017-03-02 23:36 UTC (permalink / raw)
  To: Bandan Das
  Cc: QEMU Devel Mailing List, Alex Williamson, Peter Xu,
	 Michael S . Tsirkin " <mst@redhat.com>

On Thu, Mar 2, 2017 at 5:20 PM, Bandan Das <bsd@redhat.com> wrote:
> Jintack Lim <jintack@cs.columbia.edu> writes:
>
>> [cc Bandan]
>>
>> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
>> wrote:
>>
>>>
>>>
>>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>>> alex.williamson@redhat.com> wrote:
> ...
>>>
>>
>> I've tried another network device on a different machine. It has "Intel
>> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
>> controller. I got the same problem of getting the network device
>> initialization failure in L2. I think I'm missing something since I heard
>> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>>
>> This is the error message from dmesg in L2.
>>
>> [    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
>> version 4.2.1-k
>> [    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
>> [    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
>> [    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12
>>
>> I checked that L2 indeed had that device.
>> root@guest0:~# lspci
>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
>> Controller
>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
>> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
>> Network Connection (rev 01)
>
> Jintack, any progress with this ?

Not much, unfortunately.

>
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:

Thanks for your effort!

>
> [ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter
>
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

I was using 4.6.0-rc on the machine with Mellanox device, and
4.10.0-rc on the machine with Intel device. L0, L1 and L2 had the same
version.

This is the initial hack from Peter,
------8<-------
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..bacd302 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)

     }

-    /* Cleanup chain head ID if necessary */
-    if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
-        pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-    }
-
     g_free(config);
     return;
 }
------>8-------

and I believe this is the commit merged into QEMU repo.

commit d0d1cd70d10639273e2a23870e7e7d80b2bc4e21
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Wed Feb 22 13:19:58 2017 -0700

    vfio/pci: Improve extended capability comments, skip masked caps


Thanks,
Jintack

>
> Thanks,
> Bandan
>
>> I'm describing steps I took, so if you notice something wrong, PLEASE let
>> me know.
>>
>> 1. [L0] Check the device with lspci. Result is [1]
>> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
>> following [2][3]
>> 3. [L0] Start L1 with this script. [4]
>> 4. [L1] L1 is able to use the network device.
>> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
>> the step 2.
>> 6. [L1] Start L2 with this script. [5]
>> 7. [L2] Got the init failure error message above.
>>
>> [1] https://paste.ubuntu.com/24055745/
>> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
>> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
>> [4] https://paste.ubuntu.com/24055715/
>> [5] https://paste.ubuntu.com/24055720/
>>
>> Thanks,
>> Jintack
>>
>>
>>>
>>>
>>>> Alex
>>>>
>>>>
>>>
>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-03-02 22:20                             ` Bandan Das
  2017-03-02 23:36                               ` Jintack Lim
@ 2017-03-03  3:43                               ` Peter Xu
  2017-03-03  7:45                                 ` Bandan Das
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Xu @ 2017-03-03  3:43 UTC (permalink / raw)
  To: Bandan Das; +Cc: Jintack Lim, QEMU Devel Mailing List, Alex Williamson

On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
> Jintack Lim <jintack@cs.columbia.edu> writes:
> 
> > [cc Bandan]
> >
> > On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jintack@cs.columbia.edu>
> > wrote:
> >
> >>
> >>
> >> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> >> alex.williamson@redhat.com> wrote:
> ...
> >>
> >
> > I've tried another network device on a different machine. It has "Intel
> > Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> > controller. I got the same problem of getting the network device
> > initialization failure in L2. I think I'm missing something since I heard
> > from Bandan that he had no problem to assign a device to L2 with ixgbe.
> >
> > This is the error message from dmesg in L2.
> >
> > [    3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> > version 4.2.1-k
> > [    3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> > [    3.964875] ixgbe 0000:00:02.0: HW Init failed: -12
> > [    3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12
> >
> > I checked that L2 indeed had that device.
> > root@guest0:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> > Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> > Network Connection (rev 01)
> 
> Jintack, any progress with this ?
> 
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:
> 
> [ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter
> 
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

Hi, Bandan,

Are you using the vtd vfio v7 series or another branch?

If it's the upstream one... just a note that we need to make sure
"-device intel-iommu" be the first device specified in QEMU command
line parameters (need to before "-device vfio-pci,..."). Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] iommu emulation
  2017-03-03  3:43                               ` Peter Xu
@ 2017-03-03  7:45                                 ` Bandan Das
  0 siblings, 0 replies; 19+ messages in thread
From: Bandan Das @ 2017-03-03  7:45 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jintack Lim, QEMU Devel Mailing List, Alex Williamson

Peter Xu <peterx@redhat.com> writes:

> On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
>> Jintack Lim <jintack@cs.columbia.edu> writes:
>> 
>> > [cc Bandan]
>> >
...
>> 
>> Jintack, any progress with this ?
>> 
>> I am testing on a X540-AT2 and I see a different behavior. It appears
>> config succeeds but the driver keeps resetting the device due to a Tx
>> hang:
>> 
>> [ 568.612391 ] ixgbe 0000:00:03.0 enp0s3: tx hang 38 detected on queue 0,
>> resetting adapter
>> [ 568.612393 ]  ixgbe 0000:00:03.0 enp0s3: initiating reset due to tx
>> timeout
>> [ 568.612397 ]  ixgbe 0000:00:03.0 enp0s3: Reset adapter
>> 
>> This may be device specific but I think the actual behavior you see is
>> also dependent on the ixgbe driver in the guest. Are you on a recent
>> kernel ? Also, can you point me to the hack (by Peter) that you have
>> mentioned above ?
>
> Hi, Bandan,
>
> Are you using the vtd vfio v7 series or another branch?

Thanks for the tip. Jintack pointed me to your repo and I am using v7
from there.

> If it's the upstream one... just a note that we need to make sure
> "-device intel-iommu" be the first device specified in QEMU command
> line parameters (need to before "-device vfio-pci,..."). Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-03-03  7:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHyh4xiVKjd+D=qaizUZ02O8xLYhpoVKOqC9cR0ZWWyLq9HtbQ@mail.gmail.com>
     [not found] ` <20170208031216.GA5151@pxdev.xzpeter.org>
     [not found]   ` <CAHyh4xg7NVPjXu3c+xGWNzQqwLgFqFJTPo4SgN-X+FNuHjGihQ@mail.gmail.com>
     [not found]     ` <CAHyh4xhOPmfLoU_fvtbBF1Wqbzji9q6rp_bRN38qfnwvhQq+9A@mail.gmail.com>
2017-02-09  3:52       ` [Qemu-devel] iommu emulation Peter Xu
2017-02-09 13:01         ` Jintack Lim
2017-02-14  7:35           ` Peter Xu
2017-02-14 12:50             ` Jintack Lim
2017-02-15  2:52               ` Peter Xu
2017-02-15  3:34                 ` Peter Xu
2017-02-15 18:15                   ` Alex Williamson
2017-02-16  2:28                     ` Peter Xu
2017-02-16  2:47                       ` Alex Williamson
2017-02-21 10:33                         ` Jintack Lim
2017-02-23 23:04                           ` Jintack Lim
2017-03-02 22:20                             ` Bandan Das
2017-03-02 23:36                               ` Jintack Lim
2017-03-03  3:43                               ` Peter Xu
2017-03-03  7:45                                 ` Bandan Das
2017-02-15 22:05                 ` Jintack Lim
2017-02-15 22:50                   ` Alex Williamson
2017-02-15 23:25                     ` Jintack Lim
2017-02-16  1:17                       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.