Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

All of lore.kernel.org
 help / color / mirror / Atom feed

* Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
@ 2019-03-11  6:42 Leo Yan
  2019-03-11  6:57 ` Leo Yan
  2019-03-11  8:23 ` Auger Eric
  0 siblings, 2 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-11  6:42 UTC (permalink / raw)
  To: kvmarm, eric.auger; +Cc: Daniel Thompson

Hi all,

I am trying to enable PCI-e device pass-through mode with KVM, since
Juno-r2 board has PCI-e bus so I firstly try to use vfio to
passthrough the network card on PCI-e bus.

According to Juno-r2 board TRM [1], there has a CoreLink MMU-401 (SMMU)
between PCI-e devices and CCI bus; IIUC, PCI-e device and the SMMU can
be used for vfio for address isolation and from hardware pespective it
is sufficient for support pass-through mode.

I followed Eric's blog [2] for 'VFIO-PCI driver binding', so I
executed blow commands on Juno-r2 board:

  echo vfio-pci > /sys/bus/pci/devices/0000\:08\:00.0/driver_override
  echo 0000:08:00.0 > /sys/bus/pci/drivers/sky2/unbind
  echo 0000:08:00.0 > /sys/bus/pci/drivers_probe

But at the last command for vifo probing, it reports failure as below:

[   21.553889] sky2 0000:08:00.0 enp8s0: disabling interface
[   21.616720] vfio-pci: probe of 0000:08:00.0 failed with error -22

I looked into for the code, though 'dev->bus->iommu_ops' points to the
data structure 'arm_smmu_ops', but 'dev->iommu_group' is NULL thus the
probe function returns failure with below flow:

  vfio_pci_probe()
    `-> vfio_iommu_group_get()
          `-> iommu_group_get()
                `-> return NULL;

Alternatively, if enable the kconfig CONFIG_VFIO_NOIOMMU & set global
variable 'noiommu' = true, the probe function still returns error; since
the function iommu_present(dev->bus) return back 'arm_smmu_ops' so you
could see the code will run into below logic:

vfio_iommu_group_get()
{
	group = iommu_group_get(dev);

#ifdef CONFIG_VFIO_NOIOMMU

	/*
	 * With noiommu enabled, an IOMMU group will be created for a device
	 * that doesn't already have one and doesn't have an iommu_ops on their
	 * bus.  We set iommudata simply to be able to identify these groups
	 * as special use and for reclamation later.
	 */
	if (group || !noiommu || iommu_present(dev->bus))
		return group;    ==> return 'group' and 'group' is NULL

	[...]
}

So either using SMMU or with kernel config CONFIG_VFIO_NOIOMMU, both cannot
bind vifo driver for network card device on Juno-r2 board.

P.s. I also checked the sysfs node and found it doesn't contain node
'iommu_group':

# ls /sys/bus/pci/devices/0000\:08\:00.0/iommu_group
ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/iommu_group': No
such file or directory

Could you give some suggestions for this so that I can proceed?  Very
appreciate for any comment.

Thanks,
Leo Yan

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0515f/DDI0515F_juno_arm_development_platform_soc_trm.pdf
[2] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11  6:42 Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2 Leo Yan
@ 2019-03-11  6:57 ` Leo Yan
  2019-03-11  8:23 ` Auger Eric
  1 sibling, 0 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-11  6:57 UTC (permalink / raw)
  To: kvmarm, eric.auger; +Cc: Daniel Thompson

On Mon, Mar 11, 2019 at 02:42:48PM +0800, Leo Yan wrote:
> Hi all,
> 
> I am trying to enable PCI-e device pass-through mode with KVM, since
> Juno-r2 board has PCI-e bus so I firstly try to use vfio to
> passthrough the network card on PCI-e bus.

Sorry for spamming, just want to add info for Linux kernel version.

I am using Linux mainline kernel 5.0-rc7 with the latest commit:

commit a215ce8f0e00c2d707080236f1aafec337371043 (origin/master, origin/HEAD)
Merge: 2d28e01dca1a cffaaf0c8162
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Mar 1 09:13:04 2019 -0800

    Merge tag 'iommu-fix-v5.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

    Pull IOMMU fix from Joerg Roedel:
     "One important fix for a memory corruption issue in the Intel VT-d
      driver that triggers on hardware with deep PCI hierarchies"

    * tag 'iommu-fix-v5.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
      iommu/dmar: Fix buffer overflow during PCI bus notification

[...]

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11  6:42 Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2 Leo Yan
  2019-03-11  6:57 ` Leo Yan
@ 2019-03-11  8:23 ` Auger Eric
  2019-03-11  9:39   ` Leo Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Auger Eric @ 2019-03-11  8:23 UTC (permalink / raw)
  To: Leo Yan, kvmarm; +Cc: Daniel Thompson

Hi Leo,

On 3/11/19 7:42 AM, Leo Yan wrote:
> Hi all,
> 
> I am trying to enable PCI-e device pass-through mode with KVM, since
> Juno-r2 board has PCI-e bus so I firstly try to use vfio to
> passthrough the network card on PCI-e bus.
> 
> According to Juno-r2 board TRM [1], there has a CoreLink MMU-401 (SMMU)
> between PCI-e devices and CCI bus; IIUC, PCI-e device and the SMMU can
> be used for vfio for address isolation and from hardware pespective it
> is sufficient for support pass-through mode.
> 
> I followed Eric's blog [2] for 'VFIO-PCI driver binding', so I
> executed blow commands on Juno-r2 board:
> 
>   echo vfio-pci > /sys/bus/pci/devices/0000\:08\:00.0/driver_override
>   echo 0000:08:00.0 > /sys/bus/pci/drivers/sky2/unbind
>   echo 0000:08:00.0 > /sys/bus/pci/drivers_probe
> 
> But at the last command for vifo probing, it reports failure as below:
> 
> [   21.553889] sky2 0000:08:00.0 enp8s0: disabling interface
> [   21.616720] vfio-pci: probe of 0000:08:00.0 failed with error -22
> 
> I looked into for the code, though 'dev->bus->iommu_ops' points to the
> data structure 'arm_smmu_ops', but 'dev->iommu_group' is NULL thus the
> probe function returns failure with below flow:
> 
>   vfio_pci_probe()
>     `-> vfio_iommu_group_get()
>           `-> iommu_group_get()
>                 `-> return NULL;
> 
> Alternatively, if enable the kconfig CONFIG_VFIO_NOIOMMU & set global
> variable 'noiommu' = true, the probe function still returns error; since
> the function iommu_present(dev->bus) return back 'arm_smmu_ops' so you
> could see the code will run into below logic:
> 
> vfio_iommu_group_get()
> {
> 	group = iommu_group_get(dev);
> 
> #ifdef CONFIG_VFIO_NOIOMMU
> 
> 	/*
> 	 * With noiommu enabled, an IOMMU group will be created for a device
> 	 * that doesn't already have one and doesn't have an iommu_ops on their
> 	 * bus.  We set iommudata simply to be able to identify these groups
> 	 * as special use and for reclamation later.
> 	 */
> 	if (group || !noiommu || iommu_present(dev->bus))
> 		return group;    ==> return 'group' and 'group' is NULL
> 
> 	[...]
> }
> 
> So either using SMMU or with kernel config CONFIG_VFIO_NOIOMMU, both cannot
> bind vifo driver for network card device on Juno-r2 board.
> 
> P.s. I also checked the sysfs node and found it doesn't contain node
> 'iommu_group':
> 
> # ls /sys/bus/pci/devices/0000\:08\:00.0/iommu_group
> ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/iommu_group': No
> such file or directory

please can you give the output of the following command:
find /sys/kernel/iommu_groups/

when booting your host without noiommu=true

At first sight I would say you have trouble with your iommu groups.

Thanks

Eric
> 
> Could you give some suggestions for this so that I can proceed?  Very
> appreciate for any comment.
> 
> Thanks,
> Leo Yan
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0515f/DDI0515F_juno_arm_development_platform_soc_trm.pdf
> [2] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11  8:23 ` Auger Eric
@ 2019-03-11  9:39   ` Leo Yan
  2019-03-11  9:47     ` Auger Eric
  0 siblings, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-11  9:39 UTC (permalink / raw)
  To: Auger Eric; +Cc: Daniel Thompson, kvmarm

Hi Auger,

On Mon, Mar 11, 2019 at 09:23:20AM +0100, Auger Eric wrote:

[...]

> > P.s. I also checked the sysfs node and found it doesn't contain node
> > 'iommu_group':
> > 
> > # ls /sys/bus/pci/devices/0000\:08\:00.0/iommu_group
> > ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/iommu_group': No
> > such file or directory
> 
> please can you give the output of the following command:
> find /sys/kernel/iommu_groups/

I get below result on Juno board:

root@debian:~# find /sys/kernel/iommu_groups/
/sys/kernel/iommu_groups/
/sys/kernel/iommu_groups/1
/sys/kernel/iommu_groups/1/devices
/sys/kernel/iommu_groups/1/devices/20070000.etr
/sys/kernel/iommu_groups/1/type
/sys/kernel/iommu_groups/1/reserved_regions
/sys/kernel/iommu_groups/0
/sys/kernel/iommu_groups/0/devices
/sys/kernel/iommu_groups/0/devices/7ffb0000.ohci
/sys/kernel/iommu_groups/0/devices/7ffc0000.ehci
/sys/kernel/iommu_groups/0/type
/sys/kernel/iommu_groups/0/reserved_regions

So the 'iommu_groups' is not created for pci-e devices, right?

Will debug into dt binding and related code and keep posted at here.

> when booting your host without noiommu=true
> 
> At first sight I would say you have trouble with your iommu groups.

Thanks a lot for guidance.
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11  9:39   ` Leo Yan
@ 2019-03-11  9:47     ` Auger Eric
  2019-03-11 14:35       ` Leo Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Auger Eric @ 2019-03-11  9:47 UTC (permalink / raw)
  To: Leo Yan; +Cc: Daniel Thompson, kvmarm

Hi Leo,

On 3/11/19 10:39 AM, Leo Yan wrote:
> Hi Auger,
> 
> On Mon, Mar 11, 2019 at 09:23:20AM +0100, Auger Eric wrote:
> 
> [...]
> 
>>> P.s. I also checked the sysfs node and found it doesn't contain node
>>> 'iommu_group':
>>>
>>> # ls /sys/bus/pci/devices/0000\:08\:00.0/iommu_group
>>> ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/iommu_group': No
>>> such file or directory
>>
>> please can you give the output of the following command:
>> find /sys/kernel/iommu_groups/
> 
> I get below result on Juno board:
> 
> root@debian:~# find /sys/kernel/iommu_groups/
> /sys/kernel/iommu_groups/
> /sys/kernel/iommu_groups/1
> /sys/kernel/iommu_groups/1/devices
> /sys/kernel/iommu_groups/1/devices/20070000.etr
> /sys/kernel/iommu_groups/1/type
> /sys/kernel/iommu_groups/1/reserved_regions
> /sys/kernel/iommu_groups/0
> /sys/kernel/iommu_groups/0/devices
> /sys/kernel/iommu_groups/0/devices/7ffb0000.ohci
> /sys/kernel/iommu_groups/0/devices/7ffc0000.ehci
> /sys/kernel/iommu_groups/0/type
> /sys/kernel/iommu_groups/0/reserved_regions
> 
> So the 'iommu_groups' is not created for pci-e devices, right?

Yes that's correct.
> 
> Will debug into dt binding and related code and keep posted at here.OK

Thanks

Eric
> 
>> when booting your host without noiommu=true
>>
>> At first sight I would say you have trouble with your iommu groups.
> 
> Thanks a lot for guidance.
> Leo Yan
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11  9:47     ` Auger Eric
@ 2019-03-11 14:35       ` Leo Yan
  2019-03-13  8:00         ` Leo Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-11 14:35 UTC (permalink / raw)
  To: Auger Eric, Mark Rutland; +Cc: Daniel Thompson, kvmarm

Hi Eric,

[ + Mark Rutland ]

On Mon, Mar 11, 2019 at 10:47:22AM +0100, Auger Eric wrote:

[...]

> >>> P.s. I also checked the sysfs node and found it doesn't contain node
> >>> 'iommu_group':
> >>>
> >>> # ls /sys/bus/pci/devices/0000\:08\:00.0/iommu_group
> >>> ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/iommu_group': No
> >>> such file or directory
> >>
> >> please can you give the output of the following command:
> >> find /sys/kernel/iommu_groups/
> > 
> > I get below result on Juno board:
> > 
> > root@debian:~# find /sys/kernel/iommu_groups/
> > /sys/kernel/iommu_groups/
> > /sys/kernel/iommu_groups/1
> > /sys/kernel/iommu_groups/1/devices
> > /sys/kernel/iommu_groups/1/devices/20070000.etr
> > /sys/kernel/iommu_groups/1/type
> > /sys/kernel/iommu_groups/1/reserved_regions
> > /sys/kernel/iommu_groups/0
> > /sys/kernel/iommu_groups/0/devices
> > /sys/kernel/iommu_groups/0/devices/7ffb0000.ohci
> > /sys/kernel/iommu_groups/0/devices/7ffc0000.ehci
> > /sys/kernel/iommu_groups/0/type
> > /sys/kernel/iommu_groups/0/reserved_regions
> > 
> > So the 'iommu_groups' is not created for pci-e devices, right?
> 
> Yes that's correct.

Juno's pci-e devices without 'iommu_groups' is caused by dt binding, in
dts file [1] it sets 'status = "disabled"' for 'smmu_pcie' node.  Thus
the SMMU device will not be enabled for pci-e devices.

@Mark Rutland, could you have chance to confirm this should be fixed in
juno-base.dtsi or juno-r1.dts/juno-r2.dts?

> > Will debug into dt binding and related code and keep posted at here.OK

So now I made some progress and can see the networking card is
pass-through to guest OS, though the networking card reports errors
now.  Below is detailed steps and info:

- Bind devices in the same IOMMU group to vfio driver:

  echo 0000:03:00.0 > /sys/bus/pci/devices/0000\:03\:00.0/driver/unbind
  echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id

  echo 0000:08:00.0 > /sys/bus/pci/devices/0000\:08\:00.0/driver/unbind
  echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id

- Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;

- Use qemu to launch guest OS:

  qemu-system-aarch64 \
        -cpu host -M virt,accel=kvm -m 4096 -nographic \
        -kernel /root/virt/Image -append root=/dev/vda2 \
        -net none -device vfio-pci,host=08:00.0 \
        -drive if=virtio,file=/root/virt/qemu/debian.img \
        -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk ip=dhcp'

- Host log:

[  188.329861] vfio-pci 0000:08:00.0: enabling device (0000 -> 0003)

- Below is guest log, from log though the driver has been registered but
  it reports PCI hardware failure and the timeout for the interrupt.

  So is this caused by very 'slow' forward interrupt handling?  Juno
  board uses GICv2 (I think it has GICv2m extension).

[...]

[    1.024483] sky2 0000:00:01.0 eth0: enabling interface
[    1.026822] sky2 0000:00:01.0: error interrupt status=0x80000000
[    1.029155] sky2 0000:00:01.0: PCI hardware error (0x1010)
[    4.000699] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
[    4.026116] Sending DHCP requests .
[    4.026201] sky2 0000:00:01.0: error interrupt status=0x80000000
[    4.030043] sky2 0000:00:01.0: PCI hardware error (0x1010)
[    6.546111] ..
[   14.118106] ------------[ cut here ]------------
[   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
[   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2b4/0x2c0
[   14.127082] Modules linked in:
[   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc8-00061-ga98f9a047756-dirty #
[   14.132800] Hardware name: linux,dummy-virt (DT)
[   14.135082] pstate: 60000005 (nZCv daif -PAN -UAO)
[   14.137459] pc : dev_watchdog+0x2b4/0x2c0
[   14.139457] lr : dev_watchdog+0x2b4/0x2c0
[   14.141351] sp : ffff000010003d70
[   14.142924] x29: ffff000010003d70 x28: ffff0000112f60c0
[   14.145433] x27: 0000000000000140 x26: ffff8000fa6eb3b8
[   14.147936] x25: 00000000ffffffff x24: ffff8000fa7a7c80
[   14.150428] x23: ffff8000fa6eb39c x22: ffff8000fa6eafb8
[   14.152934] x21: ffff8000fa6eb000 x20: ffff0000112f7000
[   14.155437] x19: 0000000000000000 x18: ffffffffffffffff
[   14.157929] x17: 0000000000000000 x16: 0000000000000000
[   14.160432] x15: ffff0000112fd6c8 x14: ffff000090003a97
[   14.162927] x13: ffff000010003aa5 x12: ffff000011315878
[   14.165428] x11: ffff000011315000 x10: 0000000005f5e0ff
[   14.167935] x9 : 00000000ffffffd0 x8 : 64656d6974203020
[   14.170430] x7 : 6575657571207469 x6 : 00000000000000e3
[   14.172935] x5 : 0000000000000000 x4 : 0000000000000000
[   14.175443] x3 : 00000000ffffffff x2 : ffff0000113158a8
[   14.177938] x1 : f2db9128b1f08600 x0 : 0000000000000000
[   14.180443] Call trace:
[   14.181625]  dev_watchdog+0x2b4/0x2c0
[   14.183377]  call_timer_fn+0x20/0x78
[   14.185078]  expire_timers+0xa4/0xb0
[   14.186777]  run_timer_softirq+0xa0/0x190
[   14.188687]  __do_softirq+0x108/0x234
[   14.190428]  irq_exit+0xcc/0xd8
[   14.191941]  __handle_domain_irq+0x60/0xb8
[   14.193877]  gic_handle_irq+0x58/0xb0
[   14.195630]  el1_irq+0xb0/0x128
[   14.197132]  arch_cpu_idle+0x10/0x18
[   14.198835]  do_idle+0x1cc/0x288
[   14.200389]  cpu_startup_entry+0x24/0x28
[   14.202251]  rest_init+0xd4/0xe0
[   14.203804]  arch_call_rest_init+0xc/0x14
[   14.205702]  start_kernel+0x3d8/0x404
[   14.207449] ---[ end trace 65449acd5c054609 ]---
[   14.209630] sky2 0000:00:01.0 eth0: tx timeout
[   14.211655] sky2 0000:00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
[   17.906956] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both

Thanks,
Leo Yan

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/arm/juno-base.dtsi?h=v5.0#n46

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-11 14:35       ` Leo Yan
@ 2019-03-13  8:00         ` Leo Yan
  2019-03-13 10:01           ` Leo Yan
  2019-03-13 10:01           ` Auger Eric
  0 siblings, 2 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-13  8:00 UTC (permalink / raw)
  To: Auger Eric, Mark Rutland; +Cc: Daniel Thompson, kvmarm

Hi Eric & all,

On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:

[...]

> So now I made some progress and can see the networking card is
> pass-through to guest OS, though the networking card reports errors
> now.  Below is detailed steps and info:
> 
> - Bind devices in the same IOMMU group to vfio driver:
> 
>   echo 0000:03:00.0 > /sys/bus/pci/devices/0000\:03\:00.0/driver/unbind
>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
> 
>   echo 0000:08:00.0 > /sys/bus/pci/devices/0000\:08\:00.0/driver/unbind
>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
> 
> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
> 
> - Use qemu to launch guest OS:
> 
>   qemu-system-aarch64 \
>         -cpu host -M virt,accel=kvm -m 4096 -nographic \
>         -kernel /root/virt/Image -append root=/dev/vda2 \
>         -net none -device vfio-pci,host=08:00.0 \
>         -drive if=virtio,file=/root/virt/qemu/debian.img \
>         -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk ip=dhcp'
> 
> - Host log:
> 
> [  188.329861] vfio-pci 0000:08:00.0: enabling device (0000 -> 0003)
> 
> - Below is guest log, from log though the driver has been registered but
>   it reports PCI hardware failure and the timeout for the interrupt.
> 
>   So is this caused by very 'slow' forward interrupt handling?  Juno
>   board uses GICv2 (I think it has GICv2m extension).
> 
> [...]
> 
> [    1.024483] sky2 0000:00:01.0 eth0: enabling interface
> [    1.026822] sky2 0000:00:01.0: error interrupt status=0x80000000
> [    1.029155] sky2 0000:00:01.0: PCI hardware error (0x1010)
> [    4.000699] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
> [    4.026116] Sending DHCP requests .
> [    4.026201] sky2 0000:00:01.0: error interrupt status=0x80000000
> [    4.030043] sky2 0000:00:01.0: PCI hardware error (0x1010)
> [    6.546111] ..
> [   14.118106] ------------[ cut here ]------------
> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2b4/0x2c0
> [   14.127082] Modules linked in:
> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc8-00061-ga98f9a047756-dirty #
> [   14.132800] Hardware name: linux,dummy-virt (DT)
> [   14.135082] pstate: 60000005 (nZCv daif -PAN -UAO)
> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
> [   14.141351] sp : ffff000010003d70
> [   14.142924] x29: ffff000010003d70 x28: ffff0000112f60c0
> [   14.145433] x27: 0000000000000140 x26: ffff8000fa6eb3b8
> [   14.147936] x25: 00000000ffffffff x24: ffff8000fa7a7c80
> [   14.150428] x23: ffff8000fa6eb39c x22: ffff8000fa6eafb8
> [   14.152934] x21: ffff8000fa6eb000 x20: ffff0000112f7000
> [   14.155437] x19: 0000000000000000 x18: ffffffffffffffff
> [   14.157929] x17: 0000000000000000 x16: 0000000000000000
> [   14.160432] x15: ffff0000112fd6c8 x14: ffff000090003a97
> [   14.162927] x13: ffff000010003aa5 x12: ffff000011315878
> [   14.165428] x11: ffff000011315000 x10: 0000000005f5e0ff
> [   14.167935] x9 : 00000000ffffffd0 x8 : 64656d6974203020
> [   14.170430] x7 : 6575657571207469 x6 : 00000000000000e3
> [   14.172935] x5 : 0000000000000000 x4 : 0000000000000000
> [   14.175443] x3 : 00000000ffffffff x2 : ffff0000113158a8
> [   14.177938] x1 : f2db9128b1f08600 x0 : 0000000000000000
> [   14.180443] Call trace:
> [   14.181625]  dev_watchdog+0x2b4/0x2c0
> [   14.183377]  call_timer_fn+0x20/0x78
> [   14.185078]  expire_timers+0xa4/0xb0
> [   14.186777]  run_timer_softirq+0xa0/0x190
> [   14.188687]  __do_softirq+0x108/0x234
> [   14.190428]  irq_exit+0xcc/0xd8
> [   14.191941]  __handle_domain_irq+0x60/0xb8
> [   14.193877]  gic_handle_irq+0x58/0xb0
> [   14.195630]  el1_irq+0xb0/0x128
> [   14.197132]  arch_cpu_idle+0x10/0x18
> [   14.198835]  do_idle+0x1cc/0x288
> [   14.200389]  cpu_startup_entry+0x24/0x28
> [   14.202251]  rest_init+0xd4/0xe0
> [   14.203804]  arch_call_rest_init+0xc/0x14
> [   14.205702]  start_kernel+0x3d8/0x404
> [   14.207449] ---[ end trace 65449acd5c054609 ]---
> [   14.209630] sky2 0000:00:01.0 eth0: tx timeout
> [   14.211655] sky2 0000:00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
> [   17.906956] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both

I am stucking at the network card cannot receive interrupts in guest
OS.  So took time to look into the code and added some printed info to
help me to understand the detailed flow, below are two main questions
I am confused with them and need some guidance:

- The first question is about the msi usage in network card driver;
  when review the sky2 network card driver [1], it has function
  sky2_test_msi() which is used to decide if can use msi or not.

  The interesting thing is this function will firstly request irq for
  the interrupt and based on the interrupt handler to read back
  register and then can make decision if msi is avalible or not.

  This can work well for host OS, but if we want to passthrough this
  device to guest OS, since the KVM doesn't prepare the interrupt for
  sky2 drivers (no injection or forwarding) thus at this point the
  interrupt handle will not be invorked.  At the end the driver will
  not set flag 'hw->flags |= SKY2_HW_USE_MSI' and this results to not
  use msi in guest OS and rollback to INTx mode.

  My first impression is if we passthrough the devices to guest OS in
  KVM, the PCI-e device can directly use msi;  I tweaked a bit for the
  code to check status value after timeout, so both host OS and guest
  OS can set the flag for msi.

  I want to confirm, if this is the recommended mode for
  passthrough PCI-e device to use msi both in host OS and geust OS?
  Or it's will be fine for host OS using msi and guest OS using
  INTx mode?

- The second question is for GICv2m.  If I understand correctly, when
  passthrough PCI-e device to guest OS, in the guest OS we should
  create below data path for PCI-e devices:
                                                            +--------+
                                                         -> | Memory |
    +-----------+    +------------------+    +-------+  /   +--------+
    | Net card  | -> | PCI-e controller | -> | IOMMU | -
    +-----------+    +------------------+    +-------+  \   +--------+
                                                         -> | MSI    |
                                                            | frame  |
                                                            +--------+

  Since now the master is network card/PCI-e controller but not CPU,
  thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
  this case, if we configure IOMMU (SMMU) for guest OS for address
  translation before switch from host to guest, right?  Or SMMU also
  have two stages memory mapping?

  Another thing confuses me is I can see the MSI frame is mapped to
  GIC's physical address in host OS, thus the PCI-e device can send
  message correctly to msi frame.  But for guest OS, the MSI frame is
  mapped to one IPA memory region, and this region is use to emulate
  GICv2 msi frame rather than the hardware msi frame; thus will any
  access from PCI-e to this region will trap to hypervisor in CPU
  side so KVM hyperviso can help emulate (and inject) the interrupt
  for guest OS?

  Essentially, I want to check what's the expected behaviour for GICv2
  msi frame working mode when we want to passthrough one PCI-e device
  to guest OS and the PCI-e device has one static msi frame for it.

I will continue to look into the code and post at here.  Thanks a lot
for any comment and suggestion!
Leo Yan

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/marvell/sky2.c#n4859

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13  8:00         ` Leo Yan
@ 2019-03-13 10:01           ` Leo Yan
  2019-03-13 10:16             ` Auger Eric
  2019-03-13 10:01           ` Auger Eric
  1 sibling, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-13 10:01 UTC (permalink / raw)
  To: Auger Eric, Mark Rutland; +Cc: Daniel Thompson, kvmarm

On Wed, Mar 13, 2019 at 04:00:48PM +0800, Leo Yan wrote:

[...]

> - The second question is for GICv2m.  If I understand correctly, when
>   passthrough PCI-e device to guest OS, in the guest OS we should
>   create below data path for PCI-e devices:
>                                                             +--------+
>                                                          -> | Memory |
>     +-----------+    +------------------+    +-------+  /   +--------+
>     | Net card  | -> | PCI-e controller | -> | IOMMU | -
>     +-----------+    +------------------+    +-------+  \   +--------+
>                                                          -> | MSI    |
>                                                             | frame  |
>                                                             +--------+
> 
>   Since now the master is network card/PCI-e controller but not CPU,
>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>   this case, if we configure IOMMU (SMMU) for guest OS for address
>   translation before switch from host to guest, right?  Or SMMU also
>   have two stages memory mapping?
> 
>   Another thing confuses me is I can see the MSI frame is mapped to
>   GIC's physical address in host OS, thus the PCI-e device can send
>   message correctly to msi frame.  But for guest OS, the MSI frame is
>   mapped to one IPA memory region, and this region is use to emulate
>   GICv2 msi frame rather than the hardware msi frame; thus will any
>   access from PCI-e to this region will trap to hypervisor in CPU
>   side so KVM hyperviso can help emulate (and inject) the interrupt
>   for guest OS?
> 
>   Essentially, I want to check what's the expected behaviour for GICv2
>   msi frame working mode when we want to passthrough one PCI-e device
>   to guest OS and the PCI-e device has one static msi frame for it.

>From the blog [1], it has below explanation for my question for mapping
IOVA and hardware msi address.  But I searched the flag
VFIO_DMA_FLAG_MSI_RESERVED_IOVA which isn't found in mainline kernel;
I might miss something for this, want to check if related patches have
been merged in the mainline kernel?

'We reuse the VFIO DMA MAP ioctl to pass this reserved IOVA region. A
new flag (VFIO_DMA_FLAG_MSI_RESERVED_IOVA ) is introduced to
differentiate such reserved IOVA from RAM IOVA. Then the base/size of
the window is passed to the IOMMU driver though a new function
introduced in the IOMMU API. 

The IOVA allocation within the supplied reserved IOVA window is
performed on-demand, when the MSI controller composes/writes the MSI
message in the PCIe device. Also the IOMMU mapping between the newly
allocated IOVA and the backdoor address page is done at that time. The
MSI controller uses a new function introduced in the IOMMU API to
allocate the IOVA and create an IOMMU mapping.
 
So there are adaptations needed at VFIO, IOMMU and MSI controller
level. The extension of the IOMMU API still is under discussion. Also
changes at MSI controller level need to be consolidated.'

P.s. I also tried two tools qemu/kvmtool, both cannot pass interrupt
for network card in guest OS.

Thanks,
Leo Yan

[1] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13  8:00         ` Leo Yan
  2019-03-13 10:01           ` Leo Yan
@ 2019-03-13 10:01           ` Auger Eric
  2019-03-13 10:24             ` Auger Eric
  2019-03-13 11:35             ` Leo Yan
  1 sibling, 2 replies; 20+ messages in thread
From: Auger Eric @ 2019-03-13 10:01 UTC (permalink / raw)
  To: Leo Yan, Mark Rutland; +Cc: Daniel Thompson, Robin Murphy, kvmarm

Hi Leo,

+ Robin

On 3/13/19 9:00 AM, Leo Yan wrote:
> Hi Eric & all,
> 
> On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:
> 
> [...]
> 
>> So now I made some progress and can see the networking card is
>> pass-through to guest OS, though the networking card reports errors
>> now.  Below is detailed steps and info:
>>
>> - Bind devices in the same IOMMU group to vfio driver:
>>
>>   echo 0000:03:00.0 > /sys/bus/pci/devices/0000\:03\:00.0/driver/unbind
>>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>>   echo 0000:08:00.0 > /sys/bus/pci/devices/0000\:08\:00.0/driver/unbind
>>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
>>
>> - Use qemu to launch guest OS:
>>
>>   qemu-system-aarch64 \
>>         -cpu host -M virt,accel=kvm -m 4096 -nographic \
>>         -kernel /root/virt/Image -append root=/dev/vda2 \
>>         -net none -device vfio-pci,host=08:00.0 \
>>         -drive if=virtio,file=/root/virt/qemu/debian.img \
>>         -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk ip=dhcp'
>>
>> - Host log:
>>
>> [  188.329861] vfio-pci 0000:08:00.0: enabling device (0000 -> 0003)
>>
>> - Below is guest log, from log though the driver has been registered but
>>   it reports PCI hardware failure and the timeout for the interrupt.
>>
>>   So is this caused by very 'slow' forward interrupt handling?  Juno
>>   board uses GICv2 (I think it has GICv2m extension).
>>
>> [...]
>>
>> [    1.024483] sky2 0000:00:01.0 eth0: enabling interface
>> [    1.026822] sky2 0000:00:01.0: error interrupt status=0x80000000
>> [    1.029155] sky2 0000:00:01.0: PCI hardware error (0x1010)
>> [    4.000699] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
>> [    4.026116] Sending DHCP requests .
>> [    4.026201] sky2 0000:00:01.0: error interrupt status=0x80000000
>> [    4.030043] sky2 0000:00:01.0: PCI hardware error (0x1010)
>> [    6.546111] ..
>> [   14.118106] ------------[ cut here ]------------
>> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
>> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2b4/0x2c0
>> [   14.127082] Modules linked in:
>> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc8-00061-ga98f9a047756-dirty #
>> [   14.132800] Hardware name: linux,dummy-virt (DT)
>> [   14.135082] pstate: 60000005 (nZCv daif -PAN -UAO)
>> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
>> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
>> [   14.141351] sp : ffff000010003d70
>> [   14.142924] x29: ffff000010003d70 x28: ffff0000112f60c0
>> [   14.145433] x27: 0000000000000140 x26: ffff8000fa6eb3b8
>> [   14.147936] x25: 00000000ffffffff x24: ffff8000fa7a7c80
>> [   14.150428] x23: ffff8000fa6eb39c x22: ffff8000fa6eafb8
>> [   14.152934] x21: ffff8000fa6eb000 x20: ffff0000112f7000
>> [   14.155437] x19: 0000000000000000 x18: ffffffffffffffff
>> [   14.157929] x17: 0000000000000000 x16: 0000000000000000
>> [   14.160432] x15: ffff0000112fd6c8 x14: ffff000090003a97
>> [   14.162927] x13: ffff000010003aa5 x12: ffff000011315878
>> [   14.165428] x11: ffff000011315000 x10: 0000000005f5e0ff
>> [   14.167935] x9 : 00000000ffffffd0 x8 : 64656d6974203020
>> [   14.170430] x7 : 6575657571207469 x6 : 00000000000000e3
>> [   14.172935] x5 : 0000000000000000 x4 : 0000000000000000
>> [   14.175443] x3 : 00000000ffffffff x2 : ffff0000113158a8
>> [   14.177938] x1 : f2db9128b1f08600 x0 : 0000000000000000
>> [   14.180443] Call trace:
>> [   14.181625]  dev_watchdog+0x2b4/0x2c0
>> [   14.183377]  call_timer_fn+0x20/0x78
>> [   14.185078]  expire_timers+0xa4/0xb0
>> [   14.186777]  run_timer_softirq+0xa0/0x190
>> [   14.188687]  __do_softirq+0x108/0x234
>> [   14.190428]  irq_exit+0xcc/0xd8
>> [   14.191941]  __handle_domain_irq+0x60/0xb8
>> [   14.193877]  gic_handle_irq+0x58/0xb0
>> [   14.195630]  el1_irq+0xb0/0x128
>> [   14.197132]  arch_cpu_idle+0x10/0x18
>> [   14.198835]  do_idle+0x1cc/0x288
>> [   14.200389]  cpu_startup_entry+0x24/0x28
>> [   14.202251]  rest_init+0xd4/0xe0
>> [   14.203804]  arch_call_rest_init+0xc/0x14
>> [   14.205702]  start_kernel+0x3d8/0x404
>> [   14.207449] ---[ end trace 65449acd5c054609 ]---
>> [   14.209630] sky2 0000:00:01.0 eth0: tx timeout
>> [   14.211655] sky2 0000:00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
>> [   17.906956] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
> 
> I am stucking at the network card cannot receive interrupts in guest
> OS.  So took time to look into the code and added some printed info to
> help me to understand the detailed flow, below are two main questions
> I am confused with them and need some guidance:
> 
> - The first question is about the msi usage in network card driver;
>   when review the sky2 network card driver [1], it has function
>   sky2_test_msi() which is used to decide if can use msi or not.
> 
>   The interesting thing is this function will firstly request irq for
>   the interrupt and based on the interrupt handler to read back
>   register and then can make decision if msi is avalible or not.
> 
>   This can work well for host OS, but if we want to passthrough this
>   device to guest OS, since the KVM doesn't prepare the interrupt for
>   sky2 drivers (no injection or forwarding) thus at this point the
>   interrupt handle will not be invorked.  At the end the driver will
>   not set flag 'hw->flags |= SKY2_HW_USE_MSI' and this results to not
>   use msi in guest OS and rollback to INTx mode.
> 
>   My first impression is if we passthrough the devices to guest OS in
>   KVM, the PCI-e device can directly use msi;  I tweaked a bit for the
>   code to check status value after timeout, so both host OS and guest
>   OS can set the flag for msi.
> 
>   I want to confirm, if this is the recommended mode for
>   passthrough PCI-e device to use msi both in host OS and geust OS?
>   Or it's will be fine for host OS using msi and guest OS using
>   INTx mode?

If the NIC supports MSIs they logically are used. This can be easily
checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
check whether the guest received any interrupt? I remember that Robin
said in the past that on Juno, the MSI doorbell was in the PCI host
bridge window and possibly transactions towards the doorbell could not
reach it since considered as peer to peer. Using GICv2M should not bring
any performance issue. I tested that in the past with Seattle board.
> 
> - The second question is for GICv2m.  If I understand correctly, when
>   passthrough PCI-e device to guest OS, in the guest OS we should
>   create below data path for PCI-e devices:
>                                                             +--------+
>                                                          -> | Memory |
>     +-----------+    +------------------+    +-------+  /   +--------+
>     | Net card  | -> | PCI-e controller | -> | IOMMU | -
>     +-----------+    +------------------+    +-------+  \   +--------+
>                                                          -> | MSI    |
>                                                             | frame  |
>                                                             +--------+
> 
>   Since now the master is network card/PCI-e controller but not CPU,
>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>   this case, if we configure IOMMU (SMMU) for guest OS for address
>   translation before switch from host to guest, right?  Or SMMU also
>   have two stages memory mapping?

in your use case you don't have any virtual IOMMU. So the guest programs
the assigned device with guest physical device and the virtualizer uses
the physical IOMMU to translate this GPA into host physical address
backing the guest RAM and the MSI frame. A single stage of the physical
IOMMU is used (stage1).
> 
>   Another thing confuses me is I can see the MSI frame is mapped to
>   GIC's physical address in host OS, thus the PCI-e device can send
>   message correctly to msi frame.  But for guest OS, the MSI frame is
>   mapped to one IPA memory region, and this region is use to emulate
>   GICv2 msi frame rather than the hardware msi frame; thus will any
>   access from PCI-e to this region will trap to hypervisor in CPU
>   side so KVM hyperviso can help emulate (and inject) the interrupt
>   for guest OS?

when the device sends an MSI it uses a host allocated IOVA for the
physical MSI doorbell. This gets translated by the physical IOMMU,
reaches the physical doorbell. The physical GICv2m triggers the
associated physical SPI -> kvm irqfd -> virtual IRQ
With GICv2M we have direct GSI mapping on guest.
> 
>   Essentially, I want to check what's the expected behaviour for GICv2
>   msi frame working mode when we want to passthrough one PCI-e device
>   to guest OS and the PCI-e device has one static msi frame for it.

Your config was tested in the past with Seattle (not with sky2 NIC
though). Adding Robin for the peer to peer potential concern.

Thanks

Eric
> 
> I will continue to look into the code and post at here.  Thanks a lot
> for any comment and suggestion!
> Leo Yan
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/marvell/sky2.c#n4859
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13 10:01           ` Leo Yan
@ 2019-03-13 10:16             ` Auger Eric
  0 siblings, 0 replies; 20+ messages in thread
From: Auger Eric @ 2019-03-13 10:16 UTC (permalink / raw)
  To: Leo Yan, Mark Rutland; +Cc: Daniel Thompson, kvmarm

Hi Leo,

On 3/13/19 11:01 AM, Leo Yan wrote:
> On Wed, Mar 13, 2019 at 04:00:48PM +0800, Leo Yan wrote:
> 
> [...]
> 
>> - The second question is for GICv2m.  If I understand correctly, when
>>   passthrough PCI-e device to guest OS, in the guest OS we should
>>   create below data path for PCI-e devices:
>>                                                             +--------+
>>                                                          -> | Memory |
>>     +-----------+    +------------------+    +-------+  /   +--------+
>>     | Net card  | -> | PCI-e controller | -> | IOMMU | -
>>     +-----------+    +------------------+    +-------+  \   +--------+
>>                                                          -> | MSI    |
>>                                                             | frame  |
>>                                                             +--------+
>>
>>   Since now the master is network card/PCI-e controller but not CPU,
>>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>>   this case, if we configure IOMMU (SMMU) for guest OS for address
>>   translation before switch from host to guest, right?  Or SMMU also
>>   have two stages memory mapping?
>>
>>   Another thing confuses me is I can see the MSI frame is mapped to
>>   GIC's physical address in host OS, thus the PCI-e device can send
>>   message correctly to msi frame.  But for guest OS, the MSI frame is
>>   mapped to one IPA memory region, and this region is use to emulate
>>   GICv2 msi frame rather than the hardware msi frame; thus will any
>>   access from PCI-e to this region will trap to hypervisor in CPU
>>   side so KVM hyperviso can help emulate (and inject) the interrupt
>>   for guest OS?
>>
>>   Essentially, I want to check what's the expected behaviour for GICv2
>>   msi frame working mode when we want to passthrough one PCI-e device
>>   to guest OS and the PCI-e device has one static msi frame for it.
> 
> From the blog [1], it has below explanation for my question for mapping
> IOVA and hardware msi address.  But I searched the flag
> VFIO_DMA_FLAG_MSI_RESERVED_IOVA which isn't found in mainline kernel;
> I might miss something for this, want to check if related patches have
> been merged in the mainline kernel?

Yes all the mechanics for passthrough/MSI on ARM is upstream. The blog
page is outdated. The kernel allocates IOVAs for MSI doorbells
arbitrarily within this region.

#define MSI_IOVA_BASE                   0x8000000
#define MSI_IOVA_LENGTH                 0x100000

and userspace is not involved anymore in passing a usable reserved IOVA
region.

Thanks

Eric
> 
> 'We reuse the VFIO DMA MAP ioctl to pass this reserved IOVA region. A
> new flag (VFIO_DMA_FLAG_MSI_RESERVED_IOVA ) is introduced to
> differentiate such reserved IOVA from RAM IOVA. Then the base/size of
> the window is passed to the IOMMU driver though a new function
> introduced in the IOMMU API. 
> 
> The IOVA allocation within the supplied reserved IOVA window is
> performed on-demand, when the MSI controller composes/writes the MSI
> message in the PCIe device. Also the IOMMU mapping between the newly
> allocated IOVA and the backdoor address page is done at that time. The
> MSI controller uses a new function introduced in the IOMMU API to
> allocate the IOVA and create an IOMMU mapping.
>  
> So there are adaptations needed at VFIO, IOMMU and MSI controller
> level. The extension of the IOMMU API still is under discussion. Also
> changes at MSI controller level need to be consolidated.'
> 
> P.s. I also tried two tools qemu/kvmtool, both cannot pass interrupt
> for network card in guest OS.
> 
> Thanks,
> Leo Yan
> 
> [1] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13 10:01           ` Auger Eric
@ 2019-03-13 10:24             ` Auger Eric
  2019-03-13 11:52               ` Leo Yan
  2019-03-15  9:37               ` Leo Yan
  2019-03-13 11:35             ` Leo Yan
  1 sibling, 2 replies; 20+ messages in thread
From: Auger Eric @ 2019-03-13 10:24 UTC (permalink / raw)
  To: Leo Yan, Mark Rutland; +Cc: Daniel Thompson, Robin Murphy, kvmarm

Hi,

On 3/13/19 11:01 AM, Auger Eric wrote:
> Hi Leo,
> 
> + Robin
> 
> On 3/13/19 9:00 AM, Leo Yan wrote:
>> Hi Eric & all,
>>
>> On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:
>>
>> [...]
>>
>>> So now I made some progress and can see the networking card is
>>> pass-through to guest OS, though the networking card reports errors
>>> now.  Below is detailed steps and info:
>>>
>>> - Bind devices in the same IOMMU group to vfio driver:
>>>
>>>   echo 0000:03:00.0 > /sys/bus/pci/devices/0000\:03\:00.0/driver/unbind
>>>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
>>>
>>>   echo 0000:08:00.0 > /sys/bus/pci/devices/0000\:08\:00.0/driver/unbind
>>>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
>>>
>>> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
>>>
>>> - Use qemu to launch guest OS:
>>>
>>>   qemu-system-aarch64 \
>>>         -cpu host -M virt,accel=kvm -m 4096 -nographic \
>>>         -kernel /root/virt/Image -append root=/dev/vda2 \
>>>         -net none -device vfio-pci,host=08:00.0 \
>>>         -drive if=virtio,file=/root/virt/qemu/debian.img \
>>>         -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk ip=dhcp'
>>>
>>> - Host log:
>>>
>>> [  188.329861] vfio-pci 0000:08:00.0: enabling device (0000 -> 0003)
>>>
>>> - Below is guest log, from log though the driver has been registered but
>>>   it reports PCI hardware failure and the timeout for the interrupt.
>>>
>>>   So is this caused by very 'slow' forward interrupt handling?  Juno
>>>   board uses GICv2 (I think it has GICv2m extension).
>>>
>>> [...]
>>>
>>> [    1.024483] sky2 0000:00:01.0 eth0: enabling interface
>>> [    1.026822] sky2 0000:00:01.0: error interrupt status=0x80000000
>>> [    1.029155] sky2 0000:00:01.0: PCI hardware error (0x1010)
>>> [    4.000699] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
>>> [    4.026116] Sending DHCP requests .
>>> [    4.026201] sky2 0000:00:01.0: error interrupt status=0x80000000
>>> [    4.030043] sky2 0000:00:01.0: PCI hardware error (0x1010)
>>> [    6.546111] ..
>>> [   14.118106] ------------[ cut here ]------------
>>> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
>>> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2b4/0x2c0
>>> [   14.127082] Modules linked in:
>>> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc8-00061-ga98f9a047756-dirty #
>>> [   14.132800] Hardware name: linux,dummy-virt (DT)
>>> [   14.135082] pstate: 60000005 (nZCv daif -PAN -UAO)
>>> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
>>> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
>>> [   14.141351] sp : ffff000010003d70
>>> [   14.142924] x29: ffff000010003d70 x28: ffff0000112f60c0
>>> [   14.145433] x27: 0000000000000140 x26: ffff8000fa6eb3b8
>>> [   14.147936] x25: 00000000ffffffff x24: ffff8000fa7a7c80
>>> [   14.150428] x23: ffff8000fa6eb39c x22: ffff8000fa6eafb8
>>> [   14.152934] x21: ffff8000fa6eb000 x20: ffff0000112f7000
>>> [   14.155437] x19: 0000000000000000 x18: ffffffffffffffff
>>> [   14.157929] x17: 0000000000000000 x16: 0000000000000000
>>> [   14.160432] x15: ffff0000112fd6c8 x14: ffff000090003a97
>>> [   14.162927] x13: ffff000010003aa5 x12: ffff000011315878
>>> [   14.165428] x11: ffff000011315000 x10: 0000000005f5e0ff
>>> [   14.167935] x9 : 00000000ffffffd0 x8 : 64656d6974203020
>>> [   14.170430] x7 : 6575657571207469 x6 : 00000000000000e3
>>> [   14.172935] x5 : 0000000000000000 x4 : 0000000000000000
>>> [   14.175443] x3 : 00000000ffffffff x2 : ffff0000113158a8
>>> [   14.177938] x1 : f2db9128b1f08600 x0 : 0000000000000000
>>> [   14.180443] Call trace:
>>> [   14.181625]  dev_watchdog+0x2b4/0x2c0
>>> [   14.183377]  call_timer_fn+0x20/0x78
>>> [   14.185078]  expire_timers+0xa4/0xb0
>>> [   14.186777]  run_timer_softirq+0xa0/0x190
>>> [   14.188687]  __do_softirq+0x108/0x234
>>> [   14.190428]  irq_exit+0xcc/0xd8
>>> [   14.191941]  __handle_domain_irq+0x60/0xb8
>>> [   14.193877]  gic_handle_irq+0x58/0xb0
>>> [   14.195630]  el1_irq+0xb0/0x128
>>> [   14.197132]  arch_cpu_idle+0x10/0x18
>>> [   14.198835]  do_idle+0x1cc/0x288
>>> [   14.200389]  cpu_startup_entry+0x24/0x28
>>> [   14.202251]  rest_init+0xd4/0xe0
>>> [   14.203804]  arch_call_rest_init+0xc/0x14
>>> [   14.205702]  start_kernel+0x3d8/0x404
>>> [   14.207449] ---[ end trace 65449acd5c054609 ]---
>>> [   14.209630] sky2 0000:00:01.0 eth0: tx timeout
>>> [   14.211655] sky2 0000:00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
>>> [   17.906956] sky2 0000:00:01.0 eth0: Link is up at 1000 Mbps, full duplex, flow control both
>>
>> I am stucking at the network card cannot receive interrupts in guest
>> OS.  So took time to look into the code and added some printed info to
>> help me to understand the detailed flow, below are two main questions
>> I am confused with them and need some guidance:
>>
>> - The first question is about the msi usage in network card driver;
>>   when review the sky2 network card driver [1], it has function
>>   sky2_test_msi() which is used to decide if can use msi or not.
>>
>>   The interesting thing is this function will firstly request irq for
>>   the interrupt and based on the interrupt handler to read back
>>   register and then can make decision if msi is avalible or not.
>>
>>   This can work well for host OS, but if we want to passthrough this
>>   device to guest OS, since the KVM doesn't prepare the interrupt for
>>   sky2 drivers (no injection or forwarding) thus at this point the
>>   interrupt handle will not be invorked.  At the end the driver will
>>   not set flag 'hw->flags |= SKY2_HW_USE_MSI' and this results to not
>>   use msi in guest OS and rollback to INTx mode.
>>
>>   My first impression is if we passthrough the devices to guest OS in
>>   KVM, the PCI-e device can directly use msi;  I tweaked a bit for the
>>   code to check status value after timeout, so both host OS and guest
>>   OS can set the flag for msi.
>>
>>   I want to confirm, if this is the recommended mode for
>>   passthrough PCI-e device to use msi both in host OS and geust OS?
>>   Or it's will be fine for host OS using msi and guest OS using
>>   INTx mode?
> 
> If the NIC supports MSIs they logically are used. This can be easily
> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> check whether the guest received any interrupt? I remember that Robin
> said in the past that on Juno, the MSI doorbell was in the PCI host
> bridge window and possibly transactions towards the doorbell could not
> reach it since considered as peer to peer.

I found back Robin's explanation. It was not related to MSI IOVA being
within the PCI host bridge window but RAM GPA colliding with host PCI
config space?

"MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x40000000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them."

Thanks

Eric


 Using GICv2M should not bring
> any performance issue. I tested that in the past with Seattle board.
>>
>> - The second question is for GICv2m.  If I understand correctly, when
>>   passthrough PCI-e device to guest OS, in the guest OS we should
>>   create below data path for PCI-e devices:
>>                                                             +--------+
>>                                                          -> | Memory |
>>     +-----------+    +------------------+    +-------+  /   +--------+
>>     | Net card  | -> | PCI-e controller | -> | IOMMU | -
>>     +-----------+    +------------------+    +-------+  \   +--------+
>>                                                          -> | MSI    |
>>                                                             | frame  |
>>                                                             +--------+
>>
>>   Since now the master is network card/PCI-e controller but not CPU,
>>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>>   this case, if we configure IOMMU (SMMU) for guest OS for address
>>   translation before switch from host to guest, right?  Or SMMU also
>>   have two stages memory mapping?
> 
> in your use case you don't have any virtual IOMMU. So the guest programs
> the assigned device with guest physical device and the virtualizer uses
> the physical IOMMU to translate this GPA into host physical address
> backing the guest RAM and the MSI frame. A single stage of the physical
> IOMMU is used (stage1).
>>
>>   Another thing confuses me is I can see the MSI frame is mapped to
>>   GIC's physical address in host OS, thus the PCI-e device can send
>>   message correctly to msi frame.  But for guest OS, the MSI frame is
>>   mapped to one IPA memory region, and this region is use to emulate
>>   GICv2 msi frame rather than the hardware msi frame; thus will any
>>   access from PCI-e to this region will trap to hypervisor in CPU
>>   side so KVM hyperviso can help emulate (and inject) the interrupt
>>   for guest OS?
> 
> when the device sends an MSI it uses a host allocated IOVA for the
> physical MSI doorbell. This gets translated by the physical IOMMU,
> reaches the physical doorbell. The physical GICv2m triggers the
> associated physical SPI -> kvm irqfd -> virtual IRQ
> With GICv2M we have direct GSI mapping on guest.
>>
>>   Essentially, I want to check what's the expected behaviour for GICv2
>>   msi frame working mode when we want to passthrough one PCI-e device
>>   to guest OS and the PCI-e device has one static msi frame for it.
> 
> Your config was tested in the past with Seattle (not with sky2 NIC
> though). Adding Robin for the peer to peer potential concern.
> 
> Thanks
> 
> Eric
>>
>> I will continue to look into the code and post at here.  Thanks a lot
>> for any comment and suggestion!
>> Leo Yan
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/marvell/sky2.c#n4859
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13 10:01           ` Auger Eric
  2019-03-13 10:24             ` Auger Eric
@ 2019-03-13 11:35             ` Leo Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-13 11:35 UTC (permalink / raw)
  To: Auger Eric; +Cc: Daniel Thompson, Robin Murphy, kvmarm

Hi Eric,

On Wed, Mar 13, 2019 at 11:01:33AM +0100, Auger Eric wrote:

[...]

> >   I want to confirm, if this is the recommended mode for
> >   passthrough PCI-e device to use msi both in host OS and geust OS?
> >   Or it's will be fine for host OS using msi and guest OS using
> >   INTx mode?
> 
> If the NIC supports MSIs they logically are used. This can be easily
> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> check whether the guest received any interrupt? I remember that Robin
> said in the past that on Juno, the MSI doorbell was in the PCI host
> bridge window and possibly transactions towards the doorbell could not
> reach it since considered as peer to peer. Using GICv2M should not bring
> any performance issue. I tested that in the past with Seattle board.

I can see below info on host with launching KVM:

root@debian:~# cat /proc/interrupts | grep vfio
 46:          0          0          0          0          0          0       MSI 4194304 Edge      vfio-msi[0](0000:08:00.0)

And below is interrupts in guest:

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
  3:        506        400        281        403        298        330     GIC-0  27 Level     arch_timer
  5:        768          0          0          0          0          0     GIC-0 101 Edge      virtio0
  6:        246          0          0          0          0          0     GIC-0 102 Edge      virtio1
  7:          2          0          0          0          0          0     GIC-0 103 Edge      virtio2
  8:        210          0          0          0          0          0     GIC-0  97 Level     ttyS0
 13:          0          0          0          0          0          0       MSI   0 Edge      eth1

> > - The second question is for GICv2m.  If I understand correctly, when
> >   passthrough PCI-e device to guest OS, in the guest OS we should
> >   create below data path for PCI-e devices:
> >                                                             +--------+
> >                                                          -> | Memory |
> >     +-----------+    +------------------+    +-------+  /   +--------+
> >     | Net card  | -> | PCI-e controller | -> | IOMMU | -
> >     +-----------+    +------------------+    +-------+  \   +--------+
> >                                                          -> | MSI    |
> >                                                             | frame  |
> >                                                             +--------+
> > 
> >   Since now the master is network card/PCI-e controller but not CPU,
> >   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
> >   this case, if we configure IOMMU (SMMU) for guest OS for address
> >   translation before switch from host to guest, right?  Or SMMU also
> >   have two stages memory mapping?
> 
> in your use case you don't have any virtual IOMMU. So the guest programs
> the assigned device with guest physical device and the virtualizer uses
> the physical IOMMU to translate this GPA into host physical address
> backing the guest RAM and the MSI frame. A single stage of the physical
> IOMMU is used (stage1).

Thanks a lot for the explaination.

> >   Another thing confuses me is I can see the MSI frame is mapped to
> >   GIC's physical address in host OS, thus the PCI-e device can send
> >   message correctly to msi frame.  But for guest OS, the MSI frame is
> >   mapped to one IPA memory region, and this region is use to emulate
> >   GICv2 msi frame rather than the hardware msi frame; thus will any
> >   access from PCI-e to this region will trap to hypervisor in CPU
> >   side so KVM hyperviso can help emulate (and inject) the interrupt
> >   for guest OS?
> 
> when the device sends an MSI it uses a host allocated IOVA for the
> physical MSI doorbell. This gets translated by the physical IOMMU,
> reaches the physical doorbell. The physical GICv2m triggers the
> associated physical SPI -> kvm irqfd -> virtual IRQ
> With GICv2M we have direct GSI mapping on guest.

Just want to confirm, in your elaborated flow the virtual IRQ will be
injected by qemu (or kvmtool) for every time but it's not needed to
interfere with IRQ's deactivation, right?

> >   Essentially, I want to check what's the expected behaviour for GICv2
> >   msi frame working mode when we want to passthrough one PCI-e device
> >   to guest OS and the PCI-e device has one static msi frame for it.
> 
> Your config was tested in the past with Seattle (not with sky2 NIC
> though). Adding Robin for the peer to peer potential concern.

Very appreciate for your help.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13 10:24             ` Auger Eric
@ 2019-03-13 11:52               ` Leo Yan
  2019-03-15  9:37               ` Leo Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-13 11:52 UTC (permalink / raw)
  To: Auger Eric; +Cc: Daniel Thompson, Robin Murphy, kvmarm

On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]

> >> I am stucking at the network card cannot receive interrupts in guest
> >> OS.  So took time to look into the code and added some printed info to
> >> help me to understand the detailed flow, below are two main questions
> >> I am confused with them and need some guidance:
> >>
> >> - The first question is about the msi usage in network card driver;
> >>   when review the sky2 network card driver [1], it has function
> >>   sky2_test_msi() which is used to decide if can use msi or not.
> >>
> >>   The interesting thing is this function will firstly request irq for
> >>   the interrupt and based on the interrupt handler to read back
> >>   register and then can make decision if msi is avalible or not.
> >>
> >>   This can work well for host OS, but if we want to passthrough this
> >>   device to guest OS, since the KVM doesn't prepare the interrupt for
> >>   sky2 drivers (no injection or forwarding) thus at this point the
> >>   interrupt handle will not be invorked.  At the end the driver will
> >>   not set flag 'hw->flags |= SKY2_HW_USE_MSI' and this results to not
> >>   use msi in guest OS and rollback to INTx mode.
> >>
> >>   My first impression is if we passthrough the devices to guest OS in
> >>   KVM, the PCI-e device can directly use msi;  I tweaked a bit for the
> >>   code to check status value after timeout, so both host OS and guest
> >>   OS can set the flag for msi.
> >>
> >>   I want to confirm, if this is the recommended mode for
> >>   passthrough PCI-e device to use msi both in host OS and geust OS?
> >>   Or it's will be fine for host OS using msi and guest OS using
> >>   INTx mode?
> > 
> > If the NIC supports MSIs they logically are used. This can be easily
> > checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> > check whether the guest received any interrupt? I remember that Robin
> > said in the past that on Juno, the MSI doorbell was in the PCI host
> > bridge window and possibly transactions towards the doorbell could not
> > reach it since considered as peer to peer.
> 
> I found back Robin's explanation. It was not related to MSI IOVA being
> within the PCI host bridge window but RAM GPA colliding with host PCI
> config space?
> 
> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
> typically have a programmable address, so could be anywhere. In the more
> general category of "special hardware addresses", QEMU's default ARM
> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
> platform, that happens to be where PCI config space starts; as Juno's
> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
> the PCI bus to a guest (all of it, given the lack of ACS), the root
> complex just sees the guest's attempts to DMA to "memory" as the device
> attempting to access config space and aborts them."

Thanks a lot for the info, Eric.

Seems to me, this issue can be bypassed by using other memory address
rather than 0x40000000 for kernel IPA thus can avoid colliding with
host PCI config space.

Robin, just curious have you tried to change guest memory start address
for bypassing this issue?  Or tried kvmtool for on Juno-r2 board (e.g.
kvmtool uses 0x4000000 for AXI bus and 0x80000000 for RAM, we can do
some quickly shrinking thus can don't touch 0x40000000 region?)

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-13 10:24             ` Auger Eric
  2019-03-13 11:52               ` Leo Yan
@ 2019-03-15  9:37               ` Leo Yan
  2019-03-15 11:03                 ` Auger Eric
  1 sibling, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-15  9:37 UTC (permalink / raw)
  To: Auger Eric; +Cc: Daniel Thompson, Robin Murphy, kvmarm

Hi Eric, Robin,

On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]

> > If the NIC supports MSIs they logically are used. This can be easily
> > checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> > check whether the guest received any interrupt? I remember that Robin
> > said in the past that on Juno, the MSI doorbell was in the PCI host
> > bridge window and possibly transactions towards the doorbell could not
> > reach it since considered as peer to peer.
> 
> I found back Robin's explanation. It was not related to MSI IOVA being
> within the PCI host bridge window but RAM GPA colliding with host PCI
> config space?
> 
> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
> typically have a programmable address, so could be anywhere. In the more
> general category of "special hardware addresses", QEMU's default ARM
> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
> platform, that happens to be where PCI config space starts; as Juno's
> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
> the PCI bus to a guest (all of it, given the lack of ACS), the root
> complex just sees the guest's attempts to DMA to "memory" as the device
> attempting to access config space and aborts them."

Below is some following investigation at my side:

Firstly, must admit that I don't understand well for up paragraph; so
based on the description I am wandering if can use INTx mode and if
it's lucky to avoid this hardware pitfall.

But when I want to rollback to use INTx mode I found there have issue
for kvmtool to support INTx mode, so this is why I wrote the patch [1]
to fix the issue.  Alternatively, we also can set the NIC driver
module parameter 'sky2.disable_msi=1' thus can totally disable msi and
only use INTx mode.

Anyway, finally I can get INTx mode enabled and I can see the
interrupt will be registered successfully on both host and guest:

Host side:

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 41:          0          0          0          0          0          0     GICv2  54 Level     arm-pmu
 42:          0          0          0          0          0          0     GICv2  58 Level     arm-pmu
 43:          0          0          0          0          0          0     GICv2  62 Level     arm-pmu
 45:        772          0          0          0          0          0     GICv2 171 Level     vfio-intx(0000:08:00.0)

Guest side:

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 12:          0          0          0          0          0          0     GIC-0  96 Level     eth1

So you could see the host can receive the interrupts, but these
interrupts are mainly triggered before binding vfio-pci driver.  But
seems now after launch kvm I can see there have very small mount
interrupts are triggered in host and the guest kernel also can receive
the virtual interrupts, e.g. if use 'dhclient eth1' command in guest
OS, this command stalls for long time (> 1 minute) after return back,
I can see both the host OS and guest OS can receive 5~6 interrupts.
Based on this, I guess the flow for interrupts forwarding has been
enabled.  But seems the data packet will not really output and I use
wireshark to capture packets, but cannot find any packet output from
the NIC.

I did another testing is to shrink the memory space/io/bus region to
less than 0x40000000, so this can avoid to put guest memory IPA into
0x40000000.  But this doesn't work.

@Robin, could you help explain for the hardware issue and review my
methods are feasible on Juno board?  Thanks a lot for suggestions.

I will dig more for the memory mapping and post at here.

Thanks,
Leo Yan

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035055.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-15  9:37               ` Leo Yan
@ 2019-03-15 11:03                 ` Auger Eric
  2019-03-15 12:54                   ` Robin Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Auger Eric @ 2019-03-15 11:03 UTC (permalink / raw)
  To: Leo Yan; +Cc: Daniel Thompson, Robin Murphy, kvmarm

Hi Leo,

+ Jean-Philippe

On 3/15/19 10:37 AM, Leo Yan wrote:
> Hi Eric, Robin,
> 
> On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
> 
> [...]
> 
>>> If the NIC supports MSIs they logically are used. This can be easily
>>> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
>>> check whether the guest received any interrupt? I remember that Robin
>>> said in the past that on Juno, the MSI doorbell was in the PCI host
>>> bridge window and possibly transactions towards the doorbell could not
>>> reach it since considered as peer to peer.
>>
>> I found back Robin's explanation. It was not related to MSI IOVA being
>> within the PCI host bridge window but RAM GPA colliding with host PCI
>> config space?
>>
>> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
>> typically have a programmable address, so could be anywhere. In the more
>> general category of "special hardware addresses", QEMU's default ARM
>> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
>> platform, that happens to be where PCI config space starts; as Juno's
>> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
>> the PCI bus to a guest (all of it, given the lack of ACS), the root
>> complex just sees the guest's attempts to DMA to "memory" as the device
>> attempting to access config space and aborts them."
> 
> Below is some following investigation at my side:
> 
> Firstly, must admit that I don't understand well for up paragraph; so
> based on the description I am wandering if can use INTx mode and if
> it's lucky to avoid this hardware pitfall.

The problem above is that during the assignment process, the virtualizer
maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
allow the device, programmed in GPA to access the whole guest RAM.
Unfortunately if the device emits a DMA request with 0x40000000 IOVA
address, this IOVA is interpreted by the Juno RC as a transaction
towards the PCIe config space. So this DMA request will not go beyond
the RC, will never reach the IOMMU and will never reach the guest RAM.
So globally the device is not able to reach part of the guest RAM.
That's how I interpret the above statement. Then I don't know the
details of the collision, I don't have access to this HW. I don't know
either if this problem still exists on the r2 HW.
> 
> But when I want to rollback to use INTx mode I found there have issue
> for kvmtool to support INTx mode, so this is why I wrote the patch [1]
> to fix the issue.  Alternatively, we also can set the NIC driver
> module parameter 'sky2.disable_msi=1' thus can totally disable msi and
> only use INTx mode.
> 
> Anyway, finally I can get INTx mode enabled and I can see the
> interrupt will be registered successfully on both host and guest:
> 
> Host side:
> 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
>  41:          0          0          0          0          0          0     GICv2  54 Level     arm-pmu
>  42:          0          0          0          0          0          0     GICv2  58 Level     arm-pmu
>  43:          0          0          0          0          0          0     GICv2  62 Level     arm-pmu
>  45:        772          0          0          0          0          0     GICv2 171 Level     vfio-intx(0000:08:00.0)
> 
> Guest side:
> 
> # cat /proc/interrupts
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
>  12:          0          0          0          0          0          0     GIC-0  96 Level     eth1
> 
> So you could see the host can receive the interrupts, but these
> interrupts are mainly triggered before binding vfio-pci driver.  But
> seems now after launch kvm I can see there have very small mount
> interrupts are triggered in host and the guest kernel also can receive
> the virtual interrupts, e.g. if use 'dhclient eth1' command in guest
> OS, this command stalls for long time (> 1 minute) after return back,
> I can see both the host OS and guest OS can receive 5~6 interrupts.
> Based on this, I guess the flow for interrupts forwarding has been
> enabled.  But seems the data packet will not really output and I use
> wireshark to capture packets, but cannot find any packet output from
> the NIC.
> 
> I did another testing is to shrink the memory space/io/bus region to
> less than 0x40000000, so this can avoid to put guest memory IPA into
> 0x40000000.  But this doesn't work.

What is worth to try is to move the base address of the guest RAM. I
think there were some recent works on this on kvmtool. Adding
Jean-Philippe in the loop.

Thanks

Eric
> 
> @Robin, could you help explain for the hardware issue and review my
> methods are feasible on Juno board?  Thanks a lot for suggestions.
> 
> I will dig more for the memory mapping and post at here.
> 
> Thanks,
> Leo Yan
> 
> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035055.html
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-15 11:03                 ` Auger Eric
@ 2019-03-15 12:54                   ` Robin Murphy
  2019-03-16  4:56                     ` Leo Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Robin Murphy @ 2019-03-15 12:54 UTC (permalink / raw)
  To: Auger Eric, Leo Yan; +Cc: Daniel Thompson, kvmarm

Hi Leo,

Sorry for the delay - I'm on holiday this week, but since I've made the 
mistake of glancing at my inbox I should probably save you from wasting 
any more time...

On 2019-03-15 11:03 am, Auger Eric wrote:
> Hi Leo,
> 
> + Jean-Philippe
> 
> On 3/15/19 10:37 AM, Leo Yan wrote:
>> Hi Eric, Robin,
>>
>> On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
>>
>> [...]
>>
>>>> If the NIC supports MSIs they logically are used. This can be easily
>>>> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
>>>> check whether the guest received any interrupt? I remember that Robin
>>>> said in the past that on Juno, the MSI doorbell was in the PCI host
>>>> bridge window and possibly transactions towards the doorbell could not
>>>> reach it since considered as peer to peer.
>>>
>>> I found back Robin's explanation. It was not related to MSI IOVA being
>>> within the PCI host bridge window but RAM GPA colliding with host PCI
>>> config space?
>>>
>>> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
>>> typically have a programmable address, so could be anywhere. In the more
>>> general category of "special hardware addresses", QEMU's default ARM
>>> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
>>> platform, that happens to be where PCI config space starts; as Juno's
>>> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
>>> the PCI bus to a guest (all of it, given the lack of ACS), the root
>>> complex just sees the guest's attempts to DMA to "memory" as the device
>>> attempting to access config space and aborts them."
>>
>> Below is some following investigation at my side:
>>
>> Firstly, must admit that I don't understand well for up paragraph; so
>> based on the description I am wandering if can use INTx mode and if
>> it's lucky to avoid this hardware pitfall.
> 
> The problem above is that during the assignment process, the virtualizer
> maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
> allow the device, programmed in GPA to access the whole guest RAM.
> Unfortunately if the device emits a DMA request with 0x40000000 IOVA
> address, this IOVA is interpreted by the Juno RC as a transaction
> towards the PCIe config space. So this DMA request will not go beyond
> the RC, will never reach the IOMMU and will never reach the guest RAM.
> So globally the device is not able to reach part of the guest RAM.
> That's how I interpret the above statement. Then I don't know the
> details of the collision, I don't have access to this HW. I don't know
> either if this problem still exists on the r2 HW.

The short answer is that if you want PCI passthrough to work on Juno, 
the guest memory map has to look like a Juno.

The PCIe root complex uses an internal lookup table to generate 
appropriate AXI attributes for outgoing PCIe transactions; unfortunately 
this has no notion of 'default' attributes, so addresses *must* match 
one of the programmed windows in order to be valid. From memory, EDK2 
sets up a 2GB window covering the lower DRAM bank, an 8GB window 
covering the upper DRAM bank, and a 1MB (or thereabouts) window covering 
the GICv2m region with Device attributes. Any PCIe transactions to 
addresses not within one of those windows will be aborted by the RC 
without ever going out to the AXI side where the SMMU lies (and I think 
anything matching the config space or I/O space windows or a region 
claimed by a BAR will be aborted even earlier as a peer-to-peer attempt 
regardless of the AXI Translation Table setup).

You could potentially modify the firmware to change the window 
configuration, but the alignment restrictions make it awkward. I've only 
ever tested passthrough on Juno using kvmtool, which IIRC already has 
guest RAM in an appropriate place (and is trivially easy to hack if not) 
- I don't remember if I ever actually tried guest MSI with that.

Robin.

>> But when I want to rollback to use INTx mode I found there have issue
>> for kvmtool to support INTx mode, so this is why I wrote the patch [1]
>> to fix the issue.  Alternatively, we also can set the NIC driver
>> module parameter 'sky2.disable_msi=1' thus can totally disable msi and
>> only use INTx mode.
>>
>> Anyway, finally I can get INTx mode enabled and I can see the
>> interrupt will be registered successfully on both host and guest:
>>
>> Host side:
>>
>>             CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
>>   41:          0          0          0          0          0          0     GICv2  54 Level     arm-pmu
>>   42:          0          0          0          0          0          0     GICv2  58 Level     arm-pmu
>>   43:          0          0          0          0          0          0     GICv2  62 Level     arm-pmu
>>   45:        772          0          0          0          0          0     GICv2 171 Level     vfio-intx(0000:08:00.0)
>>
>> Guest side:
>>
>> # cat /proc/interrupts
>>             CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
>>   12:          0          0          0          0          0          0     GIC-0  96 Level     eth1
>>
>> So you could see the host can receive the interrupts, but these
>> interrupts are mainly triggered before binding vfio-pci driver.  But
>> seems now after launch kvm I can see there have very small mount
>> interrupts are triggered in host and the guest kernel also can receive
>> the virtual interrupts, e.g. if use 'dhclient eth1' command in guest
>> OS, this command stalls for long time (> 1 minute) after return back,
>> I can see both the host OS and guest OS can receive 5~6 interrupts.
>> Based on this, I guess the flow for interrupts forwarding has been
>> enabled.  But seems the data packet will not really output and I use
>> wireshark to capture packets, but cannot find any packet output from
>> the NIC.
>>
>> I did another testing is to shrink the memory space/io/bus region to
>> less than 0x40000000, so this can avoid to put guest memory IPA into
>> 0x40000000.  But this doesn't work.
> 
> What is worth to try is to move the base address of the guest RAM. I
> think there were some recent works on this on kvmtool. Adding
> Jean-Philippe in the loop.
> 
> Thanks
> 
> Eric
>>
>> @Robin, could you help explain for the hardware issue and review my
>> methods are feasible on Juno board?  Thanks a lot for suggestions.
>>
>> I will dig more for the memory mapping and post at here.
>>
>> Thanks,
>> Leo Yan
>>
>> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035055.html
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-15 12:54                   ` Robin Murphy
@ 2019-03-16  4:56                     ` Leo Yan
  2019-03-18 12:25                       ` Robin Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-16  4:56 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Daniel Thompson, kvmarm

Hi Robin,

On Fri, Mar 15, 2019 at 12:54:10PM +0000, Robin Murphy wrote:
> Hi Leo,
> 
> Sorry for the delay - I'm on holiday this week, but since I've made the
> mistake of glancing at my inbox I should probably save you from wasting any
> more time...

Sorry for disturbing you in holiday and appreciate your help.  It's no
rush to reply.

> On 2019-03-15 11:03 am, Auger Eric wrote:
> > Hi Leo,
> > 
> > + Jean-Philippe
> > 
> > On 3/15/19 10:37 AM, Leo Yan wrote:
> > > Hi Eric, Robin,
> > > 
> > > On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
> > > 
> > > [...]
> > > 
> > > > > If the NIC supports MSIs they logically are used. This can be easily
> > > > > checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> > > > > check whether the guest received any interrupt? I remember that Robin
> > > > > said in the past that on Juno, the MSI doorbell was in the PCI host
> > > > > bridge window and possibly transactions towards the doorbell could not
> > > > > reach it since considered as peer to peer.
> > > > 
> > > > I found back Robin's explanation. It was not related to MSI IOVA being
> > > > within the PCI host bridge window but RAM GPA colliding with host PCI
> > > > config space?
> > > > 
> > > > "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
> > > > typically have a programmable address, so could be anywhere. In the more
> > > > general category of "special hardware addresses", QEMU's default ARM
> > > > guest memory map puts RAM starting at 0x40000000; on the ARM Juno
> > > > platform, that happens to be where PCI config space starts; as Juno's
> > > > PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
> > > > the PCI bus to a guest (all of it, given the lack of ACS), the root
> > > > complex just sees the guest's attempts to DMA to "memory" as the device
> > > > attempting to access config space and aborts them."
> > > 
> > > Below is some following investigation at my side:
> > > 
> > > Firstly, must admit that I don't understand well for up paragraph; so
> > > based on the description I am wandering if can use INTx mode and if
> > > it's lucky to avoid this hardware pitfall.
> > 
> > The problem above is that during the assignment process, the virtualizer
> > maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
> > allow the device, programmed in GPA to access the whole guest RAM.
> > Unfortunately if the device emits a DMA request with 0x40000000 IOVA
> > address, this IOVA is interpreted by the Juno RC as a transaction
> > towards the PCIe config space. So this DMA request will not go beyond
> > the RC, will never reach the IOMMU and will never reach the guest RAM.
> > So globally the device is not able to reach part of the guest RAM.
> > That's how I interpret the above statement. Then I don't know the
> > details of the collision, I don't have access to this HW. I don't know
> > either if this problem still exists on the r2 HW.

Thanks a lot for rephrasing, Eric :)

> The short answer is that if you want PCI passthrough to work on Juno, the
> guest memory map has to look like a Juno.
> 
> The PCIe root complex uses an internal lookup table to generate appropriate
> AXI attributes for outgoing PCIe transactions; unfortunately this has no
> notion of 'default' attributes, so addresses *must* match one of the
> programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
> window covering the lower DRAM bank, an 8GB window covering the upper DRAM
> bank, and a 1MB (or thereabouts) window covering the GICv2m region with
> Device attributes.

I checked kernel memory blocks info, it gives out below result:

root@debian:~# cat /sys/kernel/debug/memblock/memory
   0: 0x0000000080000000..0x00000000feffffff
   1: 0x0000000880000000..0x00000009ffffffff

So I think the lower 2GB DRAM window is: [0x8000_0000..0xfeff_ffff]
and the high DRAM window is [0x8_8000_0000..0x9_ffff_ffff].

BTW, now I am using uboot rather than UEFI, so not sure if uboot has
programmed memory windows for PCIe.  Could you help give a point for
which registers should be set in UEFI thus I also can check related
configurations in uboot?

> Any PCIe transactions to addresses not within one of
> those windows will be aborted by the RC without ever going out to the AXI
> side where the SMMU lies (and I think anything matching the config space or
> I/O space windows or a region claimed by a BAR will be aborted even earlier
> as a peer-to-peer attempt regardless of the AXI Translation Table setup).
> 
> You could potentially modify the firmware to change the window
> configuration, but the alignment restrictions make it awkward. I've only
> ever tested passthrough on Juno using kvmtool, which IIRC already has guest
> RAM in an appropriate place (and is trivially easy to hack if not) - I don't
> remember if I ever actually tried guest MSI with that.

I did several tries with kvmtool to tweak memory regions but it's no
lucky.  Since the host uses [0x8000_0000..0xfeff_ffff] as the first
valid memory window for PCIe, thus I tried to change all memory/io
regions into this window with below changes but it's no lucky:

diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
index b9d486d..43f78b1 100644
--- a/arm/include/arm-common/kvm-arch.h
+++ b/arm/include/arm-common/kvm-arch.h
@@ -7,10 +7,10 @@

 #include "arm-common/gic.h"

-#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
-#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
-#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
-#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
+#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
+#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
+#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
+#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)

Anyway, very appreciate for the suggestions; it's sufficent for me to
dig more for memory related information (e.g. PCIe configurations,
IOMMU, etc) and will keep posted if I make any progress.

Thanks,
Leo Yan

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-16  4:56                     ` Leo Yan
@ 2019-03-18 12:25                       ` Robin Murphy
  2019-03-19  1:33                         ` Leo Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Robin Murphy @ 2019-03-18 12:25 UTC (permalink / raw)
  To: Leo Yan; +Cc: Daniel Thompson, kvmarm

On 16/03/2019 04:56, Leo Yan wrote:
> Hi Robin,
> 
> On Fri, Mar 15, 2019 at 12:54:10PM +0000, Robin Murphy wrote:
>> Hi Leo,
>>
>> Sorry for the delay - I'm on holiday this week, but since I've made the
>> mistake of glancing at my inbox I should probably save you from wasting any
>> more time...
> 
> Sorry for disturbing you in holiday and appreciate your help.  It's no
> rush to reply.
> 
>> On 2019-03-15 11:03 am, Auger Eric wrote:
>>> Hi Leo,
>>>
>>> + Jean-Philippe
>>>
>>> On 3/15/19 10:37 AM, Leo Yan wrote:
>>>> Hi Eric, Robin,
>>>>
>>>> On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:
>>>>
>>>> [...]
>>>>
>>>>>> If the NIC supports MSIs they logically are used. This can be easily
>>>>>> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
>>>>>> check whether the guest received any interrupt? I remember that Robin
>>>>>> said in the past that on Juno, the MSI doorbell was in the PCI host
>>>>>> bridge window and possibly transactions towards the doorbell could not
>>>>>> reach it since considered as peer to peer.
>>>>>
>>>>> I found back Robin's explanation. It was not related to MSI IOVA being
>>>>> within the PCI host bridge window but RAM GPA colliding with host PCI
>>>>> config space?
>>>>>
>>>>> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
>>>>> typically have a programmable address, so could be anywhere. In the more
>>>>> general category of "special hardware addresses", QEMU's default ARM
>>>>> guest memory map puts RAM starting at 0x40000000; on the ARM Juno
>>>>> platform, that happens to be where PCI config space starts; as Juno's
>>>>> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
>>>>> the PCI bus to a guest (all of it, given the lack of ACS), the root
>>>>> complex just sees the guest's attempts to DMA to "memory" as the device
>>>>> attempting to access config space and aborts them."
>>>>
>>>> Below is some following investigation at my side:
>>>>
>>>> Firstly, must admit that I don't understand well for up paragraph; so
>>>> based on the description I am wandering if can use INTx mode and if
>>>> it's lucky to avoid this hardware pitfall.
>>>
>>> The problem above is that during the assignment process, the virtualizer
>>> maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
>>> allow the device, programmed in GPA to access the whole guest RAM.
>>> Unfortunately if the device emits a DMA request with 0x40000000 IOVA
>>> address, this IOVA is interpreted by the Juno RC as a transaction
>>> towards the PCIe config space. So this DMA request will not go beyond
>>> the RC, will never reach the IOMMU and will never reach the guest RAM.
>>> So globally the device is not able to reach part of the guest RAM.
>>> That's how I interpret the above statement. Then I don't know the
>>> details of the collision, I don't have access to this HW. I don't know
>>> either if this problem still exists on the r2 HW.
> 
> Thanks a lot for rephrasing, Eric :)
> 
>> The short answer is that if you want PCI passthrough to work on Juno, the
>> guest memory map has to look like a Juno.
>>
>> The PCIe root complex uses an internal lookup table to generate appropriate
>> AXI attributes for outgoing PCIe transactions; unfortunately this has no
>> notion of 'default' attributes, so addresses *must* match one of the
>> programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
>> window covering the lower DRAM bank, an 8GB window covering the upper DRAM
>> bank, and a 1MB (or thereabouts) window covering the GICv2m region with
>> Device attributes.
> 
> I checked kernel memory blocks info, it gives out below result:
> 
> root@debian:~# cat /sys/kernel/debug/memblock/memory
>     0: 0x0000000080000000..0x00000000feffffff
>     1: 0x0000000880000000..0x00000009ffffffff
> 
> So I think the lower 2GB DRAM window is: [0x8000_0000..0xfeff_ffff]
> and the high DRAM window is [0x8_8000_0000..0x9_ffff_ffff].
> 
> BTW, now I am using uboot rather than UEFI, so not sure if uboot has
> programmed memory windows for PCIe.  Could you help give a point for
> which registers should be set in UEFI thus I also can check related
> configurations in uboot?

U-Boot does the same thing[1] - you can confirm that by whether PCIe 
works at all on the host ;)

>> Any PCIe transactions to addresses not within one of
>> those windows will be aborted by the RC without ever going out to the AXI
>> side where the SMMU lies (and I think anything matching the config space or
>> I/O space windows or a region claimed by a BAR will be aborted even earlier
>> as a peer-to-peer attempt regardless of the AXI Translation Table setup).
>>
>> You could potentially modify the firmware to change the window
>> configuration, but the alignment restrictions make it awkward. I've only
>> ever tested passthrough on Juno using kvmtool, which IIRC already has guest
>> RAM in an appropriate place (and is trivially easy to hack if not) - I don't
>> remember if I ever actually tried guest MSI with that.
> 
> I did several tries with kvmtool to tweak memory regions but it's no
> lucky.  Since the host uses [0x8000_0000..0xfeff_ffff] as the first
> valid memory window for PCIe, thus I tried to change all memory/io
> regions into this window with below changes but it's no lucky:
> 
> diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
> index b9d486d..43f78b1 100644
> --- a/arm/include/arm-common/kvm-arch.h
> +++ b/arm/include/arm-common/kvm-arch.h
> @@ -7,10 +7,10 @@
> 
>   #include "arm-common/gic.h"
> 
> -#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
> -#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
> -#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
> -#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
> +#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
> +#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
> +#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
> +#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)
> 
> Anyway, very appreciate for the suggestions; it's sufficent for me to
> dig more for memory related information (e.g. PCIe configurations,
> IOMMU, etc) and will keep posted if I make any progress.

None of those should need to change (all the MMIO emulation stuff is 
irrelevant to PCIe DMA anyway) - provided you don't give the guest more 
than 2GB of RAM, passthrough with legacy INTx ought to work 
out-of-the-box. For MSIs to get through, you'll further need to change 
the host kernel to place its software MSI region[2] within any of the 
host bridge windows as well.

Robin.

[1] 
http://git.denx.de/?p=u-boot.git;a=blob;f=board/armltd/vexpress64/pcie.c;h=0608a5a88b941cdd362e9f231250a981aebab357;hb=HEAD#l95
[2] MSI_IOVA_BASE in drivers/iommu/arm-smmu.c

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-18 12:25                       ` Robin Murphy
@ 2019-03-19  1:33                         ` Leo Yan
  2019-03-20  8:42                           ` Leo Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Leo Yan @ 2019-03-19  1:33 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Daniel Thompson, kvmarm

Hi Robin,

On Mon, Mar 18, 2019 at 12:25:33PM +0000, Robin Murphy wrote:

[...]

> > diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
> > index b9d486d..43f78b1 100644
> > --- a/arm/include/arm-common/kvm-arch.h
> > +++ b/arm/include/arm-common/kvm-arch.h
> > @@ -7,10 +7,10 @@
> > 
> >   #include "arm-common/gic.h"
> > 
> > -#define ARM_IOPORT_AREA                _AC(0x0000000000000000, UL)
> > -#define ARM_MMIO_AREA          _AC(0x0000000000010000, UL)
> > -#define ARM_AXI_AREA           _AC(0x0000000040000000, UL)
> > -#define ARM_MEMORY_AREA                _AC(0x0000000080000000, UL)
> > +#define ARM_IOPORT_AREA                _AC(0x0000000080000000, UL)
> > +#define ARM_MMIO_AREA          _AC(0x0000000080010000, UL)
> > +#define ARM_AXI_AREA           _AC(0x0000000088000000, UL)
> > +#define ARM_MEMORY_AREA                _AC(0x0000000090000000, UL)
> > 
> > Anyway, very appreciate for the suggestions; it's sufficent for me to
> > dig more for memory related information (e.g. PCIe configurations,
> > IOMMU, etc) and will keep posted if I make any progress.
> 
> None of those should need to change (all the MMIO emulation stuff is
> irrelevant to PCIe DMA anyway) - provided you don't give the guest more than
> 2GB of RAM, passthrough with legacy INTx ought to work out-of-the-box. For
> MSIs to get through, you'll further need to change the host kernel to place
> its software MSI region[2] within any of the host bridge windows as well.

>From PCI configurations dumping, I can see after launch the guest with
kvmtool, the host receives the first interrupt (checked with the
function vfio_intx_handler() has been invoked once) and then PCI sent
command with PCI_COMMAND_INTX_DISABLE to disable interrupt line.  So
this flow is very likely the interrupt is not forwarded properly and
guest doesn't receive interrupt.

It's lucky that I found below flow can let interrupt forwarding from
host to guest after I always set "sky2.disable_msi=1" for both kernel
command lines:

    host                    guest

  INTx mode               INTx mode

So far, it still cannot work well if I only set "sky2.disable_msi=1"
for host kernel command line, with this config it runs with below flow
and which cannot forward interrupt properly from host to guest:

    host                    guest

  INTx mode               msi enable
                          msi disable
                          Switch back to INTx mode

I am so happy now I can use pure INTx mode on Juno board for NIC
enabling and pinged successfully from guest OS to my router :)

Will look into the issue in the second secnario; and if I have more
time I will look into msi mode as well (I confirmed msi mode can work
with host OS but failed in guest OS).

Very appreciate you & Eric helping!

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
  2019-03-19  1:33                         ` Leo Yan
@ 2019-03-20  8:42                           ` Leo Yan
  0 siblings, 0 replies; 20+ messages in thread
From: Leo Yan @ 2019-03-20  8:42 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Daniel Thompson, kvmarm

On Tue, Mar 19, 2019 at 09:33:58AM +0800, Leo Yan wrote:

[...]

> So far, it still cannot work well if I only set "sky2.disable_msi=1"
> for host kernel command line, with this config it runs with below flow
> and which cannot forward interrupt properly from host to guest:
> 
>     host                    guest
> 
>   INTx mode               msi enable
>                           msi disable
>                           Switch back to INTx mode

Just for heading up, for enabling vfio-pci with NIC device on Juno
board, I sent out two patch set for related changes.  With these
changes, the INTx mode can work well and it also can handle for up
failure case.

  kvmtool changes:
  https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035186.html

  Juno DT binding for PCIe SMMU:
  https://archive.armlinux.org.uk/lurker/message/20190320.083105.f002c91c.en.html

@Robin, if you find issue for Juno DT binding in my patch and want to
send your patch for related fixing, please feel free let me know.  I
am happy to test it and drop my own.  Thanks!

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-03-20  8:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-11  6:42 Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2 Leo Yan
2019-03-11  6:57 ` Leo Yan
2019-03-11  8:23 ` Auger Eric
2019-03-11  9:39   ` Leo Yan
2019-03-11  9:47     ` Auger Eric
2019-03-11 14:35       ` Leo Yan
2019-03-13  8:00         ` Leo Yan
2019-03-13 10:01           ` Leo Yan
2019-03-13 10:16             ` Auger Eric
2019-03-13 10:01           ` Auger Eric
2019-03-13 10:24             ` Auger Eric
2019-03-13 11:52               ` Leo Yan
2019-03-15  9:37               ` Leo Yan
2019-03-15 11:03                 ` Auger Eric
2019-03-15 12:54                   ` Robin Murphy
2019-03-16  4:56                     ` Leo Yan
2019-03-18 12:25                       ` Robin Murphy
2019-03-19  1:33                         ` Leo Yan
2019-03-20  8:42                           ` Leo Yan
2019-03-13 11:35             ` Leo Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.