linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Write to srvio_numvfs triggers kernel panic
@ 2022-05-04 19:56 Volodymyr Babchuk
  2022-05-06 20:17 ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-04 19:56 UTC (permalink / raw)
  To: linux-pci


Hello,

I have encountered issue when PCI code tries to use both fields in

        union {
		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
		struct pci_dev		*physfn;	/* VF: related PF */
	};

(which are part of struct pci_dev) at the same time.

Symptoms are following:

# echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
 Unable to handle kernel paging request at virtual address 0001000200000010
 Mem abort info:
   ESR = 0x96000004
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 [0001000200000010] address between user and kernel address ranges
 Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
 CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
 Hardware name: XENVM-4.17 (DT)
 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
 pc : pcie_aspm_get_link+0x90/0xcc
 lr : pcie_aspm_get_link+0x8c/0xcc
 sp : ffff8000130d39c0
 x29: ffff8000130d39c0 x28: 00000000000001a4
 x27: 00000000ffffee4b x26: ffff80001164f560
 x25: 0000000000000000 x24: 0000000000000000
 x23: ffff80001164f660 x22: 0000000000000000
 x23: ffff80001164f660 x22: 0000000000000000
 x21: ffff000003f08000 x20: ffff800010db37d8
 x19: ffff000004b8e780 x18: ffffffffffffffff
 x17: 0000000000000000 x16: 00000000deadbeef
 x15: ffff8000930d36c7 x14: 0000000000000006
 x13: ffff8000115c2710 x12: 000000000000081c
 x11: 00000000000002b4 x10: ffff8000115c2710
 x9 : ffff8000115c2710 x8 : 00000000ffffefff
 x7 : ffff80001161a710 x6 : ffff80001161a710
 x5 : ffff00003fdad900 x4 : 0000000000000000
 x3 : 0000000000000000 x2 : 0000000000000000
 x1 : ffff000003c51c80 x0 : 0001000200000000
 Call trace:
  pcie_aspm_get_link+0x90/0xcc
  aspm_ctrl_attrs_are_visible+0x30/0xc0
  internal_create_group+0xd0/0x3cc
  internal_create_groups.part.0+0x4c/0xc0
  sysfs_create_groups+0x20/0x34
  device_add+0x2b4/0x760
  pci_device_add+0x814/0x854
  pci_iov_add_virtfn+0x240/0x2f0
  sriov_enable+0x1f8/0x474
  pci_sriov_configure_simple+0x38/0x90
  sriov_numvfs_store+0xa4/0x1a0
  dev_attr_store+0x1c/0x30
  sysfs_kf_write+0x48/0x60
  kernfs_fop_write_iter+0x118/0x1ac
  new_sync_write+0xe8/0x184
  vfs_write+0x23c/0x2a0
  ksys_write+0x68/0xf4
  __arm64_sys_write+0x20/0x2c
  el0_svc_common.constprop.0+0x78/0x1a0
  do_el0_svc+0x28/0x94
  el0_svc+0x14/0x20
  el0_sync_handler+0xa4/0x130
  el0_sync+0x180/0x1c0
Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)


Debugging showed the following:

pci_iov_add_virtfn() allocates new struct pci_dev:

	virtfn = pci_alloc_dev(bus);
and sets physfn:
	virtfn->is_virtfn = 1;
	virtfn->physfn = pci_dev_get(dev);

then we will get into sriov_init() via the following call path:

pci_device_add(virtfn, virtfn->bus);
  pci_init_capabilities(dev);
    pci_iov_init(dev);
      sriov_init(dev, pos);


sriov_init() overwrites value in the union:
	dev->sriov = iov; <<<<<---- There
	dev->is_physfn = 1;

Next, we will get into function that causes panic:

pci_device_add(virtfn, virtfn->bus);
  device_add(&dev->dev)
    sysfs_create_groups()
      internal_create_group()
        aspm_ctrl_attrs_are_visible()
          pcie_aspm_get_link()
            pci_upstream_bridge()
              pci_physfn()

which is

static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
{
#ifdef CONFIG_PCI_IOV
	if (dev->is_virtfn)
		dev = dev->physfn;
#endif
	return dev;
}


as is_virtfn == 1 and dev->physfn was overwritten via write to
dev->sriov, pci_physfn() will return pointer to struct pci_sriov
allocated by sriov_init(). And then pci_upstream_bridge() will
cause panic by acessing it as if it were pointer to struct pci_dev

I encountered this issue on ARM64, Linux 5.10.41. Tried to test
on master branch, but it is quite difficult to rebase platform
code on the master. But I compared all relevant parts of PCI code
and didn't found any differences.

Looks like I am missing following, because how SR-IOV can be so broken?
But what exactly?

-- 
Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-04 19:56 Write to srvio_numvfs triggers kernel panic Volodymyr Babchuk
@ 2022-05-06 20:17 ` Bjorn Helgaas
  2022-05-07  1:34   ` Jason Gunthorpe
  2022-05-07 10:22   ` Volodymyr Babchuk
  0 siblings, 2 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-06 20:17 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe

[+cc Alex, Leon, Jason]

On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
> 
> Hello,
> 
> I have encountered issue when PCI code tries to use both fields in
> 
>         union {
> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
> 		struct pci_dev		*physfn;	/* VF: related PF */
> 	};
> 
> (which are part of struct pci_dev) at the same time.
> 
> Symptoms are following:
> 
> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> 
> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>  Unable to handle kernel paging request at virtual address 0001000200000010
>  Mem abort info:
>    ESR = 0x96000004
>    EC = 0x25: DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>  Data abort info:
>    ISV = 0, ISS = 0x00000004
>    CM = 0, WnR = 0
>  [0001000200000010] address between user and kernel address ranges
>  Internal error: Oops: 96000004 [#1] PREEMPT SMP
> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>  Hardware name: XENVM-4.17 (DT)
>  pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>  pc : pcie_aspm_get_link+0x90/0xcc
>  lr : pcie_aspm_get_link+0x8c/0xcc
>  sp : ffff8000130d39c0
>  x29: ffff8000130d39c0 x28: 00000000000001a4
>  x27: 00000000ffffee4b x26: ffff80001164f560
>  x25: 0000000000000000 x24: 0000000000000000
>  x23: ffff80001164f660 x22: 0000000000000000
>  x23: ffff80001164f660 x22: 0000000000000000
>  x21: ffff000003f08000 x20: ffff800010db37d8
>  x19: ffff000004b8e780 x18: ffffffffffffffff
>  x17: 0000000000000000 x16: 00000000deadbeef
>  x15: ffff8000930d36c7 x14: 0000000000000006
>  x13: ffff8000115c2710 x12: 000000000000081c
>  x11: 00000000000002b4 x10: ffff8000115c2710
>  x9 : ffff8000115c2710 x8 : 00000000ffffefff
>  x7 : ffff80001161a710 x6 : ffff80001161a710
>  x5 : ffff00003fdad900 x4 : 0000000000000000
>  x3 : 0000000000000000 x2 : 0000000000000000
>  x1 : ffff000003c51c80 x0 : 0001000200000000
>  Call trace:
>   pcie_aspm_get_link+0x90/0xcc
>   aspm_ctrl_attrs_are_visible+0x30/0xc0
>   internal_create_group+0xd0/0x3cc
>   internal_create_groups.part.0+0x4c/0xc0
>   sysfs_create_groups+0x20/0x34
>   device_add+0x2b4/0x760
>   pci_device_add+0x814/0x854
>   pci_iov_add_virtfn+0x240/0x2f0
>   sriov_enable+0x1f8/0x474
>   pci_sriov_configure_simple+0x38/0x90
>   sriov_numvfs_store+0xa4/0x1a0
>   dev_attr_store+0x1c/0x30
>   sysfs_kf_write+0x48/0x60
>   kernfs_fop_write_iter+0x118/0x1ac
>   new_sync_write+0xe8/0x184
>   vfs_write+0x23c/0x2a0
>   ksys_write+0x68/0xf4
>   __arm64_sys_write+0x20/0x2c
>   el0_svc_common.constprop.0+0x78/0x1a0
>   do_el0_svc+0x28/0x94
>   el0_svc+0x14/0x20
>   el0_sync_handler+0xa4/0x130
>   el0_sync+0x180/0x1c0
> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
> 
> 
> Debugging showed the following:
> 
> pci_iov_add_virtfn() allocates new struct pci_dev:
> 
> 	virtfn = pci_alloc_dev(bus);
> and sets physfn:
> 	virtfn->is_virtfn = 1;
> 	virtfn->physfn = pci_dev_get(dev);
> 
> then we will get into sriov_init() via the following call path:
> 
> pci_device_add(virtfn, virtfn->bus);
>   pci_init_capabilities(dev);
>     pci_iov_init(dev);
>       sriov_init(dev, pos);

We called pci_device_add() with the VF.  pci_iov_init() only calls
sriov_init() if it finds an SR-IOV capability on the device:

  pci_iov_init(struct pci_dev *dev)
    pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
    if (pos)
      return sriov_init(dev, pos);

So this means the VF must have an SR-IOV capability, which sounds a
little dubious.  From PCIe r6.0:

  9.3.3 SR-IOV Extended Capability

  The SR-IOV Extended Capability defined here is a PCIe extended
  capability that must be implemented in each PF that supports SR-IOV.
  This Capability is used to describe and control a PF’s SR-IOV
  Capabilities.

  For Multi-Function Devices, each PF that supports SR-IOV shall
  provide the Capability structure defined in this section.  This
  Capability structure may be present in any Function with a Type 0
  Configuration Space Header. This Capability must not be present in
  Functions with a Type 1 Configuration Space Header.

Can you supply the output of "sudo lspci -vv" for your system?

It could be that the device has an SR-IOV capability when it
shouldn't.  But even if it does, Linux could tolerate that better
than it does today.

> sriov_init() overwrites value in the union:
> 	dev->sriov = iov; <<<<<---- There
> 	dev->is_physfn = 1;
> 
> Next, we will get into function that causes panic:
> 
> pci_device_add(virtfn, virtfn->bus);
>   device_add(&dev->dev)
>     sysfs_create_groups()
>       internal_create_group()
>         aspm_ctrl_attrs_are_visible()
>           pcie_aspm_get_link()
>             pci_upstream_bridge()
>               pci_physfn()
> 
> which is
> 
> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> {
> #ifdef CONFIG_PCI_IOV
> 	if (dev->is_virtfn)
> 		dev = dev->physfn;
> #endif
> 	return dev;
> }
> 
> 
> as is_virtfn == 1 and dev->physfn was overwritten via write to
> dev->sriov, pci_physfn() will return pointer to struct pci_sriov
> allocated by sriov_init(). And then pci_upstream_bridge() will
> cause panic by acessing it as if it were pointer to struct pci_dev
> 
> I encountered this issue on ARM64, Linux 5.10.41. Tried to test
> on master branch, but it is quite difficult to rebase platform
> code on the master. But I compared all relevant parts of PCI code
> and didn't found any differences.
> 
> Looks like I am missing following, because how SR-IOV can be so broken?
> But what exactly?
> 
> -- 
> Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-06 20:17 ` Bjorn Helgaas
@ 2022-05-07  1:34   ` Jason Gunthorpe
  2022-05-07 10:25     ` Volodymyr Babchuk
  2022-05-07 10:22   ` Volodymyr Babchuk
  1 sibling, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2022-05-07  1:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Volodymyr Babchuk, linux-pci, Alex Williamson, Leon Romanovsky

On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:

> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> >  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> >  Hardware name: XENVM-4.17 (DT)
                ^^^^^^^^^^^^^^^^^

> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious.  From PCIe r6.0:

Enabling SRIOV from within a VM is "exciting" - I would not be
surprised if there was some wonky bugs here.

Jsaon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-06 20:17 ` Bjorn Helgaas
  2022-05-07  1:34   ` Jason Gunthorpe
@ 2022-05-07 10:22   ` Volodymyr Babchuk
  2022-05-07 15:41     ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-07 10:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe


Hello Bjorn,

Bjorn Helgaas <helgaas@kernel.org> writes:

> [+cc Alex, Leon, Jason]
>
> On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hello,
>> 
>> I have encountered issue when PCI code tries to use both fields in
>> 
>>         union {
>> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
>> 		struct pci_dev		*physfn;	/* VF: related PF */
>> 	};
>> 
>> (which are part of struct pci_dev) at the same time.
>> 
>> Symptoms are following:
>> 
>> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>> 
>> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>>  Unable to handle kernel paging request at virtual address 0001000200000010
>>  Mem abort info:
>>    ESR = 0x96000004
>>    EC = 0x25: DABT (current EL), IL = 32 bits
>>    SET = 0, FnV = 0
>>    EA = 0, S1PTW = 0
>>  Data abort info:
>>    ISV = 0, ISS = 0x00000004
>>    CM = 0, WnR = 0
>>  [0001000200000010] address between user and kernel address ranges
>>  Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>>  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>>  Hardware name: XENVM-4.17 (DT)
>>  pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>>  pc : pcie_aspm_get_link+0x90/0xcc
>>  lr : pcie_aspm_get_link+0x8c/0xcc
>>  sp : ffff8000130d39c0
>>  x29: ffff8000130d39c0 x28: 00000000000001a4
>>  x27: 00000000ffffee4b x26: ffff80001164f560
>>  x25: 0000000000000000 x24: 0000000000000000
>>  x23: ffff80001164f660 x22: 0000000000000000
>>  x23: ffff80001164f660 x22: 0000000000000000
>>  x21: ffff000003f08000 x20: ffff800010db37d8
>>  x19: ffff000004b8e780 x18: ffffffffffffffff
>>  x17: 0000000000000000 x16: 00000000deadbeef
>>  x15: ffff8000930d36c7 x14: 0000000000000006
>>  x13: ffff8000115c2710 x12: 000000000000081c
>>  x11: 00000000000002b4 x10: ffff8000115c2710
>>  x9 : ffff8000115c2710 x8 : 00000000ffffefff
>>  x7 : ffff80001161a710 x6 : ffff80001161a710
>>  x5 : ffff00003fdad900 x4 : 0000000000000000
>>  x3 : 0000000000000000 x2 : 0000000000000000
>>  x1 : ffff000003c51c80 x0 : 0001000200000000
>>  Call trace:
>>   pcie_aspm_get_link+0x90/0xcc
>>   aspm_ctrl_attrs_are_visible+0x30/0xc0
>>   internal_create_group+0xd0/0x3cc
>>   internal_create_groups.part.0+0x4c/0xc0
>>   sysfs_create_groups+0x20/0x34
>>   device_add+0x2b4/0x760
>>   pci_device_add+0x814/0x854
>>   pci_iov_add_virtfn+0x240/0x2f0
>>   sriov_enable+0x1f8/0x474
>>   pci_sriov_configure_simple+0x38/0x90
>>   sriov_numvfs_store+0xa4/0x1a0
>>   dev_attr_store+0x1c/0x30
>>   sysfs_kf_write+0x48/0x60
>>   kernfs_fop_write_iter+0x118/0x1ac
>>   new_sync_write+0xe8/0x184
>>   vfs_write+0x23c/0x2a0
>>   ksys_write+0x68/0xf4
>>   __arm64_sys_write+0x20/0x2c
>>   el0_svc_common.constprop.0+0x78/0x1a0
>>   do_el0_svc+0x28/0x94
>>   el0_svc+0x14/0x20
>>   el0_sync_handler+0xa4/0x130
>>   el0_sync+0x180/0x1c0
>> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
>> 
>> 
>> Debugging showed the following:
>> 
>> pci_iov_add_virtfn() allocates new struct pci_dev:
>> 
>> 	virtfn = pci_alloc_dev(bus);
>> and sets physfn:
>> 	virtfn->is_virtfn = 1;
>> 	virtfn->physfn = pci_dev_get(dev);
>> 
>> then we will get into sriov_init() via the following call path:
>> 
>> pci_device_add(virtfn, virtfn->bus);
>>   pci_init_capabilities(dev);
>>     pci_iov_init(dev);
>>       sriov_init(dev, pos);
>
> We called pci_device_add() with the VF.  pci_iov_init() only calls
> sriov_init() if it finds an SR-IOV capability on the device:
>
>   pci_iov_init(struct pci_dev *dev)
>     pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
>     if (pos)
>       return sriov_init(dev, pos);
>
> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious.  From PCIe r6.0:

[...]

Yes, I dived into debugging and came to the same conclusions. I'm still
investigating this, but looks like my PCIe controller (DesignWare-based)
incorrectly reads configuration space for VF. Looks like instead of
providing access VF config space, it reads PF's one.

>
> Can you supply the output of "sudo lspci -vv" for your system?

Sure:

root@spider:~# lspci -vv
00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 189
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: [disabled]
        Memory behind bridge: 30000000-301fffff [size=2M]
        Prefetchable memory behind bridge: [disabled]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
                Address: 0000000004030040  Data: 0000
                Masking: fffffffe  Pending: 00000000
        Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (ok), Width x2 (ok)
                        TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
                RootCap: CRSVisible-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn- NFERptEn- FERptEn-
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [19c v1] Lane Margining at the Receiver <?>
        Capabilities: [1bc v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
        Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [304 v1] Data Link Feature <?>
        Capabilities: [310 v1] Precision Time Measurement
                PTMCap: Requester:+ Responder:+ Root:+
                PTMClockGranularity: 16ns
                PTMControl: Enabled:- RootSelected:-
                PTMEffectiveGranularity: Unknown
        Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
        Kernel driver in use: pcieport
        Kernel modules: pci_endpoint_test

01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a809
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 0
        NUMA node: 0
        Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
        Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00                                                                                                                               [8/5710]
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00004000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
        Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
        Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 2, stride: 1, Device ID: a824
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [3a4 v1] Data Link Feature <?>
        Kernel driver in use: nvme
        Kernel modules: nvme


> It could be that the device has an SR-IOV capability when it
> shouldn't.  But even if it does, Linux could tolerate that better
> than it does today.
>

Agree there. I can create simple patch that checks for is_virtfn
in sriov_init(). But what to do if it is set?

>> sriov_init() overwrites value in the union:
>> 	dev->sriov = iov; <<<<<---- There
>> 	dev->is_physfn = 1;
>> 

-- 
Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-07  1:34   ` Jason Gunthorpe
@ 2022-05-07 10:25     ` Volodymyr Babchuk
  2022-05-08 11:19       ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-07 10:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, linux-pci, Alex Williamson, Leon Romanovsky



Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
>
>> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>> >  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>> >  Hardware name: XENVM-4.17 (DT)
>                 ^^^^^^^^^^^^^^^^^
>
>> So this means the VF must have an SR-IOV capability, which sounds a
>> little dubious.  From PCIe r6.0:
>
> Enabling SRIOV from within a VM is "exciting" - I would not be
> surprised if there was some wonky bugs here.

Well, yes. But in this case, this VM has direct access to the PCIe
controller. So it should not cause any troubles. I'll try baremetal
setup, though. 


-- 
Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-07 10:22   ` Volodymyr Babchuk
@ 2022-05-07 15:41     ` Bjorn Helgaas
  2022-05-08 11:07       ` Volodymyr Babchuk
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-07 15:41 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe

On Sat, May 07, 2022 at 10:22:32AM +0000, Volodymyr Babchuk wrote:
> Bjorn Helgaas <helgaas@kernel.org> writes:
> > On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
> >> 
> >> I have encountered issue when PCI code tries to use both fields in
> >> 
> >>         union {
> >> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
> >> 		struct pci_dev		*physfn;	/* VF: related PF */
> >> 	};
> >> 
> >> (which are part of struct pci_dev) at the same time.
> >> 
> >> Symptoms are following:
> >> 
> >> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> >> 
> >> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
> >> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
> >>  Unable to handle kernel paging request at virtual address 0001000200000010

> >> Debugging showed the following:
> >> 
> >> pci_iov_add_virtfn() allocates new struct pci_dev:
> >> 
> >> 	virtfn = pci_alloc_dev(bus);
> >> and sets physfn:
> >> 	virtfn->is_virtfn = 1;
> >> 	virtfn->physfn = pci_dev_get(dev);
> >> 
> >> then we will get into sriov_init() via the following call path:
> >> 
> >> pci_device_add(virtfn, virtfn->bus);
> >>   pci_init_capabilities(dev);
> >>     pci_iov_init(dev);
> >>       sriov_init(dev, pos);
> >
> > We called pci_device_add() with the VF.  pci_iov_init() only calls
> > sriov_init() if it finds an SR-IOV capability on the device:
> >
> >   pci_iov_init(struct pci_dev *dev)
> >     pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> >     if (pos)
> >       return sriov_init(dev, pos);
> >
> > So this means the VF must have an SR-IOV capability, which sounds a
> > little dubious.  From PCIe r6.0:
> 
> [...]
> 
> Yes, I dived into debugging and came to the same conclusions. I'm still
> investigating this, but looks like my PCIe controller (DesignWare-based)
> incorrectly reads configuration space for VF. Looks like instead of
> providing access VF config space, it reads PF's one.
> 
> > Can you supply the output of "sudo lspci -vv" for your system?
> 
> Sure:
> 
> root@spider:~# lspci -vv
> 00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 189
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: [disabled]
>         Memory behind bridge: 30000000-301fffff [size=2M]
>         Prefetchable memory behind bridge: [disabled]
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
>                 Address: 0000000004030040  Data: 0000
>                 Masking: fffffffe  Pending: 00000000
>         Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0
>                         ExtTag+ RBE+
>                 DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>                 LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
>                         ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (ok), Width x2 (ok)
>                         TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
>                 RootCap: CRSVisible-
>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
>                          AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
>                          AtomicOpsCtl: ReqEn- EgressBlck-
>                 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>                 RootCmd: CERptEn- NFERptEn- FERptEn-
>                 RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>                          FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>                 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
>         Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
>         Capabilities: [158 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [19c v1] Lane Margining at the Receiver <?>
>         Capabilities: [1bc v1] L1 PM Substates
>                 L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>                           PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
>                 L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
>                            T_CommonMode=0us LTR1.2_Threshold=0ns
>                 L1SubCtl2: T_PwrOn=10us
>         Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
>         Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
>         Capabilities: [304 v1] Data Link Feature <?>
>         Capabilities: [310 v1] Precision Time Measurement
>                 PTMCap: Requester:+ Responder:+ Root:+
>                 PTMClockGranularity: 16ns
>                 PTMControl: Enabled:- RootSelected:-
>                 PTMEffectiveGranularity: Unknown
>         Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
>         Kernel driver in use: pcieport
>         Kernel modules: pci_endpoint_test
> 
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>         Subsystem: Samsung Electronics Co Ltd Device a809
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 0
>         NUMA node: 0
>         Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
>         Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [70] Express (v2) Endpoint, MSI 00                                                                                                                               [8/5710]
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, TPHComp-, ExtTPHComp-
>                          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>                          AtomicOpsCtl: ReqEn-
>                 LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
>                 Vector table: BAR=0 offset=00004000
>                 PBA: BAR=0 offset=00003000
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>         Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>         Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [178 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>         Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
>                 IOVSta: Migration-
>                 Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
>                 VF offset: 2, stride: 1, Device ID: a824
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Capabilities: [3a4 v1] Data Link Feature <?>
>         Kernel driver in use: nvme
>         Kernel modules: nvme

I guess this is before enabling SR-IOV on 01:00.0, so it doesn't show
the VFs themselves.

> > It could be that the device has an SR-IOV capability when it
> > shouldn't.  But even if it does, Linux could tolerate that better
> > than it does today.
> 
> Agree there. I can create simple patch that checks for is_virtfn
> in sriov_init(). But what to do if it is set?

Maybe something like this?  It makes no sense to me that a VF would
have an SR-IOV capability, but ...

If the below avoids the problem, maybe collect another "lspci -vv"
output including the VF(s).

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 952217572113..9c5184384a45 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -901,6 +901,10 @@ int pci_iov_init(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
+	/* Some devices include SR-IOV cap on VFs as well as PFs */
+	if (dev->is_virtfn)
+		return -ENODEV;
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
 	if (pos)
 		return sriov_init(dev, pos);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-07 15:41     ` Bjorn Helgaas
@ 2022-05-08 11:07       ` Volodymyr Babchuk
  2022-05-09 16:49         ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-08 11:07 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe


Hello Bjorn,

Bjorn Helgaas <helgaas@kernel.org> writes:

> On Sat, May 07, 2022 at 10:22:32AM +0000, Volodymyr Babchuk wrote:
>> Bjorn Helgaas <helgaas@kernel.org> writes:
>> > On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> I have encountered issue when PCI code tries to use both fields in
>> >> 
>> >>         union {
>> >> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
>> >> 		struct pci_dev		*physfn;	/* VF: related PF */
>> >> 	};
>> >> 
>> >> (which are part of struct pci_dev) at the same time.
>> >> 
>> >> Symptoms are following:
>> >> 
>> >> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>> >> 
>> >> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> >> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>> >>  Unable to handle kernel paging request at virtual address 0001000200000010
>
>> >> Debugging showed the following:
>> >> 
>> >> pci_iov_add_virtfn() allocates new struct pci_dev:
>> >> 
>> >> 	virtfn = pci_alloc_dev(bus);
>> >> and sets physfn:
>> >> 	virtfn->is_virtfn = 1;
>> >> 	virtfn->physfn = pci_dev_get(dev);
>> >> 
>> >> then we will get into sriov_init() via the following call path:
>> >> 
>> >> pci_device_add(virtfn, virtfn->bus);
>> >>   pci_init_capabilities(dev);
>> >>     pci_iov_init(dev);
>> >>       sriov_init(dev, pos);
>> >
>> > We called pci_device_add() with the VF.  pci_iov_init() only calls
>> > sriov_init() if it finds an SR-IOV capability on the device:
>> >
>> >   pci_iov_init(struct pci_dev *dev)
>> >     pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
>> >     if (pos)
>> >       return sriov_init(dev, pos);
>> >
>> > So this means the VF must have an SR-IOV capability, which sounds a
>> > little dubious.  From PCIe r6.0:
>> 
>> [...]
>> 
>> Yes, I dived into debugging and came to the same conclusions. I'm still
>> investigating this, but looks like my PCIe controller (DesignWare-based)
>> incorrectly reads configuration space for VF. Looks like instead of
>> providing access VF config space, it reads PF's one.
>> 
>> > Can you supply the output of "sudo lspci -vv" for your system?
>> 
>> Sure:
>> 
>> root@spider:~# lspci -vv
>> 00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
>>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>         Latency: 0
>>         Interrupt: pin A routed to IRQ 189
>>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>>         I/O behind bridge: [disabled]
>>         Memory behind bridge: 30000000-301fffff [size=2M]
>>         Prefetchable memory behind bridge: [disabled]
>>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>>         BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
>>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>         Capabilities: [40] Power Management version 3
>>                 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
>>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>         Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
>>                 Address: 0000000004030040  Data: 0000
>>                 Masking: fffffffe  Pending: 00000000
>>         Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>>                 DevCap: MaxPayload 256 bytes, PhantFunc 0
>>                         ExtTag+ RBE+
>>                 DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
>>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
>>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>                 LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
>>                         ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
>>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                 LnkSta: Speed 5GT/s (ok), Width x2 (ok)
>>                         TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
>>                 RootCap: CRSVisible-
>>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
>>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
>>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>>                          FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
>>                          AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
>>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
>>                          AtomicOpsCtl: ReqEn- EgressBlck-
>>                 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>>                          Compliance De-emphasis: -6dB
>>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>         Capabilities: [100 v2] Advanced Error Reporting
>>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                 AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                 HeaderLog: 00000000 00000000 00000000 00000000
>>                 RootCmd: CERptEn- NFERptEn- FERptEn-
>>                 RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>>                          FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>>                 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
>>         Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
>>         Capabilities: [158 v1] Secondary PCI Express
>>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>>                 LaneErrStat: 0
>>         Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
>>         Capabilities: [19c v1] Lane Margining at the Receiver <?>
>>         Capabilities: [1bc v1] L1 PM Substates
>>                 L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>>                           PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
>>                 L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
>>                            T_CommonMode=0us LTR1.2_Threshold=0ns
>>                 L1SubCtl2: T_PwrOn=10us
>>         Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
>>         Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
>>         Capabilities: [304 v1] Data Link Feature <?>
>>         Capabilities: [310 v1] Precision Time Measurement
>>                 PTMCap: Requester:+ Responder:+ Root:+
>>                 PTMClockGranularity: 16ns
>>                 PTMControl: Enabled:- RootSelected:-
>>                 PTMEffectiveGranularity: Unknown
>>         Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
>>         Kernel driver in use: pcieport
>>         Kernel modules: pci_endpoint_test
>> 
>> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>>         Subsystem: Samsung Electronics Co Ltd Device a809
>>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>         Latency: 0
>>         Interrupt: pin A routed to IRQ 0
>>         NUMA node: 0
>>         Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
>>         Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
>>         Capabilities: [40] Power Management version 3
>>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>         Capabilities: [70] Express (v2) Endpoint, MSI 00                                                                                                                               [8/5710]
>>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>>                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>>                 LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                 LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>>                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>>                          FRS-, TPHComp-, ExtTPHComp-
>>                          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>>                          AtomicOpsCtl: ReqEn-
>>                 LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>>                          Compliance De-emphasis: -6dB
>>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>         Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
>>                 Vector table: BAR=0 offset=00004000
>>                 PBA: BAR=0 offset=00003000
>>         Capabilities: [100 v2] Advanced Error Reporting
>>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>                 AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>                 HeaderLog: 00000000 00000000 00000000 00000000
>>         Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>>         Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>>                 ARICap: MFVC- ACS-, Next Function: 0
>>                 ARICtl: MFVC- ACS-, Function Group: 0
>>         Capabilities: [178 v1] Secondary PCI Express
>>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>>                 LaneErrStat: 0
>>         Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>>         Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>>         Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>>                 IOVCap: Migration-, Interrupt Message Number: 000
>>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
>>                 IOVSta: Migration-
>>                 Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
>>                 VF offset: 2, stride: 1, Device ID: a824
>>                 Supported Page Size: 00000553, System Page Size: 00000001
>>                 Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>>                 VF Migration: offset: 00000000, BIR: 0
>>         Capabilities: [3a4 v1] Data Link Feature <?>
>>         Kernel driver in use: nvme
>>         Kernel modules: nvme
>
> I guess this is before enabling SR-IOV on 01:00.0, so it doesn't show
> the VFs themselves.

Yes. Because kernel crashed without your suggested patch.

>> > It could be that the device has an SR-IOV capability when it
>> > shouldn't.  But even if it does, Linux could tolerate that better
>> > than it does today.
>> 
>> Agree there. I can create simple patch that checks for is_virtfn
>> in sriov_init(). But what to do if it is set?
>
> Maybe something like this?  It makes no sense to me that a VF would
> have an SR-IOV capability, but ...
>
> If the below avoids the problem, maybe collect another "lspci -vv"
> output including the VF(s).
>

I had another crash in nvme_pci_enable(), for which I made quick
workaround. And now yeah, it looks like I have some issues with
my root complex HW:

[skipping bridge info]

01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a809
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 0
        NUMA node: 0
        Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
        Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=0 offset=00004000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
        Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
        Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
                VF offset: 2, stride: 1, Device ID: a824
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [3a4 v1] Data Link Feature <?>
        Kernel driver in use: nvme
        Kernel modules: nvme

01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a809
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 0
        NUMA node: 0
        Region 0: Memory at 30018000 (64-bit, non-prefetchable) [size=32K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=0 offset=00004000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
        Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
        Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
                VF offset: 2, stride: 1, Device ID: a824
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [3a4 v1] Data Link Feature <?>
        Kernel modules: nvme

As you can see, output for func 0 and func 2 is identical, so yeah,
looks like my system reads config space for func 0 in both cases.

Now at least I know where to look. Thank you for your help.

> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 952217572113..9c5184384a45 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -901,6 +901,10 @@ int pci_iov_init(struct pci_dev *dev)
>  	if (!pci_is_pcie(dev))
>  		return -ENODEV;
>  
> +	/* Some devices include SR-IOV cap on VFs as well as PFs */
> +	if (dev->is_virtfn)
> +		return -ENODEV;
> +
>  	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
>  	if (pos)
>  		return sriov_init(dev, pos);

Thanks, this patch helped. You can have my

Tested-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

if you are going to include it in the kernel.

On other hand, I'm wondering if it is correct to have both is_virtfn and
is_physfn in the first place, as there can 4 combinations and only two
(or three?) of them are valid. Maybe it is worth to replace them with
enum?

-- 
Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-07 10:25     ` Volodymyr Babchuk
@ 2022-05-08 11:19       ` Leon Romanovsky
  2022-05-09 18:22         ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2022-05-08 11:19 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Jason Gunthorpe, Bjorn Helgaas, linux-pci, Alex Williamson

On Sat, May 07, 2022 at 10:25:16AM +0000, Volodymyr Babchuk wrote:
> 
> 
> Jason Gunthorpe <jgg@ziepe.ca> writes:
> 
> > On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
> >
> >> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> >> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> >> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> >> >  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> >> >  Hardware name: XENVM-4.17 (DT)
> >                 ^^^^^^^^^^^^^^^^^
> >
> >> So this means the VF must have an SR-IOV capability, which sounds a
> >> little dubious.  From PCIe r6.0:
> >
> > Enabling SRIOV from within a VM is "exciting" - I would not be
> > surprised if there was some wonky bugs here.
> 
> Well, yes. But in this case, this VM has direct access to the PCIe
> controller. So it should not cause any troubles. I'll try baremetal
> setup, though. 

If I remember correctly, the VM software that runs on hypervisor needs
to filter PCI capabilities even for pass-through devices.

The vfio kernel module did this for QEMU, so something similar I would
expect from XEN too.

Thanks

> 
> 
> -- 
> Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-08 11:07       ` Volodymyr Babchuk
@ 2022-05-09 16:49         ` Bjorn Helgaas
  2022-05-09 16:58           ` Alex Williamson
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-09 16:49 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe

On Sun, May 08, 2022 at 11:07:40AM +0000, Volodymyr Babchuk wrote:

> I had another crash in nvme_pci_enable(), for which I made quick
> workaround. And now yeah, it looks like I have some issues with
> my root complex HW:

Please point to the root complex issue you see.

> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>         Subsystem: Samsung Electronics Co Ltd Device a809
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 0
>         NUMA node: 0
>         Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
>         Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>                 Capabilities: [70] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, TPHComp-, ExtTPHComp-
>                          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>                          AtomicOpsCtl: ReqEn-
>                 LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
>                 Vector table: BAR=0 offset=00004000
>                 PBA: BAR=0 offset=00003000
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>         Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>         Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [178 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>         Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
>                 IOVSta: Migration-
>                 Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
>                 VF offset: 2, stride: 1, Device ID: a824
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Capabilities: [3a4 v1] Data Link Feature <?>
>         Kernel driver in use: nvme
>         Kernel modules: nvme
> 
> 01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>         Subsystem: Samsung Electronics Co Ltd Device a809
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 0
>         NUMA node: 0
>         Region 0: Memory at 30018000 (64-bit, non-prefetchable) [size=32K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [70] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, TPHComp-, ExtTPHComp-
>                          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>                          AtomicOpsCtl: ReqEn-
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
>                 Vector table: BAR=0 offset=00004000
>                 PBA: BAR=0 offset=00003000
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>         Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>         Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [178 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>         Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
>                 IOVSta: Migration-
>                 Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
>                 VF offset: 2, stride: 1, Device ID: a824
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Capabilities: [3a4 v1] Data Link Feature <?>
>         Kernel modules: nvme
> 
> As you can see, output for func 0 and func 2 is identical, so yeah,
> looks like my system reads config space for func 0 in both cases.

They are not identical:

  01:00.0 Non-Volatile memory controller
    Region 0: Memory at 30010000

  01:00.2 Non-Volatile memory controller
    Region 0: Memory at 30018000

> On other hand, I'm wondering if it is correct to have both is_virtfn and
> is_physfn in the first place, as there can 4 combinations and only two
> (or three?) of them are valid. Maybe it is worth to replace them with
> enum?

Good question.  I think there was a reason, but I can't remember it
right now.

Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-09 16:49         ` Bjorn Helgaas
@ 2022-05-09 16:58           ` Alex Williamson
  2022-05-10  6:39             ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Williamson @ 2022-05-09 16:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Volodymyr Babchuk, linux-pci, Leon Romanovsky, Jason Gunthorpe

On Mon, 9 May 2022 11:49:29 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Sun, May 08, 2022 at 11:07:40AM +0000, Volodymyr Babchuk wrote:
> > 
> > As you can see, output for func 0 and func 2 is identical, so yeah,
> > looks like my system reads config space for func 0 in both cases.  
> 
> They are not identical:
> 
>   01:00.0 Non-Volatile memory controller
>     Region 0: Memory at 30010000
> 
>   01:00.2 Non-Volatile memory controller
>     Region 0: Memory at 30018000
> 
> > On other hand, I'm wondering if it is correct to have both is_virtfn and
> > is_physfn in the first place, as there can 4 combinations and only two
> > (or three?) of them are valid. Maybe it is worth to replace them with
> > enum?  
> 
> Good question.  I think there was a reason, but I can't remember it
> right now.

AIUI, three combinations are valid:

is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
is_physfn = 1, is_virtfn = 0: An SR-IOV PF
is_physfn = 0, is_virtfn = 1: An SR-IOV VF

As implemented with bit fields this is 2 bits, which is more space
efficient than an enum.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-08 11:19       ` Leon Romanovsky
@ 2022-05-09 18:22         ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2022-05-09 18:22 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Volodymyr Babchuk, Jason Gunthorpe, Bjorn Helgaas, linux-pci,
	Alex Williamson

On Sun, May 08, 2022 at 02:19:56PM +0300, Leon Romanovsky wrote:
> On Sat, May 07, 2022 at 10:25:16AM +0000, Volodymyr Babchuk wrote:
> > Jason Gunthorpe <jgg@ziepe.ca> writes:
> > > On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
> > >
> > >> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> > >> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> > >> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> > >> >  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> > >> >  Hardware name: XENVM-4.17 (DT)
> > >                 ^^^^^^^^^^^^^^^^^
> > >
> > >> So this means the VF must have an SR-IOV capability, which sounds a
> > >> little dubious.  From PCIe r6.0:
> > >
> > > Enabling SRIOV from within a VM is "exciting" - I would not be
> > > surprised if there was some wonky bugs here.
> > 
> > Well, yes. But in this case, this VM has direct access to the PCIe
> > controller. So it should not cause any troubles. I'll try baremetal
> > setup, though. 
> 
> If I remember correctly, the VM software that runs on hypervisor needs
> to filter PCI capabilities even for pass-through devices.
> 
> The vfio kernel module did this for QEMU, so something similar I would
> expect from XEN too.

AIUI, some kind of translation from GPA to HPA needs to happen, but I'm not
sure how that would occur for the VF's enabled within the guest when the host
doesn't know about those functions.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-09 16:58           ` Alex Williamson
@ 2022-05-10  6:39             ` Christoph Hellwig
  2022-05-10 17:37               ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2022-05-10  6:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Volodymyr Babchuk, linux-pci, Leon Romanovsky,
	Jason Gunthorpe

On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
> is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
> is_physfn = 1, is_virtfn = 0: An SR-IOV PF
> is_physfn = 0, is_virtfn = 1: An SR-IOV VF
> 
> As implemented with bit fields this is 2 bits, which is more space
> efficient than an enum.  Thanks,

A two-bit bitfield with explicit constants for the values would probably
still much eaiser to understand.

And there is some code that seems to intepret is_physfn a bit odd, e.g.:

arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-10  6:39             ` Christoph Hellwig
@ 2022-05-10 17:37               ` Bjorn Helgaas
  2022-05-12  7:18                 ` Volodymyr Babchuk
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-10 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Volodymyr Babchuk, linux-pci, Leon Romanovsky,
	Jason Gunthorpe

On Mon, May 09, 2022 at 11:39:30PM -0700, Christoph Hellwig wrote:
> On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
> > is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
> > is_physfn = 1, is_virtfn = 0: An SR-IOV PF
> > is_physfn = 0, is_virtfn = 1: An SR-IOV VF
> > 
> > As implemented with bit fields this is 2 bits, which is more space
> > efficient than an enum.  Thanks,
> 
> A two-bit bitfield with explicit constants for the values would probably
> still much eaiser to understand.
> 
> And there is some code that seems to intepret is_physfn a bit odd, e.g.:
> 
> arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
> arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);

"dev->sriov != NULL" and "dev->is_physfn" are basically the same and
many of the dev->is_physfn uses in drivers/pci would end up being
simpler if replaced with dev->sriov, e.g.,

  int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
  {
    if (!dev->is_physfn)
      return -EINVAL;
    return dev->bus->number + ((dev->devfn + dev->sriov->offset +
				dev->sriov->stride * vf_id) >> 8);
  }

would be more obvious as:

  if (dev->sriov)
    return dev->bus->number + ((dev->devfn + dev->sriov->offset +
				dev->sriov->stride * vf_id) >> 8);
  return -EINVAL;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Write to srvio_numvfs triggers kernel panic
  2022-05-10 17:37               ` Bjorn Helgaas
@ 2022-05-12  7:18                 ` Volodymyr Babchuk
  0 siblings, 0 replies; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-12  7:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Christoph Hellwig, Alex Williamson, linux-pci, Leon Romanovsky,
	Jason Gunthorpe


Hello Bjorn,

Bjorn Helgaas <helgaas@kernel.org> writes:

> On Mon, May 09, 2022 at 11:39:30PM -0700, Christoph Hellwig wrote:
>> On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
>> > is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
>> > is_physfn = 1, is_virtfn = 0: An SR-IOV PF
>> > is_physfn = 0, is_virtfn = 1: An SR-IOV VF
>> > 
>> > As implemented with bit fields this is 2 bits, which is more space
>> > efficient than an enum.  Thanks,
>> 
>> A two-bit bitfield with explicit constants for the values would probably
>> still much eaiser to understand.
>> 
>> And there is some code that seems to intepret is_physfn a bit odd, e.g.:
>> 
>> arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
>> arch/powerpc/kernel/eeh_sysfs.c:        np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
>
> "dev->sriov != NULL" and "dev->is_physfn" are basically the same and
> many of the dev->is_physfn uses in drivers/pci would end up being
> simpler if replaced with dev->sriov, e.g.,
>
>   int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
>   {
>     if (!dev->is_physfn)
>       return -EINVAL;
>     return dev->bus->number + ((dev->devfn + dev->sriov->offset +
> 				dev->sriov->stride * vf_id) >> 8);
>   }
>
> would be more obvious as:
>
>   if (dev->sriov)
>     return dev->bus->number + ((dev->devfn + dev->sriov->offset +
> 				dev->sriov->stride * vf_id) >> 8);
>   return -EINVAL;

I'm not sure that this will work because dev->sriov and dev->physfn are
stored in the same union. 


-- 
Volodymyr Babchuk at EPAM

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-05-12  7:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 19:56 Write to srvio_numvfs triggers kernel panic Volodymyr Babchuk
2022-05-06 20:17 ` Bjorn Helgaas
2022-05-07  1:34   ` Jason Gunthorpe
2022-05-07 10:25     ` Volodymyr Babchuk
2022-05-08 11:19       ` Leon Romanovsky
2022-05-09 18:22         ` Keith Busch
2022-05-07 10:22   ` Volodymyr Babchuk
2022-05-07 15:41     ` Bjorn Helgaas
2022-05-08 11:07       ` Volodymyr Babchuk
2022-05-09 16:49         ` Bjorn Helgaas
2022-05-09 16:58           ` Alex Williamson
2022-05-10  6:39             ` Christoph Hellwig
2022-05-10 17:37               ` Bjorn Helgaas
2022-05-12  7:18                 ` Volodymyr Babchuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).