* Write to srvio_numvfs triggers kernel panic
@ 2022-05-04 19:56 Volodymyr Babchuk
2022-05-06 20:17 ` Bjorn Helgaas
0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-04 19:56 UTC (permalink / raw)
To: linux-pci
Hello,
I have encountered issue when PCI code tries to use both fields in
union {
struct pci_sriov *sriov; /* PF: SR-IOV info */
struct pci_dev *physfn; /* VF: related PF */
};
(which are part of struct pci_dev) at the same time.
Symptoms are following:
# echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
Unable to handle kernel paging request at virtual address 0001000200000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[0001000200000010] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
Hardware name: XENVM-4.17 (DT)
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : pcie_aspm_get_link+0x90/0xcc
lr : pcie_aspm_get_link+0x8c/0xcc
sp : ffff8000130d39c0
x29: ffff8000130d39c0 x28: 00000000000001a4
x27: 00000000ffffee4b x26: ffff80001164f560
x25: 0000000000000000 x24: 0000000000000000
x23: ffff80001164f660 x22: 0000000000000000
x23: ffff80001164f660 x22: 0000000000000000
x21: ffff000003f08000 x20: ffff800010db37d8
x19: ffff000004b8e780 x18: ffffffffffffffff
x17: 0000000000000000 x16: 00000000deadbeef
x15: ffff8000930d36c7 x14: 0000000000000006
x13: ffff8000115c2710 x12: 000000000000081c
x11: 00000000000002b4 x10: ffff8000115c2710
x9 : ffff8000115c2710 x8 : 00000000ffffefff
x7 : ffff80001161a710 x6 : ffff80001161a710
x5 : ffff00003fdad900 x4 : 0000000000000000
x3 : 0000000000000000 x2 : 0000000000000000
x1 : ffff000003c51c80 x0 : 0001000200000000
Call trace:
pcie_aspm_get_link+0x90/0xcc
aspm_ctrl_attrs_are_visible+0x30/0xc0
internal_create_group+0xd0/0x3cc
internal_create_groups.part.0+0x4c/0xc0
sysfs_create_groups+0x20/0x34
device_add+0x2b4/0x760
pci_device_add+0x814/0x854
pci_iov_add_virtfn+0x240/0x2f0
sriov_enable+0x1f8/0x474
pci_sriov_configure_simple+0x38/0x90
sriov_numvfs_store+0xa4/0x1a0
dev_attr_store+0x1c/0x30
sysfs_kf_write+0x48/0x60
kernfs_fop_write_iter+0x118/0x1ac
new_sync_write+0xe8/0x184
vfs_write+0x23c/0x2a0
ksys_write+0x68/0xf4
__arm64_sys_write+0x20/0x2c
el0_svc_common.constprop.0+0x78/0x1a0
do_el0_svc+0x28/0x94
el0_svc+0x14/0x20
el0_sync_handler+0xa4/0x130
el0_sync+0x180/0x1c0
Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
Debugging showed the following:
pci_iov_add_virtfn() allocates new struct pci_dev:
virtfn = pci_alloc_dev(bus);
and sets physfn:
virtfn->is_virtfn = 1;
virtfn->physfn = pci_dev_get(dev);
then we will get into sriov_init() via the following call path:
pci_device_add(virtfn, virtfn->bus);
pci_init_capabilities(dev);
pci_iov_init(dev);
sriov_init(dev, pos);
sriov_init() overwrites value in the union:
dev->sriov = iov; <<<<<---- There
dev->is_physfn = 1;
Next, we will get into function that causes panic:
pci_device_add(virtfn, virtfn->bus);
device_add(&dev->dev)
sysfs_create_groups()
internal_create_group()
aspm_ctrl_attrs_are_visible()
pcie_aspm_get_link()
pci_upstream_bridge()
pci_physfn()
which is
static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
{
#ifdef CONFIG_PCI_IOV
if (dev->is_virtfn)
dev = dev->physfn;
#endif
return dev;
}
as is_virtfn == 1 and dev->physfn was overwritten via write to
dev->sriov, pci_physfn() will return pointer to struct pci_sriov
allocated by sriov_init(). And then pci_upstream_bridge() will
cause panic by acessing it as if it were pointer to struct pci_dev
I encountered this issue on ARM64, Linux 5.10.41. Tried to test
on master branch, but it is quite difficult to rebase platform
code on the master. But I compared all relevant parts of PCI code
and didn't found any differences.
Looks like I am missing following, because how SR-IOV can be so broken?
But what exactly?
--
Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-04 19:56 Write to srvio_numvfs triggers kernel panic Volodymyr Babchuk
@ 2022-05-06 20:17 ` Bjorn Helgaas
2022-05-07 1:34 ` Jason Gunthorpe
2022-05-07 10:22 ` Volodymyr Babchuk
0 siblings, 2 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-06 20:17 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe
[+cc Alex, Leon, Jason]
On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>
> Hello,
>
> I have encountered issue when PCI code tries to use both fields in
>
> union {
> struct pci_sriov *sriov; /* PF: SR-IOV info */
> struct pci_dev *physfn; /* VF: related PF */
> };
>
> (which are part of struct pci_dev) at the same time.
>
> Symptoms are following:
>
> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>
> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
> Unable to handle kernel paging request at virtual address 0001000200000010
> Mem abort info:
> ESR = 0x96000004
> EC = 0x25: DABT (current EL), IL = 32 bits
> SET = 0, FnV = 0
> EA = 0, S1PTW = 0
> Data abort info:
> ISV = 0, ISS = 0x00000004
> CM = 0, WnR = 0
> [0001000200000010] address between user and kernel address ranges
> Internal error: Oops: 96000004 [#1] PREEMPT SMP
> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> Hardware name: XENVM-4.17 (DT)
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> pc : pcie_aspm_get_link+0x90/0xcc
> lr : pcie_aspm_get_link+0x8c/0xcc
> sp : ffff8000130d39c0
> x29: ffff8000130d39c0 x28: 00000000000001a4
> x27: 00000000ffffee4b x26: ffff80001164f560
> x25: 0000000000000000 x24: 0000000000000000
> x23: ffff80001164f660 x22: 0000000000000000
> x23: ffff80001164f660 x22: 0000000000000000
> x21: ffff000003f08000 x20: ffff800010db37d8
> x19: ffff000004b8e780 x18: ffffffffffffffff
> x17: 0000000000000000 x16: 00000000deadbeef
> x15: ffff8000930d36c7 x14: 0000000000000006
> x13: ffff8000115c2710 x12: 000000000000081c
> x11: 00000000000002b4 x10: ffff8000115c2710
> x9 : ffff8000115c2710 x8 : 00000000ffffefff
> x7 : ffff80001161a710 x6 : ffff80001161a710
> x5 : ffff00003fdad900 x4 : 0000000000000000
> x3 : 0000000000000000 x2 : 0000000000000000
> x1 : ffff000003c51c80 x0 : 0001000200000000
> Call trace:
> pcie_aspm_get_link+0x90/0xcc
> aspm_ctrl_attrs_are_visible+0x30/0xc0
> internal_create_group+0xd0/0x3cc
> internal_create_groups.part.0+0x4c/0xc0
> sysfs_create_groups+0x20/0x34
> device_add+0x2b4/0x760
> pci_device_add+0x814/0x854
> pci_iov_add_virtfn+0x240/0x2f0
> sriov_enable+0x1f8/0x474
> pci_sriov_configure_simple+0x38/0x90
> sriov_numvfs_store+0xa4/0x1a0
> dev_attr_store+0x1c/0x30
> sysfs_kf_write+0x48/0x60
> kernfs_fop_write_iter+0x118/0x1ac
> new_sync_write+0xe8/0x184
> vfs_write+0x23c/0x2a0
> ksys_write+0x68/0xf4
> __arm64_sys_write+0x20/0x2c
> el0_svc_common.constprop.0+0x78/0x1a0
> do_el0_svc+0x28/0x94
> el0_svc+0x14/0x20
> el0_sync_handler+0xa4/0x130
> el0_sync+0x180/0x1c0
> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
>
>
> Debugging showed the following:
>
> pci_iov_add_virtfn() allocates new struct pci_dev:
>
> virtfn = pci_alloc_dev(bus);
> and sets physfn:
> virtfn->is_virtfn = 1;
> virtfn->physfn = pci_dev_get(dev);
>
> then we will get into sriov_init() via the following call path:
>
> pci_device_add(virtfn, virtfn->bus);
> pci_init_capabilities(dev);
> pci_iov_init(dev);
> sriov_init(dev, pos);
We called pci_device_add() with the VF. pci_iov_init() only calls
sriov_init() if it finds an SR-IOV capability on the device:
pci_iov_init(struct pci_dev *dev)
pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
if (pos)
return sriov_init(dev, pos);
So this means the VF must have an SR-IOV capability, which sounds a
little dubious. From PCIe r6.0:
9.3.3 SR-IOV Extended Capability
The SR-IOV Extended Capability defined here is a PCIe extended
capability that must be implemented in each PF that supports SR-IOV.
This Capability is used to describe and control a PF’s SR-IOV
Capabilities.
For Multi-Function Devices, each PF that supports SR-IOV shall
provide the Capability structure defined in this section. This
Capability structure may be present in any Function with a Type 0
Configuration Space Header. This Capability must not be present in
Functions with a Type 1 Configuration Space Header.
Can you supply the output of "sudo lspci -vv" for your system?
It could be that the device has an SR-IOV capability when it
shouldn't. But even if it does, Linux could tolerate that better
than it does today.
> sriov_init() overwrites value in the union:
> dev->sriov = iov; <<<<<---- There
> dev->is_physfn = 1;
>
> Next, we will get into function that causes panic:
>
> pci_device_add(virtfn, virtfn->bus);
> device_add(&dev->dev)
> sysfs_create_groups()
> internal_create_group()
> aspm_ctrl_attrs_are_visible()
> pcie_aspm_get_link()
> pci_upstream_bridge()
> pci_physfn()
>
> which is
>
> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> {
> #ifdef CONFIG_PCI_IOV
> if (dev->is_virtfn)
> dev = dev->physfn;
> #endif
> return dev;
> }
>
>
> as is_virtfn == 1 and dev->physfn was overwritten via write to
> dev->sriov, pci_physfn() will return pointer to struct pci_sriov
> allocated by sriov_init(). And then pci_upstream_bridge() will
> cause panic by acessing it as if it were pointer to struct pci_dev
>
> I encountered this issue on ARM64, Linux 5.10.41. Tried to test
> on master branch, but it is quite difficult to rebase platform
> code on the master. But I compared all relevant parts of PCI code
> and didn't found any differences.
>
> Looks like I am missing following, because how SR-IOV can be so broken?
> But what exactly?
>
> --
> Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-06 20:17 ` Bjorn Helgaas
@ 2022-05-07 1:34 ` Jason Gunthorpe
2022-05-07 10:25 ` Volodymyr Babchuk
2022-05-07 10:22 ` Volodymyr Babchuk
1 sibling, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2022-05-07 1:34 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Volodymyr Babchuk, linux-pci, Alex Williamson, Leon Romanovsky
On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> > CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> > Hardware name: XENVM-4.17 (DT)
^^^^^^^^^^^^^^^^^
> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious. From PCIe r6.0:
Enabling SRIOV from within a VM is "exciting" - I would not be
surprised if there was some wonky bugs here.
Jsaon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-06 20:17 ` Bjorn Helgaas
2022-05-07 1:34 ` Jason Gunthorpe
@ 2022-05-07 10:22 ` Volodymyr Babchuk
2022-05-07 15:41 ` Bjorn Helgaas
1 sibling, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-07 10:22 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe
Hello Bjorn,
Bjorn Helgaas <helgaas@kernel.org> writes:
> [+cc Alex, Leon, Jason]
>
> On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>>
>> Hello,
>>
>> I have encountered issue when PCI code tries to use both fields in
>>
>> union {
>> struct pci_sriov *sriov; /* PF: SR-IOV info */
>> struct pci_dev *physfn; /* VF: related PF */
>> };
>>
>> (which are part of struct pci_dev) at the same time.
>>
>> Symptoms are following:
>>
>> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>>
>> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>> Unable to handle kernel paging request at virtual address 0001000200000010
>> Mem abort info:
>> ESR = 0x96000004
>> EC = 0x25: DABT (current EL), IL = 32 bits
>> SET = 0, FnV = 0
>> EA = 0, S1PTW = 0
>> Data abort info:
>> ISV = 0, ISS = 0x00000004
>> CM = 0, WnR = 0
>> [0001000200000010] address between user and kernel address ranges
>> Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>> CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>> Hardware name: XENVM-4.17 (DT)
>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>> pc : pcie_aspm_get_link+0x90/0xcc
>> lr : pcie_aspm_get_link+0x8c/0xcc
>> sp : ffff8000130d39c0
>> x29: ffff8000130d39c0 x28: 00000000000001a4
>> x27: 00000000ffffee4b x26: ffff80001164f560
>> x25: 0000000000000000 x24: 0000000000000000
>> x23: ffff80001164f660 x22: 0000000000000000
>> x23: ffff80001164f660 x22: 0000000000000000
>> x21: ffff000003f08000 x20: ffff800010db37d8
>> x19: ffff000004b8e780 x18: ffffffffffffffff
>> x17: 0000000000000000 x16: 00000000deadbeef
>> x15: ffff8000930d36c7 x14: 0000000000000006
>> x13: ffff8000115c2710 x12: 000000000000081c
>> x11: 00000000000002b4 x10: ffff8000115c2710
>> x9 : ffff8000115c2710 x8 : 00000000ffffefff
>> x7 : ffff80001161a710 x6 : ffff80001161a710
>> x5 : ffff00003fdad900 x4 : 0000000000000000
>> x3 : 0000000000000000 x2 : 0000000000000000
>> x1 : ffff000003c51c80 x0 : 0001000200000000
>> Call trace:
>> pcie_aspm_get_link+0x90/0xcc
>> aspm_ctrl_attrs_are_visible+0x30/0xc0
>> internal_create_group+0xd0/0x3cc
>> internal_create_groups.part.0+0x4c/0xc0
>> sysfs_create_groups+0x20/0x34
>> device_add+0x2b4/0x760
>> pci_device_add+0x814/0x854
>> pci_iov_add_virtfn+0x240/0x2f0
>> sriov_enable+0x1f8/0x474
>> pci_sriov_configure_simple+0x38/0x90
>> sriov_numvfs_store+0xa4/0x1a0
>> dev_attr_store+0x1c/0x30
>> sysfs_kf_write+0x48/0x60
>> kernfs_fop_write_iter+0x118/0x1ac
>> new_sync_write+0xe8/0x184
>> vfs_write+0x23c/0x2a0
>> ksys_write+0x68/0xf4
>> __arm64_sys_write+0x20/0x2c
>> el0_svc_common.constprop.0+0x78/0x1a0
>> do_el0_svc+0x28/0x94
>> el0_svc+0x14/0x20
>> el0_sync_handler+0xa4/0x130
>> el0_sync+0x180/0x1c0
>> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
>>
>>
>> Debugging showed the following:
>>
>> pci_iov_add_virtfn() allocates new struct pci_dev:
>>
>> virtfn = pci_alloc_dev(bus);
>> and sets physfn:
>> virtfn->is_virtfn = 1;
>> virtfn->physfn = pci_dev_get(dev);
>>
>> then we will get into sriov_init() via the following call path:
>>
>> pci_device_add(virtfn, virtfn->bus);
>> pci_init_capabilities(dev);
>> pci_iov_init(dev);
>> sriov_init(dev, pos);
>
> We called pci_device_add() with the VF. pci_iov_init() only calls
> sriov_init() if it finds an SR-IOV capability on the device:
>
> pci_iov_init(struct pci_dev *dev)
> pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> if (pos)
> return sriov_init(dev, pos);
>
> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious. From PCIe r6.0:
[...]
Yes, I dived into debugging and came to the same conclusions. I'm still
investigating this, but looks like my PCIe controller (DesignWare-based)
incorrectly reads configuration space for VF. Looks like instead of
providing access VF config space, it reads PF's one.
>
> Can you supply the output of "sudo lspci -vv" for your system?
Sure:
root@spider:~# lspci -vv
00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 189
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: 30000000-301fffff [size=2M]
Prefetchable memory behind bridge: [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
Address: 0000000004030040 Data: 0000
Masking: fffffffe Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x2 (ok)
TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn- NFERptEn- FERptEn-
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [19c v1] Lane Margining at the Receiver <?>
Capabilities: [1bc v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [304 v1] Data Link Feature <?>
Capabilities: [310 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport
Kernel modules: pci_endpoint_test
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a809
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 0
NUMA node: 0
Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00 [8/5710]
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00004000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: a824
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [3a4 v1] Data Link Feature <?>
Kernel driver in use: nvme
Kernel modules: nvme
> It could be that the device has an SR-IOV capability when it
> shouldn't. But even if it does, Linux could tolerate that better
> than it does today.
>
Agree there. I can create simple patch that checks for is_virtfn
in sriov_init(). But what to do if it is set?
>> sriov_init() overwrites value in the union:
>> dev->sriov = iov; <<<<<---- There
>> dev->is_physfn = 1;
>>
--
Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-07 1:34 ` Jason Gunthorpe
@ 2022-05-07 10:25 ` Volodymyr Babchuk
2022-05-08 11:19 ` Leon Romanovsky
0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-07 10:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Bjorn Helgaas, linux-pci, Alex Williamson, Leon Romanovsky
Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
>
>> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>> > CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>> > Hardware name: XENVM-4.17 (DT)
> ^^^^^^^^^^^^^^^^^
>
>> So this means the VF must have an SR-IOV capability, which sounds a
>> little dubious. From PCIe r6.0:
>
> Enabling SRIOV from within a VM is "exciting" - I would not be
> surprised if there was some wonky bugs here.
Well, yes. But in this case, this VM has direct access to the PCIe
controller. So it should not cause any troubles. I'll try baremetal
setup, though.
--
Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-07 10:22 ` Volodymyr Babchuk
@ 2022-05-07 15:41 ` Bjorn Helgaas
2022-05-08 11:07 ` Volodymyr Babchuk
0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-07 15:41 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe
On Sat, May 07, 2022 at 10:22:32AM +0000, Volodymyr Babchuk wrote:
> Bjorn Helgaas <helgaas@kernel.org> writes:
> > On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
> >>
> >> I have encountered issue when PCI code tries to use both fields in
> >>
> >> union {
> >> struct pci_sriov *sriov; /* PF: SR-IOV info */
> >> struct pci_dev *physfn; /* VF: related PF */
> >> };
> >>
> >> (which are part of struct pci_dev) at the same time.
> >>
> >> Symptoms are following:
> >>
> >> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> >>
> >> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
> >> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
> >> Unable to handle kernel paging request at virtual address 0001000200000010
> >> Debugging showed the following:
> >>
> >> pci_iov_add_virtfn() allocates new struct pci_dev:
> >>
> >> virtfn = pci_alloc_dev(bus);
> >> and sets physfn:
> >> virtfn->is_virtfn = 1;
> >> virtfn->physfn = pci_dev_get(dev);
> >>
> >> then we will get into sriov_init() via the following call path:
> >>
> >> pci_device_add(virtfn, virtfn->bus);
> >> pci_init_capabilities(dev);
> >> pci_iov_init(dev);
> >> sriov_init(dev, pos);
> >
> > We called pci_device_add() with the VF. pci_iov_init() only calls
> > sriov_init() if it finds an SR-IOV capability on the device:
> >
> > pci_iov_init(struct pci_dev *dev)
> > pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> > if (pos)
> > return sriov_init(dev, pos);
> >
> > So this means the VF must have an SR-IOV capability, which sounds a
> > little dubious. From PCIe r6.0:
>
> [...]
>
> Yes, I dived into debugging and came to the same conclusions. I'm still
> investigating this, but looks like my PCIe controller (DesignWare-based)
> incorrectly reads configuration space for VF. Looks like instead of
> providing access VF config space, it reads PF's one.
>
> > Can you supply the output of "sudo lspci -vv" for your system?
>
> Sure:
>
> root@spider:~# lspci -vv
> 00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 189
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> I/O behind bridge: [disabled]
> Memory behind bridge: 30000000-301fffff [size=2M]
> Prefetchable memory behind bridge: [disabled]
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
> Address: 0000000004030040 Data: 0000
> Masking: fffffffe Pending: 00000000
> Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0
> ExtTag+ RBE+
> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s (ok), Width x2 (ok)
> TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
> RootCap: CRSVisible-
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
> AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
> AtomicOpsCtl: ReqEn- EgressBlck-
> LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> RootCmd: CERptEn- NFERptEn- FERptEn-
> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
> Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
> Capabilities: [158 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
> LaneErrStat: 0
> Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
> Capabilities: [19c v1] Lane Margining at the Receiver <?>
> Capabilities: [1bc v1] L1 PM Substates
> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> T_CommonMode=0us LTR1.2_Threshold=0ns
> L1SubCtl2: T_PwrOn=10us
> Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
> Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
> Capabilities: [304 v1] Data Link Feature <?>
> Capabilities: [310 v1] Precision Time Measurement
> PTMCap: Requester:+ Responder:+ Root:+
> PTMClockGranularity: 16ns
> PTMControl: Enabled:- RootSelected:-
> PTMEffectiveGranularity: Unknown
> Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
> Kernel driver in use: pcieport
> Kernel modules: pci_endpoint_test
>
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
> Subsystem: Samsung Electronics Co Ltd Device a809
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 0
> NUMA node: 0
> Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
> Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [70] Express (v2) Endpoint, MSI 00 [8/5710]
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS-, TPHComp-, ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> AtomicOpsCtl: ReqEn-
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
> Vector table: BAR=0 offset=00004000
> PBA: BAR=0 offset=00003000
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
> Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 0
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [178 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
> LaneErrStat: 0
> Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
> Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
> Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
> IOVCap: Migration-, Interrupt Message Number: 000
> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
> IOVSta: Migration-
> Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
> VF offset: 2, stride: 1, Device ID: a824
> Supported Page Size: 00000553, System Page Size: 00000001
> Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
> VF Migration: offset: 00000000, BIR: 0
> Capabilities: [3a4 v1] Data Link Feature <?>
> Kernel driver in use: nvme
> Kernel modules: nvme
I guess this is before enabling SR-IOV on 01:00.0, so it doesn't show
the VFs themselves.
> > It could be that the device has an SR-IOV capability when it
> > shouldn't. But even if it does, Linux could tolerate that better
> > than it does today.
>
> Agree there. I can create simple patch that checks for is_virtfn
> in sriov_init(). But what to do if it is set?
Maybe something like this? It makes no sense to me that a VF would
have an SR-IOV capability, but ...
If the below avoids the problem, maybe collect another "lspci -vv"
output including the VF(s).
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 952217572113..9c5184384a45 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -901,6 +901,10 @@ int pci_iov_init(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
+ /* Some devices include SR-IOV cap on VFs as well as PFs */
+ if (dev->is_virtfn)
+ return -ENODEV;
+
pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
if (pos)
return sriov_init(dev, pos);
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-07 15:41 ` Bjorn Helgaas
@ 2022-05-08 11:07 ` Volodymyr Babchuk
2022-05-09 16:49 ` Bjorn Helgaas
0 siblings, 1 reply; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-08 11:07 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe
Hello Bjorn,
Bjorn Helgaas <helgaas@kernel.org> writes:
> On Sat, May 07, 2022 at 10:22:32AM +0000, Volodymyr Babchuk wrote:
>> Bjorn Helgaas <helgaas@kernel.org> writes:
>> > On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>> >>
>> >> I have encountered issue when PCI code tries to use both fields in
>> >>
>> >> union {
>> >> struct pci_sriov *sriov; /* PF: SR-IOV info */
>> >> struct pci_dev *physfn; /* VF: related PF */
>> >> };
>> >>
>> >> (which are part of struct pci_dev) at the same time.
>> >>
>> >> Symptoms are following:
>> >>
>> >> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>> >>
>> >> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> >> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>> >> Unable to handle kernel paging request at virtual address 0001000200000010
>
>> >> Debugging showed the following:
>> >>
>> >> pci_iov_add_virtfn() allocates new struct pci_dev:
>> >>
>> >> virtfn = pci_alloc_dev(bus);
>> >> and sets physfn:
>> >> virtfn->is_virtfn = 1;
>> >> virtfn->physfn = pci_dev_get(dev);
>> >>
>> >> then we will get into sriov_init() via the following call path:
>> >>
>> >> pci_device_add(virtfn, virtfn->bus);
>> >> pci_init_capabilities(dev);
>> >> pci_iov_init(dev);
>> >> sriov_init(dev, pos);
>> >
>> > We called pci_device_add() with the VF. pci_iov_init() only calls
>> > sriov_init() if it finds an SR-IOV capability on the device:
>> >
>> > pci_iov_init(struct pci_dev *dev)
>> > pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
>> > if (pos)
>> > return sriov_init(dev, pos);
>> >
>> > So this means the VF must have an SR-IOV capability, which sounds a
>> > little dubious. From PCIe r6.0:
>>
>> [...]
>>
>> Yes, I dived into debugging and came to the same conclusions. I'm still
>> investigating this, but looks like my PCIe controller (DesignWare-based)
>> incorrectly reads configuration space for VF. Looks like instead of
>> providing access VF config space, it reads PF's one.
>>
>> > Can you supply the output of "sudo lspci -vv" for your system?
>>
>> Sure:
>>
>> root@spider:~# lspci -vv
>> 00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> Latency: 0
>> Interrupt: pin A routed to IRQ 189
>> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>> I/O behind bridge: [disabled]
>> Memory behind bridge: 30000000-301fffff [size=2M]
>> Prefetchable memory behind bridge: [disabled]
>> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>> BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
>> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>> Capabilities: [40] Power Management version 3
>> Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>> Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
>> Address: 0000000004030040 Data: 0000
>> Masking: fffffffe Pending: 00000000
>> Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>> DevCap: MaxPayload 256 bytes, PhantFunc 0
>> ExtTag+ RBE+
>> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
>> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>> LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
>> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> LnkSta: Speed 5GT/s (ok), Width x2 (ok)
>> TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
>> RootCap: CRSVisible-
>> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
>> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>> DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
>> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>> FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
>> AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
>> AtomicOpsCtl: ReqEn- EgressBlck-
>> LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>> Compliance De-emphasis: -6dB
>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>> Capabilities: [100 v2] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>> HeaderLog: 00000000 00000000 00000000 00000000
>> RootCmd: CERptEn- NFERptEn- FERptEn-
>> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
>> Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
>> Capabilities: [158 v1] Secondary PCI Express
>> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>> LaneErrStat: 0
>> Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
>> Capabilities: [19c v1] Lane Margining at the Receiver <?>
>> Capabilities: [1bc v1] L1 PM Substates
>> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>> PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
>> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
>> T_CommonMode=0us LTR1.2_Threshold=0ns
>> L1SubCtl2: T_PwrOn=10us
>> Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
>> Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
>> Capabilities: [304 v1] Data Link Feature <?>
>> Capabilities: [310 v1] Precision Time Measurement
>> PTMCap: Requester:+ Responder:+ Root:+
>> PTMClockGranularity: 16ns
>> PTMControl: Enabled:- RootSelected:-
>> PTMEffectiveGranularity: Unknown
>> Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
>> Kernel driver in use: pcieport
>> Kernel modules: pci_endpoint_test
>>
>> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>> Subsystem: Samsung Electronics Co Ltd Device a809
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> Latency: 0
>> Interrupt: pin A routed to IRQ 0
>> NUMA node: 0
>> Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
>> Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
>> Capabilities: [40] Power Management version 3
>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>> Capabilities: [70] Express (v2) Endpoint, MSI 00 [8/5710]
>> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>> LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>> FRS-, TPHComp-, ExtTPHComp-
>> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>> AtomicOpsCtl: ReqEn-
>> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>> Compliance De-emphasis: -6dB
>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>> Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
>> Vector table: BAR=0 offset=00004000
>> PBA: BAR=0 offset=00003000
>> Capabilities: [100 v2] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>> HeaderLog: 00000000 00000000 00000000 00000000
>> Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>> Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>> ARICap: MFVC- ACS-, Next Function: 0
>> ARICtl: MFVC- ACS-, Function Group: 0
>> Capabilities: [178 v1] Secondary PCI Express
>> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>> LaneErrStat: 0
>> Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>> Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>> Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>> IOVCap: Migration-, Interrupt Message Number: 000
>> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
>> IOVSta: Migration-
>> Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
>> VF offset: 2, stride: 1, Device ID: a824
>> Supported Page Size: 00000553, System Page Size: 00000001
>> Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>> VF Migration: offset: 00000000, BIR: 0
>> Capabilities: [3a4 v1] Data Link Feature <?>
>> Kernel driver in use: nvme
>> Kernel modules: nvme
>
> I guess this is before enabling SR-IOV on 01:00.0, so it doesn't show
> the VFs themselves.
Yes. Because kernel crashed without your suggested patch.
>> > It could be that the device has an SR-IOV capability when it
>> > shouldn't. But even if it does, Linux could tolerate that better
>> > than it does today.
>>
>> Agree there. I can create simple patch that checks for is_virtfn
>> in sriov_init(). But what to do if it is set?
>
> Maybe something like this? It makes no sense to me that a VF would
> have an SR-IOV capability, but ...
>
> If the below avoids the problem, maybe collect another "lspci -vv"
> output including the VF(s).
>
I had another crash in nvme_pci_enable(), for which I made quick
workaround. And now yeah, it looks like I have some issues with
my root complex HW:
[skipping bridge info]
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a809
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 0
NUMA node: 0
Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
Vector table: BAR=0 offset=00004000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: a824
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [3a4 v1] Data Link Feature <?>
Kernel driver in use: nvme
Kernel modules: nvme
01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a809
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 0
NUMA node: 0
Region 0: Memory at 30018000 (64-bit, non-prefetchable) [size=32K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
Vector table: BAR=0 offset=00004000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: a824
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [3a4 v1] Data Link Feature <?>
Kernel modules: nvme
As you can see, output for func 0 and func 2 is identical, so yeah,
looks like my system reads config space for func 0 in both cases.
Now at least I know where to look. Thank you for your help.
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 952217572113..9c5184384a45 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -901,6 +901,10 @@ int pci_iov_init(struct pci_dev *dev)
> if (!pci_is_pcie(dev))
> return -ENODEV;
>
> + /* Some devices include SR-IOV cap on VFs as well as PFs */
> + if (dev->is_virtfn)
> + return -ENODEV;
> +
> pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> if (pos)
> return sriov_init(dev, pos);
Thanks, this patch helped. You can have my
Tested-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
if you are going to include it in the kernel.
On other hand, I'm wondering if it is correct to have both is_virtfn and
is_physfn in the first place, as there can 4 combinations and only two
(or three?) of them are valid. Maybe it is worth to replace them with
enum?
--
Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-07 10:25 ` Volodymyr Babchuk
@ 2022-05-08 11:19 ` Leon Romanovsky
2022-05-09 18:22 ` Keith Busch
0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2022-05-08 11:19 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: Jason Gunthorpe, Bjorn Helgaas, linux-pci, Alex Williamson
On Sat, May 07, 2022 at 10:25:16AM +0000, Volodymyr Babchuk wrote:
>
>
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
> >
> >> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> >> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> >> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> >> > CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> >> > Hardware name: XENVM-4.17 (DT)
> > ^^^^^^^^^^^^^^^^^
> >
> >> So this means the VF must have an SR-IOV capability, which sounds a
> >> little dubious. From PCIe r6.0:
> >
> > Enabling SRIOV from within a VM is "exciting" - I would not be
> > surprised if there was some wonky bugs here.
>
> Well, yes. But in this case, this VM has direct access to the PCIe
> controller. So it should not cause any troubles. I'll try baremetal
> setup, though.
If I remember correctly, the VM software that runs on hypervisor needs
to filter PCI capabilities even for pass-through devices.
The vfio kernel module did this for QEMU, so something similar I would
expect from XEN too.
Thanks
>
>
> --
> Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-08 11:07 ` Volodymyr Babchuk
@ 2022-05-09 16:49 ` Bjorn Helgaas
2022-05-09 16:58 ` Alex Williamson
0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-09 16:49 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: linux-pci, Alex Williamson, Leon Romanovsky, Jason Gunthorpe
On Sun, May 08, 2022 at 11:07:40AM +0000, Volodymyr Babchuk wrote:
> I had another crash in nvme_pci_enable(), for which I made quick
> workaround. And now yeah, it looks like I have some issues with
> my root complex HW:
Please point to the root complex issue you see.
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
> Subsystem: Samsung Electronics Co Ltd Device a809
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 0
> NUMA node: 0
> Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
> Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [70] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS-, TPHComp-, ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> AtomicOpsCtl: ReqEn-
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
> Vector table: BAR=0 offset=00004000
> PBA: BAR=0 offset=00003000
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
> Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 0
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [178 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
> LaneErrStat: 0
> Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
> Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
> Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
> IOVCap: Migration-, Interrupt Message Number: 000
> IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
> IOVSta: Migration-
> Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
> VF offset: 2, stride: 1, Device ID: a824
> Supported Page Size: 00000553, System Page Size: 00000001
> Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
> VF Migration: offset: 00000000, BIR: 0
> Capabilities: [3a4 v1] Data Link Feature <?>
> Kernel driver in use: nvme
> Kernel modules: nvme
>
> 01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
> Subsystem: Samsung Electronics Co Ltd Device a809
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 0
> NUMA node: 0
> Region 0: Memory at 30018000 (64-bit, non-prefetchable) [size=32K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [70] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
> 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS-, TPHComp-, ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> AtomicOpsCtl: ReqEn-
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [b0] MSI-X: Enable- Count=64 Masked-
> Vector table: BAR=0 offset=00004000
> PBA: BAR=0 offset=00003000
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
> Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 0
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [178 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
> LaneErrStat: 0
> Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
> Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
> Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
> IOVCap: Migration-, Interrupt Message Number: 000
> IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
> IOVSta: Migration-
> Initial VFs: 32, Total VFs: 32, Number of VFs: 1, Function Dependency Link: 00
> VF offset: 2, stride: 1, Device ID: a824
> Supported Page Size: 00000553, System Page Size: 00000001
> Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
> VF Migration: offset: 00000000, BIR: 0
> Capabilities: [3a4 v1] Data Link Feature <?>
> Kernel modules: nvme
>
> As you can see, output for func 0 and func 2 is identical, so yeah,
> looks like my system reads config space for func 0 in both cases.
They are not identical:
01:00.0 Non-Volatile memory controller
Region 0: Memory at 30010000
01:00.2 Non-Volatile memory controller
Region 0: Memory at 30018000
> On other hand, I'm wondering if it is correct to have both is_virtfn and
> is_physfn in the first place, as there can 4 combinations and only two
> (or three?) of them are valid. Maybe it is worth to replace them with
> enum?
Good question. I think there was a reason, but I can't remember it
right now.
Bjorn
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-09 16:49 ` Bjorn Helgaas
@ 2022-05-09 16:58 ` Alex Williamson
2022-05-10 6:39 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Alex Williamson @ 2022-05-09 16:58 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Volodymyr Babchuk, linux-pci, Leon Romanovsky, Jason Gunthorpe
On Mon, 9 May 2022 11:49:29 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Sun, May 08, 2022 at 11:07:40AM +0000, Volodymyr Babchuk wrote:
> >
> > As you can see, output for func 0 and func 2 is identical, so yeah,
> > looks like my system reads config space for func 0 in both cases.
>
> They are not identical:
>
> 01:00.0 Non-Volatile memory controller
> Region 0: Memory at 30010000
>
> 01:00.2 Non-Volatile memory controller
> Region 0: Memory at 30018000
>
> > On other hand, I'm wondering if it is correct to have both is_virtfn and
> > is_physfn in the first place, as there can 4 combinations and only two
> > (or three?) of them are valid. Maybe it is worth to replace them with
> > enum?
>
> Good question. I think there was a reason, but I can't remember it
> right now.
AIUI, three combinations are valid:
is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
is_physfn = 1, is_virtfn = 0: An SR-IOV PF
is_physfn = 0, is_virtfn = 1: An SR-IOV VF
As implemented with bit fields this is 2 bits, which is more space
efficient than an enum. Thanks,
Alex
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-08 11:19 ` Leon Romanovsky
@ 2022-05-09 18:22 ` Keith Busch
0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2022-05-09 18:22 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Volodymyr Babchuk, Jason Gunthorpe, Bjorn Helgaas, linux-pci,
Alex Williamson
On Sun, May 08, 2022 at 02:19:56PM +0300, Leon Romanovsky wrote:
> On Sat, May 07, 2022 at 10:25:16AM +0000, Volodymyr Babchuk wrote:
> > Jason Gunthorpe <jgg@ziepe.ca> writes:
> > > On Fri, May 06, 2022 at 03:17:22PM -0500, Bjorn Helgaas wrote:
> > >
> > >> > Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> > >> > nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
> > >> > pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
> > >> > CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
> > >> > Hardware name: XENVM-4.17 (DT)
> > > ^^^^^^^^^^^^^^^^^
> > >
> > >> So this means the VF must have an SR-IOV capability, which sounds a
> > >> little dubious. From PCIe r6.0:
> > >
> > > Enabling SRIOV from within a VM is "exciting" - I would not be
> > > surprised if there was some wonky bugs here.
> >
> > Well, yes. But in this case, this VM has direct access to the PCIe
> > controller. So it should not cause any troubles. I'll try baremetal
> > setup, though.
>
> If I remember correctly, the VM software that runs on hypervisor needs
> to filter PCI capabilities even for pass-through devices.
>
> The vfio kernel module did this for QEMU, so something similar I would
> expect from XEN too.
AIUI, some kind of translation from GPA to HPA needs to happen, but I'm not
sure how that would occur for the VF's enabled within the guest when the host
doesn't know about those functions.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-09 16:58 ` Alex Williamson
@ 2022-05-10 6:39 ` Christoph Hellwig
2022-05-10 17:37 ` Bjorn Helgaas
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2022-05-10 6:39 UTC (permalink / raw)
To: Alex Williamson
Cc: Bjorn Helgaas, Volodymyr Babchuk, linux-pci, Leon Romanovsky,
Jason Gunthorpe
On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
> is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
> is_physfn = 1, is_virtfn = 0: An SR-IOV PF
> is_physfn = 0, is_virtfn = 1: An SR-IOV VF
>
> As implemented with bit fields this is 2 bits, which is more space
> efficient than an enum. Thanks,
A two-bit bitfield with explicit constants for the values would probably
still much eaiser to understand.
And there is some code that seems to intepret is_physfn a bit odd, e.g.:
arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-10 6:39 ` Christoph Hellwig
@ 2022-05-10 17:37 ` Bjorn Helgaas
2022-05-12 7:18 ` Volodymyr Babchuk
0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2022-05-10 17:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Alex Williamson, Volodymyr Babchuk, linux-pci, Leon Romanovsky,
Jason Gunthorpe
On Mon, May 09, 2022 at 11:39:30PM -0700, Christoph Hellwig wrote:
> On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
> > is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
> > is_physfn = 1, is_virtfn = 0: An SR-IOV PF
> > is_physfn = 0, is_virtfn = 1: An SR-IOV VF
> >
> > As implemented with bit fields this is 2 bits, which is more space
> > efficient than an enum. Thanks,
>
> A two-bit bitfield with explicit constants for the values would probably
> still much eaiser to understand.
>
> And there is some code that seems to intepret is_physfn a bit odd, e.g.:
>
> arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
> arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
"dev->sriov != NULL" and "dev->is_physfn" are basically the same and
many of the dev->is_physfn uses in drivers/pci would end up being
simpler if replaced with dev->sriov, e.g.,
int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
{
if (!dev->is_physfn)
return -EINVAL;
return dev->bus->number + ((dev->devfn + dev->sriov->offset +
dev->sriov->stride * vf_id) >> 8);
}
would be more obvious as:
if (dev->sriov)
return dev->bus->number + ((dev->devfn + dev->sriov->offset +
dev->sriov->stride * vf_id) >> 8);
return -EINVAL;
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Write to srvio_numvfs triggers kernel panic
2022-05-10 17:37 ` Bjorn Helgaas
@ 2022-05-12 7:18 ` Volodymyr Babchuk
0 siblings, 0 replies; 14+ messages in thread
From: Volodymyr Babchuk @ 2022-05-12 7:18 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Christoph Hellwig, Alex Williamson, linux-pci, Leon Romanovsky,
Jason Gunthorpe
Hello Bjorn,
Bjorn Helgaas <helgaas@kernel.org> writes:
> On Mon, May 09, 2022 at 11:39:30PM -0700, Christoph Hellwig wrote:
>> On Mon, May 09, 2022 at 10:58:57AM -0600, Alex Williamson wrote:
>> > is_physfn = 0, is_virtfn = 0: A non-SR-IOV function
>> > is_physfn = 1, is_virtfn = 0: An SR-IOV PF
>> > is_physfn = 0, is_virtfn = 1: An SR-IOV VF
>> >
>> > As implemented with bit fields this is 2 bits, which is more space
>> > efficient than an enum. Thanks,
>>
>> A two-bit bitfield with explicit constants for the values would probably
>> still much eaiser to understand.
>>
>> And there is some code that seems to intepret is_physfn a bit odd, e.g.:
>>
>> arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
>> arch/powerpc/kernel/eeh_sysfs.c: np = pci_device_to_OF_node(pdev->is_physfn ? pdev : pdev->physfn);
>
> "dev->sriov != NULL" and "dev->is_physfn" are basically the same and
> many of the dev->is_physfn uses in drivers/pci would end up being
> simpler if replaced with dev->sriov, e.g.,
>
> int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
> {
> if (!dev->is_physfn)
> return -EINVAL;
> return dev->bus->number + ((dev->devfn + dev->sriov->offset +
> dev->sriov->stride * vf_id) >> 8);
> }
>
> would be more obvious as:
>
> if (dev->sriov)
> return dev->bus->number + ((dev->devfn + dev->sriov->offset +
> dev->sriov->stride * vf_id) >> 8);
> return -EINVAL;
I'm not sure that this will work because dev->sriov and dev->physfn are
stored in the same union.
--
Volodymyr Babchuk at EPAM
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-05-12 7:18 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 19:56 Write to srvio_numvfs triggers kernel panic Volodymyr Babchuk
2022-05-06 20:17 ` Bjorn Helgaas
2022-05-07 1:34 ` Jason Gunthorpe
2022-05-07 10:25 ` Volodymyr Babchuk
2022-05-08 11:19 ` Leon Romanovsky
2022-05-09 18:22 ` Keith Busch
2022-05-07 10:22 ` Volodymyr Babchuk
2022-05-07 15:41 ` Bjorn Helgaas
2022-05-08 11:07 ` Volodymyr Babchuk
2022-05-09 16:49 ` Bjorn Helgaas
2022-05-09 16:58 ` Alex Williamson
2022-05-10 6:39 ` Christoph Hellwig
2022-05-10 17:37 ` Bjorn Helgaas
2022-05-12 7:18 ` Volodymyr Babchuk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).