Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
* arm64 iommu groups issue
@ 2019-09-19  8:43 John Garry
  2019-09-19 13:25 ` Robin Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: John Garry @ 2019-09-19  8:43 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: linux-kernel, Alex Williamson, Shameer Kolothum, Linuxarm, iommu,
	Bjorn Helgaas, linux-arm-kernel

Hi all,

We have noticed a special behaviour on our arm64 D05 board when the SMMU 
is enabled with regards PCI device iommu groups.

This platform does not support ACS, yet we find that all functions for a 
PCI device are not grouped together:

root@ubuntu:/sys# dmesg | grep "Adding to iommu group"
[    7.307539] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   12.590533] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   13.688527] mlx5_core 000a:11:00.0: Adding to iommu group 2
[   14.324606] mlx5_core 000a:11:00.1: Adding to iommu group 3
[   14.937090] ehci-platform PNP0D20:00: Adding to iommu group 4
[   15.276637] pcieport 0002:f8:00.0: Adding to iommu group 5
[   15.340845] pcieport 0004:88:00.0: Adding to iommu group 6
[   15.392098] pcieport 0005:78:00.0: Adding to iommu group 7
[   15.443356] pcieport 000a:10:00.0: Adding to iommu group 8
[   15.484975] pcieport 000c:20:00.0: Adding to iommu group 9
[   15.543647] pcieport 000d:30:00.0: Adding to iommu group 10
[   15.599771] serial 0002:f9:00.0: Adding to iommu group 5
[   15.690807] serial 0002:f9:00.1: Adding to iommu group 5
[   84.322097] mlx5_core 000a:11:00.2: Adding to iommu group 8
[   84.856408] mlx5_core 000a:11:00.3: Adding to iommu group 8

root@ubuntu:/sys#  lspci -tv
lspci -tvv
-+-[000d:30]---00.0-[31]--
   +-[000c:20]---00.0-[21]----00.0  Huawei Technologies Co., Ltd.
   +-[000a:10]---00.0-[11-12]--+-00.0  Mellanox [ConnectX-5]
   |                           +-00.1  Mellanox [ConnectX-5]
   |                           +-00.2  Mellanox [ConnectX-5 VF]
   |                           \-00.3  Mellanox [ConnectX-5 VF]
   +-[0007:90]---00.0-[91]----00.0  Huawei Technologies Co., ...
   +-[0006:c0]---00.0-[c1]--
   +-[0005:78]---00.0-[79]--
   +-[0004:88]---00.0-[89]--
   +-[0002:f8]---00.0-[f9]--+-00.0  MosChip Semiconductor Technology ...
   |                        +-00.1  MosChip Semiconductor Technology ...
   |                        \-00.2  MosChip Semiconductor Technology ...
   \-[0000:00]-

For the PCI devices in question - on port 000a:10:00.0 - you will notice 
that the port and VFs (000a:11:00.2, 3) are in one group, yet the 2 PFs 
(000a:11:00.0, 000a:11:00.1) are in separate groups.

I also notice the same ordering nature on our D06 platform - the 
pcieport is added to an iommu group after PF for that port. However this 
platform supports ACS, so not such a problem.

After some checking, I find that when the pcieport driver probes, the 
associated SMMU device had not registered yet with the IOMMU framework, 
so we defer the probe for this device - in iort.c:iort_iommu_xlate(), 
when no iommu ops are available, we defer.

Yet, when the mlx5 PF devices probe, the iommu ops are available at this 
stage. So the probe continues and we get an iommu group for the device - 
but not the same group as the parent port, as it has not yet been added 
to a group. When the port eventually probes it gets a new, separate group.

This all seems to be as the built-in module init ordering is as follows: 
pcieport drv, smmu drv, mlx5 drv

I notice that if I build the mlx5 drv as a ko and insert after boot, all 
functions + pcieport are in the same group:

[   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
[   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
[   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
[   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
[   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
[   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
[   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
[   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
[   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
[   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
[   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
[  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
[  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
[  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
[  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6

This does seem like a problem for arm64 platforms which don't support 
ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS.

Opinion?

Thanks,
John


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2019-09-19  8:43 arm64 iommu groups issue John Garry
@ 2019-09-19 13:25 ` Robin Murphy
  2019-09-19 14:35   ` John Garry
  0 siblings, 1 reply; 9+ messages in thread
From: Robin Murphy @ 2019-09-19 13:25 UTC (permalink / raw)
  To: John Garry, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: linux-kernel, Alex Williamson, Shameer Kolothum, Linuxarm, iommu,
	Bjorn Helgaas, linux-arm-kernel

Hi John,

On 19/09/2019 09:43, John Garry wrote:
> Hi all,
> 
> We have noticed a special behaviour on our arm64 D05 board when the SMMU 
> is enabled with regards PCI device iommu groups.
> 
> This platform does not support ACS, yet we find that all functions for a 
> PCI device are not grouped together:
> 
> root@ubuntu:/sys# dmesg | grep "Adding to iommu group"
> [    7.307539] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
> [   12.590533] hns_dsaf HISI00B2:00: Adding to iommu group 1
> [   13.688527] mlx5_core 000a:11:00.0: Adding to iommu group 2
> [   14.324606] mlx5_core 000a:11:00.1: Adding to iommu group 3
> [   14.937090] ehci-platform PNP0D20:00: Adding to iommu group 4
> [   15.276637] pcieport 0002:f8:00.0: Adding to iommu group 5
> [   15.340845] pcieport 0004:88:00.0: Adding to iommu group 6
> [   15.392098] pcieport 0005:78:00.0: Adding to iommu group 7
> [   15.443356] pcieport 000a:10:00.0: Adding to iommu group 8
> [   15.484975] pcieport 000c:20:00.0: Adding to iommu group 9
> [   15.543647] pcieport 000d:30:00.0: Adding to iommu group 10
> [   15.599771] serial 0002:f9:00.0: Adding to iommu group 5
> [   15.690807] serial 0002:f9:00.1: Adding to iommu group 5
> [   84.322097] mlx5_core 000a:11:00.2: Adding to iommu group 8
> [   84.856408] mlx5_core 000a:11:00.3: Adding to iommu group 8
> 
> root@ubuntu:/sys#  lspci -tv
> lspci -tvv
> -+-[000d:30]---00.0-[31]--
>    +-[000c:20]---00.0-[21]----00.0  Huawei Technologies Co., Ltd.
>    +-[000a:10]---00.0-[11-12]--+-00.0  Mellanox [ConnectX-5]
>    |                           +-00.1  Mellanox [ConnectX-5]
>    |                           +-00.2  Mellanox [ConnectX-5 VF]
>    |                           \-00.3  Mellanox [ConnectX-5 VF]
>    +-[0007:90]---00.0-[91]----00.0  Huawei Technologies Co., ...
>    +-[0006:c0]---00.0-[c1]--
>    +-[0005:78]---00.0-[79]--
>    +-[0004:88]---00.0-[89]--
>    +-[0002:f8]---00.0-[f9]--+-00.0  MosChip Semiconductor Technology ...
>    |                        +-00.1  MosChip Semiconductor Technology ...
>    |                        \-00.2  MosChip Semiconductor Technology ...
>    \-[0000:00]-
> 
> For the PCI devices in question - on port 000a:10:00.0 - you will notice 
> that the port and VFs (000a:11:00.2, 3) are in one group, yet the 2 PFs 
> (000a:11:00.0, 000a:11:00.1) are in separate groups.
> 
> I also notice the same ordering nature on our D06 platform - the 
> pcieport is added to an iommu group after PF for that port. However this 
> platform supports ACS, so not such a problem.
> 
> After some checking, I find that when the pcieport driver probes, the 
> associated SMMU device had not registered yet with the IOMMU framework, 
> so we defer the probe for this device - in iort.c:iort_iommu_xlate(), 
> when no iommu ops are available, we defer.
> 
> Yet, when the mlx5 PF devices probe, the iommu ops are available at this 
> stage. So the probe continues and we get an iommu group for the device - 
> but not the same group as the parent port, as it has not yet been added 
> to a group. When the port eventually probes it gets a new, separate group.
> 
> This all seems to be as the built-in module init ordering is as follows: 
> pcieport drv, smmu drv, mlx5 drv
> 
> I notice that if I build the mlx5 drv as a ko and insert after boot, all 
> functions + pcieport are in the same group:
> 
> [   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
> [   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
> [   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
> [   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
> [   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
> [   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
> [   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
> [   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
> [   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
> [   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
> [   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
> [  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
> [  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
> [  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
> [  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6
> 
> This does seem like a problem for arm64 platforms which don't support 
> ACS, yet enable an SMMU. Maybe also a problem even if they do support ACS.
> 
> Opinion?

Yeah, this is less than ideal. One way to bodge it might be to make 
pci_device_group() also walk downwards to see if any non-ACS-isolated 
children already have a group, rather than assuming that groups get 
allocated in hierarchical order, but that's far from ideal.

The underlying issue is that, for historical reasons, OF/IORT-based 
IOMMU drivers have ended up with group allocation being tied to endpoint 
driver probing via the dma_configure() mechanism (long story short, 
driver probe is the only thing which can be delayed in order to wait for 
a specific IOMMU instance to be ready). However, in the meantime, the 
IOMMU API internals have evolved sufficiently that I think there's a way 
to really put things right - I have the spark of an idea which I'll try 
to sketch out ASAP...

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2019-09-19 13:25 ` Robin Murphy
@ 2019-09-19 14:35   ` John Garry
  2019-11-04 12:18     ` John Garry
  2020-02-13 15:49     ` John Garry
  0 siblings, 2 replies; 9+ messages in thread
From: John Garry @ 2019-09-19 14:35 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: linux-kernel, Alex Williamson, Shameer Kolothum, Linuxarm, iommu,
	Bjorn Helgaas, linux-arm-kernel

On 19/09/2019 14:25, Robin Murphy wrote:
>> When the port eventually probes it gets a new, separate group.
>>
>> This all seems to be as the built-in module init ordering is as
>> follows: pcieport drv, smmu drv, mlx5 drv
>>
>> I notice that if I build the mlx5 drv as a ko and insert after boot,
>> all functions + pcieport are in the same group:
>>
>> [   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
>> [   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
>> [   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
>> [   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
>> [   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
>> [   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
>> [   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
>> [   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
>> [   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
>> [   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
>> [   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
>> [  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
>> [  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
>> [  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
>> [  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6
>>
>> This does seem like a problem for arm64 platforms which don't support
>> ACS, yet enable an SMMU. Maybe also a problem even if they do support
>> ACS.
>>
>> Opinion?
>

Hi Robin,

> Yeah, this is less than ideal.

For sure. Our production D05 boards don't ship with the SMMU enabled in 
BIOS, but it would be slightly concerning in this regard if they did.

 > One way to bodge it might be to make
> pci_device_group() also walk downwards to see if any non-ACS-isolated
> children already have a group, rather than assuming that groups get
> allocated in hierarchical order, but that's far from ideal.

Agree.

My own workaround was to hack the mentioned iort code to defer the PF 
probe if the parent port had also yet to probe.

>
> The underlying issue is that, for historical reasons, OF/IORT-based
> IOMMU drivers have ended up with group allocation being tied to endpoint
> driver probing via the dma_configure() mechanism (long story short,
> driver probe is the only thing which can be delayed in order to wait for
> a specific IOMMU instance to be ready).However, in the meantime, the
> IOMMU API internals have evolved sufficiently that I think there's a way
> to really put things right - I have the spark of an idea which I'll try
> to sketch out ASAP...
>

OK, great.

Thanks,
John

> Robin.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2019-09-19 14:35   ` John Garry
@ 2019-11-04 12:18     ` John Garry
  2020-02-13 15:49     ` John Garry
  1 sibling, 0 replies; 9+ messages in thread
From: John Garry @ 2019-11-04 12:18 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson, Shameer Kolothum,
	Linuxarm, iommu, Bjorn Helgaas, linux-arm-kernel

+

On 19/09/2019 15:35, John Garry wrote:
> On 19/09/2019 14:25, Robin Murphy wrote:
>>> When the port eventually probes it gets a new, separate group.
>>>
>>> This all seems to be as the built-in module init ordering is as
>>> follows: pcieport drv, smmu drv, mlx5 drv
>>>
>>> I notice that if I build the mlx5 drv as a ko and insert after boot,
>>> all functions + pcieport are in the same group:
>>>
>>> [   11.530046] hisi_sas_v2_hw HISI0162:01: Adding to iommu group 0
>>> [   17.301093] hns_dsaf HISI00B2:00: Adding to iommu group 1
>>> [   18.743600] ehci-platform PNP0D20:00: Adding to iommu group 2
>>> [   20.212284] pcieport 0002:f8:00.0: Adding to iommu group 3
>>> [   20.356303] pcieport 0004:88:00.0: Adding to iommu group 4
>>> [   20.493337] pcieport 0005:78:00.0: Adding to iommu group 5
>>> [   20.702999] pcieport 000a:10:00.0: Adding to iommu group 6
>>> [   20.859183] pcieport 000c:20:00.0: Adding to iommu group 7
>>> [   20.996140] pcieport 000d:30:00.0: Adding to iommu group 8
>>> [   21.152637] serial 0002:f9:00.0: Adding to iommu group 3
>>> [   21.346991] serial 0002:f9:00.1: Adding to iommu group 3
>>> [  100.754306] mlx5_core 000a:11:00.0: Adding to iommu group 6
>>> [  101.420156] mlx5_core 000a:11:00.1: Adding to iommu group 6
>>> [  292.481714] mlx5_core 000a:11:00.2: Adding to iommu group 6
>>> [  293.281061] mlx5_core 000a:11:00.3: Adding to iommu group 6
>>>
>>> This does seem like a problem for arm64 platforms which don't support
>>> ACS, yet enable an SMMU. Maybe also a problem even if they do support
>>> ACS.
>>>
>>> Opinion?
>>
> 
> Hi Robin,
> 
>> Yeah, this is less than ideal.
> 
> For sure. Our production D05 boards don't ship with the SMMU enabled in 
> BIOS, but it would be slightly concerning in this regard if they did.
> 
>  > One way to bodge it might be to make
>> pci_device_group() also walk downwards to see if any non-ACS-isolated
>> children already have a group, rather than assuming that groups get
>> allocated in hierarchical order, but that's far from ideal.
> 
> Agree.
> 
> My own workaround was to hack the mentioned iort code to defer the PF 
> probe if the parent port had also yet to probe.
> 
>>
>> The underlying issue is that, for historical reasons, OF/IORT-based
>> IOMMU drivers have ended up with group allocation being tied to endpoint
>> driver probing via the dma_configure() mechanism (long story short,
>> driver probe is the only thing which can be delayed in order to wait for
>> a specific IOMMU instance to be ready).However, in the meantime, the
>> IOMMU API internals have evolved sufficiently that I think there's a way
>> to really put things right - I have the spark of an idea which I'll try
>> to sketch out ASAP...
>>
> 
> OK, great.
> 
> Thanks,
> John
> 
>> Robin.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2019-09-19 14:35   ` John Garry
  2019-11-04 12:18     ` John Garry
@ 2020-02-13 15:49     ` John Garry
  2020-02-13 19:40       ` Robin Murphy
  1 sibling, 1 reply; 9+ messages in thread
From: John Garry @ 2020-02-13 15:49 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson, Shameer Kolothum,
	Linuxarm, iommu, Bjorn Helgaas, linux-arm-kernel

>>
>> The underlying issue is that, for historical reasons, OF/IORT-based
>> IOMMU drivers have ended up with group allocation being tied to endpoint
>> driver probing via the dma_configure() mechanism (long story short,
>> driver probe is the only thing which can be delayed in order to wait for
>> a specific IOMMU instance to be ready).However, in the meantime, the
>> IOMMU API internals have evolved sufficiently that I think there's a way
>> to really put things right - I have the spark of an idea which I'll try
>> to sketch out ASAP...
>>
> 
> OK, great.

Hi Robin,

I was wondering if you have had a chance to consider this problem again?

One simple idea could be to introduce a device link between the endpoint 
device and its parent bridge to ensure that they probe in order, as 
expected in pci_device_group():

Subject: [PATCH] PCI: Add device link to ensure endpoint device driver 
probes after bridge

It is required to ensure that a device driver for an endpoint will probe
after the parent port driver, so add a device link for this.

---
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 512cb4312ddd..4b832ad25b20 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2383,6 +2383,7 @@ static void pci_set_msi_domain(struct pci_dev *dev)
  void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
  {
  	int ret;
+	struct device *parent;

  	pci_configure_device(dev);

@@ -2420,6 +2421,10 @@ void pci_device_add(struct pci_dev *dev, struct 
pci_bus *bus)
  	/* Set up MSI IRQ domain */
  	pci_set_msi_domain(dev);

+	parent = dev->dev.parent;
+	if (parent && parent->bus == &pci_bus_type)
+		device_link_add(&dev->dev, parent, DL_FLAG_AUTOPROBE_CONSUMER);
+
  	/* Notifier could use PCI capabilities */
  	dev->match_driver = false;
  	ret = device_add(&dev->dev);
-- 

This would work, but the problem is that if the port driver fails in 
probing - and not just for -EPROBE_DEFER - then the child device will 
never probe. This very thing happens on my dev board. However we could 
expand the device links API to cover this sort of scenario.

As for alternatives, it looks pretty difficult to me to disassociate the 
group allocation from the dma_configure path.

Let me know.

Thanks,
John

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2020-02-13 15:49     ` John Garry
@ 2020-02-13 19:40       ` Robin Murphy
  2020-02-14 14:09         ` John Garry
  0 siblings, 1 reply; 9+ messages in thread
From: Robin Murphy @ 2020-02-13 19:40 UTC (permalink / raw)
  To: John Garry, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson, Shameer Kolothum,
	Linuxarm, iommu, Bjorn Helgaas, linux-arm-kernel

On 13/02/2020 3:49 pm, John Garry wrote:
>>>
>>> The underlying issue is that, for historical reasons, OF/IORT-based
>>> IOMMU drivers have ended up with group allocation being tied to endpoint
>>> driver probing via the dma_configure() mechanism (long story short,
>>> driver probe is the only thing which can be delayed in order to wait for
>>> a specific IOMMU instance to be ready).However, in the meantime, the
>>> IOMMU API internals have evolved sufficiently that I think there's a way
>>> to really put things right - I have the spark of an idea which I'll try
>>> to sketch out ASAP...
>>>
>>
>> OK, great.
> 
> Hi Robin,
> 
> I was wondering if you have had a chance to consider this problem again?

Yeah, sorry, more things came up such that it still hasn't been P yet ;)

Lorenzo and I did get as far as discussing it a bit more and writing up 
a ticket, so it's formally on our internal radar now, at least.

> One simple idea could be to introduce a device link between the endpoint 
> device and its parent bridge to ensure that they probe in order, as 
> expected in pci_device_group():
> 
> Subject: [PATCH] PCI: Add device link to ensure endpoint device driver 
> probes after bridge
> 
> It is required to ensure that a device driver for an endpoint will probe
> after the parent port driver, so add a device link for this.
> 
> ---
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 512cb4312ddd..4b832ad25b20 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2383,6 +2383,7 @@ static void pci_set_msi_domain(struct pci_dev *dev)
>   void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>   {
>       int ret;
> +    struct device *parent;
> 
>       pci_configure_device(dev);
> 
> @@ -2420,6 +2421,10 @@ void pci_device_add(struct pci_dev *dev, struct 
> pci_bus *bus)
>       /* Set up MSI IRQ domain */
>       pci_set_msi_domain(dev);
> 
> +    parent = dev->dev.parent;
> +    if (parent && parent->bus == &pci_bus_type)
> +        device_link_add(&dev->dev, parent, DL_FLAG_AUTOPROBE_CONSUMER);
> +
>       /* Notifier could use PCI capabilities */
>       dev->match_driver = false;
>       ret = device_add(&dev->dev);
> -- 
> 
> This would work, but the problem is that if the port driver fails in 
> probing - and not just for -EPROBE_DEFER - then the child device will 
> never probe. This very thing happens on my dev board. However we could 
> expand the device links API to cover this sort of scenario.

Yes, that's an undesirable issue, but in fact I think it's mostly 
indicative that involving drivers in something which is designed to 
happen at a level below drivers is still fundamentally wrong and doomed 
to be fragile at best.

Another thought that crosses my mind is that when pci_device_group() 
walks up to the point of ACS isolation and doesn't find an existing 
group, it can still infer that everything it walked past *should* be put 
in the same group it's then eventually going to return. Unfortunately I 
can't see an obvious way for it to act on that knowledge, though, since 
recursive iommu_probe_device() is unlikely to end well.

> As for alternatives, it looks pretty difficult to me to disassociate the 
> group allocation from the dma_configure path.

Indeed it's non-trivial, but it really does need cleaning up at some point.

Having just had yet another spark, does something like the untested 
super-hack below work at all? I doubt it's a viable direction to take in 
itself, but it could be food for thought if it at least proves the concept.

Robin.

----->8-----
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index aa3ac2a03807..554cde76c766 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3841,20 +3841,20 @@ static int arm_smmu_set_bus_ops(struct iommu_ops 
*ops)
  	int err;

  #ifdef CONFIG_PCI
-	if (pci_bus_type.iommu_ops != ops) {
+	if (1) {
  		err = bus_set_iommu(&pci_bus_type, ops);
  		if (err)
  			return err;
  	}
  #endif
  #ifdef CONFIG_ARM_AMBA
-	if (amba_bustype.iommu_ops != ops) {
+	if (1) {
  		err = bus_set_iommu(&amba_bustype, ops);
  		if (err)
  			goto err_reset_pci_ops;
  	}
  #endif
-	if (platform_bus_type.iommu_ops != ops) {
+	if (1) {
  		err = bus_set_iommu(&platform_bus_type, ops);
  		if (err)
  			goto err_reset_amba_ops;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 660eea8d1d2f..b81ae2b4d4fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1542,6 +1542,14 @@ static int iommu_bus_init(struct bus_type *bus, 
const struct iommu_ops *ops)
  	return err;
  }

+static int iommu_bus_replay(struct device *dev, void *data)
+{
+	if (dev->iommu_group)
+		return 0;
+
+	return add_iommu_group(dev, data);
+}
+
  /**
   * bus_set_iommu - set iommu-callbacks for the bus
   * @bus: bus.
@@ -1564,6 +1572,9 @@ int bus_set_iommu(struct bus_type *bus, const 
struct iommu_ops *ops)
  		return 0;
  	}

+	if (bus->iommu_ops == ops)
+		return bus_for_each_dev(bus, NULL, NULL, iommu_bus_replay);
+
  	if (bus->iommu_ops != NULL)
  		return -EBUSY;


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2020-02-13 19:40       ` Robin Murphy
@ 2020-02-14 14:09         ` John Garry
  2020-02-14 18:35           ` Robin Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: John Garry @ 2020-02-14 14:09 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson,
	Shameerali Kolothum Thodi, Linuxarm, iommu, Bjorn Helgaas,
	linux-arm-kernel

>>
>> @@ -2420,6 +2421,10 @@ void pci_device_add(struct pci_dev *dev, struct
>> pci_bus *bus)
>>        /* Set up MSI IRQ domain */
>>        pci_set_msi_domain(dev);
>>
>> +    parent = dev->dev.parent;
>> +    if (parent && parent->bus == &pci_bus_type)
>> +        device_link_add(&dev->dev, parent, DL_FLAG_AUTOPROBE_CONSUMER);
>> +
>>        /* Notifier could use PCI capabilities */
>>        dev->match_driver = false;
>>        ret = device_add(&dev->dev);
>> -- 
>>
>> This would work, but the problem is that if the port driver fails in
>> probing - and not just for -EPROBE_DEFER - then the child device will
>> never probe. This very thing happens on my dev board. However we could
>> expand the device links API to cover this sort of scenario.
> 
> Yes, that's an undesirable issue, but in fact I think it's mostly
> indicative that involving drivers in something which is designed to
> happen at a level below drivers is still fundamentally wrong and doomed
> to be fragile at best.

Right, and even worse is that it relies on the port driver even existing 
at all.

All this iommu group assignment should be taken outside device driver 
probe paths.

However we could still consider device links for sync'ing the SMMU and 
each device probing.

> 
> Another thought that crosses my mind is that when pci_device_group()
> walks up to the point of ACS isolation and doesn't find an existing
> group, it can still infer that everything it walked past *should* be put
> in the same group it's then eventually going to return. Unfortunately I
> can't see an obvious way for it to act on that knowledge, though, since
> recursive iommu_probe_device() is unlikely to end well.

I'd be inclined not to change that code.

> 
>> As for alternatives, it looks pretty difficult to me to disassociate the
>> group allocation from the dma_configure path.
> 
> Indeed it's non-trivial, but it really does need cleaning up at some point.
> 
> Having just had yet another spark, does something like the untested
> super-hack below work at all? 

I tried it and it doesn't (yet) work.

So when we try 
iommu_bus_replay()->add_iommu_group()->iommu_probe_device()->arm_smmu_add_device(),
the iommu_fwspec is still NULL for that device - this is not set until 
later when the device driver is going to finally probe in 
iort_iommu_xlate()->iommu_fwspec_init(), and it's too late...

And this looks to be the reason for which current 
iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also.

On this current code mentioned, the principle of this seems wrong to me 
- we call bus_for_each_device(..., add_iommu_group) for the first SMMU 
in the system which probes, but we attempt to add_iommu_group() for all 
devices on the bus, even though the SMMU for that device may yet to have 
probed.

Thanks,
John

I doubt it's a viable direction to take in
> itself, but it could be food for thought if it at least proves the concept.
> 
> Robin.
> 
> ----->8-----
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index aa3ac2a03807..554cde76c766 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -3841,20 +3841,20 @@ static int arm_smmu_set_bus_ops(struct iommu_ops
> *ops)
>    	int err;
> 
>    #ifdef CONFIG_PCI
> -	if (pci_bus_type.iommu_ops != ops) {
> +	if (1) {
>    		err = bus_set_iommu(&pci_bus_type, ops);
>    		if (err)
>    			return err;
>    	}
>    #endif
>    #ifdef CONFIG_ARM_AMBA
> -	if (amba_bustype.iommu_ops != ops) {
> +	if (1) {
>    		err = bus_set_iommu(&amba_bustype, ops);
>    		if (err)
>    			goto err_reset_pci_ops;
>    	}
>    #endif
> -	if (platform_bus_type.iommu_ops != ops) {
> +	if (1) {
>    		err = bus_set_iommu(&platform_bus_type, ops);
>    		if (err)
>    			goto err_reset_amba_ops;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 660eea8d1d2f..b81ae2b4d4fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1542,6 +1542,14 @@ static int iommu_bus_init(struct bus_type *bus,
> const struct iommu_ops *ops)
>    	return err;
>    }
> 
> +static int iommu_bus_replay(struct device *dev, void *data)
> +{
> +	if (dev->iommu_group)
> +		return 0;
> +
> +	return add_iommu_group(dev, data);
> +}
> +
>    /**
>     * bus_set_iommu - set iommu-callbacks for the bus
>     * @bus: bus.
> @@ -1564,6 +1572,9 @@ int bus_set_iommu(struct bus_type *bus, const
> struct iommu_ops *ops)
>    		return 0;
>    	}
> 
> +	if (bus->iommu_ops == ops)
> +		return bus_for_each_dev(bus, NULL, NULL, iommu_bus_replay);
> +
>    	if (bus->iommu_ops != NULL)
>    		return -EBUSY;
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2020-02-14 14:09         ` John Garry
@ 2020-02-14 18:35           ` Robin Murphy
  2020-02-17 12:08             ` John Garry
  0 siblings, 1 reply; 9+ messages in thread
From: Robin Murphy @ 2020-02-14 18:35 UTC (permalink / raw)
  To: John Garry, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson,
	Shameerali Kolothum Thodi, Linuxarm, iommu, Bjorn Helgaas,
	linux-arm-kernel

On 14/02/2020 2:09 pm, John Garry wrote:
>>>
>>> @@ -2420,6 +2421,10 @@ void pci_device_add(struct pci_dev *dev, struct
>>> pci_bus *bus)
>>>        /* Set up MSI IRQ domain */
>>>        pci_set_msi_domain(dev);
>>>
>>> +    parent = dev->dev.parent;
>>> +    if (parent && parent->bus == &pci_bus_type)
>>> +        device_link_add(&dev->dev, parent, DL_FLAG_AUTOPROBE_CONSUMER);
>>> +
>>>        /* Notifier could use PCI capabilities */
>>>        dev->match_driver = false;
>>>        ret = device_add(&dev->dev);
>>> -- 
>>>
>>> This would work, but the problem is that if the port driver fails in
>>> probing - and not just for -EPROBE_DEFER - then the child device will
>>> never probe. This very thing happens on my dev board. However we could
>>> expand the device links API to cover this sort of scenario.
>>
>> Yes, that's an undesirable issue, but in fact I think it's mostly
>> indicative that involving drivers in something which is designed to
>> happen at a level below drivers is still fundamentally wrong and doomed
>> to be fragile at best.
> 
> Right, and even worse is that it relies on the port driver even existing 
> at all.
> 
> All this iommu group assignment should be taken outside device driver 
> probe paths.
> 
> However we could still consider device links for sync'ing the SMMU and 
> each device probing.

Yes, we should get that for DT now thanks to the of_devlink stuff, but 
cooking up some equivalent for IORT might be worthwhile.

>> Another thought that crosses my mind is that when pci_device_group()
>> walks up to the point of ACS isolation and doesn't find an existing
>> group, it can still infer that everything it walked past *should* be put
>> in the same group it's then eventually going to return. Unfortunately I
>> can't see an obvious way for it to act on that knowledge, though, since
>> recursive iommu_probe_device() is unlikely to end well.
> 
> I'd be inclined not to change that code.
> 
>>
>>> As for alternatives, it looks pretty difficult to me to disassociate the
>>> group allocation from the dma_configure path.
>>
>> Indeed it's non-trivial, but it really does need cleaning up at some 
>> point.
>>
>> Having just had yet another spark, does something like the untested
>> super-hack below work at all? 
> 
> I tried it and it doesn't (yet) work.

Bleh - further reinforcement of the "ideas after 6PM are bad ideas" rule...

> So when we try 
> iommu_bus_replay()->add_iommu_group()->iommu_probe_device()->arm_smmu_add_device(), 
> 
> the iommu_fwspec is still NULL for that device - this is not set until 
> later when the device driver is going to finally probe in 
> iort_iommu_xlate()->iommu_fwspec_init(), and it's too late...
> 
> And this looks to be the reason for which current 
> iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also.

Of course, just adding a 'correct' add_device replay without the 
of_xlate process doesn't help at all. No wonder this looked suspiciously 
simpler than where the first idea left off...

(on reflection, the core of this idea seems to be recycling the existing 
iommu_bus_init walk rather than building up a separate "waiting list", 
while forgetting that that wasn't the difficult part of the original 
idea anyway)

> On this current code mentioned, the principle of this seems wrong to me 
> - we call bus_for_each_device(..., add_iommu_group) for the first SMMU 
> in the system which probes, but we attempt to add_iommu_group() for all 
> devices on the bus, even though the SMMU for that device may yet to have 
> probed.

Yes, iommu_bus_init() is one of the places still holding a 
deeply-ingrained assumption that the ops go live for all IOMMU instances 
at once, which is what warranted the further replay in 
of_iommu_configure() originally. Moving that out of 
of_platform_device_create() to support probe deferral is where the 
trouble really started.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm64 iommu groups issue
  2020-02-14 18:35           ` Robin Murphy
@ 2020-02-17 12:08             ` John Garry
  0 siblings, 0 replies; 9+ messages in thread
From: John Garry @ 2020-02-17 12:08 UTC (permalink / raw)
  To: Robin Murphy, Marc Zyngier, Will Deacon, Lorenzo Pieralisi,
	Sudeep Holla, Guohanjun (Hanjun Guo)
  Cc: Saravana Kannan, linux-kernel, Alex Williamson,
	Shameerali Kolothum Thodi, Linuxarm, iommu, Bjorn Helgaas,
	linux-arm-kernel

>>
>> Right, and even worse is that it relies on the port driver even 
>> existing at all.
>>
>> All this iommu group assignment should be taken outside device driver 
>> probe paths.
>>
>> However we could still consider device links for sync'ing the SMMU and 
>> each device probing.
> 
> Yes, we should get that for DT now thanks to the of_devlink stuff, but 
> cooking up some equivalent for IORT might be worthwhile.

It doesn't solve this problem, but at least we could remove the 
iommu_ops check in iort_iommu_xlate().

We would need to carve out a path from pci_device_add() or even 
device_add() to solve all cases.

> 
>>> Another thought that crosses my mind is that when pci_device_group()
>>> walks up to the point of ACS isolation and doesn't find an existing
>>> group, it can still infer that everything it walked past *should* be put
>>> in the same group it's then eventually going to return. Unfortunately I
>>> can't see an obvious way for it to act on that knowledge, though, since
>>> recursive iommu_probe_device() is unlikely to end well.
>>

[...]

>> And this looks to be the reason for which current 
>> iommu_bus_init()->bus_for_each_device(..., add_iommu_group) fails also.
> 
> Of course, just adding a 'correct' add_device replay without the 
> of_xlate process doesn't help at all. No wonder this looked suspiciously 
> simpler than where the first idea left off...
> 
> (on reflection, the core of this idea seems to be recycling the existing 
> iommu_bus_init walk rather than building up a separate "waiting list", 
> while forgetting that that wasn't the difficult part of the original 
> idea anyway)

We could still use a bus walk to add the group per iommu, but we would 
need an additional check to ensure the device is associated with the IOMMU.

> 
>> On this current code mentioned, the principle of this seems wrong to 
>> me - we call bus_for_each_device(..., add_iommu_group) for the first 
>> SMMU in the system which probes, but we attempt to add_iommu_group() 
>> for all devices on the bus, even though the SMMU for that device may 
>> yet to have probed.
> 
> Yes, iommu_bus_init() is one of the places still holding a 
> deeply-ingrained assumption that the ops go live for all IOMMU instances 
> at once, which is what warranted the further replay in 
> of_iommu_configure() originally. Moving that out of 
> of_platform_device_create() to support probe deferral is where the 
> trouble really started.

I'm not too familiar with the history here, but could this be reverted 
now with the introduction of of_devlink stuff?

Cheers,
John

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  8:43 arm64 iommu groups issue John Garry
2019-09-19 13:25 ` Robin Murphy
2019-09-19 14:35   ` John Garry
2019-11-04 12:18     ` John Garry
2020-02-13 15:49     ` John Garry
2020-02-13 19:40       ` Robin Murphy
2020-02-14 14:09         ` John Garry
2020-02-14 18:35           ` Robin Murphy
2020-02-17 12:08             ` John Garry

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org
	public-inbox-index linux-arm-kernel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git