linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
       [not found]       ` <CAErSpo4LsrPCtdZwp6CyT0jKhXLt3j=fGSiFjpRRTPUjFoKHtQ@mail.gmail.com>
@ 2017-02-12  4:09         ` Joshua Kinard
  2017-02-13 22:45           ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Joshua Kinard @ 2017-02-12  4:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On 02/09/2017 16:29, Bjorn Helgaas wrote:

[snip]

>>>> However, IP27 is completely different in this regard.  Instead of 
>>>> using ioremapped addresses for I/O, IP27 has a dedicated address 
>>>> range, 0x92xxxxxxxxxxxxxx, that is used for all I/O access.  Since 
>>>> this is uncached physical address space, the generic MIPS PCI code 
>>>> will not probe it correctly and thus, the original behavior of 
>>>> PCI_PROBE_ONLY needs to be restored only for the IP27 platform to 
>>>> bypass this logic and have working PCI, at least for the IO6/IO6G 
>>>> board that houses the base devices, until a better solution is found.
>>> 
>>> It sounds like there's something different about how ioremap() works on
>>>  these platforms and PCI probing is tripping over that.  I'd really like
>>>  to understand more about this difference to see if we can converge that
>>>  instead of adding back the PCI_PROBE_ONLY usage.
>> 
>> I'd need to go and dig around in the IP27 headers again for this machine 
>> to see what ioremap() is actually doing, but I *think* it returns uncached
>> physical addresses in most instances because of a special feature of the
>> CPU, the R10000-family, which makes uncached access very fast.  I think
>> only the IP27 platform uses this capability.  Other R1x0-based systems
>> don't (although, I might be wrong about IP27's successor in IP35).
> 
> ioremap() must return a CPU virtual address.  It can take advantage of 
> arch-specific features, special address ranges that are uncached, special 
> identity mappings, or whatever, but from the caller's point of view, the 
> return value must be usable as a virtual address without any special 
> handling.

Apparently, MIPS's implementation of ioremap does not guarantee that a virtual
address is always returned.  Quoting from arch/mips/include/asm/io.h around
line #236:

/*
 * ioremap     -   map bus memory into CPU space
 * @offset:    bus address of the memory
 * @size:      size of the resource to map
 *
 * ioremap performs a platform specific sequence of operations to
 * make bus memory CPU accessible via the readb/readw/readl/writeb/
 * writew/writel functions and the other mmio helpers. The returned
 * address is not guaranteed to be usable directly as a virtual
 * address.
 */

It looks like the qla1280.c driver calls pci_ioremap_bar(), which calls
ioremap_nocache(), and on MIPS, that's just an alias (as far as I can tell) to
ioremap(), and it has the same limitation, that the address returned may not
be usable as a CPU virtual address.


> This is from ip27-dmesg-working_pci-20170208.txt:
> 
>   PCI host bridge to bus 0002:00
>   pci_bus 0002:00: root bus resource [mem 0x920000000f200000-0x920000000f9fffff]
>   pci_bus 0002:00: root bus resource [io  0x920000000fa00000-0x920000000fbfffff]
>   pci_bus 0002:00: root bus resource [bus 02-ff]
>   pci 0002:00:00.0: [1077:1020] type 00 class 0x010000
>   pci 0002:00:00.0: reg 0x10: [io  0xf200000-0xf2000ff]
>   pci 0002:00:00.0: reg 0x14: [mem 0x0f200000-0x0f200fff]
>   pci 0002:00:00.0: reg 0x30: [mem 0x00000000-0x0000ffff pref]
> 
> There's something wrong here: all the resources printed above, including the
> BARs, should be CPU physical addresses (as are all the /proc/iomem 
> addresses).  The "root bus resources" are windows through the host bridge to
> PCI.  The BAR resources should be inside those windows.
> 
> I'm sure the bridge translates between CPU physical addresses and PCI bus 
> addresses, probably by chopping off all those high-order bits.  The PCI core
> needs to know about this, and it looks like pcibios_scanbus() is trying to
> tell it by using pci_add_resource_offset().  But the PCI core isn't getting
> the message because we don't see the "(bus address [%#010llx-%#010llx])"
> being printed by pci_register_host_bridge().
> 
> A correct dmesg log would look something like this:
> 
>   PCI host bridge to bus 0002:00
>   pci_bus 0002:00: root bus resource [mem 0x920000000f200000-0x920000000f9fffff]
> (bus address [0x0f200000-0x0f9fffff])
>   pci 0002:00:00.0: reg 0x14: [mem 0x920000000f200000-0x920000000f200fff]
> 
> This would mean that for 0002:00:00.0, pdev->resource[1] contains the CPU 
> physical address 0x920000000f200000, and the host bridge translates that to 
> 0x0f200000, which is what the 32-bit BAR1 would contain.
> 
> I think "lspci" should show you the CPU addresses (0x920000000f200000) and 
> "lspci -b" should show you the actual BAR values (0x0f200000).
> 
> Your dmesg log is showing you 0x0f200000, which is probably the BAR value. 
> __pci_read_base() reads the BAR value and runs it through 
> pcibios_bus_to_resource(), which *should* be converting it to 
> 0x920000000f200000 (the reverse of the translation done by the host bridge).
> But since the PCI core doesn't know about that translation, 
> pcibios_bus_to_resource() does nothing.
> 
> So I think there's something wrong with the way you're using 
> pci_add_resource_offset().

This is basically the same conclusion I've come to.  I've found a "partial"
solution, but I'm pretty sure it's wrong.

It seems the reason IP30 works without PCI_PROBE_ONLY now is because it doesn't
need special casing on the addresses in order to work.  IP27, on the other
hand, needs all I/O accesses to go through the 0x92xxx range or you'll run into
hardware-enforced barriers that will panic the kernel.

First attempt I did was to simply subtract NODE_IO_BASE(_n) from the address
values that ultimately get fed to pci_add_resource_offset().  That yields this
kind of output:

[    8.065146] PCI host bridge to bus 0000:00
[    8.113572] pci_bus 0000:00: root bus resource [mem 0x0b200000-0x0b9fffff]
[    8.196284] pci_bus 0000:00: root bus resource [io  0xba00000-0xbbfffff]
[    8.276934] pci_bus 0000:00: root bus resource [bus 00-ff]
[    8.343199] PCI host bridge to bus 0001:00
[    8.392144] pci_bus 0001:00: root bus resource [mem 0x0c200000-0x0c9fffff]
[    8.474899] pci_bus 0001:00: root bus resource [io  0xca00000-0xcbfffff]
[    8.555520] pci_bus 0001:00: root bus resource [bus 01-ff]
[    8.622536] pci 0001:00:00.0: can't claim BAR 0 [mem 0x00000000-0x07ffffff]: no compatible bridge window
[    8.735665] pci 0001:00:01.0: can't claim BAR 0 [mem 0x00000000-0x07ffffff]: no compatible bridge window
[    8.850128] PCI host bridge to bus 0002:00
[    8.899049] pci_bus 0002:00: root bus resource [mem 0x0f200000-0x0f9fffff]
[    8.981794] pci_bus 0002:00: root bus resource [io  0xfa00000-0xfbfffff]
[    9.062413] pci_bus 0002:00: root bus resource [bus 02-ff]
[    9.130425] pci 0002:00:00.0: can't claim BAR 0 [io  0xf200000-0xf2000ff]: no compatible bridge window
[    9.240674] pci 0002:00:00.0: can't claim BAR 6 [mem 0x00000000-0x0000ffff pref]: no compatible bridge window
[    9.360079] pci 0002:00:01.0: can't claim BAR 0 [io  0xf400000-0xf4000ff]: no compatible bridge window
[    9.472148] pci 0002:00:01.0: can't claim BAR 6 [mem 0x00000000-0x0000ffff pref]: no compatible bridge window
[    9.591555] pci 0002:00:06.0: can't claim BAR 0 [mem 0x0fa00000-0x0fafffff]: no compatible bridge window
[    9.705685] pci 0002:00:07.0: can't claim BAR 0 [mem 0x00000000-0x00001fff]: no compatible bridge window
[    9.819933] qla1280: QLA1040 found on PCI bus 0, dev 0
[    9.881644] PCI: Enabling device 0002:00:00.0 (0006 -> 0007)

Despite the above errors, it succeeds in probing and finds devices and boots
into userland.  The error message above stems from 'pci_claim_resource' in
drivers/pci/setup-res.c because it can't seem to find the parent bridge.  Not
sure if that's good or bad.  But it does boot, and it would remove one of the
two #ifdef hacks in my patch.

So I next looked at the offset bit you mentioned, and after some fiddling
around in pci-bridge.c, got this dmesg output, which matches what you say
should be the correct dmesg output:

[    8.095560] PCI host bridge to bus 0000:00
[    8.143949] pci_bus 0000:00: root bus resource [mem 0x920000000b200000-0x920000000b9fffff] (bus address [0x0b200000-0x0b9fffff])
[    8.283216] pci_bus 0000:00: root bus resource [io  0x920000000ba00000-0x920000000bbfffff] (bus address [0xba00000-0xbbfffff])
[    8.420442] pci_bus 0000:00: root bus resource [bus 00-ff]
[    8.486711] PCI host bridge to bus 0001:00
[    8.535656] pci_bus 0001:00: root bus resource [mem 0x920000000c200000-0x920000000c9fffff] (bus address [0x0c200000-0x0c9fffff])
[    8.674965] pci_bus 0001:00: root bus resource [io  0x920000000ca00000-0x920000000cbfffff] (bus address [0xca00000-0xcbfffff])
[    8.812144] pci_bus 0001:00: root bus resource [bus 01-ff]
[    8.879140] pci 0001:00:00.0: can't claim BAR 0 [mem 0x00000000-0x07ffffff]: no compatible bridge window
[    8.992298] pci 0001:00:01.0: can't claim BAR 0 [mem 0x00000000-0x07ffffff]: no compatible bridge window
[    9.106732] PCI host bridge to bus 0002:00
[    9.155679] pci_bus 0002:00: root bus resource [mem 0x920000000f200000-0x920000000f9fffff] (bus address [0x0f200000-0x0f9fffff])
[    9.294974] pci_bus 0002:00: root bus resource [io  0x920000000fa00000-0x920000000fbfffff] (bus address [0xfa00000-0xfbfffff])
[    9.432168] pci_bus 0002:00: root bus resource [bus 02-ff]
[    9.500118] pci 0002:00:00.0: can't claim BAR 0 [io  0xf200000-0xf2000ff]: no compatible bridge window
[    9.610326] pci 0002:00:00.0: can't claim BAR 6 [mem 0x00000000-0x0000ffff pref]: no compatible bridge window
[    9.729688] pci 0002:00:01.0: can't claim BAR 0 [io  0xf400000-0xf4000ff]: no compatible bridge window
[    9.841764] pci 0002:00:01.0: can't claim BAR 6 [mem 0x00000000-0x0000ffff pref]: no compatible bridge window
[    9.961151] pci 0002:00:06.0: can't claim BAR 0 [mem 0x0fa00000-0x0fafffff]: no compatible bridge window
[   10.075344] pci 0002:00:07.0: can't claim BAR 0 [mem 0x00000000-0x00001fff]: no compatible bridge window
[   10.189528] qla1280: QLA1040 found on PCI bus 0, dev 0
[   10.251300] PCI: Enabling device 0002:00:00.0 (0006 -> 0007)
[   10.320129] Unhandled kernel unaligned access[#1]:
[   10.376908] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc7-mipsgit-20170208 #24
[   10.471163] task: a8000000e077c880 task.stack: a8000000e0780000
[   10.542378] $ 0   : 0000000000000000 0000000000024680 0000000000000000 0000000000000000
[   10.638730] $ 4   : 000000000065e0c0 0000000000000010 a800000000450a18 240000000f20000a
[   10.735084] $ 8   : a8000000007a3978 a8000000e00c05b0 0000000000000000 a80000000145ef30
[   10.831440] $12   : ffffffff94005ce1 000000001000001e ffffffffffffff80 00000000007e0000
[   10.927796] $16   : a8000000e078f920 a800000000450a14 0000000000000000 ffffffffa4600000
[   11.024149] $20   : 240000000f20000a a8000000004509b8 a8000000006b9930 a8000000007eb1b8
[   11.120506] $24   : 0000000000940000 00000000007e0000
[   11.216860] $28   : a8000000e0780000 a8000000e078f8d0 a8000000e00c1650 a800000000024680
[   11.313216] Hi    : 0000000000000000
[   11.356156] Lo    : 0000000000002800
[   11.399117] epc   : a80000000002de44 do_ade+0x4f4/0x960
[   11.461952] ra    : a800000000024680 ret_from_exception+0x0/0x18
[   11.534204] Status: 94005ce3 KX SX UX KERNEL EXL IE
[   11.593903] Cause : 00008014 (ExcCode 05)
[   11.642080] BadVA : 240000000f20000b
[   11.685020] PrId  : 00000f14 (R14000)

And you can see that while the dmesg output is correct, the end result is not.
The issue goes back to what MIPS's ioremap() is doing when it encounters the
specialized cached-uncached attribute of the R1x0k CPUs on IP27, it tries
adding 0x9200000000000000, which is IP27's IO_BASE, to one of the addresses
during qla1280's probe, which wraps around to 0x240000000xxxxxxx and is
invalid in MIPS.  This leads to the unhandled kernel unaligned access panic.

I tried implementing a custom plat_ioremap() that would subtract the IO_BASE
off of the address being ioremapped by qla1280.  MIPS has a generic template
available to do this in arch/mips/include/asm/mach-generic/ioremap.h, but it
appears my gcc version is recent enough to have some new checks enabled that
choke on casting a phys_addr_t to void __iomem * (-Wint-to-pointer-cast), and
I've given up finding a workaround that doesn't involve invasive changes to
MIPS's ioremapping core.

--J

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-12  4:09         ` [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case Joshua Kinard
@ 2017-02-13 22:45           ` Bjorn Helgaas
  2017-02-14  7:39             ` Joshua Kinard
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-02-13 22:45 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

It looks like IP30 is using some code that's not upstream yet, so I'll
just point out some things that don't look right.  I'm ignoring the
IP27 stuff here in case that has different issues.

Joshua wrote:
> On 02/07/2017 13:29, Bjorn Helgaas wrote:
>> Is there any chance you can collect complete dmesg logs and
>> /proc/iomem contents from IP27 and IP30?  Maybe "lspci -vv" output,
>> too?  I'm not sure where to look to understand the ioremap() behavior.
> 
> ...
> A note about IP30, since that's part of an external patch set, the
> dmesg output is a little different, as it is using the newer
> BRIDGE/Xtalk code I just sent in as patches.  ...
> 
> Also, IP30 doesn't use the PCI_PROBE_ONLY thing anymore.  This causes
> /proc/iomem to be far more detailed than it used to be.
> 
> IP30's /proc/iomem:
>     00000000-00003fff : reserved
>     1d200000-1d9fffff : Bridge MEM
>       d080000000-d080003fff : 0001:00:03.0
>       1d204000-1d204fff : 0001:00:02.0
>       1d205000-1d205fff : 0001:00:02.1
>       1d206000-1d206fff : 0001:00:02.2
>       1d207000-1d2070ff : 0001:00:02.3
>       1d207100-1d20717f : 0001:00:01.0
>     1f200000-1f9fffff : Bridge MEM
>       f080100000-f0801fffff : 0000:00:02.0
>       f080010000-f08001ffff : 0000:00:00.0
>       f080030000-f08003ffff : 0000:00:01.0
>       f080000000-f080000fff : 0000:00:00.0
>       f080020000-f080020fff : 0000:00:01.0
>     20004000-209b8fff : reserved
>       20004000-206acb13 : Kernel code
>       206acb14-208affff : Kernel data
>     209b9000-20efffff : System RAM
>     20f00000-20ffffff : System RAM
>     21000000-9fffffff : System RAM
>     f080100000-f0801fffff : ioc3

>From the dmesg you attached (ip30-dmesg-20170208.txt, PCI parts
appended below):

  pci_bus 0000:00: root bus resource [mem 0x1f200000-0x1f9fffff]
  pci_bus 0001:00: root bus resource [mem 0x1d200000-0x1d9fffff]

These are shown correctly in /proc/iomem.  It would be nice if the
/proc/iomem string identified which bridge was which (x86 uses "PCI
Bus 0000:00", "PCI Bus 0001:00", etc.) but that's not essential.

  pci 0001:00:03.0: BAR 0: assigned [mem 0x1d200000-0x1d203fff]
  ip30-bridge: 0001:00:03.0 Bar 0 with size 0x00004000 at bus 0x00000000 vma 0x000000d080000000 is Direct 64-bit.

This is funky.  We read 0x1d200000 from the BAR (PCI bus addresses are
apparently identical to CPU physical addresses).  We claimed that
space from the host bridge window, so /proc/iomem would have looked
like this at that point:

  1d200000-1d9fffff : Bridge MEM
    1d200000-1d203fff : 0001:00:03.0

But it looks like whatever this ip30-bridge thing is overwrote the
0001:00:03.0 resource, which makes /proc/iomem wrong.

All the resources under the bridge to 0000:00 are wrong, too:

  1f200000-1f9fffff : Bridge MEM
    f080100000-f0801fffff : 0000:00:02.0
    f080010000-f08001ffff : 0000:00:00.0
    f080030000-f08003ffff : 0000:00:01.0
    f080000000-f080000fff : 0000:00:00.0
    f080020000-f080020fff : 0000:00:01.0

We read values from the BARs:

  pci_bus 0000:00: root bus resource [mem 0x1f200000-0x1f9fffff]
  pci 0000:00:00.0: reg 0x14: [mem 0x00200000-0x00200fff]
  pci 0000:00:00.0: reg 0x30: [mem 0x00210000-0x0021ffff pref]
  pci 0000:00:01.0: reg 0x14: [mem 0x00400000-0x00400fff]
  pci 0000:00:01.0: reg 0x30: [mem 0x00410000-0x0041ffff pref]
  pci 0000:00:02.0: reg 0x10: [mem 0x00500000-0x005fffff]
  pci 0000:00:03.0: reg 0x10: [mem 0x00600000-0x00601fff]

These aren't zero, so somebody apparently has assigned them, but they
aren't inside the host bridge window.  Therefore the PCI core assumes
they will not work, and it reassigns space from the window:

  pci 0000:00:02.0: BAR 0: assigned [mem 0x1f200000-0x1f2fffff]
  pci 0000:00:00.0: BAR 6: assigned [mem 0x1f300000-0x1f30ffff pref]
  pci 0000:00:01.0: BAR 6: assigned [mem 0x1f310000-0x1f31ffff pref]
  pci 0000:00:00.0: BAR 1: assigned [mem 0x1f320000-0x1f320fff]
  pci 0000:00:01.0: BAR 1: assigned [mem 0x1f321000-0x1f321fff]

I don't know why we didn't assign space to 0000:00:03.0 BAR 1.

Anyway, it looks like they got inserted in /proc/iomem, then
overwritten by the ip30-bridge stuff again.

>From /proc/iomem, I would expect that only the USB devices (0001:00:02
functions 0, 1, 2, 3) would work.  But apparently your system does
actually work, so there must be corresponding funkiness somewhere else
that compensates for these resources.

I'm attaching the PCI parts of your ip30-dmesg-20170208.txt and the
complete ip30-lspci-20170208.txt below for context.

  pci_bus 0000:00: root bus resource [mem 0x1f200000-0x1f9fffff]
  pci_bus 0000:00: root bus resource [io  0x1fa00000-0x1fbfffff]
  pci_bus 0000:00: root bus resource [bus 00-ff]
  pci 0000:00:00.0: [1077:1020] type 00 class 0x010000
  pci 0000:00:00.0: reg 0x10: [io  0x200000-0x2000ff]
  pci 0000:00:00.0: reg 0x14: [mem 0x00200000-0x00200fff]
  pci 0000:00:00.0: reg 0x30: [mem 0x00210000-0x0021ffff pref]
  pci 0000:00:01.0: [1077:1020] type 00 class 0x010000
  pci 0000:00:01.0: reg 0x10: [io  0x400000-0x4000ff]
  pci 0000:00:01.0: reg 0x14: [mem 0x00400000-0x00400fff]
  pci 0000:00:01.0: reg 0x30: [mem 0x00410000-0x0041ffff pref]
  pci 0000:00:02.0: [10a9:0003] type 00 class 0xff0000
  pci 0000:00:02.0: reg 0x10: [mem 0x00500000-0x005fffff]
  pci 0000:00:03.0: [10a9:0005] type 00 class 0x000000
  pci 0000:00:03.0: reg 0x10: [mem 0x00600000-0x00601fff]
  pci 0000:00:02.0: BAR 0: assigned [mem 0x1f200000-0x1f2fffff]
  pci 0000:00:00.0: BAR 6: assigned [mem 0x1f300000-0x1f30ffff pref]
  pci 0000:00:01.0: BAR 6: assigned [mem 0x1f310000-0x1f31ffff pref]
  pci 0000:00:00.0: BAR 1: assigned [mem 0x1f320000-0x1f320fff]
  pci 0000:00:01.0: BAR 1: assigned [mem 0x1f321000-0x1f321fff]
  pci 0000:00:00.0: BAR 0: assigned [io  0x1fa00000-0x1fa000ff]
  pci 0000:00:01.0: BAR 0: assigned [io  0x1fa00400-0x1fa004ff]

  ip30-bridge: 0000:00:00.0 Bar 0 with size 0x00000100 at bus 0x00000000 vma 0x000000f100000000 is Direct I/O.
  ip30-bridge: 0000:00:00.0 Bar 1 with size 0x00001000 at bus 0x00000000 vma 0x000000f080000000 is Direct 64-bit.
  ip30-bridge: 0000:00:00.0 Bar 6 with size 0x00010000 at bus 0x00010000 vma 0x000000f080010000 is Direct 64-bit.

  ip30-bridge: 0000:00:01.0 Bar 0 with size 0x00000100 at bus 0x00000100 vma 0x000000f100000100 is Direct I/O.
  ip30-bridge: 0000:00:01.0 Bar 1 with size 0x00001000 at bus 0x00020000 vma 0x000000f080020000 is Direct 64-bit.
  ip30-bridge: 0000:00:01.0 Bar 6 with size 0x00010000 at bus 0x00030000 vma 0x000000f080030000 is Direct 64-bit.

  ip30-bridge: 0000:00:02.0 Bar 0 with size 0x00100000 at bus 0x00100000 vma 0x000000f080100000 is Direct 64-bit.

  PCI host bridge to bus 0001:00
  pci_bus 0001:00: root bus resource [mem 0x1d200000-0x1d9fffff]
  pci_bus 0001:00: root bus resource [io  0x1da00000-0x1dbfffff]
  pci_bus 0001:00: root bus resource [bus 01-ff]
  pci 0001:00:01.0: [11fe:080e] type 00 class 0x078000
  pci 0001:00:01.0: reg 0x10: [mem 0x00200000-0x0020007f]
  pci 0001:00:01.0: reg 0x14: [io  0x200000-0x20007f]
  pci 0001:00:01.0: reg 0x18: [io  0x204000-0x2040ff]
  pci 0001:00:02.0: [10b9:5237] type 00 class 0x0c0310
  pci 0001:00:02.0: reg 0x10: [mem 0x00300000-0x00300fff]
  pci 0001:00:02.0: PME# supported from D0 D1 D3hot D3cold
  pci 0001:00:02.1: [10b9:5237] type 00 class 0x0c0310
  pci 0001:00:02.1: reg 0x10: [mem 0x00000000-0x00000fff]
  pci 0001:00:02.1: PME# supported from D0 D1 D3hot D3cold
  pci 0001:00:02.2: [10b9:5237] type 00 class 0x0c0310
  pci 0001:00:02.2: reg 0x10: [mem 0x00000000-0x00000fff]
  pci 0001:00:02.2: PME# supported from D0 D1 D3hot D3cold
  pci 0001:00:02.3: [10b9:5239] type 00 class 0x0c0320
  pci 0001:00:02.3: reg 0x10: [mem 0x00000000-0x000000ff]
  pci 0001:00:02.3: PME# supported from D0 D3hot D3cold
  pci 0001:00:03.0: [10a9:0009] type 00 class 0x020000
  pci 0001:00:03.0: reg 0x10: [mem 0x00400000-0x00403fff]
  pci 0001:00:03.0: BAR 0: assigned [mem 0x1d200000-0x1d203fff]
  pci 0001:00:02.0: BAR 0: assigned [mem 0x1d204000-0x1d204fff]
  pci 0001:00:02.1: BAR 0: assigned [mem 0x1d205000-0x1d205fff]
  pci 0001:00:02.2: BAR 0: assigned [mem 0x1d206000-0x1d206fff]
  pci 0001:00:01.0: BAR 2: assigned [io  0x1da00000-0x1da000ff]
  pci 0001:00:02.3: BAR 0: assigned [mem 0x1d207000-0x1d2070ff]
  pci 0001:00:01.0: BAR 0: assigned [mem 0x1d207100-0x1d20717f]
  pci 0001:00:01.0: BAR 1: assigned [io  0x1da00400-0x1da0047f]

  ip30-bridge: 0001:00:03.0 Bar 0 with size 0x00004000 at bus 0x00000000 vma 0x000000d080000000 is Direct 64-bit.


Here's the ip30-lspci-20170208.txt attachment from your mail:


0000:00:00.0 SCSI storage controller: QLogic Corp. ISP1020 Fast-wide SCSI (rev 05)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64, Cache Line Size: 256 bytes
        Interrupt: pin A routed to IRQ 0
        Region 0: I/O ports at f100000000 [size=257]
        Region 1: [virtual] Memory at f080000000 (32-bit, non-prefetchable) [size=4097]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at f080010000 [disabled] [size=65537]
        Kernel driver in use: qla1280
lspci: Unable to load libkmod resources: error -12

0000:00:01.0 SCSI storage controller: QLogic Corp. ISP1020 Fast-wide SCSI (rev 05)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64, Cache Line Size: 256 bytes
        Interrupt: pin A routed to IRQ 1
        Region 0: I/O ports at f100000100 [size=257]
        Region 1: Memory at f080020000 (32-bit, non-prefetchable) [size=4097]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at f080030000 [disabled] [size=65537]
        Kernel driver in use: qla1280

0000:00:02.0 Unassigned class [ff00]: Silicon Graphics Intl. Corp. IOC3 I/O controller (rev 01)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Interrupt: pin A routed to IRQ 2
        Region 0: Memory at f080100000 (32-bit, non-prefetchable) [size=1048577]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Kernel driver in use: IOC3

0000:00:03.0 Non-VGA unclassified device: Silicon Graphics Intl. Corp. RAD Audio (rev c0)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=slow >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 255
        Region 0: Memory at 00600000 (32-bit, non-prefetchable) [size=8193]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]

0001:00:01.0 Communication controller: Comtrol Corporation Device 080e (rev 01)
        Subsystem: Comtrol Corporation Device 080e
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 0
        Region 0: Memory at 1d207100 (32-bit, non-prefetchable) [size=129]
        Region 1: I/O ports at 1da00400 [size=129]
        Region 2: I/O ports at 1da00000 [size=257]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]

0001:00:02.0 USB controller: ULi Electronics Inc. USB 1.1 Controller (rev 03) (prog-if 10 [OHCI])
        Subsystem: ULi Electronics Inc. ASRock 939Dual-SATA2 Motherboard
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 16 (20000ns max)
        Interrupt: pin B routed to IRQ 0
        Region 0: Memory at 1d204000 (32-bit, non-prefetchable) [size=4097]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

0001:00:02.1 USB controller: ULi Electronics Inc. USB 1.1 Controller (rev 03) (prog-if 10 [OHCI])
        Subsystem: ULi Electronics Inc. ASRock 939Dual-SATA2 Motherboard
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin C routed to IRQ 0
        Region 0: Memory at 1d205000 (32-bit, non-prefetchable) [disabled] [size=4097]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

0001:00:02.2 USB controller: ULi Electronics Inc. USB 1.1 Controller (rev 03) (prog-if 10 [OHCI])
        Subsystem: ULi Electronics Inc. ASRock 939Dual-SATA2 Motherboard
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin D routed to IRQ 0
        Region 0: Memory at 1d206000 (32-bit, non-prefetchable) [disabled] [size=4097]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

0001:00:02.3 USB controller: ULi Electronics Inc. USB 2.0 Controller (rev 01) (prog-if 20 [EHCI])
        Subsystem: ULi Electronics Inc. ASRock 939Dual-SATA2 Motherboard
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 0
        Region 0: Memory at 1d207000 (32-bit, non-prefetchable) [disabled] [size=257]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [disabled] [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Debug port: BAR=1 offset=0090

0001:00:03.0 Ethernet controller: Silicon Graphics Intl. Corp. AceNIC Gigabit Ethernet (rev 01)
        Subsystem: Silicon Graphics Intl. Corp. AceNIC Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64 (16000ns min), Cache Line Size: 128 bytes
        Interrupt: pin A routed to IRQ 4
        Region 0: [virtual] Memory at d080000000 (32-bit, non-prefetchable) [size=16385]
        Region 1: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 2: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 3: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 4: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Region 5: Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Expansion ROM at <unassigned> [disabled] [size=2]
        Kernel driver in use: acenic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-13 22:45           ` Bjorn Helgaas
@ 2017-02-14  7:39             ` Joshua Kinard
  2017-02-14 14:56               ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Joshua Kinard @ 2017-02-14  7:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On 02/13/2017 17:45, Bjorn Helgaas wrote:
> It looks like IP30 is using some code that's not upstream yet, so I'll
> just point out some things that don't look right.  I'm ignoring the
> IP27 stuff here in case that has different issues.

Sorry about that.  It's an external set of patches I've been maintaining for
the last decade (I'm not the original author).  Being that I do this for a
hobby, sometimes I get waylaid by other priorities.  I've spent some time
re-organizing my patch collection lately so that I can start sending these
things in.  This series and an earlier one to the Linux/MIPS list are the
beginnings of that, as time permits.  However, I've got a ways to go before
getting to the IP30 patches.  E.g., IP27 comes before IP30, so that's where my
focus has been lately.

I've put an 4.10-rc7 tree with all patches applied here (including the IOC3,
IP27, and IP30 patches):
http://dev.gentoo.org/~kumba/mips/ip30/linux-4.10_rc7-20170208.ip30/

The patchset itself, if interested:
http://dev.gentoo.org/~kumba/mips/ip30/mips-patches-4.10.0/

You can find the IP30 code in:
  arch/mips/sgi-ip30/*
  arch/mips/include/asm/mach-ip30/*

IP27's code is in these folders:
  arch/mips/sgi-ip27/*
  arch/mips/include/asm/mach-ip27/*
  arch/mips/include/asm/sn/*

PCI code:
  arch/mips/pci/pci-bridge.c
  arch/mips/pci/ops-bridge.c
  arch/mips/include/asm/pci/bridge.h


> Joshua wrote:
>> On 02/07/2017 13:29, Bjorn Helgaas wrote:
>>> Is there any chance you can collect complete dmesg logs and
>>> /proc/iomem contents from IP27 and IP30?  Maybe "lspci -vv" output,
>>> too?  I'm not sure where to look to understand the ioremap() behavior.
>>
>> ...
>> A note about IP30, since that's part of an external patch set, the
>> dmesg output is a little different, as it is using the newer
>> BRIDGE/Xtalk code I just sent in as patches.  ...
>>
>> Also, IP30 doesn't use the PCI_PROBE_ONLY thing anymore.  This causes
>> /proc/iomem to be far more detailed than it used to be.
>>
>> IP30's /proc/iomem:
>>     00000000-00003fff : reserved
>>     1d200000-1d9fffff : Bridge MEM
>>       d080000000-d080003fff : 0001:00:03.0
>>       1d204000-1d204fff : 0001:00:02.0
>>       1d205000-1d205fff : 0001:00:02.1
>>       1d206000-1d206fff : 0001:00:02.2
>>       1d207000-1d2070ff : 0001:00:02.3
>>       1d207100-1d20717f : 0001:00:01.0
>>     1f200000-1f9fffff : Bridge MEM
>>       f080100000-f0801fffff : 0000:00:02.0
>>       f080010000-f08001ffff : 0000:00:00.0
>>       f080030000-f08003ffff : 0000:00:01.0
>>       f080000000-f080000fff : 0000:00:00.0
>>       f080020000-f080020fff : 0000:00:01.0
>>     20004000-209b8fff : reserved
>>       20004000-206acb13 : Kernel code
>>       206acb14-208affff : Kernel data
>>     209b9000-20efffff : System RAM
>>     20f00000-20ffffff : System RAM
>>     21000000-9fffffff : System RAM
>>     f080100000-f0801fffff : ioc3
> 
>>From the dmesg you attached (ip30-dmesg-20170208.txt, PCI parts
> appended below):
> 
>   pci_bus 0000:00: root bus resource [mem 0x1f200000-0x1f9fffff]
>   pci_bus 0001:00: root bus resource [mem 0x1d200000-0x1d9fffff]
> 
> These are shown correctly in /proc/iomem.  It would be nice if the
> /proc/iomem string identified which bridge was which (x86 uses "PCI
> Bus 0000:00", "PCI Bus 0001:00", etc.) but that's not essential.
> 
>   pci 0001:00:03.0: BAR 0: assigned [mem 0x1d200000-0x1d203fff]
>   ip30-bridge: 0001:00:03.0 Bar 0 with size 0x00004000 at bus 0x00000000 vma 0x000000d080000000 is Direct 64-bit.
> 
> This is funky.  We read 0x1d200000 from the BAR (PCI bus addresses are
> apparently identical to CPU physical addresses).  We claimed that
> space from the host bridge window, so /proc/iomem would have looked
> like this at that point:
> 
>   1d200000-1d9fffff : Bridge MEM
>     1d200000-1d203fff : 0001:00:03.0
> 
> But it looks like whatever this ip30-bridge thing is overwrote the
> 0001:00:03.0 resource, which makes /proc/iomem wrong.
> 
> All the resources under the bridge to 0000:00 are wrong, too:
> 
>   1f200000-1f9fffff : Bridge MEM
>     f080100000-f0801fffff : 0000:00:02.0
>     f080010000-f08001ffff : 0000:00:00.0
>     f080030000-f08003ffff : 0000:00:01.0
>     f080000000-f080000fff : 0000:00:00.0
>     f080020000-f080020fff : 0000:00:01.0

Actually, that's not wrong.  It took me a long time to figure this out, too,
but once I stumbled upon the BRIDGE ASIC specification on an anonymous FTP
site, via Google, by complete accident (hint), it finally made sense.

BRIDGE is what SGI calls the chip that interfaces the Crosstalk (Xtalk) bus to
a PCI bus.  Most known "XIO" (Xtalk I/O) boards contain a BRIDGE chip under a
heatsink at the edge of the board, next to a "compression connector".  Octane
can hold up to four XIO boards, with a fifth slot dedicated to an optional "PCI
Shoebox" (literally a metal box that can hold up to three PCI or PCI-X cards).
One exception to this are the two graphics boards, Impact/MGRAS and
Odyssey/VPro, which are "pure" XIO devices, as far as anyone's figured out.

Each BRIDGE chip can address up to eight different PCI devices.

Crosstalk on IP30 is managed by the Crossbow (XBOW) ASIC.  It implements a
crossbar switch that's not totally unlike a networking switch.  A single XBOW
supports up to 16 XIO widgets, numbered 0x0 to 0xf.  Widgets 0x1 to 0x7 are for
internal use or unused.  XBOW itself is widget 0x0.  HEART, Octane's system
controller, is widget 0x8.  The four standard XIO slots are numbered (I think)
as 0x9 to 0xc.  I think widget 0xe is unused.

The system board has a BRIDGE chip on it to provide the "BaseIO" devices, such
as two SCSI (QL1040B), IOC3 Ethernet, KB/Mouse, Serial/Parallel, Audio, and an
RTC.  This BaseIO BRIDGE is widget 0xf, which is what you're seeing with the
first set of addresses starting with 0x1fxxx.  The second BRIDGE that you're
seeing is the PCI Shoebox I have installed.  It's widget 0xd, and has addresses
starting with 0x1dxxx (second nibble from left is the widget), with three
random PCI cards I found in a storage bin plugged into it.  The Xtalk scan
happens in reverse order, starting at widget 0xf down to 0x0.

Both 0x1fxxx and 0x1dxxx addresses are "small window" addresses.  The HEART
system controller gives you three "windows" into Xtalk space, which uses 48-bit
addressing:

0x0000_1000_0000 - 0x0000_1fff_ffff - Small windows, x16 @ 16MB each
0x0008_0000_0000 - 0x000f_ffff_ffff - Medium windows, x16 @ 2GB each
0x0010_0000_0000 - 0x00ff_ffff_ffff - Big windows, x15 @ 64GB each

So a device on widget 0xf mapped into a small window at 0x0000_1fxx_xxxx could
also be referred to by a big window at address 0x00f0_8xxx_xxxx.  The only
difference is the big window gives you a larger address space to play around
with.  Medium windows are not currently used on the IP30 platform in Linux.
Widget 0x0 (XBOW) is only accessible via small or medium windows.

As such, this layout in /proc/iomem is oddly correct:

>   1f200000-1f9fffff : Bridge MEM
>     f080100000-f0801fffff : 0000:00:02.0
>     f080010000-f08001ffff : 0000:00:00.0
>     f080030000-f08003ffff : 0000:00:01.0
>     f080000000-f080000fff : 0000:00:00.0
>     f080020000-f080020fff : 0000:00:01.0

My bug is probably that I do the initial BRIDGE scan with small windows in the
main BRIDGE driver (arch/mips/pci/pci-bridge.c), and then IP30's BRIDGE glue
driver (arch/mips/sgi-ip30/ip30-bridge.c) is switching to big windows to probe
each device.  So, yes, different address ranges, but still looking at the same
device.  I'll look into fixing that.  It's partially tied to how I was kludging
the IP27 to also start doing PCI correctly within its own BRIDGE glue driver
(ip27-bridge.c), but I just learned over the weekend that there's missing
functionality I have to port over from old 2.5.x-era IA64 code.  In IP30, big
windows are available by default, but on IP27, what it calls "big windows" need
to go through some kind of translation table, and I'm trying to figure out how
to set that up.

There's also an issue with DMA on BRIDGE on IP30 that limits the system's
memory to 2GB to avoid kernel panics.  OpenBSD's figured it out -- BRIDGE's
IOMMU is busted so it can only do 31-bit DMA max.  I just haven't figured out
how to set that limit in Linux for the entire BRIDGE, not just for individual
devices.


> We read values from the BARs:
> 
>   pci_bus 0000:00: root bus resource [mem 0x1f200000-0x1f9fffff]
>   pci 0000:00:00.0: reg 0x14: [mem 0x00200000-0x00200fff]
>   pci 0000:00:00.0: reg 0x30: [mem 0x00210000-0x0021ffff pref]
>   pci 0000:00:01.0: reg 0x14: [mem 0x00400000-0x00400fff]
>   pci 0000:00:01.0: reg 0x30: [mem 0x00410000-0x0041ffff pref]
>   pci 0000:00:02.0: reg 0x10: [mem 0x00500000-0x005fffff]
>   pci 0000:00:03.0: reg 0x10: [mem 0x00600000-0x00601fff]
> 
> These aren't zero, so somebody apparently has assigned them, but they
> aren't inside the host bridge window.  Therefore the PCI core assumes
> they will not work, and it reassigns space from the window:
> 
>   pci 0000:00:02.0: BAR 0: assigned [mem 0x1f200000-0x1f2fffff]
>   pci 0000:00:00.0: BAR 6: assigned [mem 0x1f300000-0x1f30ffff pref]
>   pci 0000:00:01.0: BAR 6: assigned [mem 0x1f310000-0x1f31ffff pref]
>   pci 0000:00:00.0: BAR 1: assigned [mem 0x1f320000-0x1f320fff]
>   pci 0000:00:01.0: BAR 1: assigned [mem 0x1f321000-0x1f321fff]

Hmm, from the PCI device's point of view, 0x00200000-0x00200fff would be
correct.  The same address beginning with 0x1fxxx is the Crosstalk view of the
device.  Odds are likely I am not masking something off somewhere, but somehow,
it all still finds a way to work.


> I don't know why we didn't assign space to 0000:00:03.0 BAR 1.

The PCI device sitting at 0000:00:03.0 is the "RAD Audio" chip.  There's an
issue with the driver that I keep forgetting to try and fix, so I've left it
out of my recent kernel builds.  That's probably why no space is being assigned
to it.


>>From /proc/iomem, I would expect that only the USB devices (0001:00:02
> functions 0, 1, 2, 3) would work.  But apparently your system does
> actually work, so there must be corresponding funkiness somewhere else
> that compensates for these resources.

USB is...interesting on this platform.  Some USB devices will work, but not
many have been tested.  If my memory recalls, I think UHCI was known to play
nice, but OHCI had issues -- might be the other way around.  EHCI is dependent
on what you plug in and the position of several planets in the sky.  xHCI has
endianess issues with one card that I tried.  It's hard to find PCI or PCI-X
xHCI cards, though.  Most everything now is PCIe.  And due to chassis space
restrictions, using PCIe-to-PCI-X adapters is a no-go.

PCI-to-PCI bridges and USB hubs are totally busted, because BRIDGE's PCI Type 1
configuration space isn't very well understood.  The BRIDGE Specification I
think finally cleared that up, but I have not tried to learn how to actually
set the space up and find a spare USB Hub to play with.

All-in-all, the entire machine is as much a work of art in its inner workings
as how it looks on the outside.  As with all art, it's up to the eye of the
beholder on whether that's a good thing or not </smirk>

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-14  7:39             ` Joshua Kinard
@ 2017-02-14 14:56               ` Bjorn Helgaas
  2017-02-24  8:50                 ` Joshua Kinard
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-02-14 14:56 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On Tue, Feb 14, 2017 at 02:39:45AM -0500, Joshua Kinard wrote:
> On 02/13/2017 17:45, Bjorn Helgaas wrote:
> > It looks like IP30 is using some code that's not upstream yet, so I'll
> > just point out some things that don't look right.  I'm ignoring the
> > IP27 stuff here in case that has different issues.
> 
> Sorry about that.  It's an external set of patches I've been maintaining for
> the last decade (I'm not the original author).  Being that I do this for a
> hobby, sometimes I get waylaid by other priorities.  I've spent some time
> re-organizing my patch collection lately so that I can start sending these
> things in.  This series and an earlier one to the Linux/MIPS list are the
> beginnings of that, as time permits.  However, I've got a ways to go before
> getting to the IP30 patches.  E.g., IP27 comes before IP30, so that's where my
> focus has been lately.
> 
> I've put an 4.10-rc7 tree with all patches applied here (including the IOC3,
> IP27, and IP30 patches):
> http://dev.gentoo.org/~kumba/mips/ip30/linux-4.10_rc7-20170208.ip30/
> 
> The patchset itself, if interested:
> http://dev.gentoo.org/~kumba/mips/ip30/mips-patches-4.10.0/

Thanks for the pointers.

> > All the resources under the bridge to 0000:00 are wrong, too:
> > 
> >   1f200000-1f9fffff : Bridge MEM
> >     f080100000-f0801fffff : 0000:00:02.0
> >     f080010000-f08001ffff : 0000:00:00.0
> >     f080030000-f08003ffff : 0000:00:01.0
> >     f080000000-f080000fff : 0000:00:00.0
> >     f080020000-f080020fff : 0000:00:01.0
> 
> Actually, that's not wrong.  It took me a long time to figure this out, too,
> but once I stumbled upon the BRIDGE ASIC specification on an anonymous FTP
> site, via Google, by complete accident (hint), it finally made sense.
> 
> BRIDGE is what SGI calls the chip that interfaces the Crosstalk (Xtalk) bus to
> a PCI bus.  Most known "XIO" (Xtalk I/O) boards contain a BRIDGE chip under a
> heatsink at the edge of the board, next to a "compression connector".  Octane
> can hold up to four XIO boards, with a fifth slot dedicated to an optional "PCI
> Shoebox" (literally a metal box that can hold up to three PCI or PCI-X cards).
> One exception to this are the two graphics boards, Impact/MGRAS and
> Odyssey/VPro, which are "pure" XIO devices, as far as anyone's figured out.
> 
> Each BRIDGE chip can address up to eight different PCI devices.
> 
> Crosstalk on IP30 is managed by the Crossbow (XBOW) ASIC.  It implements a
> crossbar switch that's not totally unlike a networking switch.  A single XBOW
> supports up to 16 XIO widgets, numbered 0x0 to 0xf.  Widgets 0x1 to 0x7 are for
> internal use or unused.  XBOW itself is widget 0x0.  HEART, Octane's system
> controller, is widget 0x8.  The four standard XIO slots are numbered (I think)
> as 0x9 to 0xc.  I think widget 0xe is unused.
> 
> The system board has a BRIDGE chip on it to provide the "BaseIO" devices, such
> as two SCSI (QL1040B), IOC3 Ethernet, KB/Mouse, Serial/Parallel, Audio, and an
> RTC.  This BaseIO BRIDGE is widget 0xf, which is what you're seeing with the
> first set of addresses starting with 0x1fxxx.  The second BRIDGE that you're
> seeing is the PCI Shoebox I have installed.  It's widget 0xd, and has addresses
> starting with 0x1dxxx (second nibble from left is the widget), with three
> random PCI cards I found in a storage bin plugged into it.  The Xtalk scan
> happens in reverse order, starting at widget 0xf down to 0x0.
> 
> Both 0x1fxxx and 0x1dxxx addresses are "small window" addresses.  The HEART
> system controller gives you three "windows" into Xtalk space, which uses 48-bit
> addressing:
> 
> 0x0000_1000_0000 - 0x0000_1fff_ffff - Small windows, x16 @ 16MB each
> 0x0008_0000_0000 - 0x000f_ffff_ffff - Medium windows, x16 @ 2GB each
> 0x0010_0000_0000 - 0x00ff_ffff_ffff - Big windows, x15 @ 64GB each
> 
> So a device on widget 0xf mapped into a small window at 0x0000_1fxx_xxxx could
> also be referred to by a big window at address 0x00f0_8xxx_xxxx.  The only
> difference is the big window gives you a larger address space to play around
> with.  Medium windows are not currently used on the IP30 platform in Linux.
> Widget 0x0 (XBOW) is only accessible via small or medium windows.
> 
> As such, this layout in /proc/iomem is oddly correct:
> 
> >   1f200000-1f9fffff : Bridge MEM
> >     f080100000-f0801fffff : 0000:00:02.0
> >     f080010000-f08001ffff : 0000:00:00.0
> >     f080030000-f08003ffff : 0000:00:01.0
> >     f080000000-f080000fff : 0000:00:00.0
> >     f080020000-f080020fff : 0000:00:01.0
> 
> My bug is probably that I do the initial BRIDGE scan with small windows in the
> main BRIDGE driver (arch/mips/pci/pci-bridge.c), and then IP30's BRIDGE glue
> driver (arch/mips/sgi-ip30/ip30-bridge.c) is switching to big windows to probe
> each device.  So, yes, different address ranges, but still looking at the same
> device.  I'll look into fixing that.  It's partially tied to how I was kludging
> the IP27 to also start doing PCI correctly within its own BRIDGE glue driver
> (ip27-bridge.c), but I just learned over the weekend that there's missing
> functionality I have to port over from old 2.5.x-era IA64 code.  In IP30, big
> windows are available by default, but on IP27, what it calls "big windows" need
> to go through some kind of translation table, and I'm trying to figure out how
> to set that up.

I agree, it sounds like the problem is the switch from small windows
to big ones.  In order for the PCI core to work correctly, the host
bridge window ("1f200000-1f9fffff : Bridge MEM") must enclose the BARs
of the devices below the bridge.

If you use the big windows for the device BARs, you should use big
windows for the host bridge windows.  I think we're talking about
widget 0xf here, so the 0xf big window would be 0xf0_0000_0000 -
0xff_ffff_ffff.  This should all be set up *before* calling
pci_scan_root_bus() instead of after, as it seems to be today.

The scan would look like this:

  pci_bus 0000:00: root bus resource [mem 0xf000000000-0xffffffffff] (bus addresses [0x00000000-0xfffffffff])
  pci 0000:00:00.0: reg 0x14: [mem 0xf000200000-0xf000200fff]
  pci 0000:00:00.0: reg 0x30: [mem 0xf000210000-0xf00021ffff pref]
  pci 0000:00:01.0: reg 0x14: [mem 0xf000400000-0xf000400fff]
  pci 0000:00:01.0: reg 0x30: [mem 0xf000410000-0xf00041ffff pref]
  pci 0000:00:02.0: reg 0x10: [mem 0xf000500000-0xf0005fffff]
  pci 0000:00:03.0: reg 0x10: [mem 0xf000600000-0xf000601fff]

This would make /proc/iomem look like this:

  f000000000-ffffffffff : Bridge MEM
    f000500000-f0005fffff : 0000:00:02.0
    f000210000-f00021ffff : 0000:00:00.0
    f000410000-f00041ffff : 0000:00:01.0
    f000200000-f000200fff : 0000:00:00.0
    f000400000-f000400fff : 0000:00:01.0

This doesn't match the device BARs in your /proc/iomem, so there must
be some other transformation going on as well.

As long as you tell the PCI core about the host bridge windows you're
going to use, along with offsets that include *all* these
transformations, the core should just work, and /proc/iomem should
also make sense.  The details of small/medium/big windows, widgets,
etc., are immaterial to the core.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-14 14:56               ` Bjorn Helgaas
@ 2017-02-24  8:50                 ` Joshua Kinard
  2017-02-24 18:38                   ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Joshua Kinard @ 2017-02-24  8:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On 02/14/2017 09:56, Bjorn Helgaas wrote:
> On Tue, Feb 14, 2017 at 02:39:45AM -0500, Joshua Kinard wrote:
>> On 02/13/2017 17:45, Bjorn Helgaas wrote:
>>> It looks like IP30 is using some code that's not upstream yet, so I'll
>>> just point out some things that don't look right.  I'm ignoring the
>>> IP27 stuff here in case that has different issues.
>>
>> Sorry about that.  It's an external set of patches I've been maintaining for
>> the last decade (I'm not the original author).  Being that I do this for a
>> hobby, sometimes I get waylaid by other priorities.  I've spent some time
>> re-organizing my patch collection lately so that I can start sending these
>> things in.  This series and an earlier one to the Linux/MIPS list are the
>> beginnings of that, as time permits.  However, I've got a ways to go before
>> getting to the IP30 patches.  E.g., IP27 comes before IP30, so that's where my
>> focus has been lately.
>>
>> I've put an 4.10-rc7 tree with all patches applied here (including the IOC3,
>> IP27, and IP30 patches):
>> http://dev.gentoo.org/~kumba/mips/ip30/linux-4.10_rc7-20170208.ip30/
>>
>> The patchset itself, if interested:
>> http://dev.gentoo.org/~kumba/mips/ip30/mips-patches-4.10.0/
> 
> Thanks for the pointers.
> 
>>> All the resources under the bridge to 0000:00 are wrong, too:
>>>
>>>   1f200000-1f9fffff : Bridge MEM
>>>     f080100000-f0801fffff : 0000:00:02.0
>>>     f080010000-f08001ffff : 0000:00:00.0
>>>     f080030000-f08003ffff : 0000:00:01.0
>>>     f080000000-f080000fff : 0000:00:00.0
>>>     f080020000-f080020fff : 0000:00:01.0
>>
>> Actually, that's not wrong.  It took me a long time to figure this out, too,
>> but once I stumbled upon the BRIDGE ASIC specification on an anonymous FTP
>> site, via Google, by complete accident (hint), it finally made sense.
>>
>> BRIDGE is what SGI calls the chip that interfaces the Crosstalk (Xtalk) bus to
>> a PCI bus.  Most known "XIO" (Xtalk I/O) boards contain a BRIDGE chip under a
>> heatsink at the edge of the board, next to a "compression connector".  Octane
>> can hold up to four XIO boards, with a fifth slot dedicated to an optional "PCI
>> Shoebox" (literally a metal box that can hold up to three PCI or PCI-X cards).
>> One exception to this are the two graphics boards, Impact/MGRAS and
>> Odyssey/VPro, which are "pure" XIO devices, as far as anyone's figured out.
>>
>> Each BRIDGE chip can address up to eight different PCI devices.
>>
>> Crosstalk on IP30 is managed by the Crossbow (XBOW) ASIC.  It implements a
>> crossbar switch that's not totally unlike a networking switch.  A single XBOW
>> supports up to 16 XIO widgets, numbered 0x0 to 0xf.  Widgets 0x1 to 0x7 are for
>> internal use or unused.  XBOW itself is widget 0x0.  HEART, Octane's system
>> controller, is widget 0x8.  The four standard XIO slots are numbered (I think)
>> as 0x9 to 0xc.  I think widget 0xe is unused.
>>
>> The system board has a BRIDGE chip on it to provide the "BaseIO" devices, such
>> as two SCSI (QL1040B), IOC3 Ethernet, KB/Mouse, Serial/Parallel, Audio, and an
>> RTC.  This BaseIO BRIDGE is widget 0xf, which is what you're seeing with the
>> first set of addresses starting with 0x1fxxx.  The second BRIDGE that you're
>> seeing is the PCI Shoebox I have installed.  It's widget 0xd, and has addresses
>> starting with 0x1dxxx (second nibble from left is the widget), with three
>> random PCI cards I found in a storage bin plugged into it.  The Xtalk scan
>> happens in reverse order, starting at widget 0xf down to 0x0.
>>
>> Both 0x1fxxx and 0x1dxxx addresses are "small window" addresses.  The HEART
>> system controller gives you three "windows" into Xtalk space, which uses 48-bit
>> addressing:
>>
>> 0x0000_1000_0000 - 0x0000_1fff_ffff - Small windows, x16 @ 16MB each
>> 0x0008_0000_0000 - 0x000f_ffff_ffff - Medium windows, x16 @ 2GB each
>> 0x0010_0000_0000 - 0x00ff_ffff_ffff - Big windows, x15 @ 64GB each
>>
>> So a device on widget 0xf mapped into a small window at 0x0000_1fxx_xxxx could
>> also be referred to by a big window at address 0x00f0_8xxx_xxxx.  The only
>> difference is the big window gives you a larger address space to play around
>> with.  Medium windows are not currently used on the IP30 platform in Linux.
>> Widget 0x0 (XBOW) is only accessible via small or medium windows.
>>
>> As such, this layout in /proc/iomem is oddly correct:
>>
>>>   1f200000-1f9fffff : Bridge MEM
>>>     f080100000-f0801fffff : 0000:00:02.0
>>>     f080010000-f08001ffff : 0000:00:00.0
>>>     f080030000-f08003ffff : 0000:00:01.0
>>>     f080000000-f080000fff : 0000:00:00.0
>>>     f080020000-f080020fff : 0000:00:01.0
>>
>> My bug is probably that I do the initial BRIDGE scan with small windows in the
>> main BRIDGE driver (arch/mips/pci/pci-bridge.c), and then IP30's BRIDGE glue
>> driver (arch/mips/sgi-ip30/ip30-bridge.c) is switching to big windows to probe
>> each device.  So, yes, different address ranges, but still looking at the same
>> device.  I'll look into fixing that.  It's partially tied to how I was kludging
>> the IP27 to also start doing PCI correctly within its own BRIDGE glue driver
>> (ip27-bridge.c), but I just learned over the weekend that there's missing
>> functionality I have to port over from old 2.5.x-era IA64 code.  In IP30, big
>> windows are available by default, but on IP27, what it calls "big windows" need
>> to go through some kind of translation table, and I'm trying to figure out how
>> to set that up.
> 
> I agree, it sounds like the problem is the switch from small windows
> to big ones.  In order for the PCI core to work correctly, the host
> bridge window ("1f200000-1f9fffff : Bridge MEM") must enclose the BARs
> of the devices below the bridge.
> 
> If you use the big windows for the device BARs, you should use big
> windows for the host bridge windows.  I think we're talking about
> widget 0xf here, so the 0xf big window would be 0xf0_0000_0000 -
> 0xff_ffff_ffff.  This should all be set up *before* calling
> pci_scan_root_bus() instead of after, as it seems to be today.

Okay, so after a week of downtime due to hardware failure on another machine, I
started looking at this again, and it looks like it's more than just getting
the windows wrong.  The existing BRIDGE code (pci-bridge.c) is limiting the PCI
window to the first five devices on BaseIO in the small window (my fault), and
the IP30 glue code (ip30-bridge.c) is re-writing the PCI BARs to access the
BRIDGE PCI Memory and PCI I/O spaces via big windows, without updating the
original mapping.  We're not teaching the Linux PCI core at all, I think, about
how to generically access PCI Memory Space and PCI I/O Space on the BRIDGE.
The fact this actually works appears to be pure luck.

Quoting from the BRIDGE docs somewhat, from the point of view of generic
Crosstalk space and not specific to IP30 or IP27, there are several views into
PCI space.  Each address is 48-bits, the size of a Crosstalk address.

    0000_0000_0000 to 0000_00ff_ffff is "Widget space".  It seems the BRIDGE
    docs use the term "widget" here to refer to the eight possible PCI devices
    addressable by a single BRIDGE ASIC.  This plus the small windows is how
    the code is currently getting things to work.  I think.

    0000_4000_0000 to 0000_7fff_ffff is BRIDGE's view into PCI Memory Space
    for direct-mapped 32-bit devices.  It is 1GB in size.  My take is this is
    normally a 4GB space?  Can the Linux PCI core be taught that only 1GB
    is usable?

    0000_8000_0000 to 0000_bfff_ffff is an alias view by BRIDGE into PCI
    Memory space for 64-bit PCI devices, immediately after the 32-bit space.
    It is also 1GB in size.

    0000_c000_0000 to 0000_ffff_ffff is 2GB adjacent to PCI Memory space, but
    is either unused or used for Crosstalk/BRIDGE DMA operations
    (the docs aren't very clear on this point or I am simply not
    understanding).

    0001_0000_0000 to 0001_ffff_ffff is BRIDGE's view into PCI I/O Space.
    It has the entire 4GB range available to use.


I suspect the existing BRIDGE driver is mapping through "widget space" first to
probe the PCI device slots, then the ip30-bridge.c code is re-writing the PCI
BARs to go through the big window on IP30, but it never updates the original
BRIDGE setup (if that's even possible).  As such, as you pointed out, there's
some kind of logic in the generic PCI core that's re-assigning space from the
window and we're just getting lucky and everything still works.

I dug up some old debugging code given to me by the original author of the IP30
port and built up a set of macros that get me to the PCI Configuration Space on
Widget 0xf (BaseIO Bridge), Device #0 (the first QLA1040B SCSI controller).
All three addresses access the same PCI device and return the same config space
info.

    Small window:  0x900000001f02xxxx
    Medium window: 0x9000000f8002xxxx
    Big window:    0x900000f00002xxxx


Each additional device's configuration space is offset 0x1000, which gives me
the following addresses (dev's 4 to 7 are special-cased for IRQ trickery, so
can't probe them):

    - Dev #0: scsi0/qla1040b: 0x90xxxxxxxxx20000
    - Dev #1: scsi1/qla1040b: 0x90xxxxxxxxx21000
    - Dev #2: io/ioc3       : 0x90xxxxxxxxx22000
    - Dev #3: audio/rad1    : 0x90xxxxxxxxx23000


If I probe and dump the config space data for each, I get the following:

    PCI slot 0 information:
        vendor ID :              0x1077
        device ID :              0x1020
        command :                0x0006
        status :                 0x0200
        revision :               0x05
        prog if :                0x00
        class :                  0x0100
        cache line :             0x40
        latency :                0x40
        hdr type :               0x00
        BIST :                   0x00
        region 0 :               0x00200001
        region 1 :               0x00200000
        region 2 :               0x00000000
        region 3 :               0x00000000
        region 4 :               0x00000000
        region 5 :               0x00000000
        IRQ line :               0x00
        IRQ pin :                0x01

    PCI slot 1 information:
        vendor ID :              0x1077
        device ID :              0x1020
        command :                0x0006
        status :                 0x0200
        revision :               0x05
        prog if :                0x00
        class :                  0x0100
        cache line :             0x40
        latency :                0x40
        hdr type :               0x00
        BIST :                   0x00
        region 0 :               0x00400001
        region 1 :               0x00400000
        region 2 :               0x00000000
        region 3 :               0x00000000
        region 4 :               0x00000000
        region 5 :               0x00000000
        IRQ line :               0x00
        IRQ pin :                0x01

    PCI slot 2 information:
        vendor ID :              0x10a9
        device ID :              0x0003
        command :                0x0146
        status :                 0x0280
        revision :               0x01
        prog if :                0x00
        class :                  0xff00
        cache line :             0x00
        latency :                0x28
        hdr type :               0x00
        BIST :                   0x00
        region 0 :               0x00500000
        region 1 :               0x00000000
        region 2 :               0x00000000
        region 3 :               0x00500000
        region 4 :               0x000310a9
        region 5 :               0x02800146
        IRQ line :               0x00
        IRQ pin :                0x00

    PCI slot 3 information:
        vendor ID :              0x10a9
        device ID :              0x0005
        command :                0x0006
        status :                 0x0480
        revision :               0xc0
        prog if :                0x00
        class :                  0x0000
        cache line :             0x00
        latency :                0xff
        hdr type :               0x00
        BIST :                   0x00
        region 0 :               0x00600000
        region 1 :               0x00000000
        region 2 :               0x00000000
        region 3 :               0x00000000
        region 4 :               0x00000000
        region 5 :               0x00000000
        IRQ line :               0x00
        IRQ pin :                0x00

This is where my knowledge runs short -- I'm not fluent in PCI device
programming, so I'm not sure what is really supposed to be going on with these
different BAR regions.  I dug up a copy of the PCI Spec 2.2, but it's 322 pages
and I'm not sure where I should start reading from.  Are these regions
device-specific?  E.g., do I need the QLA1040B programming manual to understand
why region 0 is 0x00200001 and region 1 is 0x00200000?

I am also not yet considering how PCI views Crosstalk space for DMA operations
-- BRIDGE has several mechanisms for that, depending on if the PCI device is a
32-bit device or a 64-bit device.  It can do direct-mapping for 32-bit or
64-bit, or built-in page-mapping hardware is available (but apparently to be
avoided due to numerous hardware quirks/bugs that would make the driver
overly-complicated).

There's a good write-up in the OpenBSD "xbridge" driver, which handles BRIDGE
(IP27/IP30), XBRIDGE (IP35), and PIC (IP35/IA64 Altix):

http://bxr.su/OpenBSD/sys/arch/sgi/xbow/xbridge.c

It's starting to make some sense to me, but I am still uncertain how to work
with the 32-bit and 64-bit 1GB PCI memory spaces on BRIDGE as well as the 4GB
PCI I/O space from a Linux point of view.  Do they only matter when dealing
with DMA?

--J



> The scan would look like this:
> 
>   pci_bus 0000:00: root bus resource [mem 0xf000000000-0xffffffffff] (bus addresses [0x00000000-0xfffffffff])
>   pci 0000:00:00.0: reg 0x14: [mem 0xf000200000-0xf000200fff]
>   pci 0000:00:00.0: reg 0x30: [mem 0xf000210000-0xf00021ffff pref]
>   pci 0000:00:01.0: reg 0x14: [mem 0xf000400000-0xf000400fff]
>   pci 0000:00:01.0: reg 0x30: [mem 0xf000410000-0xf00041ffff pref]
>   pci 0000:00:02.0: reg 0x10: [mem 0xf000500000-0xf0005fffff]
>   pci 0000:00:03.0: reg 0x10: [mem 0xf000600000-0xf000601fff]
> 
> This would make /proc/iomem look like this:
> 
>   f000000000-ffffffffff : Bridge MEM
>     f000500000-f0005fffff : 0000:00:02.0
>     f000210000-f00021ffff : 0000:00:00.0
>     f000410000-f00041ffff : 0000:00:01.0
>     f000200000-f000200fff : 0000:00:00.0
>     f000400000-f000400fff : 0000:00:01.0
> 
> This doesn't match the device BARs in your /proc/iomem, so there must
> be some other transformation going on as well.
> 
> As long as you tell the PCI core about the host bridge windows you're
> going to use, along with offsets that include *all* these
> transformations, the core should just work, and /proc/iomem should
> also make sense.  The details of small/medium/big windows, widgets,
> etc., are immaterial to the core.
> 
> Bjorn
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-24  8:50                 ` Joshua Kinard
@ 2017-02-24 18:38                   ` Bjorn Helgaas
  2017-02-25  9:34                     ` Joshua Kinard
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-02-24 18:38 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On Fri, Feb 24, 2017 at 03:50:26AM -0500, Joshua Kinard wrote:
> On 02/14/2017 09:56, Bjorn Helgaas wrote:
> > On Tue, Feb 14, 2017 at 02:39:45AM -0500, Joshua Kinard wrote:
> >> On 02/13/2017 17:45, Bjorn Helgaas wrote:
> >>> It looks like IP30 is using some code that's not upstream yet, so I'll
> >>> just point out some things that don't look right.  I'm ignoring the
> >>> IP27 stuff here in case that has different issues.
> >>
> >> Sorry about that.  It's an external set of patches I've been maintaining for
> >> the last decade (I'm not the original author).  Being that I do this for a
> >> hobby, sometimes I get waylaid by other priorities.  I've spent some time
> >> re-organizing my patch collection lately so that I can start sending these
> >> things in.  This series and an earlier one to the Linux/MIPS list are the
> >> beginnings of that, as time permits.  However, I've got a ways to go before
> >> getting to the IP30 patches.  E.g., IP27 comes before IP30, so that's where my
> >> focus has been lately.
> >>
> >> I've put an 4.10-rc7 tree with all patches applied here (including the IOC3,
> >> IP27, and IP30 patches):
> >> http://dev.gentoo.org/~kumba/mips/ip30/linux-4.10_rc7-20170208.ip30/
> >>
> >> The patchset itself, if interested:
> >> http://dev.gentoo.org/~kumba/mips/ip30/mips-patches-4.10.0/
> > 
> > Thanks for the pointers.
> > 
> >>> All the resources under the bridge to 0000:00 are wrong, too:
> >>>
> >>>   1f200000-1f9fffff : Bridge MEM
> >>>     f080100000-f0801fffff : 0000:00:02.0
> >>>     f080010000-f08001ffff : 0000:00:00.0
> >>>     f080030000-f08003ffff : 0000:00:01.0
> >>>     f080000000-f080000fff : 0000:00:00.0
> >>>     f080020000-f080020fff : 0000:00:01.0
> >>
> >> Actually, that's not wrong.  It took me a long time to figure this out, too,
> >> but once I stumbled upon the BRIDGE ASIC specification on an anonymous FTP
> >> site, via Google, by complete accident (hint), it finally made sense.
> >>
> >> BRIDGE is what SGI calls the chip that interfaces the Crosstalk (Xtalk) bus to
> >> a PCI bus.  Most known "XIO" (Xtalk I/O) boards contain a BRIDGE chip under a
> >> heatsink at the edge of the board, next to a "compression connector".  Octane
> >> can hold up to four XIO boards, with a fifth slot dedicated to an optional "PCI
> >> Shoebox" (literally a metal box that can hold up to three PCI or PCI-X cards).
> >> One exception to this are the two graphics boards, Impact/MGRAS and
> >> Odyssey/VPro, which are "pure" XIO devices, as far as anyone's figured out.
> >>
> >> Each BRIDGE chip can address up to eight different PCI devices.
> >>
> >> Crosstalk on IP30 is managed by the Crossbow (XBOW) ASIC.  It implements a
> >> crossbar switch that's not totally unlike a networking switch.  A single XBOW
> >> supports up to 16 XIO widgets, numbered 0x0 to 0xf.  Widgets 0x1 to 0x7 are for
> >> internal use or unused.  XBOW itself is widget 0x0.  HEART, Octane's system
> >> controller, is widget 0x8.  The four standard XIO slots are numbered (I think)
> >> as 0x9 to 0xc.  I think widget 0xe is unused.
> >>
> >> The system board has a BRIDGE chip on it to provide the "BaseIO" devices, such
> >> as two SCSI (QL1040B), IOC3 Ethernet, KB/Mouse, Serial/Parallel, Audio, and an
> >> RTC.  This BaseIO BRIDGE is widget 0xf, which is what you're seeing with the
> >> first set of addresses starting with 0x1fxxx.  The second BRIDGE that you're
> >> seeing is the PCI Shoebox I have installed.  It's widget 0xd, and has addresses
> >> starting with 0x1dxxx (second nibble from left is the widget), with three
> >> random PCI cards I found in a storage bin plugged into it.  The Xtalk scan
> >> happens in reverse order, starting at widget 0xf down to 0x0.
> >>
> >> Both 0x1fxxx and 0x1dxxx addresses are "small window" addresses.  The HEART
> >> system controller gives you three "windows" into Xtalk space, which uses 48-bit
> >> addressing:
> >>
> >> 0x0000_1000_0000 - 0x0000_1fff_ffff - Small windows, x16 @ 16MB each
> >> 0x0008_0000_0000 - 0x000f_ffff_ffff - Medium windows, x16 @ 2GB each
> >> 0x0010_0000_0000 - 0x00ff_ffff_ffff - Big windows, x15 @ 64GB each
> >>
> >> So a device on widget 0xf mapped into a small window at 0x0000_1fxx_xxxx could
> >> also be referred to by a big window at address 0x00f0_8xxx_xxxx.  The only
> >> difference is the big window gives you a larger address space to play around
> >> with.  Medium windows are not currently used on the IP30 platform in Linux.
> >> Widget 0x0 (XBOW) is only accessible via small or medium windows.
> >>
> >> As such, this layout in /proc/iomem is oddly correct:
> >>
> >>>   1f200000-1f9fffff : Bridge MEM
> >>>     f080100000-f0801fffff : 0000:00:02.0
> >>>     f080010000-f08001ffff : 0000:00:00.0
> >>>     f080030000-f08003ffff : 0000:00:01.0
> >>>     f080000000-f080000fff : 0000:00:00.0
> >>>     f080020000-f080020fff : 0000:00:01.0
> >>
> >> My bug is probably that I do the initial BRIDGE scan with small windows in the
> >> main BRIDGE driver (arch/mips/pci/pci-bridge.c), and then IP30's BRIDGE glue
> >> driver (arch/mips/sgi-ip30/ip30-bridge.c) is switching to big windows to probe
> >> each device.  So, yes, different address ranges, but still looking at the same
> >> device.  I'll look into fixing that.  It's partially tied to how I was kludging
> >> the IP27 to also start doing PCI correctly within its own BRIDGE glue driver
> >> (ip27-bridge.c), but I just learned over the weekend that there's missing
> >> functionality I have to port over from old 2.5.x-era IA64 code.  In IP30, big
> >> windows are available by default, but on IP27, what it calls "big windows" need
> >> to go through some kind of translation table, and I'm trying to figure out how
> >> to set that up.
> > 
> > I agree, it sounds like the problem is the switch from small windows
> > to big ones.  In order for the PCI core to work correctly, the host
> > bridge window ("1f200000-1f9fffff : Bridge MEM") must enclose the BARs
> > of the devices below the bridge.
> > 
> > If you use the big windows for the device BARs, you should use big
> > windows for the host bridge windows.  I think we're talking about
> > widget 0xf here, so the 0xf big window would be 0xf0_0000_0000 -
> > 0xff_ffff_ffff.  This should all be set up *before* calling
> > pci_scan_root_bus() instead of after, as it seems to be today.
> 
> Okay, so after a week of downtime due to hardware failure on another machine, I
> started looking at this again, and it looks like it's more than just getting
> the windows wrong.  The existing BRIDGE code (pci-bridge.c) is limiting the PCI
> window to the first five devices on BaseIO in the small window (my fault), and
> the IP30 glue code (ip30-bridge.c) is re-writing the PCI BARs to access the
> BRIDGE PCI Memory and PCI I/O spaces via big windows, without updating the
> original mapping.  We're not teaching the Linux PCI core at all, I think, about
> how to generically access PCI Memory Space and PCI I/O Space on the BRIDGE.
> The fact this actually works appears to be pure luck.

I agree about the luck part :)

> Quoting from the BRIDGE docs somewhat, from the point of view of generic
> Crosstalk space and not specific to IP30 or IP27, there are several views into
> PCI space.  Each address is 48-bits, the size of a Crosstalk address.
> 
>     0000_0000_0000 to 0000_00ff_ffff is "Widget space".  It seems the BRIDGE
>     docs use the term "widget" here to refer to the eight possible PCI devices
>     addressable by a single BRIDGE ASIC.  This plus the small windows is how
>     the code is currently getting things to work.  I think.
> 
>     0000_4000_0000 to 0000_7fff_ffff is BRIDGE's view into PCI Memory Space
>     for direct-mapped 32-bit devices.  It is 1GB in size.  My take is this is
>     normally a 4GB space?  Can the Linux PCI core be taught that only 1GB
>     is usable?

Bridges normally have a window that contains some 32-bit PCI memory
space, because many PCI devices have 32-bit BARs that have to be
located below 4GB.

This window is usually smaller than 4GB (1GB would be typical) because
these devices likely can only generate 32-bit DMA as well, and the DMA
has to use 32-bit PCI addresses that are outside the host bridge
window.

The host bridge code, i.e., the pcibios_scanbus() path, tells the core
how big the window is.  In this case, "hose->mem_resource" contains
the CPU physical address range (and size), and "host->mem_offset"
contains the offset between the CPU physical address and the PCI bus
address.  So if "hose->mem_resource" is only 1GB, that's all the PCI
core will use.

>     0000_8000_0000 to 0000_bfff_ffff is an alias view by BRIDGE into PCI
>     Memory space for 64-bit PCI devices, immediately after the 32-bit space.
>     It is also 1GB in size.
> 
>     0000_c000_0000 to 0000_ffff_ffff is 2GB adjacent to PCI Memory space, but
>     is either unused or used for Crosstalk/BRIDGE DMA operations
>     (the docs aren't very clear on this point or I am simply not
>     understanding).
> 
>     0001_0000_0000 to 0001_ffff_ffff is BRIDGE's view into PCI I/O Space.
>     It has the entire 4GB range available to use.
> 
> 
> I suspect the existing BRIDGE driver is mapping through "widget space" first to
> probe the PCI device slots, then the ip30-bridge.c code is re-writing the PCI
> BARs to go through the big window on IP30, but it never updates the original
> BRIDGE setup (if that's even possible).  As such, as you pointed out, there's
> some kind of logic in the generic PCI core that's re-assigning space from the
> window and we're just getting lucky and everything still works.
> 
> I dug up some old debugging code given to me by the original author of the IP30
> port and built up a set of macros that get me to the PCI Configuration Space on
> Widget 0xf (BaseIO Bridge), Device #0 (the first QLA1040B SCSI controller).
> All three addresses access the same PCI device and return the same config space
> info.
> 
>     Small window:  0x900000001f02xxxx
>     Medium window: 0x9000000f8002xxxx
>     Big window:    0x900000f00002xxxx
> 
> 
> Each additional device's configuration space is offset 0x1000, which gives me
> the following addresses (dev's 4 to 7 are special-cased for IRQ trickery, so
> can't probe them):
> 
>     - Dev #0: scsi0/qla1040b: 0x90xxxxxxxxx20000
>     - Dev #1: scsi1/qla1040b: 0x90xxxxxxxxx21000
>     - Dev #2: io/ioc3       : 0x90xxxxxxxxx22000
>     - Dev #3: audio/rad1    : 0x90xxxxxxxxx23000

PCIe functions have 4K of config space each, so a 0x1000 offset makes
sense.  Whether a platform can access all of it is a separate
question.

It looks like the type 0 config accessors (pci_conf0_read_config(),
pci_conf0_write_config()) are basically like ECAM, where config space
is memory-mapped.

Per spec, the ECAM window for a bridge that could have buses 00-ff
below it would be 256MB (256 buses * 32 devices/bus * 8
functions/device * 4096 bytes config space/function).

But based on b_type0_cfg_dev[], it looks like SGI only made space for
8 devices on the root bus, each with 8 functions.  I think that's OK,
because they can control how many devices can be on the root bus.

The type 1 config accessors (pci_conf1_read_config(),
pci_conf1_write_config()) are for devices below a PCI-to-PCI bridge.
It looks like there's a single 4K memory-mapped window, and you point
it at a specific bus & device with bridge->b_pci_cfg.

The Linux type 0 accessors look like they support 4K config space per
function, while the type 1 accessors put the function number in bits
8-10, so it looks like they only support 256 bytes per function.

The OpenBSD accessors (xbridge_conf_read() and xbridge_conf_write())
look like they only support 256 bytes per function regardless of
whether it's type 0 or type 1.

Supporting 4K of space for type 0 seems like a potential Linux
problem.  Also, pci_conf1_write_config() uses b_type0_cfg_dev in one
place where it looks like it should be using b_type1_cfg.  But that's
in the IOC3 path, and I don't know if that even makes sense for
non-root bus devices -- I doubt you can put an IOC3 behind a
PCI-to-PCI bridge.

If the hardware only supports 256 bytes of config space on non-root
bus devices, that's not a disaster.  We should still be able to
enumerate them and use all the conventional PCI features and even
basic PCIe features.  But the extended config space (offsets
0x100-0xfff) would be inaccessible, and we wouldn't see any PCIe
extended capabilities (AER, VC, SR-IOV, etc., see PCI_EXT_CAP_ID_ERR
and subsequent definitions) because they live in that space.

> If I probe and dump the config space data for each, I get the following:
> 
>     PCI slot 0 information:
>         vendor ID :              0x1077
>         device ID :              0x1020
>         command :                0x0006
>         status :                 0x0200
>         revision :               0x05
>         prog if :                0x00
>         class :                  0x0100
>         cache line :             0x40
>         latency :                0x40
>         hdr type :               0x00
>         BIST :                   0x00
>         region 0 :               0x00200001
>         region 1 :               0x00200000
>         region 2 :               0x00000000
>         region 3 :               0x00000000
>         region 4 :               0x00000000
>         region 5 :               0x00000000
>         IRQ line :               0x00
>         IRQ pin :                0x01
> 
>     PCI slot 1 information:
>         vendor ID :              0x1077
>         device ID :              0x1020
>         command :                0x0006
>         status :                 0x0200
>         revision :               0x05
>         prog if :                0x00
>         class :                  0x0100
>         cache line :             0x40
>         latency :                0x40
>         hdr type :               0x00
>         BIST :                   0x00
>         region 0 :               0x00400001
>         region 1 :               0x00400000
>         region 2 :               0x00000000
>         region 3 :               0x00000000
>         region 4 :               0x00000000
>         region 5 :               0x00000000
>         IRQ line :               0x00
>         IRQ pin :                0x01
> 
>     PCI slot 2 information:
>         vendor ID :              0x10a9
>         device ID :              0x0003
>         command :                0x0146
>         status :                 0x0280
>         revision :               0x01
>         prog if :                0x00
>         class :                  0xff00
>         cache line :             0x00
>         latency :                0x28
>         hdr type :               0x00
>         BIST :                   0x00
>         region 0 :               0x00500000
>         region 1 :               0x00000000
>         region 2 :               0x00000000
>         region 3 :               0x00500000
>         region 4 :               0x000310a9
>         region 5 :               0x02800146
>         IRQ line :               0x00
>         IRQ pin :                0x00
> 
>     PCI slot 3 information:
>         vendor ID :              0x10a9
>         device ID :              0x0005
>         command :                0x0006
>         status :                 0x0480
>         revision :               0xc0
>         prog if :                0x00
>         class :                  0x0000
>         cache line :             0x00
>         latency :                0xff
>         hdr type :               0x00
>         BIST :                   0x00
>         region 0 :               0x00600000
>         region 1 :               0x00000000
>         region 2 :               0x00000000
>         region 3 :               0x00000000
>         region 4 :               0x00000000
>         region 5 :               0x00000000
>         IRQ line :               0x00
>         IRQ pin :                0x00
> 
> This is where my knowledge runs short -- I'm not fluent in PCI device
> programming, so I'm not sure what is really supposed to be going on with these
> different BAR regions.  I dug up a copy of the PCI Spec 2.2, but it's 322 pages
> and I'm not sure where I should start reading from.  Are these regions
> device-specific?  E.g., do I need the QLA1040B programming manual to understand
> why region 0 is 0x00200001 and region 1 is 0x00200000?

The regions (BARs) are definitely device-specific.  The low order bit
in some of them indicates an I/O BAR.  In the PCI r3.0 spec, sec 6.2.5
covers "Base Addresses" and tells you how to interpret the low bits.
For most of the BARs above, they're either 0b01, indicating an I/O
BAR, or 0b00000, indicating a non-prefetchable 32-bit memory BAR.

> I am also not yet considering how PCI views Crosstalk space for DMA operations
> -- BRIDGE has several mechanisms for that, depending on if the PCI device is a
> 32-bit device or a 64-bit device.  It can do direct-mapping for 32-bit or
> 64-bit, or built-in page-mapping hardware is available (but apparently to be
> avoided due to numerous hardware quirks/bugs that would make the driver
> overly-complicated).
> 
> There's a good write-up in the OpenBSD "xbridge" driver, which handles BRIDGE
> (IP27/IP30), XBRIDGE (IP35), and PIC (IP35/IA64 Altix):
> 
> http://bxr.su/OpenBSD/sys/arch/sgi/xbow/xbridge.c
> 
> It's starting to make some sense to me, but I am still uncertain how to work
> with the 32-bit and 64-bit 1GB PCI memory spaces on BRIDGE as well as the 4GB
> PCI I/O space from a Linux point of view.  Do they only matter when dealing
> with DMA?

The host bridge windows (the "root bus resource" lines in dmesg) are
only for PIO, i.e., a driver running on the CPU reading or writing
registers on the PCI device (it could be actual registers, or a frame
buffer, etc., but it resides on the device and its PCI bus address is
determined by a BAR).  The driver uses ioremap(), pci_iomap(),
pcim_iomap_regions(), etc. to map these into the kernel virtual space.

DMA is coming the other direction and the windows are irrelevant
except that the target PCI bus address must be outside all the windows
(if the DMA target address were *inside* a host bridge window, the
bridge would assume it is intended for a PCI device, e.g., for
peer-to-peer DMA).

I think the biggest problem is that you need to set up the BRIDGE
*before* calling pcibios_scan_bus().  That way the windows
("hose->mem_resource", "hose->io_resource", etc.) will be correct when
the PCI core enumerates the devices.  Note that you can have as many
windows in the "&resources" list as you need -- the current code there
only has one memory and one I/O window, but you can add more.

> > The scan would look like this:
> > 
> >   pci_bus 0000:00: root bus resource [mem 0xf000000000-0xffffffffff] (bus addresses [0x00000000-0xfffffffff])
> >   pci 0000:00:00.0: reg 0x14: [mem 0xf000200000-0xf000200fff]
> >   pci 0000:00:00.0: reg 0x30: [mem 0xf000210000-0xf00021ffff pref]
> >   pci 0000:00:01.0: reg 0x14: [mem 0xf000400000-0xf000400fff]
> >   pci 0000:00:01.0: reg 0x30: [mem 0xf000410000-0xf00041ffff pref]
> >   pci 0000:00:02.0: reg 0x10: [mem 0xf000500000-0xf0005fffff]
> >   pci 0000:00:03.0: reg 0x10: [mem 0xf000600000-0xf000601fff]
> > 
> > This would make /proc/iomem look like this:
> > 
> >   f000000000-ffffffffff : Bridge MEM
> >     f000500000-f0005fffff : 0000:00:02.0
> >     f000210000-f00021ffff : 0000:00:00.0
> >     f000410000-f00041ffff : 0000:00:01.0
> >     f000200000-f000200fff : 0000:00:00.0
> >     f000400000-f000400fff : 0000:00:01.0
> > 
> > This doesn't match the device BARs in your /proc/iomem, so there must
> > be some other transformation going on as well.
> > 
> > As long as you tell the PCI core about the host bridge windows you're
> > going to use, along with offsets that include *all* these
> > transformations, the core should just work, and /proc/iomem should
> > also make sense.  The details of small/medium/big windows, widgets,
> > etc., are immaterial to the core.
> > 
> > Bjorn
> > 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-24 18:38                   ` Bjorn Helgaas
@ 2017-02-25  9:34                     ` Joshua Kinard
  2017-02-27 16:36                       ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Joshua Kinard @ 2017-02-25  9:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On 02/24/2017 13:38, Bjorn Helgaas wrote:
> On Fri, Feb 24, 2017 at 03:50:26AM -0500, Joshua Kinard wrote:
>> Quoting from the BRIDGE docs somewhat, from the point of view of generic
>> Crosstalk space and not specific to IP30 or IP27, there are several views into
>> PCI space.  Each address is 48-bits, the size of a Crosstalk address.
>>
>>     0000_0000_0000 to 0000_00ff_ffff is "Widget space".  It seems the BRIDGE
>>     docs use the term "widget" here to refer to the eight possible PCI devices
>>     addressable by a single BRIDGE ASIC.  This plus the small windows is how
>>     the code is currently getting things to work.  I think.
>>
>>     0000_4000_0000 to 0000_7fff_ffff is BRIDGE's view into PCI Memory Space
>>     for direct-mapped 32-bit devices.  It is 1GB in size.  My take is this is
>>     normally a 4GB space?  Can the Linux PCI core be taught that only 1GB
>>     is usable?
> 
> Bridges normally have a window that contains some 32-bit PCI memory
> space, because many PCI devices have 32-bit BARs that have to be
> located below 4GB.
> 
> This window is usually smaller than 4GB (1GB would be typical) because
> these devices likely can only generate 32-bit DMA as well, and the DMA
> has to use 32-bit PCI addresses that are outside the host bridge
> window.

Okay, this fits then.  I wasn't sure, because the verbiage in the BRIDGE docs
kinda-suggests that the 1GB window for 32/64-bit memory space isn't common.
Likely, what they're getting at is the fact the lower 1GB for 32-bit is aliased
in the next 1GB for use by 64-bit PCI devices.  It looks like the remaining 2GB
in memory space is available for DMA, and I'll have to try and wrap my head
around that at later date.


> The host bridge code, i.e., the pcibios_scanbus() path, tells the core
> how big the window is.  In this case, "hose->mem_resource" contains
> the CPU physical address range (and size), and "host->mem_offset"
> contains the offset between the CPU physical address and the PCI bus
> address.  So if "hose->mem_resource" is only 1GB, that's all the PCI
> core will use.

So basically, this bit of code from the proposed pci-bridge.c needs additional
work:

	bc->mem.name = "Bridge MEM";
	bc->mem.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_MEM);
	bc->mem.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO - 1);
	bc->mem.flags = IORESOURCE_MEM;

	bc->io.name = "Bridge IO";
	bc->io.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO);
	bc->io.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_END - 1);
	bc->io.flags = IORESOURCE_IO;

The values I've been using for PCIBR_OFFSET_MEM and PCIBR_OFFSET_IO are
obviously wrong.  For IP30, they ultimately direct "BRIDGE MEM" to the first
five device slots via small windows, and then "Bridge IO" to the last two
slots.  This is one of those "luck" moments that happened to work on BaseIO,
but is probably why devices in the secondary PCI shoebox are hit-or-miss.

Per your comment further down that I can tell the PCI core about //multiple//
windows, probably what I want to do is hardwire access to the small window
spaces in the main BRIDGE driver, with machine-specific offsets passed via
platform_data.  Small windows, for both IP30 and IP27, are always guaranteed to
be available without any additional magic.  This window range can be used to do
the initial slot probes to query devices for their desired BAR values.

Then, in IP30-specific or IP27-specific code, we add in additional memory
windows as needed.  IP30 would simply pass along the hardwired address ranges
for BRIDGE MEM or BRIDGE IO via both medium and big window addresses.

IP27 is trickier -- we'll need some logic that attempts to use a small window
(16MB) mapping first, but if that is too small, we'll have to setup an entry in
the HUB's IOTTE buffer that maps a specific block of physical memory to
Crosstalk address space.  This is what IP27 calls "big windows", of which there
are seven entries maximum, on 512MB boundaries, per HUB chip.  However, only
six entries are available because one IOTTE is used to handle a known hardware
bug on early HUB revisions, leaving only six left.

As such, the additional window information will go into the machine-specific
BRIDGE glue drivers (ip30-bridge.c or ip27-bridge.c).

Sound sane?


>>     0000_8000_0000 to 0000_bfff_ffff is an alias view by BRIDGE into PCI
>>     Memory space for 64-bit PCI devices, immediately after the 32-bit space.
>>     It is also 1GB in size.
>>
>>     0000_c000_0000 to 0000_ffff_ffff is 2GB adjacent to PCI Memory space, but
>>     is either unused or used for Crosstalk/BRIDGE DMA operations
>>     (the docs aren't very clear on this point or I am simply not
>>     understanding).
>>
>>     0001_0000_0000 to 0001_ffff_ffff is BRIDGE's view into PCI I/O Space.
>>     It has the entire 4GB range available to use.
>>
>>
>> I suspect the existing BRIDGE driver is mapping through "widget space" first to
>> probe the PCI device slots, then the ip30-bridge.c code is re-writing the PCI
>> BARs to go through the big window on IP30, but it never updates the original
>> BRIDGE setup (if that's even possible).  As such, as you pointed out, there's
>> some kind of logic in the generic PCI core that's re-assigning space from the
>> window and we're just getting lucky and everything still works.
>>
>> I dug up some old debugging code given to me by the original author of the IP30
>> port and built up a set of macros that get me to the PCI Configuration Space on
>> Widget 0xf (BaseIO Bridge), Device #0 (the first QLA1040B SCSI controller).
>> All three addresses access the same PCI device and return the same config space
>> info.
>>
>>     Small window:  0x900000001f02xxxx
>>     Medium window: 0x9000000f8002xxxx
>>     Big window:    0x900000f00002xxxx
>>
>>
>> Each additional device's configuration space is offset 0x1000, which gives me
>> the following addresses (dev's 4 to 7 are special-cased for IRQ trickery, so
>> can't probe them):
>>
>>     - Dev #0: scsi0/qla1040b: 0x90xxxxxxxxx20000
>>     - Dev #1: scsi1/qla1040b: 0x90xxxxxxxxx21000
>>     - Dev #2: io/ioc3       : 0x90xxxxxxxxx22000
>>     - Dev #3: audio/rad1    : 0x90xxxxxxxxx23000
> 
> PCIe functions have 4K of config space each, so a 0x1000 offset makes
> sense.  Whether a platform can access all of it is a separate
> question.

No need to worry about PCIe on this platform.  You're talking about 15+-year
old hardware that was built to last.  I believe SGI only implemented up to PCI
2.1 in the BRIDGE and XBRIDGE ASICs.  It's possible the PIC ASIC might do 2.2
or 3.0.  But I don't have to worry about PIC for a good while, as that's only
found in the IP35-class of hardware (Origin 300/350/3000, Fuel, Tezro, Onyx4,
etc).  We need fully-working IP27 and IP30 first, because much of the working
logic can then be re-purposed for bringing up IP35 hardware.  Especially since
IA64 did a lot of the work already for the Altix systems, which is just IP35
with an IA64 CPU, but I digress...


> It looks like the type 0 config accessors (pci_conf0_read_config(),
> pci_conf0_write_config()) are basically like ECAM, where config space
> is memory-mapped.
> 
> Per spec, the ECAM window for a bridge that could have buses 00-ff
> below it would be 256MB (256 buses * 32 devices/bus * 8
> functions/device * 4096 bytes config space/function).
> 
> But based on b_type0_cfg_dev[], it looks like SGI only made space for
> 8 devices on the root bus, each with 8 functions.  I think that's OK,
> because they can control how many devices can be on the root bus.

Not sure what ECAM is, but yes, SGI wired each BRIDGE ASIC for a maximum of 8
PCI devices (or slots).  There can be multiple BRIDGEs in an SGI system,
though, and for several of their XIO add-in boards, you have one PCI device
hardwired into a BRIDGE ASIC, so whatever logic I cook up is going to need to
handle BRIDGE's with one device or eight.  So much for this being easy...


> The type 1 config accessors (pci_conf1_read_config(),
> pci_conf1_write_config()) are for devices below a PCI-to-PCI bridge.
> It looks like there's a single 4K memory-mapped window, and you point
> it at a specific bus & device with bridge->b_pci_cfg.

Yeah, PCI-to-PCI bridges are a known-broken item right now.  Since there
appears to only be a single Type 1 config space, if I have a PCI Shoebox with
three PCI-X slots, and I stick three USB 2.0 boards in them, I have to pick
which one can handle PCI-to-PCI bridges, like USB hubs, and program that into
this Type 1 register, right?


> The Linux type 0 accessors look like they support 4K config space per
> function, while the type 1 accessors put the function number in bits
> 8-10, so it looks like they only support 256 bytes per function.
> 
> The OpenBSD accessors (xbridge_conf_read() and xbridge_conf_write())
> look like they only support 256 bytes per function regardless of
> whether it's type 0 or type 1.

I'll have to pass this question to the OpenBSD dev who maintains the xbridge
driver.  He understands this hardware far better than I do, and there might be
a reason why they do it that way.


> Supporting 4K of space for type 0 seems like a potential Linux
> problem.  Also, pci_conf1_write_config() uses b_type0_cfg_dev in one
> place where it looks like it should be using b_type1_cfg.  But that's
> in the IOC3 path, and I don't know if that even makes sense for
> non-root bus devices -- I doubt you can put an IOC3 behind a
> PCI-to-PCI bridge.

IOC3 is....special.  You'll never see one of those beind a PCI-to-PCI bridge,
thankfully.  That device is evil enough just sitting exposed on the main PCI
bus.  Just look at the comments in mainline arch/mips/pci/ops-bridge.c for some
rather flavourful language regarding IOC3.  Basically, IOC3:

  - Claims it's a single-function device but is really multi-function.
  - Implements only half of the PCI configuration space.
  - Read/write ops to the unimplemented PCI config registers are undefined.

Living "behind" the IOC3 device is your ethernet, serial ports (RS232 and
RS422), parallel port, keyboard/mouse PS/2 ports, and the RTC chip.  The IA64
people have an IOC3 "metadriver" in drivers/sn/ioc3.c that takes care of
dealing with that monster, and one of the later patches I've yet to send in
moves that driver to drivers/misc/ioc3.c (next to ioc4.c) for use by the SGI
MIPS platforms.  But that is a ways off, as we need working Xtalk code and
working BRIDGE code before the IOC3 metadriver can be worked on.


> If the hardware only supports 256 bytes of config space on non-root
> bus devices, that's not a disaster.  We should still be able to
> enumerate them and use all the conventional PCI features and even
> basic PCIe features.  But the extended config space (offsets
> 0x100-0xfff) would be inaccessible, and we wouldn't see any PCIe
> extended capabilities (AER, VC, SR-IOV, etc., see PCI_EXT_CAP_ID_ERR
> and subsequent definitions) because they live in that space.

I do not think that the BRIDGE supports AER, VC, or SR-IOV at all, primarily
due to age of the hardware.  Even if a PCI device that does support those is
plugged into a BRIDGE, I am not 100% certain those features would even be
usable if the BRIDGE doesn't know about them.  Or is this a case of where the
BRIDGE wouldn't care and once it programs the BARs correctly, these features
would be available to the Linux drivers?


>> If I probe and dump the config space data for each, I get the following:
>>
>>     PCI slot 0 information:
>>         vendor ID :              0x1077
>>         device ID :              0x1020
>>         command :                0x0006
>>         status :                 0x0200
>>         revision :               0x05
>>         prog if :                0x00
>>         class :                  0x0100
>>         cache line :             0x40
>>         latency :                0x40
>>         hdr type :               0x00
>>         BIST :                   0x00
>>         region 0 :               0x00200001
>>         region 1 :               0x00200000
>>         region 2 :               0x00000000
>>         region 3 :               0x00000000
>>         region 4 :               0x00000000
>>         region 5 :               0x00000000
>>         IRQ line :               0x00
>>         IRQ pin :                0x01
>>
>>     PCI slot 1 information:
>>         vendor ID :              0x1077
>>         device ID :              0x1020
>>         command :                0x0006
>>         status :                 0x0200
>>         revision :               0x05
>>         prog if :                0x00
>>         class :                  0x0100
>>         cache line :             0x40
>>         latency :                0x40
>>         hdr type :               0x00
>>         BIST :                   0x00
>>         region 0 :               0x00400001
>>         region 1 :               0x00400000
>>         region 2 :               0x00000000
>>         region 3 :               0x00000000
>>         region 4 :               0x00000000
>>         region 5 :               0x00000000
>>         IRQ line :               0x00
>>         IRQ pin :                0x01
>>
>>     PCI slot 2 information:
>>         vendor ID :              0x10a9
>>         device ID :              0x0003
>>         command :                0x0146
>>         status :                 0x0280
>>         revision :               0x01
>>         prog if :                0x00
>>         class :                  0xff00
>>         cache line :             0x00
>>         latency :                0x28
>>         hdr type :               0x00
>>         BIST :                   0x00
>>         region 0 :               0x00500000
>>         region 1 :               0x00000000
>>         region 2 :               0x00000000
>>         region 3 :               0x00500000
>>         region 4 :               0x000310a9
>>         region 5 :               0x02800146
>>         IRQ line :               0x00
>>         IRQ pin :                0x00
>>
>>     PCI slot 3 information:
>>         vendor ID :              0x10a9
>>         device ID :              0x0005
>>         command :                0x0006
>>         status :                 0x0480
>>         revision :               0xc0
>>         prog if :                0x00
>>         class :                  0x0000
>>         cache line :             0x00
>>         latency :                0xff
>>         hdr type :               0x00
>>         BIST :                   0x00
>>         region 0 :               0x00600000
>>         region 1 :               0x00000000
>>         region 2 :               0x00000000
>>         region 3 :               0x00000000
>>         region 4 :               0x00000000
>>         region 5 :               0x00000000
>>         IRQ line :               0x00
>>         IRQ pin :                0x00
>>
>> This is where my knowledge runs short -- I'm not fluent in PCI device
>> programming, so I'm not sure what is really supposed to be going on with these
>> different BAR regions.  I dug up a copy of the PCI Spec 2.2, but it's 322 pages
>> and I'm not sure where I should start reading from.  Are these regions
>> device-specific?  E.g., do I need the QLA1040B programming manual to understand
>> why region 0 is 0x00200001 and region 1 is 0x00200000?
> 
> The regions (BARs) are definitely device-specific.  The low order bit
> in some of them indicates an I/O BAR.  In the PCI r3.0 spec, sec 6.2.5
> covers "Base Addresses" and tells you how to interpret the low bits.
> For most of the BARs above, they're either 0b01, indicating an I/O
> BAR, or 0b00000, indicating a non-prefetchable 32-bit memory BAR.

Okay, I stumbled across this URL:
http://moi.vonos.net/linux/the-pci-bus/

Which gives a good, plain-english breakdown of BARs and whatnot.

I remember from the OpenBSD xbridge driver, it states that it currently uses
the BAR mappings setup by the ARCS firmware, but the plan is to eventually move
away from those mappings and calculate its own.  I didn't quite get that, but
after reading that site, now I do.  ARCS is writing ~0 to the BARs to query the
device to figure out how much memory each BAR wants, then programs in some
ranges for us.

By tallying up the total requested size of all BAR mappings on a given device,
we can then make a determination on whether we can use a small, medium, or
large window and then program the BAR with the correct address.  Right?

Could also do my own probes (or maybe Linux already does this?), but for the
short-term, using what ARCS has already setup might not be a bad idea to get
things working again.


>> I am also not yet considering how PCI views Crosstalk space for DMA operations
>> -- BRIDGE has several mechanisms for that, depending on if the PCI device is a
>> 32-bit device or a 64-bit device.  It can do direct-mapping for 32-bit or
>> 64-bit, or built-in page-mapping hardware is available (but apparently to be
>> avoided due to numerous hardware quirks/bugs that would make the driver
>> overly-complicated).
>>
>> There's a good write-up in the OpenBSD "xbridge" driver, which handles BRIDGE
>> (IP27/IP30), XBRIDGE (IP35), and PIC (IP35/IA64 Altix):
>>
>> http://bxr.su/OpenBSD/sys/arch/sgi/xbow/xbridge.c
>>
>> It's starting to make some sense to me, but I am still uncertain how to work
>> with the 32-bit and 64-bit 1GB PCI memory spaces on BRIDGE as well as the 4GB
>> PCI I/O space from a Linux point of view.  Do they only matter when dealing
>> with DMA?
> 
> The host bridge windows (the "root bus resource" lines in dmesg) are
> only for PIO, i.e., a driver running on the CPU reading or writing
> registers on the PCI device (it could be actual registers, or a frame
> buffer, etc., but it resides on the device and its PCI bus address is
> determined by a BAR).  The driver uses ioremap(), pci_iomap(),
> pcim_iomap_regions(), etc. to map these into the kernel virtual space.

So for the MIPS case, knowing that ioremap() and friends is not guaranteed to
return a workable virtual address, I need to be careful of what addresses I
program into each BAR.  E.g., given that on IP30, if a physical address starts
with 0x9000000f800xxxxx, it is using medium windows to talk to a device on the
BaseIO BRIDGE (Xtalk widget 0xf).  As such, knowing MIPS' ioremap() returns,
for the R10000 CPU case, the requested address OR'ed with IO_BASE
(0x9000000000000000), I want to tell the PCI core to use 0x0000000f800xxxxx so
that ioremap() will return back 0x9000000f800xxxxx?

And since I can apparently specify multiple window ranges for memory space and
I/O space, I probably want to, as stated earlier, specify all three of IP30's
known window ranges for each BRIDGE so that the Linux PCI core can walk each
resource struct and find a matching window?

If that thinking is correct, then I have some idea of how to set that up and
then see if things start working again on IP30 and then eventually, IP27.


> DMA is coming the other direction and the windows are irrelevant
> except that the target PCI bus address must be outside all the windows
> (if the DMA target address were *inside* a host bridge window, the
> bridge would assume it is intended for a PCI device, e.g., for
> peer-to-peer DMA).

DMA is the tricky one.  OpenBSD's xbridge driver implies that BRIDGE's IOMMU
has a bug of some kind in it that restricts our DMA-able address ranges to 31
bits, or 0x00000000 to 0x7fffffff.  IP27 doesn't seem to be bothered by this in
the current code, and as such, I can use all 8GB of RAM on that platform.

But IP30 has its physical memory offset 512MB, so physical memory maps begin at
0x20000000 and run to 0x9fffffff.  This means someone playing with Linux on an
IP30 platform can only install up to 2GB of RAM right now, if they want the
machine to be stable.  Though, there are other issues going on that imply that
"stable" is a very subjective term...

In either event, how do I teach the Linux PCI core that the BRIDGE itself is
limited to 31bits of DMA-able address space?  There seems to be plenty of
documentation on having an individual PCI device set this up by setting its DMA
mask, but I can't seem to find any wording on how to do this for an entire PCI bus.

Or is this an architecture-specific thing?  MIPS has the dma-coherence.h header
for defining system-specific DMA quirks, but that code seems heavily biased
towards PCI devices, and on IP30 at least, the Impact and Odyssey video cards
are NOT PCI devices at all, instead being native XIO devices that support DMA
operations.


> I think the biggest problem is that you need to set up the BRIDGE
> *before* calling pcibios_scan_bus().  That way the windows
> ("hose->mem_resource", "hose->io_resource", etc.) will be correct when
> the PCI core enumerates the devices.  Note that you can have as many
> windows in the "&resources" list as you need -- the current code there
> only has one memory and one I/O window, but you can add more.

Agreed, however, I think this is actually being done to some extent already.
We're configuring BRIDGE-specific properties and writing some values to BRIDGE
registers and the per-slot registers in bridge_probe() in the generic
pci-bridge.c driver, then, at the very end, calling register_pci_controller(),
which I believe is what kicks off the PCI bus scan.

The IP30 BRIDGE glue code adds a new function hook called "pre_enable" where it
does, in ip30-bridge.c, some additional PCI device tweaking, but this code will
run after the PCI bus scan has happened.  My guess is that I want to reverse
this and have the IP30 glue code twiddle the PCI devices on a given BRIDGE
before the PCI bus scan, right?  That code uses pci_read_config_dword() and
pci_write_config_dword() -- are these functions safe to use if we haven't
already done a PCI bus scan?

Thanks!

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-25  9:34                     ` Joshua Kinard
@ 2017-02-27 16:36                       ` Bjorn Helgaas
  2017-02-28  0:25                         ` Joshua Kinard
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-02-27 16:36 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On Sat, Feb 25, 2017 at 04:34:12AM -0500, Joshua Kinard wrote:
> On 02/24/2017 13:38, Bjorn Helgaas wrote:
> > On Fri, Feb 24, 2017 at 03:50:26AM -0500, Joshua Kinard wrote:

> > The host bridge code, i.e., the pcibios_scanbus() path, tells the
> > core how big the window is.  In this case, "hose->mem_resource"
> > contains the CPU physical address range (and size), and
> > "host->mem_offset" contains the offset between the CPU physical
> > address and the PCI bus address.  So if "hose->mem_resource" is
> > only 1GB, that's all the PCI core will use.
> 
> So basically, this bit of code from the proposed pci-bridge.c needs
> additional work:
> 
> 	bc->mem.name = "Bridge MEM";
> 	bc->mem.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_MEM);
> 	bc->mem.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO - 1);
> 	bc->mem.flags = IORESOURCE_MEM;
> 
> 	bc->io.name = "Bridge IO";
> 	bc->io.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO);
> 	bc->io.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_END - 1);
> 	bc->io.flags = IORESOURCE_IO;
> 
> The values I've been using for PCIBR_OFFSET_MEM and PCIBR_OFFSET_IO
> are obviously wrong.  For IP30, they ultimately direct "BRIDGE MEM"
> to the first five device slots via small windows, and then "Bridge
> IO" to the last two slots.  This is one of those "luck" moments that
> happened to work on BaseIO, but is probably why devices in the
> secondary PCI shoebox are hit-or-miss.
> 
> Per your comment further down that I can tell the PCI core about
> //multiple// windows, probably what I want to do is hardwire access
> to the small window spaces in the main BRIDGE driver, with
> machine-specific offsets passed via platform_data.  Small windows,
> for both IP30 and IP27, are always guaranteed to be available
> without any additional magic.  This window range can be used to do
> the initial slot probes to query devices for their desired BAR
> values.
> 
> Then, in IP30-specific or IP27-specific code, we add in additional
> memory windows as needed.  IP30 would simply pass along the
> hardwired address ranges for BRIDGE MEM or BRIDGE IO via both medium
> and big window addresses.
> 
> IP27 is trickier -- we'll need some logic that attempts to use a
> small window (16MB) mapping first, but if that is too small, we'll
> have to setup an entry in the HUB's IOTTE buffer that maps a
> specific block of physical memory to Crosstalk address space.  

pcibios_scanbus() builds the &resources list of host bridge windows,
then calls pci_scan_root_bus().  If you're proposing to change the
list of host bridge windows *after* calling pci_scan_root_bus(), that
sounds a little problematic because we do not have a PCI core
interface to do that.

> This is what IP27 calls "big windows", of which there are seven
> entries maximum, on 512MB boundaries, per HUB chip.  However, only
> six entries are available because one IOTTE is used to handle a
> known hardware bug on early HUB revisions, leaving only six left.
> 
> As such, the additional window information will go into the
> machine-specific BRIDGE glue drivers (ip30-bridge.c or
> ip27-bridge.c).
> 
> Sound sane?

Do you need to allocate the big windows based on what PCI devices you
find?  If so, that is going to be a hard problem.

> > PCIe functions have 4K of config space each, so a 0x1000 offset
> > makes sense.  Whether a platform can access all of it is a
> > separate question.
> 
> No need to worry about PCIe on this platform.  You're talking about
> 15+-year old hardware that was built to last.  I believe SGI only
> implemented up to PCI 2.1 in the BRIDGE and XBRIDGE ASICs.  It's
> possible the PIC ASIC might do 2.2 or 3.0.  But I don't have to
> worry about PIC for a good while, as that's only found in the
> IP35-class of hardware (Origin 300/350/3000, Fuel, Tezro, Onyx4,
> etc).  We need fully-working IP27 and IP30 first, because much of
> the working logic can then be re-purposed for bringing up IP35
> hardware.  Especially since IA64 did a lot of the work already for
> the Altix systems, which is just IP35 with an IA64 CPU, but I
> digress...

OK.  If you only plug in PCI devices, there should be no problem.  If
you did plug in PCIe devices (via a PCI-to-PCIe bridge, for example),
the core should still enumerate them correctly and they should be
functional, though some PCIe-only features like AER won't work.

> > The type 1 config accessors (pci_conf1_read_config(),
> > pci_conf1_write_config()) are for devices below a PCI-to-PCI
> > bridge.  It looks like there's a single 4K memory-mapped window,
> > and you point it at a specific bus & device with
> > bridge->b_pci_cfg.
> 
> Yeah, PCI-to-PCI bridges are a known-broken item right now.  Since
> there appears to only be a single Type 1 config space, if I have a
> PCI Shoebox with three PCI-X slots, and I stick three USB 2.0 boards
> in them, I have to pick which one can handle PCI-to-PCI bridges,
> like USB hubs, and program that into this Type 1 register, right?

I don't think USB hubs are relevant here because they are completely
in the USB domain.  The USB 2.0 board is a USB host controller with a
PCI interface on it.  The USB hub is on the USB side and is invisible
to the PCI core.

The way I read the BSD code, it looks like you should be able to use
the Type 1 config interface to generate accesses to any PCI bus, as
long as you serialize them, e.g., with a lock around the use of
b_type1_cfg.

> > The Linux type 0 accessors look like they support 4K config space
> > per function, while the type 1 accessors put the function number
> > in bits 8-10, so it looks like they only support 256 bytes per
> > function.
> > 
> > The OpenBSD accessors (xbridge_conf_read() and
> > xbridge_conf_write()) look like they only support 256 bytes per
> > function regardless of whether it's type 0 or type 1.
> 
> I'll have to pass this question to the OpenBSD dev who maintains the
> xbridge driver.  He understands this hardware far better than I do,
> and there might be a reason why they do it that way.

If you only support PCI devices, 256 bytes of config space is enough.
The 4K config space is only supported for PCI-X Mode 2 and PCIe
devices.  Even those PCI-X Mode 2 and PCIe devices should be
functional with only 256 bytes.

> > Supporting 4K of space for type 0 seems like a potential Linux
> > problem.  Also, pci_conf1_write_config() uses b_type0_cfg_dev in
> > one place where it looks like it should be using b_type1_cfg.

I didn't word this well.  I was trying to point out things that look
like bugs in pci_conf1_write_config() (the use of b_type0_cfg_dev) and
maybe pci_conf0_read_config() (the fact that it allows 4K config space
but the hardware probably only supports 256 bytes).

> > If the hardware only supports 256 bytes of config space on
> > non-root bus devices, that's not a disaster.  We should still be
> > able to enumerate them and use all the conventional PCI features
> > and even basic PCIe features.  But the extended config space
> > (offsets 0x100-0xfff) would be inaccessible, and we wouldn't see
> > any PCIe extended capabilities (AER, VC, SR-IOV, etc., see
> > PCI_EXT_CAP_ID_ERR and subsequent definitions) because they live
> > in that space.
> 
> I do not think that the BRIDGE supports AER, VC, or SR-IOV at all,
> primarily due to age of the hardware.  Even if a PCI device that
> does support those is plugged into a BRIDGE, I am not 100% certain
> those features would even be usable if the BRIDGE doesn't know about
> them.  Or is this a case of where the BRIDGE wouldn't care and once
> it programs the BARs correctly, these features would be available to
> the Linux drivers?

These are mostly PCIe features and I don't think you should worry
about them, at least for now.

> I remember from the OpenBSD xbridge driver, it states that it
> currently uses the BAR mappings setup by the ARCS firmware, but the
> plan is to eventually move away from those mappings and calculate
> its own.  I didn't quite get that, but after reading that site, now
> I do.  ARCS is writing ~0 to the BARs to query the device to figure
> out how much memory each BAR wants, then programs in some ranges for
> us.
> 
> By tallying up the total requested size of all BAR mappings on a
> given device, we can then make a determination on whether we can use
> a small, medium, or large window and then program the BAR with the
> correct address.  Right?

This is the problem I alluded to above: Linux does not currently
support anything like this.  We assume the Linux host bridge driver
knows the window sizes at the beginning of time (it may learn them
from firmware or it may have hard-coded knowledge of the address map).

Linux does enumerate the devices and tally up all the BAR sizes (see
pci_bus_assign_resources()), but it does not have a way to change the
host bridge windows based on how much BAR space we need.

The common thing is to use whatever host bridge windows and device BAR
values were set up by firmware.  If those are all sensible, Linux
won't change anything.

> > The host bridge windows (the "root bus resource" lines in dmesg)
> > are only for PIO, i.e., a driver running on the CPU reading or
> > writing registers on the PCI device (it could be actual registers,
> > or a frame buffer, etc., but it resides on the device and its PCI
> > bus address is determined by a BAR).  The driver uses ioremap(),
> > pci_iomap(), pcim_iomap_regions(), etc. to map these into the
> > kernel virtual space.
> 
> So for the MIPS case, knowing that ioremap() and friends is not
> guaranteed to return a workable virtual address, I need to be
> careful of what addresses I program into each BAR.  E.g., given that
> on IP30, if a physical address starts with 0x9000000f800xxxxx, it is
> using medium windows to talk to a device on the BaseIO BRIDGE (Xtalk
> widget 0xf).  As such, knowing MIPS' ioremap() returns, for the
> R10000 CPU case, the requested address OR'ed with IO_BASE
> (0x9000000000000000), I want to tell the PCI core to use
> 0x0000000f800xxxxx so that ioremap() will return back
> 0x9000000f800xxxxx?

If ioremap() doesn't return a virtual address, or at least something
that can be used like a virtual address, I think that's fundamentally
broken.  All drivers assume they should ioremap() a BAR resource,
i.e., the CPU physical address that maps to a PCI BAR value, and use
the result as a virtual address.

Documentation/DMA-API-HOWTO.txt has a picture that may help clarify
the different address spaces.

> And since I can apparently specify multiple window ranges for memory
> space and I/O space, I probably want to, as stated earlier, specify
> all three of IP30's known window ranges for each BRIDGE so that the
> Linux PCI core can walk each resource struct and find a matching
> window?

If all three windows are disjoint, you can specify them all.  If a
range in the small window is aliased and can also be reached via a
medium or large window, you should not specify both.

> > DMA is coming the other direction and the windows are irrelevant
> > except that the target PCI bus address must be outside all the
> > windows (if the DMA target address were *inside* a host bridge
> > window, the bridge would assume it is intended for a PCI device,
> > e.g., for peer-to-peer DMA).
> 
> DMA is the tricky one.  OpenBSD's xbridge driver implies that
> BRIDGE's IOMMU has a bug of some kind in it that restricts our
> DMA-able address ranges to 31 bits, or 0x00000000 to 0x7fffffff.
> IP27 doesn't seem to be bothered by this in the current code, and as
> such, I can use all 8GB of RAM on that platform.
> 
> But IP30 has its physical memory offset 512MB, so physical memory
> maps begin at 0x20000000 and run to 0x9fffffff.  This means someone
> playing with Linux on an IP30 platform can only install up to 2GB of
> RAM right now, if they want the machine to be stable.  Though, there
> are other issues going on that imply that "stable" is a very
> subjective term...
> 
> In either event, how do I teach the Linux PCI core that the BRIDGE
> itself is limited to 31bits of DMA-able address space?  There seems
> to be plenty of documentation on having an individual PCI device set
> this up by setting its DMA mask, but I can't seem to find any
> wording on how to do this for an entire PCI bus.
> 
> Or is this an architecture-specific thing?  MIPS has the
> dma-coherence.h header for defining system-specific DMA quirks, but
> that code seems heavily biased towards PCI devices, and on IP30 at
> least, the Impact and Odyssey video cards are NOT PCI devices at
> all, instead being native XIO devices that support DMA operations.

The PCI core isn't involved much with DMA, so I don't know this off
the top of my head.

> > I think the biggest problem is that you need to set up the BRIDGE
> > *before* calling pcibios_scan_bus().  That way the windows
> > ("hose->mem_resource", "hose->io_resource", etc.) will be correct
> > when the PCI core enumerates the devices.  Note that you can have
> > as many windows in the "&resources" list as you need -- the
> > current code there only has one memory and one I/O window, but you
> > can add more.
> 
> Agreed, however, I think this is actually being done to some extent
> already.  We're configuring BRIDGE-specific properties and writing
> some values to BRIDGE registers and the per-slot registers in
> bridge_probe() in the generic pci-bridge.c driver, then, at the very
> end, calling register_pci_controller(), which I believe is what
> kicks off the PCI bus scan.
> 
> The IP30 BRIDGE glue code adds a new function hook called
> "pre_enable" where it does, in ip30-bridge.c, some additional PCI
> device tweaking, but this code will run after the PCI bus scan has
> happened.  My guess is that I want to reverse this and have the IP30
> glue code twiddle the PCI devices on a given BRIDGE before the PCI
> bus scan, right?  

Theoretically you should not need to do any PCI device tweaking: PCI
devices are basically arch-independent and shouldn't need
arch-specific tweaks.  All the arch stuff should be encapsulated in
the host bridge driver, i.e., the code that sets up the window list
and calls pci_scan_root_bus().  In the MIPS case, this code is kind of
spread out and doesn't really look like a single "driver".

The ideal thing would be if you can set up the bridge to always use
large windows.  Then the PCI core will enumerate the devices and set
up their resources automatically.

I assume the small/medium windows exist because there's not enough
address space to always use the large windows.  I don't have any good
suggestions for that -- we don't have support for adjusting the window
sizes based on what devices we find.

> That code uses pci_read_config_dword() and
> pci_write_config_dword() -- are these functions safe to use if we
> haven't already done a PCI bus scan?

You can't use pci_read_config_dword() before we scan the bus because
it requires a "struct pci_dev *", and those are created during the bus
scan.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-27 16:36                       ` Bjorn Helgaas
@ 2017-02-28  0:25                         ` Joshua Kinard
  2017-03-01 15:39                           ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Joshua Kinard @ 2017-02-28  0:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On 02/27/2017 11:36, Bjorn Helgaas wrote:
> On Sat, Feb 25, 2017 at 04:34:12AM -0500, Joshua Kinard wrote:
>> On 02/24/2017 13:38, Bjorn Helgaas wrote:
>>> On Fri, Feb 24, 2017 at 03:50:26AM -0500, Joshua Kinard wrote:
> 
>>> The host bridge code, i.e., the pcibios_scanbus() path, tells the
>>> core how big the window is.  In this case, "hose->mem_resource"
>>> contains the CPU physical address range (and size), and
>>> "host->mem_offset" contains the offset between the CPU physical
>>> address and the PCI bus address.  So if "hose->mem_resource" is
>>> only 1GB, that's all the PCI core will use.
>>
>> So basically, this bit of code from the proposed pci-bridge.c needs
>> additional work:
>>
>> 	bc->mem.name = "Bridge MEM";
>> 	bc->mem.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_MEM);
>> 	bc->mem.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO - 1);
>> 	bc->mem.flags = IORESOURCE_MEM;
>>
>> 	bc->io.name = "Bridge IO";
>> 	bc->io.start = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_IO);
>> 	bc->io.end = (NODE_SWIN_BASE(nasid, widget_id) + PCIBR_OFFSET_END - 1);
>> 	bc->io.flags = IORESOURCE_IO;
>>
>> The values I've been using for PCIBR_OFFSET_MEM and PCIBR_OFFSET_IO
>> are obviously wrong.  For IP30, they ultimately direct "BRIDGE MEM"
>> to the first five device slots via small windows, and then "Bridge
>> IO" to the last two slots.  This is one of those "luck" moments that
>> happened to work on BaseIO, but is probably why devices in the
>> secondary PCI shoebox are hit-or-miss.
>>
>> Per your comment further down that I can tell the PCI core about
>> //multiple// windows, probably what I want to do is hardwire access
>> to the small window spaces in the main BRIDGE driver, with
>> machine-specific offsets passed via platform_data.  Small windows,
>> for both IP30 and IP27, are always guaranteed to be available
>> without any additional magic.  This window range can be used to do
>> the initial slot probes to query devices for their desired BAR
>> values.
>>
>> Then, in IP30-specific or IP27-specific code, we add in additional
>> memory windows as needed.  IP30 would simply pass along the
>> hardwired address ranges for BRIDGE MEM or BRIDGE IO via both medium
>> and big window addresses.
>>
>> IP27 is trickier -- we'll need some logic that attempts to use a
>> small window (16MB) mapping first, but if that is too small, we'll
>> have to setup an entry in the HUB's IOTTE buffer that maps a
>> specific block of physical memory to Crosstalk address space.  
> 
> pcibios_scanbus() builds the &resources list of host bridge windows,
> then calls pci_scan_root_bus().  If you're proposing to change the
> list of host bridge windows *after* calling pci_scan_root_bus(), that
> sounds a little problematic because we do not have a PCI core
> interface to do that.

I believe I can determine the size of the windows on BRIDGE before hitting the
generic PCI code.  The bridge_probe() function calls register_pci_controller()
at the very bottom, even in the current mainline file (pci-ip27.c), after it's
twiddled a number of BRIDGE register bits.  I'll first have to try and
understand the 'pre_enable' code from the IP30 patch that changes the window
mappings in ip30-bridge.c, then find a way to move that code to execute earlier
before any of the generic PCI code, since it contains some of the logic to size
out the window mappings.

I'll see about getting it to work on IP30 (Octane) first, because all three
window spaces are available without any special magic.  That way, I'll know
what can be done in the generic BRIDGE driver and what needs to happen in
platform-specific files.  That'll make figuring out the IP27 logic a lot
easier....I hope.


>> This is what IP27 calls "big windows", of which there are seven
>> entries maximum, on 512MB boundaries, per HUB chip.  However, only
>> six entries are available because one IOTTE is used to handle a
>> known hardware bug on early HUB revisions, leaving only six left.
>>
>> As such, the additional window information will go into the
>> machine-specific BRIDGE glue drivers (ip30-bridge.c or
>> ip27-bridge.c).
>>
>> Sound sane?
> 
> Do you need to allocate the big windows based on what PCI devices you
> find?  If so, that is going to be a hard problem.

I believe so.  The IOTTE stuff in the HUB ASIC on IP27 appears to behave like
an extremely simplified TLB, and once you work out how much space a given PCI
device needs, then you can calculate out the matching Crosstalk address, and
you write that to an unused IOTTE slot.  It doesn't have to be accurate -- each
IOTTE entry needs to be aligned on a 512MB boundary, so once you've set up a
mapping, if another PCI device (or BAR?) needs a Crosstalk address within an
already-mapped boundary, you just point it to the existing IOTTE entry that
matches.

That's my understanding anyways after looking at really old IA64 code in 2.5.70
that used virtually the same hardware when SGI was bringing up the Altix
platform.  Specifically, hub_piomap_alloc() on Line 117 in
Linux-2.5.70/arch/ia64/sn/io/io.c:
https://git.linux-mips.org/cgit/ralf/linux.git/tree/arch/ia64/sn/io/io.c?h=linux-2.5.70

So I just need to make sure all of this happens during the initial
bridge_probe() function, so that everything is ready before calling
register_pci_controller().  We'll find out!


>>> PCIe functions have 4K of config space each, so a 0x1000 offset
>>> makes sense.  Whether a platform can access all of it is a
>>> separate question.
>>
>> No need to worry about PCIe on this platform.  You're talking about
>> 15+-year old hardware that was built to last.  I believe SGI only
>> implemented up to PCI 2.1 in the BRIDGE and XBRIDGE ASICs.  It's
>> possible the PIC ASIC might do 2.2 or 3.0.  But I don't have to
>> worry about PIC for a good while, as that's only found in the
>> IP35-class of hardware (Origin 300/350/3000, Fuel, Tezro, Onyx4,
>> etc).  We need fully-working IP27 and IP30 first, because much of
>> the working logic can then be re-purposed for bringing up IP35
>> hardware.  Especially since IA64 did a lot of the work already for
>> the Altix systems, which is just IP35 with an IA64 CPU, but I
>> digress...
> 
> OK.  If you only plug in PCI devices, there should be no problem.  If
> you did plug in PCIe devices (via a PCI-to-PCIe bridge, for example),
> the core should still enumerate them correctly and they should be
> functional, though some PCIe-only features like AER won't work.

There might be physical limitations on using any kind of PCIe bridge device.
Getting a PCI device into an SGI Octane (IP30) or Origin 2000/Onyx2 (IP27) that
isn't either the BaseIO on IP30 or IO6 on IP27 requires an XIO Shoehorn or PCI
Shoebox.  There's a little bit of leeway on how wide of a PCI card you can
stuff into either one, but it's not much.  I'll have to measure a shoehorn if I
can find one of my spares in a bin someplace.  The Origin 200 tower machine
might let you get away with it...but that's a corner case in my book and not
really worth worrying about it.  It's hard to find one of those completely
intact these days, let alone with a working power supply that doesn't double as
a fireworks display.


>>> The type 1 config accessors (pci_conf1_read_config(),
>>> pci_conf1_write_config()) are for devices below a PCI-to-PCI
>>> bridge.  It looks like there's a single 4K memory-mapped window,
>>> and you point it at a specific bus & device with
>>> bridge->b_pci_cfg.
>>
>> Yeah, PCI-to-PCI bridges are a known-broken item right now.  Since
>> there appears to only be a single Type 1 config space, if I have a
>> PCI Shoebox with three PCI-X slots, and I stick three USB 2.0 boards
>> in them, I have to pick which one can handle PCI-to-PCI bridges,
>> like USB hubs, and program that into this Type 1 register, right?
> 
> I don't think USB hubs are relevant here because they are completely
> in the USB domain.  The USB 2.0 board is a USB host controller with a
> PCI interface on it.  The USB hub is on the USB side and is invisible
> to the PCI core.
> 
> The way I read the BSD code, it looks like you should be able to use
> the Type 1 config interface to generate accesses to any PCI bus, as
> long as you serialize them, e.g., with a lock around the use of
> b_type1_cfg.

Huh, well, I've never actually tried a USB Hub device on my Octane.  If I can
fix PCI up properly and get a 2.0 card to actually work, I'll try it out before
looking too deeply into the Type 1 stuff.


>>> The Linux type 0 accessors look like they support 4K config space
>>> per function, while the type 1 accessors put the function number
>>> in bits 8-10, so it looks like they only support 256 bytes per
>>> function.
>>>
>>> The OpenBSD accessors (xbridge_conf_read() and
>>> xbridge_conf_write()) look like they only support 256 bytes per
>>> function regardless of whether it's type 0 or type 1.
>>
>> I'll have to pass this question to the OpenBSD dev who maintains the
>> xbridge driver.  He understands this hardware far better than I do,
>> and there might be a reason why they do it that way.
> 
> If you only support PCI devices, 256 bytes of config space is enough.
> The 4K config space is only supported for PCI-X Mode 2 and PCIe
> devices.  Even those PCI-X Mode 2 and PCIe devices should be
> functional with only 256 bytes.

It's unknown why SGI left that much space between the registers on the BRIDGE.
Could've been future-proofing, or maybe there really is a PCI or PCI-X device
out there that currently only works under IRIX that needed it all.  If I can
get things to work better than they are now, I've got a bit of a collection of
PCI devices to test things out with.  Might expose any wrong assumptions.


>>> Supporting 4K of space for type 0 seems like a potential Linux
>>> problem.  Also, pci_conf1_write_config() uses b_type0_cfg_dev in
>>> one place where it looks like it should be using b_type1_cfg.
> 
> I didn't word this well.  I was trying to point out things that look
> like bugs in pci_conf1_write_config() (the use of b_type0_cfg_dev) and
> maybe pci_conf0_read_config() (the fact that it allows 4K config space
> but the hardware probably only supports 256 bytes).

If you're referring to those uses in ops-bridge.c, that file probably needs
some additional TLC.  The patchset I've got fixed it up a little, but looking
at it made my head hurt (and not just because of the IOC3 voodoo in it), so I
kinda passed on spending too much time on it.  I'll probably have to double
back on it in the future after fixing the other areas, so I'll keep this in mind.


>>> If the hardware only supports 256 bytes of config space on
>>> non-root bus devices, that's not a disaster.  We should still be
>>> able to enumerate them and use all the conventional PCI features
>>> and even basic PCIe features.  But the extended config space
>>> (offsets 0x100-0xfff) would be inaccessible, and we wouldn't see
>>> any PCIe extended capabilities (AER, VC, SR-IOV, etc., see
>>> PCI_EXT_CAP_ID_ERR and subsequent definitions) because they live
>>> in that space.
>>
>> I do not think that the BRIDGE supports AER, VC, or SR-IOV at all,
>> primarily due to age of the hardware.  Even if a PCI device that
>> does support those is plugged into a BRIDGE, I am not 100% certain
>> those features would even be usable if the BRIDGE doesn't know about
>> them.  Or is this a case of where the BRIDGE wouldn't care and once
>> it programs the BARs correctly, these features would be available to
>> the Linux drivers?
> 
> These are mostly PCIe features and I don't think you should worry
> about them, at least for now.

Good to know then.  Hopefully never :)


>> I remember from the OpenBSD xbridge driver, it states that it
>> currently uses the BAR mappings setup by the ARCS firmware, but the
>> plan is to eventually move away from those mappings and calculate
>> its own.  I didn't quite get that, but after reading that site, now
>> I do.  ARCS is writing ~0 to the BARs to query the device to figure
>> out how much memory each BAR wants, then programs in some ranges for
>> us.
>>
>> By tallying up the total requested size of all BAR mappings on a
>> given device, we can then make a determination on whether we can use
>> a small, medium, or large window and then program the BAR with the
>> correct address.  Right?
> 
> This is the problem I alluded to above: Linux does not currently
> support anything like this.  We assume the Linux host bridge driver
> knows the window sizes at the beginning of time (it may learn them
> from firmware or it may have hard-coded knowledge of the address map).
> 
> Linux does enumerate the devices and tally up all the BAR sizes (see
> pci_bus_assign_resources()), but it does not have a way to change the
> host bridge windows based on how much BAR space we need.
> 
> The common thing is to use whatever host bridge windows and device BAR
> values were set up by firmware.  If those are all sensible, Linux
> won't change anything.

Well, as I stated earlier above, it looks like the bridge_probe() function in
the new driver will have to do some probing of its own.  One of the patches I
have in a different series does a minimal probe to read each device slot on
BRIDGE to get the vendor ID and device ID to see if the slot is actually
populated.  This is because not all slots are "populated"; e.g., Octane wires
one of the PCI interrupt pins on slot 6 on BaseIO to its power button, for example.

So the logic used to read the vendor/dev IDs can probably be extended to poke
the BARs via the ~0 trick OR read out the pre-defined mappings from ARCS
firmware, then use that info to size out the BRIDGE windows into Crosstalk
space and then set up our IORESOURCE_MEM and IORESOURCE_IO structs with the
right information to pass into register_pci_controller().

I dunno, I'll probably try reading the ARCS mappings idea first, as that seems
easier.  I won't have a lot of time to play with things in March, so getting
something to work, even if it is sub-optimal, is better than nothing working at
all.


>>> The host bridge windows (the "root bus resource" lines in dmesg)
>>> are only for PIO, i.e., a driver running on the CPU reading or
>>> writing registers on the PCI device (it could be actual registers,
>>> or a frame buffer, etc., but it resides on the device and its PCI
>>> bus address is determined by a BAR).  The driver uses ioremap(),
>>> pci_iomap(), pcim_iomap_regions(), etc. to map these into the
>>> kernel virtual space.
>>
>> So for the MIPS case, knowing that ioremap() and friends is not
>> guaranteed to return a workable virtual address, I need to be
>> careful of what addresses I program into each BAR.  E.g., given that
>> on IP30, if a physical address starts with 0x9000000f800xxxxx, it is
>> using medium windows to talk to a device on the BaseIO BRIDGE (Xtalk
>> widget 0xf).  As such, knowing MIPS' ioremap() returns, for the
>> R10000 CPU case, the requested address OR'ed with IO_BASE
>> (0x9000000000000000), I want to tell the PCI core to use
>> 0x0000000f800xxxxx so that ioremap() will return back
>> 0x9000000f800xxxxx?
> 
> If ioremap() doesn't return a virtual address, or at least something
> that can be used like a virtual address, I think that's fundamentally
> broken.  All drivers assume they should ioremap() a BAR resource,
> i.e., the CPU physical address that maps to a PCI BAR value, and use
> the result as a virtual address.

This is an issue that the core Linux/MIPS people will have to work out.  Ralf
probably knows the precise history on why ioremap*() and related functions
aren't guaranteed to be usable as virtual addresses.  I suspect it's tied to
the other issues with IP27, as that was one of the first MIPS platforms that
Linux ran on in the early 2000's.


>> And since I can apparently specify multiple window ranges for memory
>> space and I/O space, I probably want to, as stated earlier, specify
>> all three of IP30's known window ranges for each BRIDGE so that the
>> Linux PCI core can walk each resource struct and find a matching
>> window?
> 
> If all three windows are disjoint, you can specify them all.  If a
> range in the small window is aliased and can also be reached via a
> medium or large window, you should not specify both.

Each window, at least on Octane, lives at distinct addresses in Crosstalk
space.  A specific address within each window can point to the same device on a
subordinate PCI bus, but I think that will be transparent to the Linux PCI
core.  My thinking is, if I setup IORESOURCE_MEM structs for each window for
each BRIDGE widget, then the PCI code will check each struct to see which one
contains the address range that the PCI device's BAR wants.

E.g., Using widget 0xd as an example for readability, if I know my three
windows in Crosstalk space are:
    Small:  0x0000_1d00_0000 - 0x0000_1dff_ffff
    Medium: 0x000e_8000_0000 - 0x000e_ffff_ffff
    Large:  0x00d0_0000_0000 - 0x00df_ffff_ffff

And for that specific BRIDGE, if I pass those three ranges as IORESOURCE_MEM
structs, then shouldn't the PCI core, if it is told that a specific devices BAR
wants range 1d200000 to 1d203fff, select the small window mapping?  E,g., is it
going to try to find the smallest possible window to fit first?  Or will it
instead try to match using the large window?

I guess I'll find out when I can get some time to actually test the idea out...


[snip]

>>> I think the biggest problem is that you need to set up the BRIDGE
>>> *before* calling pcibios_scan_bus().  That way the windows
>>> ("hose->mem_resource", "hose->io_resource", etc.) will be correct
>>> when the PCI core enumerates the devices.  Note that you can have
>>> as many windows in the "&resources" list as you need -- the
>>> current code there only has one memory and one I/O window, but you
>>> can add more.
>>
>> Agreed, however, I think this is actually being done to some extent
>> already.  We're configuring BRIDGE-specific properties and writing
>> some values to BRIDGE registers and the per-slot registers in
>> bridge_probe() in the generic pci-bridge.c driver, then, at the very
>> end, calling register_pci_controller(), which I believe is what
>> kicks off the PCI bus scan.
>>
>> The IP30 BRIDGE glue code adds a new function hook called
>> "pre_enable" where it does, in ip30-bridge.c, some additional PCI
>> device tweaking, but this code will run after the PCI bus scan has
>> happened.  My guess is that I want to reverse this and have the IP30
>> glue code twiddle the PCI devices on a given BRIDGE before the PCI
>> bus scan, right?  
> 
> Theoretically you should not need to do any PCI device tweaking: PCI
> devices are basically arch-independent and shouldn't need
> arch-specific tweaks.  All the arch stuff should be encapsulated in
> the host bridge driver, i.e., the code that sets up the window list
> and calls pci_scan_root_bus().  In the MIPS case, this code is kind of
> spread out and doesn't really look like a single "driver".

In an ideal world, this would be correct.  However, owning SGI equipment must
have an effect on the curvature of local spacetime, and things seem to operate
a bit differently.  The IOC3 device is a perfect example of this phenomenon.


> The ideal thing would be if you can set up the bridge to always use
> large windows.  Then the PCI core will enumerate the devices and set
> up their resources automatically.

I'll keep this in mind when I can start testing.  I'll initially try feeding
three IORESOURCE_MEM structs to the PCI core to describe the three windows in
order from smallest to largest and see what it does.  Then try reversing the
order and see if it still finds the right window or not.  And whether that
actually leads to correct probing of the PCI devices or not.


> I assume the small/medium windows exist because there's not enough
> address space to always use the large windows.  I don't have any good
> suggestions for that -- we don't have support for adjusting the window
> sizes based on what devices we find.

IMHO, "Use a bigger hammer" seems most appropriate when it comes to this hardware.


>> That code uses pci_read_config_dword() and
>> pci_write_config_dword() -- are these functions safe to use if we
>> haven't already done a PCI bus scan?
> 
> You can't use pci_read_config_dword() before we scan the bus because
> it requires a "struct pci_dev *", and those are created during the bus
> scan.

Okay, that's good to know then.  There's other accessors that are MIPS-specific
that I think I can use.  I don't think I specifically need a 'struct pci_dev *'
at that point anyways.


-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case
  2017-02-28  0:25                         ` Joshua Kinard
@ 2017-03-01 15:39                           ` Bjorn Helgaas
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2017-03-01 15:39 UTC (permalink / raw)
  To: Joshua Kinard
  Cc: Bjorn Helgaas, Ralf Baechle, James Hogan, Lorenzo Pieralisi,
	Thomas Bogendoerfer, Linux/MIPS, linux-pci

On Mon, Feb 27, 2017 at 07:25:58PM -0500, Joshua Kinard wrote:
> On 02/27/2017 11:36, Bjorn Helgaas wrote:
> > On Sat, Feb 25, 2017 at 04:34:12AM -0500, Joshua Kinard wrote:
> >> On 02/24/2017 13:38, Bjorn Helgaas wrote:
> >>> On Fri, Feb 24, 2017 at 03:50:26AM -0500, Joshua Kinard wrote:

> So the logic used to read the vendor/dev IDs can probably be
> extended to poke the BARs via the ~0 trick OR read out the
> pre-defined mappings from ARCS firmware, then use that info to size
> out the BRIDGE windows into Crosstalk space and then set up our
> IORESOURCE_MEM and IORESOURCE_IO structs with the right information
> to pass into register_pci_controller().

As you probably know, you can determine the size of a BAR regardless of
whether it has been assigned.  So it doesn't matter whether ARCS has
already assigned anything.  See __pci_read_base().

> I dunno, I'll probably try reading the ARCS mappings idea first, as
> that seems easier.  I won't have a lot of time to play with things
> in March, so getting something to work, even if it is sub-optimal,
> is better than nothing working at all.

The simplest possible thing is to enable some pre-defined set of
windows and accept that this may lead to configuration restrictions,
e.g., a device with large BARs may work only in certain slots.

> Each window, at least on Octane, lives at distinct addresses in
> Crosstalk space.  A specific address within each window can point to
> the same device on a subordinate PCI bus, but I think that will be
> transparent to the Linux PCI core.  My thinking is, if I setup
> IORESOURCE_MEM structs for each window for each BRIDGE widget, then
> the PCI code will check each struct to see which one contains the
> address range that the PCI device's BAR wants.
> 
> E.g., Using widget 0xd as an example for readability, if I know my
> three windows in Crosstalk space are:
>     Small:  0x0000_1d00_0000 - 0x0000_1dff_ffff
>     Medium: 0x000e_8000_0000 - 0x000e_ffff_ffff
>     Large:  0x00d0_0000_0000 - 0x00df_ffff_ffff
> 
> And for that specific BRIDGE, if I pass those three ranges as
> IORESOURCE_MEM structs, then shouldn't the PCI core, if it is told
> that a specific devices BAR wants range 1d200000 to 1d203fff, select
> the small window mapping?  E,g., is it going to try to find the
> smallest possible window to fit first?  Or will it instead try to
> match using the large window?

No, when matching a BAR value to a host bridge window, the PCI core
will definitely not look for the smallest (or largest) window that
contains the BAR.  

The core reads the BAR value (a bus address), then searches the host
bridge windows for one that maps to a region that contains the bus
address.  This happens in pcibios_bus_to_resource(), which is called
by __pci_read_base().

The search order is undefined because the core assumes the windows do
not overlap in PCI bus address space.  It can't deal with two windows
that map to the same PCI space because when we allocate space when
assigning a BAR, we search for available CPU address space, not for
available PCI bus address space.

If window A and window B both map to PCI bus address X, we may assign
space from window A to PCI BAR 1 and space from window B to PCI BAR 2.
Then we have two BARs at the same PCI bus address, which will cause
conflicts.

I don't think we currently check in the PCI core for host bridge
windows that map to overlapping PCI bus space.  Maybe we should, and
just reject anything that overlaps.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-01 15:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20170207061356.8270-1-kumba@gentoo.org>
     [not found] ` <20170207061356.8270-13-kumba@gentoo.org>
     [not found]   ` <CAErSpo6yKAE1_c1eZJapnjD1g0pocyOxed3_Eumdp_026uhDuA@mail.gmail.com>
     [not found]     ` <eafc94c6-1931-e2ce-7e03-d84d8e181e81@gentoo.org>
     [not found]       ` <CAErSpo4LsrPCtdZwp6CyT0jKhXLt3j=fGSiFjpRRTPUjFoKHtQ@mail.gmail.com>
2017-02-12  4:09         ` [PATCH 12/12] MIPS: PCI: Fix IP27 for the PCI_PROBE_ONLY case Joshua Kinard
2017-02-13 22:45           ` Bjorn Helgaas
2017-02-14  7:39             ` Joshua Kinard
2017-02-14 14:56               ` Bjorn Helgaas
2017-02-24  8:50                 ` Joshua Kinard
2017-02-24 18:38                   ` Bjorn Helgaas
2017-02-25  9:34                     ` Joshua Kinard
2017-02-27 16:36                       ` Bjorn Helgaas
2017-02-28  0:25                         ` Joshua Kinard
2017-03-01 15:39                           ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).