Re: Question about supporting AMD eGPU hot plug case

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Question about supporting AMD eGPU hot plug case
       [not found]                             ` <8d7e2d7b7982d8d78c0ecaa74b9d40ace4db8453.camel@yadro.com>
@ 2021-03-04 19:49                               ` Andrey Grodzovsky
  2021-03-05 16:08                                 ` Sergei Miroshnichenko
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-04 19:49 UTC (permalink / raw)
  To: Sergei Miroshnichenko
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci

[-- Attachment #1: Type: text/plain, Size: 3049 bytes --]

+ linux-pci

On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>> ...
>>>> Are you saying that even without hot-plugging, while both nvme
>>>> and
>>>> AMD
>>>> card are present
>>>> right from boot, you still get BARs moving and MMIO ranges
>>>> reassigned
>>>> for NVME BARs
>>>> just because amdgpu driver will start resize of AMD card BARs and
>>>> this
>>>> will trigger NVMEs BARs move to
>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>> Yes. Unconditionally, because it is unknown beforehand if NVMe's
>>> BAR
>>> movement will help. In this particular case BAR movement is not
>>> needed,
>>> but is done anyway.
>>>
>>> BARs are not moved one by one, but the kernel releases all the
>>> releasable ones, and then recalculates a new BAR layout to fit them
>>> all. Kernel's algorithm is different from BIOS's, so NVME has
>>> appeared
>>> at a new place.
>>>
>>> This is triggered by following:
>>> - at boot, if BIOS had assigned not every BAR;
>>> - during pci_resize_resource();
>>> - during pci_rescan_bus() -- after a pciehp event or a manual via
>>> sysfs.
>>
>> By manual via sysfs you mean something like this - 'echo 1 >
>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>> /sys/bus/pci/rescan ' ? I am looking into how most reliably trigger
>> PCI
>> code to call my callbacks even without having external PCI cage for
>> GPU
>> (will take me some time to get it).
> 
> Yeah, this is our way to go when a device can't be physically removed
> or unpowered remotely. With just a bit shorter path:
> 
>    sudo sh -c 'echo 1 > /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>    sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
> 
> Or, just a second command (rescan) is enough: a BAR movement attempt
> will be triggered even if there were no changes in PCI topology.
> 
> Serge
> 

Hi Segrei

Here is a link to initial implementation on top of your tree 
(movable_bars_v9.1) - 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=05d6abceed650181bb7fe0a49884a26e378b908e
I am able to pass one re-scan cycle and can use the card afterwards (see 
log1.log).
But, according to your prints only BAR5 which is registers BAR was
updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 -> 0xfc100000)
while I am interested to test BAR0 (Graphic RAM) move since this is
where most of the complexity is. Is there a way to hack your code to 
force this ?
When testing with 2 graphic cards and triggering rescan, hard hang of
the system happens during rescan_prepare of the second card  when 
stopping the HW (see log2.log) - I don't understand why this would 
happen as each of them passes fine when they are standalone tested and 
there should be no interdependence between them as far as i know.
Do you have any idea ?

Andrey


[-- Attachment #2: log2.log --]
[-- Type: text/x-log, Size: 1253 bytes --]

Dual Vega + Polaris 11 - hang

[  102.022609 <   12.522871>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_prepare - start 
[  102.301619 <    0.279010>] [drm] psp command (0x2) failed and response status is (0x117)
[  102.301702 <    0.000083>] [drm] free PSP TMR buffer
[  102.334903 <    0.033201>] amdgpu 0000:0b:00.0: BAR 5: releasing [mem 0xfcb00000-0xfcb7ffff]
[  102.334925 <    0.000022>] amdgpu 0000:0b:00.0: BAR 2: releasing [mem 0xf0000000-0xf01fffff 64bit pref]
[  102.334941 <    0.000016>] amdgpu 0000:0b:00.0: BAR 0: releasing [mem 0xe0000000-0xefffffff 64bit pref]
[  102.334955 <    0.000014>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_prepare - end 
[  102.354480 <    0.019525>] amdgpu 0000:0c:00.0: amdgpu: amdgpu_device_rescan_prepare - start 
[  102.355059 <    0.000579>] amdgpu: 
                               last message was failed ret is 65535
[  102.355073 <    0.000014>] amdgpu: 
                               failed to send message 261 ret is 65535 
[  102.355087 <    0.000014>] amdgpu: 
                               last message was failed ret is 65535
[  102.355097 <    0.000010>] amdgpu: 
                               failed to send message 261 ret is 65535 
[  102.355110 <    0.000013>] amdgpu: 

[-- Attachment #3: log1.log --]
[-- Type: text/x-log, Size: 17571 bytes --]

(reverse-i-search)`cat': ^Ct /sys/kernel/debug/dri/0/amdgpu_gpu_recover 
root@andrey-test:~# echo 1 > /sys/bus/pci/rescan 
[  126.923571 <   16.141524>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_prepare - start 
[  127.210952 <    0.287381>] [drm] psp command (0x2) failed and response status is (0x117)
[  127.211037 <    0.000085>] [drm] free PSP TMR buffer
[  127.244310 <    0.033273>] amdgpu 0000:0b:00.0: BAR 5: releasing [mem 0xfcc00000-0xfcc7ffff]
[  127.244333 <    0.000023>] amdgpu 0000:0b:00.0: BAR 2: releasing [mem 0xf0000000-0xf01fffff 64bit pref]
[  127.244348 <    0.000015>] amdgpu 0000:0b:00.0: BAR 0: releasing [mem 0xe0000000-0xefffffff 64bit pref]
[  127.244363 <    0.000015>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_prepare - end 
[  127.244932 <    0.000569>] pcieport 0000:03:07.0: resource 13 [io  0xe000-0xefff] released
[  127.244949 <    0.000017>] pcieport 0000:02:00.2: resource 13 [io  0xe000-0xefff] released
[  127.244964 <    0.000015>] pcieport 0000:00:01.3: resource 13 [io  0xe000-0xefff] released
[  127.244978 <    0.000014>] pcieport 0000:0a:00.0: resource 13 [io  0xd000-0xdfff] released
[  127.244992 <    0.000014>] pcieport 0000:09:00.0: resource 13 [io  0xd000-0xdfff] released
[  127.245005 <    0.000013>] pcieport 0000:00:03.1: resource 13 [io  0xd000-0xdfff] released
[  127.245019 <    0.000014>] pcieport 0000:00:01.1: resource 14 [mem 0xfcf00000-0xfcffffff] released
[  127.245034 <    0.000015>] pcieport 0000:00:01.3: resource 14 [mem 0xfc900000-0xfcbfffff] released
[  127.245047 <    0.000013>] pcieport 0000:00:03.1: resource 14 [mem 0xfcc00000-0xfcdfffff] released
[  127.245062 <    0.000015>] pcieport 0000:00:07.1: resource 14 [mem 0xfc600000-0xfc8fffff] released
[  127.245076 <    0.000014>] pcieport 0000:00:08.1: resource 14 [mem 0xfce00000-0xfcefffff] released
[  127.245091 <    0.000015>] pcieport 0000:00:03.1: resource 15 [mem 0xe0000000-0xf01fffff 64bit pref] released
[  127.245311 <    0.000220>] pcieport 0000:00:01.3: BAR 13: assigned [io  0xe000-0xefff]
[  127.245335 <    0.000024>] pcieport 0000:00:07.1: BAR 14: assigned [mem 0xfc400000-0xfc6fffff]
[  127.245359 <    0.000024>] pcieport 0000:00:01.3: BAR 14: assigned [mem 0xfc900000-0xfcbfffff]
[  127.245383 <    0.000024>] pcieport 0000:00:08.1: BAR 14: assigned [mem 0xfce00000-0xfcefffff]
[  127.245470 <    0.000087>] pcieport 0000:00:03.1: BAR 15: assigned [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.245492 <    0.000022>] pcieport 0000:00:01.1: BAR 14: assigned [mem 0xfc000000-0xfc0fffff]
[  127.245512 <    0.000020>] pcieport 0000:00:03.1: BAR 14: assigned [mem 0xfc100000-0xfc2fffff]
[  127.245534 <    0.000022>] pcieport 0000:00:03.1: BAR 13: assigned [io  0x1000-0x1fff]
[  127.245613 <    0.000079>] nvme 0000:01:00.0: BAR 0: assigned [mem 0xfc000000-0xfc003fff 64bit]
[  127.245639 <    0.000026>] nvme 0000:01:00.0: BAR 0 updated: 0xfcf00000 -> 0xfc000000
[  127.245755 <    0.000116>] pcieport 0000:02:00.2: BAR 13: assigned [io  0xe000-0xefff]
[  127.245771 <    0.000016>] pcieport 0000:02:00.2: BAR 14: assigned [mem 0xfc900000-0xfcafffff]
[  127.245785 <    0.000014>] ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff pref]
[  127.245798 <    0.000013>] ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-0xfcb9ffff]
[  127.245812 <    0.000014>] xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-0xfcba7fff 64bit]
[  127.245939 <    0.000127>] pcieport 0000:03:07.0: BAR 13: assigned [io  0xe000-0xefff]
[  127.245955 <    0.000016>] pcieport 0000:03:07.0: BAR 14: assigned [mem 0xfc900000-0xfc9fffff]
[  127.245971 <    0.000016>] pcieport 0000:03:04.0: BAR 14: assigned [mem 0xfca00000-0xfcafffff]
[  127.246040 <    0.000069>] xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-0xfca07fff 64bit]
[  127.246126 <    0.000086>] igb 0000:07:00.0: BAR 2: assigned fixed [io  0xe000-0xe01f]
[  127.246138 <    0.000012>] igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
[  127.246151 <    0.000013>] igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
[  127.246278 <    0.000127>] pcieport 0000:09:00.0: BAR 15: assigned [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.246295 <    0.000017>] pcieport 0000:09:00.0: BAR 14: assigned [mem 0xfc100000-0xfc1fffff]
[  127.246311 <    0.000016>] pcieport 0000:09:00.0: BAR 0: assigned [mem 0xfc200000-0xfc203fff]
[  127.246327 <    0.000016>] pcieport 0000:09:00.0: BAR 0 updated: 0xfcd00000 -> 0xfc200000
[  127.246340 <    0.000013>] pcieport 0000:09:00.0: BAR 13: assigned [io  0x1000-0x1fff]
[  127.246583 <    0.000243>] pcieport 0000:0a:00.0: BAR 15: assigned [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.246619 <    0.000036>] pcieport 0000:0a:00.0: BAR 14: assigned [mem 0xfc100000-0xfc1fffff]
[  127.246648 <    0.000029>] pcieport 0000:0a:00.0: BAR 13: assigned [io  0x1000-0x1fff]
[  127.246951 <    0.000303>] amdgpu 0000:0b:00.0: BAR 0: assigned [mem 0xe0000000-0xefffffff 64bit pref]
[  127.247000 <    0.000049>] amdgpu 0000:0b:00.0: BAR 2: assigned [mem 0xf0000000-0xf01fffff 64bit pref]
[  127.247040 <    0.000040>] amdgpu 0000:0b:00.0: BAR 5: assigned [mem 0xfc100000-0xfc17ffff]
[  127.247066 <    0.000026>] amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 -> 0xfc100000
[  127.247099 <    0.000033>] pci 0000:0b:00.1: BAR 0: assigned [mem 0xfc180000-0xfc183fff]
[  127.247124 <    0.000025>] pci 0000:0b:00.1: BAR 0 updated: 0xfcca0000 -> 0xfc180000
[  127.247145 <    0.000021>] amdgpu 0000:0b:00.0: BAR 4: assigned [io  0x1000-0x10ff]
[  127.247169 <    0.000024>] amdgpu 0000:0b:00.0: BAR 4 updated: 0xd000 -> 0x1000
[  127.247365 <    0.000196>] xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-0xfc6fffff 64bit]
[  127.247410 <    0.000045>] pci 0000:0c:00.2: BAR 2: assigned [mem 0xfc400000-0xfc4fffff]
[  127.247434 <    0.000024>] pci 0000:0c:00.2: BAR 2 updated: 0xfc700000 -> 0xfc400000
[  127.247457 <    0.000023>] pci 0000:0c:00.2: BAR 5: assigned [mem 0xfc500000-0xfc501fff]
[  127.247480 <    0.000023>] pci 0000:0c:00.2: BAR 5 updated: 0xfc800000 -> 0xfc500000
[  127.247590 <    0.000110>] ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-0xfce08fff]
[  127.247632 <    0.000042>] pci 0000:0d:00.3: BAR 0: assigned [mem 0xfce00000-0xfce07fff]
[  127.247676 <    0.000044>] pci_bus 0000:00: resource 4 [io  0x0000-0x03af window]
[  127.247695 <    0.000019>] pci_bus 0000:00: resource 5 [io  0x03e0-0x0cf7 window]
[  127.247714 <    0.000019>] pci_bus 0000:00: resource 6 [io  0x03b0-0x03df window]
[  127.247742 <    0.000028>] pci_bus 0000:00: resource 7 [io  0x0d00-0xefff window]
[  127.247759 <    0.000017>] pci_bus 0000:00: resource 8 [mem 0x000a0000-0x000bffff window]
[  127.247779 <    0.000020>] pci_bus 0000:00: resource 9 [mem 0x000c0000-0x000dffff window]
[  127.247808 <    0.000029>] pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff window]
[  127.247827 <    0.000019>] pci_bus 0000:00: resource 11 [mem 0xfee00000-0xffffffff window]
[  127.247848 <    0.000021>] pci_bus 0000:01: resource 1 [mem 0xfc000000-0xfc0fffff]
[  127.247877 <    0.000029>] pci_bus 0000:02: resource 0 [io  0xe000-0xefff]
[  127.247893 <    0.000016>] pci_bus 0000:02: resource 1 [mem 0xfc900000-0xfcbfffff]
[  127.247913 <    0.000020>] pci_bus 0000:03: resource 0 [io  0xe000-0xefff]
[  127.247940 <    0.000027>] pci_bus 0000:03: resource 1 [mem 0xfc900000-0xfcafffff]
[  127.247958 <    0.000018>] pci_bus 0000:05: resource 1 [mem 0xfca00000-0xfcafffff]
[  127.247977 <    0.000019>] pci_bus 0000:07: resource 0 [io  0xe000-0xefff]
[  127.247995 <    0.000018>] pci_bus 0000:07: resource 1 [mem 0xfc900000-0xfc9fffff]
[  127.248015 <    0.000020>] pci_bus 0000:09: resource 0 [io  0x1000-0x1fff]
[  127.248031 <    0.000016>] pci_bus 0000:09: resource 1 [mem 0xfc100000-0xfc2fffff]
[  127.248051 <    0.000020>] pci_bus 0000:09: resource 2 [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248078 <    0.000027>] pci_bus 0000:0a: resource 0 [io  0x1000-0x1fff]
[  127.248095 <    0.000017>] pci_bus 0000:0a: resource 1 [mem 0xfc100000-0xfc1fffff]
[  127.248114 <    0.000019>] pci_bus 0000:0a: resource 2 [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248143 <    0.000029>] pci_bus 0000:0b: resource 0 [io  0x1000-0x1fff]
[  127.248160 <    0.000017>] pci_bus 0000:0b: resource 1 [mem 0xfc100000-0xfc1fffff]
[  127.248179 <    0.000019>] pci_bus 0000:0b: resource 2 [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248209 <    0.000030>] pci_bus 0000:0c: resource 1 [mem 0xfc400000-0xfc6fffff]
[  127.248227 <    0.000018>] pci_bus 0000:0d: resource 1 [mem 0xfce00000-0xfcefffff]
[  127.248259 <    0.000032>] pcieport 0000:00:01.1: PCI bridge to [bus 01]
[  127.248282 <    0.000023>] pcieport 0000:00:01.1:   bridge window [mem 0xfc000000-0xfc0fffff]
[  127.248308 <    0.000026>] pcieport 0000:03:00.0: PCI bridge to [bus 04]
[  127.248339 <    0.000031>] pcieport 0000:03:04.0: PCI bridge to [bus 05]
[  127.248360 <    0.000021>] pcieport 0000:03:04.0:   bridge window [mem 0xfca00000-0xfcafffff]
[  127.248388 <    0.000028>] pcieport 0000:03:06.0: PCI bridge to [bus 06]
[  127.248418 <    0.000030>] pcieport 0000:03:07.0: PCI bridge to [bus 07]
[  127.248436 <    0.000018>] pcieport 0000:03:07.0:   bridge window [io  0xe000-0xefff]
[  127.248464 <    0.000028>] pcieport 0000:03:07.0:   bridge window [mem 0xfc900000-0xfc9fffff]
[  127.248492 <    0.000028>] pcieport 0000:03:09.0: PCI bridge to [bus 08]
[  127.248523 <    0.000031>] pcieport 0000:02:00.2: PCI bridge to [bus 03-08]
[  127.248540 <    0.000017>] pcieport 0000:02:00.2:   bridge window [io  0xe000-0xefff]
[  127.248562 <    0.000022>] pcieport 0000:02:00.2:   bridge window [mem 0xfc900000-0xfcafffff]
[  127.248595 <    0.000033>] pcieport 0000:00:01.3: PCI bridge to [bus 02-08]
[  127.248612 <    0.000017>] pcieport 0000:00:01.3:   bridge window [io  0xe000-0xefff]
[  127.248633 <    0.000021>] pcieport 0000:00:01.3:   bridge window [mem 0xfc900000-0xfcbfffff]
[  127.248660 <    0.000027>] pcieport 0000:0a:00.0: PCI bridge to [bus 0b]
[  127.248677 <    0.000017>] pcieport 0000:0a:00.0:   bridge window [io  0x1000-0x1fff]
[  127.248699 <    0.000022>] pcieport 0000:0a:00.0:   bridge window [mem 0xfc100000-0xfc1fffff]
[  127.248720 <    0.000021>] pcieport 0000:0a:00.0:   bridge window [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248745 <    0.000025>] pcieport 0000:09:00.0: PCI bridge to [bus 0a-0b]
[  127.248763 <    0.000018>] pcieport 0000:09:00.0:   bridge window [io  0x1000-0x1fff]
[  127.248785 <    0.000022>] pcieport 0000:09:00.0:   bridge window [mem 0xfc100000-0xfc1fffff]
[  127.248806 <    0.000021>] pcieport 0000:09:00.0:   bridge window [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248832 <    0.000026>] pcieport 0000:00:03.1: PCI bridge to [bus 09-0b]
[  127.248849 <    0.000017>] pcieport 0000:00:03.1:   bridge window [io  0x1000-0x1fff]
[  127.248869 <    0.000020>] pcieport 0000:00:03.1:   bridge window [mem 0xfc100000-0xfc2fffff]
[  127.248889 <    0.000020>] pcieport 0000:00:03.1:   bridge window [mem 0xe0000000-0xf7ffffff 64bit pref]
[  127.248917 <    0.000028>] pcieport 0000:00:07.1: PCI bridge to [bus 0c]
[  127.248936 <    0.000019>] pcieport 0000:00:07.1:   bridge window [mem 0xfc400000-0xfc6fffff]
[  127.248962 <    0.000026>] pcieport 0000:00:08.1: PCI bridge to [bus 0d]
[  127.248981 <    0.000019>] pcieport 0000:00:08.1:   bridge window [mem 0xfce00000-0xfcefffff]
[  127.249893 <    0.000912>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_done - start 
[  127.250694 <    0.000801>] amdgpu 0000:0b:00.0: amdgpu: GPU reset succeeded, trying to resume
[  127.250916 <    0.000222>] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[  127.251063 <    0.000147>] [drm] VRAM is lost due to GPU reset!
[  127.251478 <    0.000415>] [drm] PSP is resuming...
[  127.311254 <    0.059776>] [drm] reserve 0x400000 from 0xf7fec00000 for PSP TMR
[  127.342993 <    0.031739>] nvme nvme0: Shutdown timeout set to 8 seconds
[  127.380310 <    0.037317>] nvme nvme0: 32/0/0 default/read/poll queues
[  127.762031 <    0.381721>] [drm] kiq ring mec 2 pipe 1 q 0
[  127.844977 <    0.082946>] [drm] UVD and UVD ENC initialized successfully.
[  127.944787 <    0.099810>] [drm] VCE initialized successfully.
[  127.944812 <    0.000025>] amdgpu 0000:0b:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[  127.944826 <    0.000014>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  127.944838 <    0.000012>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  127.944849 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  127.944860 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  127.944871 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  127.944882 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  127.944893 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  127.944904 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  127.944915 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[  127.944926 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  127.944937 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
[  127.944947 <    0.000010>] amdgpu 0000:0b:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
[  127.944958 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
[  127.944969 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
[  127.944979 <    0.000010>] amdgpu 0000:0b:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[  127.944990 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[  127.945000 <    0.000010>] amdgpu 0000:0b:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 1
[  127.945011 <    0.000011>] amdgpu 0000:0b:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 1
[  127.945021 <    0.000010>] amdgpu 0000:0b:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 1
root@andrey-test:~# [  127.951047 <    0.006026>] amdgpu 0000:0b:00.0: amdgpu: recover vram bo from shadow start
[  127.951304 <    0.000257>] amdgpu 0000:0b:00.0: amdgpu: recover vram bo from shadow done
[  127.951403 <    0.000099>] amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_rescan_done - end 
[  127.951801 <    0.000398>] pcieport 0000:00:01.1: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.951824 <    0.000023>] nvme 0000:01:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.951847 <    0.000023>] pcieport 0000:00:01.3: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951868 <    0.000021>] xhci_hcd 0000:02:00.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951888 <    0.000020>] ahci 0000:02:00.1: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951908 <    0.000020>] pcieport 0000:02:00.2: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951928 <    0.000020>] pcieport 0000:03:00.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951948 <    0.000020>] pcieport 0000:03:04.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951972 <    0.000024>] xhci_hcd 0000:05:00.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.951991 <    0.000019>] pcieport 0000:03:06.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.952011 <    0.000020>] pcieport 0000:03:07.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.952038 <    0.000027>] igb 0000:07:00.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.952058 <    0.000020>] pcieport 0000:03:09.0: Max Payload Size set to  512/ 512 (was  512), Max Read Rq  512
[  127.952077 <    0.000019>] pcieport 0000:00:03.1: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.952097 <    0.000020>] pcieport 0000:09:00.0: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.952116 <    0.000019>] pcieport 0000:0a:00.0: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.952135 <    0.000019>] amdgpu 0000:0b:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952154 <    0.000019>] pci 0000:0b:00.1: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952172 <    0.000018>] pcieport 0000:00:07.1: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.952190 <    0.000018>] pci 0000:0c:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952208 <    0.000018>] pci 0000:0c:00.2: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952225 <    0.000017>] xhci_hcd 0000:0c:00.3: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952244 <    0.000019>] pcieport 0000:00:08.1: Max Payload Size set to  256/ 512 (was  256), Max Read Rq  512
[  127.952262 <    0.000018>] pci 0000:0d:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952280 <    0.000018>] ahci 0000:0d:00.2: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[  127.952298 <    0.000018>] pci 0000:0d:00.3: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-04 19:49                               ` Question about supporting AMD eGPU hot plug case Andrey Grodzovsky
@ 2021-03-05 16:08                                 ` Sergei Miroshnichenko
  2021-03-05 17:13                                   ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Sergei Miroshnichenko @ 2021-03-05 16:08 UTC (permalink / raw)
  To: andrey.grodzovsky
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci, linux

On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
> + linux-pci
> 
> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
> > On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
> > > On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
> > > > On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
> > > > > On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
> > > > > > ...
> > > > > Are you saying that even without hot-plugging, while both
> > > > > nvme
> > > > > and
> > > > > AMD
> > > > > card are present
> > > > > right from boot, you still get BARs moving and MMIO ranges
> > > > > reassigned
> > > > > for NVME BARs
> > > > > just because amdgpu driver will start resize of AMD card BARs
> > > > > and
> > > > > this
> > > > > will trigger NVMEs BARs move to
> > > > > allow AMD card BARs to cover full range of VIDEO RAM ?
> > > > Yes. Unconditionally, because it is unknown beforehand if
> > > > NVMe's
> > > > BAR
> > > > movement will help. In this particular case BAR movement is not
> > > > needed,
> > > > but is done anyway.
> > > > 
> > > > BARs are not moved one by one, but the kernel releases all the
> > > > releasable ones, and then recalculates a new BAR layout to fit
> > > > them
> > > > all. Kernel's algorithm is different from BIOS's, so NVME has
> > > > appeared
> > > > at a new place.
> > > > 
> > > > This is triggered by following:
> > > > - at boot, if BIOS had assigned not every BAR;
> > > > - during pci_resize_resource();
> > > > - during pci_rescan_bus() -- after a pciehp event or a manual
> > > > via
> > > > sysfs.
> > > 
> > > By manual via sysfs you mean something like this - 'echo 1 >
> > > /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
> > > /sys/bus/pci/rescan ' ? I am looking into how most reliably
> > > trigger
> > > PCI
> > > code to call my callbacks even without having external PCI cage
> > > for
> > > GPU
> > > (will take me some time to get it).
> > 
> > Yeah, this is our way to go when a device can't be physically
> > removed
> > or unpowered remotely. With just a bit shorter path:
> > 
> >    sudo sh -c 'echo 1 > /sys/bus/pci/devices/0000\:0c\:00.0/remove'
> >    sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
> > 
> > Or, just a second command (rescan) is enough: a BAR movement
> > attempt
> > will be triggered even if there were no changes in PCI topology.
> > 
> > Serge
> > 
> 
> Hi Segrei
> 
> Here is a link to initial implementation on top of your tree 
> (movable_bars_v9.1) - 
> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=05d6abceed650181bb7fe0a49884a26e378b908e
> I am able to pass one re-scan cycle and can use the card afterwards
> (see 
> log1.log).
> But, according to your prints only BAR5 which is registers BAR was
> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
> 0xfc100000)
> while I am interested to test BAR0 (Graphic RAM) move since this is
> where most of the complexity is. Is there a way to hack your code to 
> force this ?

Hi Andrey,

Regarding the amdgpu's BAR0 remaining on its place: it seems this is
because of fixed BARs starting from fc600000. The kernel tends to group
the BARs close to each other, making a bridge window as compact as
possible. So the BAR0 had occupied the closest "comfortable" slots
0xe0000000-0xefffffff, with the resulting bridge window of bus 00
covering all the BARs:

    pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff window]

I'll let you know if I get an idea how to rearrange that manually.

Two GPUs can actually swap their places.

What also can make a BAR movable -- is rmmod'ing its driver. It could
be some hack from within a tmux, like:

  rmmod igb; \
  rmmod xhci_hcd; \
  rmmod ahci; \
  echo 1 > /sys/bus/pci/rescan; \
  modprobe igb; \
  modprobe xhci_hcd; \
  modprobe ahci

I think pci_release_resource() should not be in
amdgpu_device_unmap_mmio() -- the patched kernel will do that itself
for BARs the amdgpu_device_bar_fixed() returns false. Even more -- the
kernel will ensure that all BARs which were working before, are
reassigned properly, so it needs them to be assigned before the
procedure.
The same for pci_assign_unassigned_bus_resources() in
amdgpu_device_remap_mmio(): this callback is invoked from
pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().

> When testing with 2 graphic cards and triggering rescan, hard hang of
> the system happens during rescan_prepare of the second card  when 
> stopping the HW (see log2.log) - I don't understand why this would 
> happen as each of them passes fine when they are standalone tested
> and 
> there should be no interdependence between them as far as i know.
> Do you have any idea ?

What happens with two GPUs is unclear for me as well, nothing looks
suspicious.

Serge

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-05 16:08                                 ` Sergei Miroshnichenko
@ 2021-03-05 17:13                                   ` Andrey Grodzovsky
  2021-03-05 19:12                                     ` Sergei Miroshnichenko
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-05 17:13 UTC (permalink / raw)
  To: Sergei Miroshnichenko
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci, linux



On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
> On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
>> + linux-pci
>>
>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>> ...
>>>>>> Are you saying that even without hot-plugging, while both
>>>>>> nvme
>>>>>> and
>>>>>> AMD
>>>>>> card are present
>>>>>> right from boot, you still get BARs moving and MMIO ranges
>>>>>> reassigned
>>>>>> for NVME BARs
>>>>>> just because amdgpu driver will start resize of AMD card BARs
>>>>>> and
>>>>>> this
>>>>>> will trigger NVMEs BARs move to
>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>> NVMe's
>>>>> BAR
>>>>> movement will help. In this particular case BAR movement is not
>>>>> needed,
>>>>> but is done anyway.
>>>>>
>>>>> BARs are not moved one by one, but the kernel releases all the
>>>>> releasable ones, and then recalculates a new BAR layout to fit
>>>>> them
>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has
>>>>> appeared
>>>>> at a new place.
>>>>>
>>>>> This is triggered by following:
>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>> - during pci_resize_resource();
>>>>> - during pci_rescan_bus() -- after a pciehp event or a manual
>>>>> via
>>>>> sysfs.
>>>>
>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>> trigger
>>>> PCI
>>>> code to call my callbacks even without having external PCI cage
>>>> for
>>>> GPU
>>>> (will take me some time to get it).
>>>
>>> Yeah, this is our way to go when a device can't be physically
>>> removed
>>> or unpowered remotely. With just a bit shorter path:
>>>
>>>     sudo sh -c 'echo 1 > /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>     sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>
>>> Or, just a second command (rescan) is enough: a BAR movement
>>> attempt
>>> will be triggered even if there were no changes in PCI topology.
>>>
>>> Serge
>>>
>>
>> Hi Segrei
>>
>> Here is a link to initial implementation on top of your tree
>> (movable_bars_v9.1) -
>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C637bf3aab3634ab69dc608d8dff0e0b1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505572958565332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F2urbXqCLMzZBKe0lTyAfhgqDqDxiOZN04R9LTVxvI0%3D&amp;reserved=0
>> I am able to pass one re-scan cycle and can use the card afterwards
>> (see
>> log1.log).
>> But, according to your prints only BAR5 which is registers BAR was
>> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
>> 0xfc100000)
>> while I am interested to test BAR0 (Graphic RAM) move since this is
>> where most of the complexity is. Is there a way to hack your code to
>> force this ?
> 
> Hi Andrey,
> 
> Regarding the amdgpu's BAR0 remaining on its place: it seems this is
> because of fixed BARs starting from fc600000. The kernel tends to group
> the BARs close to each other, making a bridge window as compact as
> possible. So the BAR0 had occupied the closest "comfortable" slots
> 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
> covering all the BARs:
> 
>      pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff window]
> 
> I'll let you know if I get an idea how to rearrange that manually.
> 
> Two GPUs can actually swap their places.

What do you mean ?

> 
> What also can make a BAR movable -- is rmmod'ing its driver. It could
> be some hack from within a tmux, like:
> 
>    rmmod igb; \
>    rmmod xhci_hcd; \
>    rmmod ahci; \
>    echo 1 > /sys/bus/pci/rescan; \
>    modprobe igb; \
>    modprobe xhci_hcd; \
>    modprobe ahci

But should I also rmmod amdgpu ? Or modprobing back the other drivers 
should cause (hopefully) BAR0 move in AMD graphic card ?

> 
> I think pci_release_resource() should not be in
> amdgpu_device_unmap_mmio() -- the patched kernel will do that itself
> for BARs the amdgpu_device_bar_fixed() returns false. Even more -- the
> kernel will ensure that all BARs which were working before, are
> reassigned properly, so it needs them to be assigned before the
> procedure.
> The same for pci_assign_unassigned_bus_resources() in
> amdgpu_device_remap_mmio(): this callback is invoked from
> pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().

This seems to me in contrast to your documentation (see 
https://github.com/YADRO-KNS/linux/commit/5bc12ba7c74f1c19c11db29b4807bd32acfac2c2 
step 1) although step 2 seems also to contradict step 1 with regard to 
BARs release - so now I am a bit confused. Also looking at 
nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical 
acquisition happens in nvme_dev_unmap.

Andrey

> 
>> When testing with 2 graphic cards and triggering rescan, hard hang of
>> the system happens during rescan_prepare of the second card  when
>> stopping the HW (see log2.log) - I don't understand why this would
>> happen as each of them passes fine when they are standalone tested
>> and
>> there should be no interdependence between them as far as i know.
>> Do you have any idea ?
> 
> What happens with two GPUs is unclear for me as well, nothing looks
> suspicious.
> 
> Serge
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-05 17:13                                   ` Andrey Grodzovsky
@ 2021-03-05 19:12                                     ` Sergei Miroshnichenko
  2021-03-05 23:13                                       ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Sergei Miroshnichenko @ 2021-03-05 19:12 UTC (permalink / raw)
  To: andrey.grodzovsky
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci, linux

On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
> 
> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
> > On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
> > > + linux-pci
> > > 
> > > On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
> > > > On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
> > > > > On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
> > > > > > On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
> > > > > > > On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
> > > > > > > > ...
> > > > > > > Are you saying that even without hot-plugging, while both
> > > > > > > nvme
> > > > > > > and
> > > > > > > AMD
> > > > > > > card are present
> > > > > > > right from boot, you still get BARs moving and MMIO
> > > > > > > ranges
> > > > > > > reassigned
> > > > > > > for NVME BARs
> > > > > > > just because amdgpu driver will start resize of AMD card
> > > > > > > BARs
> > > > > > > and
> > > > > > > this
> > > > > > > will trigger NVMEs BARs move to
> > > > > > > allow AMD card BARs to cover full range of VIDEO RAM ?
> > > > > > Yes. Unconditionally, because it is unknown beforehand if
> > > > > > NVMe's
> > > > > > BAR
> > > > > > movement will help. In this particular case BAR movement is
> > > > > > not
> > > > > > needed,
> > > > > > but is done anyway.
> > > > > > 
> > > > > > BARs are not moved one by one, but the kernel releases all
> > > > > > the
> > > > > > releasable ones, and then recalculates a new BAR layout to
> > > > > > fit
> > > > > > them
> > > > > > all. Kernel's algorithm is different from BIOS's, so NVME
> > > > > > has
> > > > > > appeared
> > > > > > at a new place.
> > > > > > 
> > > > > > This is triggered by following:
> > > > > > - at boot, if BIOS had assigned not every BAR;
> > > > > > - during pci_resize_resource();
> > > > > > - during pci_rescan_bus() -- after a pciehp event or a
> > > > > > manual
> > > > > > via
> > > > > > sysfs.
> > > > > 
> > > > > By manual via sysfs you mean something like this - 'echo 1 >
> > > > > /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
> > > > > /sys/bus/pci/rescan ' ? I am looking into how most reliably
> > > > > trigger
> > > > > PCI
> > > > > code to call my callbacks even without having external PCI
> > > > > cage
> > > > > for
> > > > > GPU
> > > > > (will take me some time to get it).
> > > > 
> > > > Yeah, this is our way to go when a device can't be physically
> > > > removed
> > > > or unpowered remotely. With just a bit shorter path:
> > > > 
> > > >     sudo sh -c 'echo 1 >
> > > > /sys/bus/pci/devices/0000\:0c\:00.0/remove'
> > > >     sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
> > > > 
> > > > Or, just a second command (rescan) is enough: a BAR movement
> > > > attempt
> > > > will be triggered even if there were no changes in PCI
> > > > topology.
> > > > 
> > > > Serge
> > > > 
> > > 
> > > Hi Segrei
> > > 
> > > Here is a link to initial implementation on top of your tree
> > > (movable_bars_v9.1) -
> > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C637bf3aab3634ab69dc608d8dff0e0b1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505572958565332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F2urbXqCLMzZBKe0lTyAfhgqDqDxiOZN04R9LTVxvI0%3D&amp;reserved=0
> > > I am able to pass one re-scan cycle and can use the card
> > > afterwards
> > > (see
> > > log1.log).
> > > But, according to your prints only BAR5 which is registers BAR
> > > was
> > > updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
> > > 0xfc100000)
> > > while I am interested to test BAR0 (Graphic RAM) move since this
> > > is
> > > where most of the complexity is. Is there a way to hack your code
> > > to
> > > force this ?
> > 
> > Hi Andrey,
> > 
> > Regarding the amdgpu's BAR0 remaining on its place: it seems this
> > is
> > because of fixed BARs starting from fc600000. The kernel tends to
> > group
> > the BARs close to each other, making a bridge window as compact as
> > possible. So the BAR0 had occupied the closest "comfortable" slots
> > 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
> > covering all the BARs:
> > 
> >      pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
> > window]
> > 
> > I'll let you know if I get an idea how to rearrange that manually.
> > 
> > Two GPUs can actually swap their places.
> 
> What do you mean ?

I was thinking: when the scenario of a PCI rescan with two GPUs (as was
described below) will start working, BAR0 of GPU0 can take place of
BAR0 of GPU1 after the first rescan.

> > What also can make a BAR movable -- is rmmod'ing its driver. It
> > could
> > be some hack from within a tmux, like:
> > 
> >    rmmod igb; \
> >    rmmod xhci_hcd; \
> >    rmmod ahci; \
> >    echo 1 > /sys/bus/pci/rescan; \
> >    modprobe igb; \
> >    modprobe xhci_hcd; \
> >    modprobe ahci
> 
> But should I also rmmod amdgpu ? Or modprobing back the other
> drivers 
> should cause (hopefully) BAR0 move in AMD graphic card ?

You have already made the amdgpu movable, so no need to rmmod it --
just those with fixed BARs:

    xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
0xfc6fffff 64bit]
    igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
    igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
    ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff 
pref]
    ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
0xfcb9ffff]
    xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
0xfcba7fff 64bit]
    xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
0xfca07fff 64bit]
    ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
0xfce08fff]

The expected result is they all move closer to the start of PCI address
space.

> > I think pci_release_resource() should not be in
> > amdgpu_device_unmap_mmio() -- the patched kernel will do that
> > itself
> > for BARs the amdgpu_device_bar_fixed() returns false. Even more --
> > the
> > kernel will ensure that all BARs which were working before, are
> > reassigned properly, so it needs them to be assigned before the
> > procedure.
> > The same for pci_assign_unassigned_bus_resources() in
> > amdgpu_device_remap_mmio(): this callback is invoked from
> > pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
> 
> This seems to me in contrast to your documentation (see 
> https://github.com/YADRO-KNS/linux/commit/5bc12ba7c74f1c19c11db29b4807bd32acfac2c2 
> step 1) although step 2 seems also to contradict step 1 with regard
> to 
> BARs release - so now I am a bit confused. Also looking at 
> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical 
> acquisition happens in nvme_dev_unmap.

Ah, there is a difference between pci_release_region() and
pci_release_resource(), so subtle that I had to refresh my memory. You
are right, this has to be explained in the documentation!

$ sudo cat /proc/iomem
...
f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
...
  fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
    fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
      fcf00000-fcf03fff : nvme          -- pci region (reserved by
                                           a driver, has its name).

So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
is not used by the driver anymore -- this actually should be called in
every rescan_prepare().

But the pci_release_resource() tells to the PCI subsystem that the BAR
is "released" from the device and has to be assigned to some address
before using again, and makes the pci_resource_start(pdev,
relased_barno) invalid.

Why the quotes: pci_release_resource() doesn't turn off the BAR,
doesn't write the registers -- this happens later.

I thouht at first that pci_release_resource() is not safe in a
rescan_prepare(), but then double-checked, and found it's fine, just
not needed, as the kernel will do it anyway. And the
pci_bus_check_bars_assigned() to compare the bitmasks of successfully
assigned BARs is called *before* the hook.

> > > When testing with 2 graphic cards and triggering rescan, hard
> > > hang of
> > > the system happens during rescan_prepare of the second card  when
> > > stopping the HW (see log2.log) - I don't understand why this
> > > would
> > > happen as each of them passes fine when they are standalone
> > > tested
> > > and
> > > there should be no interdependence between them as far as i know.
> > > Do you have any idea ?
> > 
> > What happens with two GPUs is unclear for me as well, nothing looks
> > suspicious.
> > 
> > Serge
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-05 19:12                                     ` Sergei Miroshnichenko
@ 2021-03-05 23:13                                       ` Andrey Grodzovsky
  2021-03-11 22:40                                         ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-05 23:13 UTC (permalink / raw)
  To: Sergei Miroshnichenko
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci, linux



On 2021-03-05 2:12 p.m., Sergei Miroshnichenko wrote:
> On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
>>
>> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
>>> On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
>>>> + linux-pci
>>>>
>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>>>> ...
>>>>>>>> Are you saying that even without hot-plugging, while both
>>>>>>>> nvme
>>>>>>>> and
>>>>>>>> AMD
>>>>>>>> card are present
>>>>>>>> right from boot, you still get BARs moving and MMIO
>>>>>>>> ranges
>>>>>>>> reassigned
>>>>>>>> for NVME BARs
>>>>>>>> just because amdgpu driver will start resize of AMD card
>>>>>>>> BARs
>>>>>>>> and
>>>>>>>> this
>>>>>>>> will trigger NVMEs BARs move to
>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>>>> NVMe's
>>>>>>> BAR
>>>>>>> movement will help. In this particular case BAR movement is
>>>>>>> not
>>>>>>> needed,
>>>>>>> but is done anyway.
>>>>>>>
>>>>>>> BARs are not moved one by one, but the kernel releases all
>>>>>>> the
>>>>>>> releasable ones, and then recalculates a new BAR layout to
>>>>>>> fit
>>>>>>> them
>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME
>>>>>>> has
>>>>>>> appeared
>>>>>>> at a new place.
>>>>>>>
>>>>>>> This is triggered by following:
>>>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>>>> - during pci_resize_resource();
>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a
>>>>>>> manual
>>>>>>> via
>>>>>>> sysfs.
>>>>>>
>>>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>>>> trigger
>>>>>> PCI
>>>>>> code to call my callbacks even without having external PCI
>>>>>> cage
>>>>>> for
>>>>>> GPU
>>>>>> (will take me some time to get it).
>>>>>
>>>>> Yeah, this is our way to go when a device can't be physically
>>>>> removed
>>>>> or unpowered remotely. With just a bit shorter path:
>>>>>
>>>>>      sudo sh -c 'echo 1 >
>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>>>      sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>>>
>>>>> Or, just a second command (rescan) is enough: a BAR movement
>>>>> attempt
>>>>> will be triggered even if there were no changes in PCI
>>>>> topology.
>>>>>
>>>>> Serge
>>>>>
>>>>
>>>> Hi Segrei
>>>>
>>>> Here is a link to initial implementation on top of your tree
>>>> (movable_bars_v9.1) -
>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qEC3qIAM8h1vU4gGEgT6sThXsaCuatTI2UjM9Bb8KGI%3D&amp;reserved=0
>>>> I am able to pass one re-scan cycle and can use the card
>>>> afterwards
>>>> (see
>>>> log1.log).
>>>> But, according to your prints only BAR5 which is registers BAR
>>>> was
>>>> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
>>>> 0xfc100000)
>>>> while I am interested to test BAR0 (Graphic RAM) move since this
>>>> is
>>>> where most of the complexity is. Is there a way to hack your code
>>>> to
>>>> force this ?
>>>
>>> Hi Andrey,
>>>
>>> Regarding the amdgpu's BAR0 remaining on its place: it seems this
>>> is
>>> because of fixed BARs starting from fc600000. The kernel tends to
>>> group
>>> the BARs close to each other, making a bridge window as compact as
>>> possible. So the BAR0 had occupied the closest "comfortable" slots
>>> 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
>>> covering all the BARs:
>>>
>>>       pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
>>> window]
>>>
>>> I'll let you know if I get an idea how to rearrange that manually.
>>>
>>> Two GPUs can actually swap their places.
>>
>> What do you mean ?
> 
> I was thinking: when the scenario of a PCI rescan with two GPUs (as was
> described below) will start working, BAR0 of GPU0 can take place of
> BAR0 of GPU1 after the first rescan.
> 
>>> What also can make a BAR movable -- is rmmod'ing its driver. It
>>> could
>>> be some hack from within a tmux, like:
>>>
>>>     rmmod igb; \
>>>     rmmod xhci_hcd; \
>>>     rmmod ahci; \
>>>     echo 1 > /sys/bus/pci/rescan; \
>>>     modprobe igb; \
>>>     modprobe xhci_hcd; \
>>>     modprobe ahci
>>
>> But should I also rmmod amdgpu ? Or modprobing back the other
>> drivers
>> should cause (hopefully) BAR0 move in AMD graphic card ?
> 
> You have already made the amdgpu movable, so no need to rmmod it --
> just those with fixed BARs:
> 
>      xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
> 0xfc6fffff 64bit]
>      igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
>      igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
>      ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff
> pref]
>      ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
> 0xfcb9ffff]
>      xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
> 0xfcba7fff 64bit]
>      xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
> 0xfca07fff 64bit]
>      ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
> 0xfce08fff]
> 
> The expected result is they all move closer to the start of PCI address
> space.
> 

Ok, I updated as you described. Also I removed PCI conf command to stop
address decoding and restart later as I noticed PCI core does it itself
when needed.
I tested now also with graphic desktop enabled while submitting
3d draw commands and seems like under this scenario everything still
works. Again, this all needs to be tested with VRAM BAR move as then
I believe I will see more issues like handling of MMIO mapped VRAM 
objects (like GART table). In case you do have an AMD card you could 
also maybe give it a try. In the meanwhile I will add support to 
ioremapping of those VRAM objects.

Andrey

>>> I think pci_release_resource() should not be in
>>> amdgpu_device_unmap_mmio() -- the patched kernel will do that
>>> itself
>>> for BARs the amdgpu_device_bar_fixed() returns false. Even more --
>>> the
>>> kernel will ensure that all BARs which were working before, are
>>> reassigned properly, so it needs them to be assigned before the
>>> procedure.
>>> The same for pci_assign_unassigned_bus_resources() in
>>> amdgpu_device_remap_mmio(): this callback is invoked from
>>> pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
>>
>> This seems to me in contrast to your documentation (see
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F5bc12ba7c74f1c19c11db29b4807bd32acfac2c2&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kO1OlRL8iHMTcijuV0jDpODCtXpCCTpJv6YIn%2FuypNQ%3D&amp;reserved=0
>> step 1) although step 2 seems also to contradict step 1 with regard
>> to
>> BARs release - so now I am a bit confused. Also looking at
>> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical
>> acquisition happens in nvme_dev_unmap.
> 
> Ah, there is a difference between pci_release_region() and
> pci_release_resource(), so subtle that I had to refresh my memory. You
> are right, this has to be explained in the documentation!
> 
> $ sudo cat /proc/iomem
> ...
> f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
> ...
>    fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
>      fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
>        fcf00000-fcf03fff : nvme          -- pci region (reserved by
>                                             a driver, has its name).
> 
> So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
> is not used by the driver anymore -- this actually should be called in
> every rescan_prepare().
> 
> But the pci_release_resource() tells to the PCI subsystem that the BAR
> is "released" from the device and has to be assigned to some address
> before using again, and makes the pci_resource_start(pdev,
> relased_barno) invalid.
> 
> Why the quotes: pci_release_resource() doesn't turn off the BAR,
> doesn't write the registers -- this happens later.
> 
> I thouht at first that pci_release_resource() is not safe in a
> rescan_prepare(), but then double-checked, and found it's fine, just
> not needed, as the kernel will do it anyway. And the
> pci_bus_check_bars_assigned() to compare the bitmasks of successfully
> assigned BARs is called *before* the hook.
> 
>>>> When testing with 2 graphic cards and triggering rescan, hard
>>>> hang of
>>>> the system happens during rescan_prepare of the second card  when
>>>> stopping the HW (see log2.log) - I don't understand why this
>>>> would
>>>> happen as each of them passes fine when they are standalone
>>>> tested
>>>> and
>>>> there should be no interdependence between them as far as i know.
>>>> Do you have any idea ?
>>>
>>> What happens with two GPUs is unclear for me as well, nothing looks
>>> suspicious.
>>>
>>> Serge
>>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-05 23:13                                       ` Andrey Grodzovsky
@ 2021-03-11 22:40                                         ` Andrey Grodzovsky
  2021-03-12  9:03                                           ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-11 22:40 UTC (permalink / raw)
  To: Sergei Miroshnichenko
  Cc: Alexander.Deucher, Christian.Koenig, anatoli.antonovitch,
	helgaas, linux-pci, linux



On 2021-03-05 6:13 p.m., Andrey Grodzovsky wrote:
> 
> 
> On 2021-03-05 2:12 p.m., Sergei Miroshnichenko wrote:
>> On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
>>>
>>> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
>>>> On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
>>>>> + linux-pci
>>>>>
>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>>>>> ...
>>>>>>>>> Are you saying that even without hot-plugging, while both
>>>>>>>>> nvme
>>>>>>>>> and
>>>>>>>>> AMD
>>>>>>>>> card are present
>>>>>>>>> right from boot, you still get BARs moving and MMIO
>>>>>>>>> ranges
>>>>>>>>> reassigned
>>>>>>>>> for NVME BARs
>>>>>>>>> just because amdgpu driver will start resize of AMD card
>>>>>>>>> BARs
>>>>>>>>> and
>>>>>>>>> this
>>>>>>>>> will trigger NVMEs BARs move to
>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>>>>> NVMe's
>>>>>>>> BAR
>>>>>>>> movement will help. In this particular case BAR movement is
>>>>>>>> not
>>>>>>>> needed,
>>>>>>>> but is done anyway.
>>>>>>>>
>>>>>>>> BARs are not moved one by one, but the kernel releases all
>>>>>>>> the
>>>>>>>> releasable ones, and then recalculates a new BAR layout to
>>>>>>>> fit
>>>>>>>> them
>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME
>>>>>>>> has
>>>>>>>> appeared
>>>>>>>> at a new place.
>>>>>>>>
>>>>>>>> This is triggered by following:
>>>>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>>>>> - during pci_resize_resource();
>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a
>>>>>>>> manual
>>>>>>>> via
>>>>>>>> sysfs.
>>>>>>>
>>>>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>>>>> trigger
>>>>>>> PCI
>>>>>>> code to call my callbacks even without having external PCI
>>>>>>> cage
>>>>>>> for
>>>>>>> GPU
>>>>>>> (will take me some time to get it).
>>>>>>
>>>>>> Yeah, this is our way to go when a device can't be physically
>>>>>> removed
>>>>>> or unpowered remotely. With just a bit shorter path:
>>>>>>
>>>>>>      sudo sh -c 'echo 1 >
>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>>>>      sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>>>>
>>>>>> Or, just a second command (rescan) is enough: a BAR movement
>>>>>> attempt
>>>>>> will be triggered even if there were no changes in PCI
>>>>>> topology.
>>>>>>
>>>>>> Serge
>>>>>>
>>>>>
>>>>> Hi Segrei
>>>>>
>>>>> Here is a link to initial implementation on top of your tree
>>>>> (movable_bars_v9.1) -
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qEC3qIAM8h1vU4gGEgT6sThXsaCuatTI2UjM9Bb8KGI%3D&amp;reserved=0 
>>>>>
>>>>> I am able to pass one re-scan cycle and can use the card
>>>>> afterwards
>>>>> (see
>>>>> log1.log).
>>>>> But, according to your prints only BAR5 which is registers BAR
>>>>> was
>>>>> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
>>>>> 0xfc100000)
>>>>> while I am interested to test BAR0 (Graphic RAM) move since this
>>>>> is
>>>>> where most of the complexity is. Is there a way to hack your code
>>>>> to
>>>>> force this ?
>>>>
>>>> Hi Andrey,
>>>>
>>>> Regarding the amdgpu's BAR0 remaining on its place: it seems this
>>>> is
>>>> because of fixed BARs starting from fc600000. The kernel tends to
>>>> group
>>>> the BARs close to each other, making a bridge window as compact as
>>>> possible. So the BAR0 had occupied the closest "comfortable" slots
>>>> 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
>>>> covering all the BARs:
>>>>
>>>>       pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
>>>> window]
>>>>
>>>> I'll let you know if I get an idea how to rearrange that manually.
>>>>
>>>> Two GPUs can actually swap their places.
>>>
>>> What do you mean ?
>>
>> I was thinking: when the scenario of a PCI rescan with two GPUs (as was
>> described below) will start working, BAR0 of GPU0 can take place of
>> BAR0 of GPU1 after the first rescan.
>>
>>>> What also can make a BAR movable -- is rmmod'ing its driver. It
>>>> could
>>>> be some hack from within a tmux, like:
>>>>
>>>>     rmmod igb; \
>>>>     rmmod xhci_hcd; \
>>>>     rmmod ahci; \
>>>>     echo 1 > /sys/bus/pci/rescan; \
>>>>     modprobe igb; \
>>>>     modprobe xhci_hcd; \
>>>>     modprobe ahci
>>>
>>> But should I also rmmod amdgpu ? Or modprobing back the other
>>> drivers
>>> should cause (hopefully) BAR0 move in AMD graphic card ?
>>
>> You have already made the amdgpu movable, so no need to rmmod it --
>> just those with fixed BARs:
>>
>>      xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
>> 0xfc6fffff 64bit]
>>      igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
>>      igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
>>      ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff
>> pref]
>>      ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
>> 0xfcb9ffff]
>>      xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
>> 0xfcba7fff 64bit]
>>      xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
>> 0xfca07fff 64bit]
>>      ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
>> 0xfce08fff]
>>
>> The expected result is they all move closer to the start of PCI address
>> space.
>>
> 
> Ok, I updated as you described. Also I removed PCI conf command to stop
> address decoding and restart later as I noticed PCI core does it itself
> when needed.
> I tested now also with graphic desktop enabled while submitting
> 3d draw commands and seems like under this scenario everything still
> works. Again, this all needs to be tested with VRAM BAR move as then
> I believe I will see more issues like handling of MMIO mapped VRAM 
> objects (like GART table). In case you do have an AMD card you could 
> also maybe give it a try. In the meanwhile I will add support to 
> ioremapping of those VRAM objects.
> 
> Andrey

Just an update, added support for unmaping/remapping of all VRAM
objects, both user space mmaped and kernel ioremaped. Seems to work
ok but again, without forcing VRAM BAR to move I can't be sure.
Alex, Chsristian - take a look when you have some time to give me some
initial feedback on the amdgpu side.

The code is at 
https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1

Andrey

> 
>>>> I think pci_release_resource() should not be in
>>>> amdgpu_device_unmap_mmio() -- the patched kernel will do that
>>>> itself
>>>> for BARs the amdgpu_device_bar_fixed() returns false. Even more --
>>>> the
>>>> kernel will ensure that all BARs which were working before, are
>>>> reassigned properly, so it needs them to be assigned before the
>>>> procedure.
>>>> The same for pci_assign_unassigned_bus_resources() in
>>>> amdgpu_device_remap_mmio(): this callback is invoked from
>>>> pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
>>>
>>> This seems to me in contrast to your documentation (see
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F5bc12ba7c74f1c19c11db29b4807bd32acfac2c2&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kO1OlRL8iHMTcijuV0jDpODCtXpCCTpJv6YIn%2FuypNQ%3D&amp;reserved=0 
>>>
>>> step 1) although step 2 seems also to contradict step 1 with regard
>>> to
>>> BARs release - so now I am a bit confused. Also looking at
>>> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical
>>> acquisition happens in nvme_dev_unmap.
>>
>> Ah, there is a difference between pci_release_region() and
>> pci_release_resource(), so subtle that I had to refresh my memory. You
>> are right, this has to be explained in the documentation!
>>
>> $ sudo cat /proc/iomem
>> ...
>> f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
>> ...
>>    fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
>>      fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
>>        fcf00000-fcf03fff : nvme          -- pci region (reserved by
>>                                             a driver, has its name).
>>
>> So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
>> is not used by the driver anymore -- this actually should be called in
>> every rescan_prepare().
>>
>> But the pci_release_resource() tells to the PCI subsystem that the BAR
>> is "released" from the device and has to be assigned to some address
>> before using again, and makes the pci_resource_start(pdev,
>> relased_barno) invalid.
>>
>> Why the quotes: pci_release_resource() doesn't turn off the BAR,
>> doesn't write the registers -- this happens later.
>>
>> I thouht at first that pci_release_resource() is not safe in a
>> rescan_prepare(), but then double-checked, and found it's fine, just
>> not needed, as the kernel will do it anyway. And the
>> pci_bus_check_bars_assigned() to compare the bitmasks of successfully
>> assigned BARs is called *before* the hook.
>>
>>>>> When testing with 2 graphic cards and triggering rescan, hard
>>>>> hang of
>>>>> the system happens during rescan_prepare of the second card  when
>>>>> stopping the HW (see log2.log) - I don't understand why this
>>>>> would
>>>>> happen as each of them passes fine when they are standalone
>>>>> tested
>>>>> and
>>>>> there should be no interdependence between them as far as i know.
>>>>> Do you have any idea ?
>>>>
>>>> What happens with two GPUs is unclear for me as well, nothing looks
>>>> suspicious.
>>>>
>>>> Serge
>>>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-11 22:40                                         ` Andrey Grodzovsky
@ 2021-03-12  9:03                                           ` Christian König
  2021-03-12 15:34                                             ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2021-03-12  9:03 UTC (permalink / raw)
  To: Andrey Grodzovsky, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux

Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
> [SNIP]
>>> The expected result is they all move closer to the start of PCI address
>>> space.
>>>
>>
>> Ok, I updated as you described. Also I removed PCI conf command to stop
>> address decoding and restart later as I noticed PCI core does it itself
>> when needed.
>> I tested now also with graphic desktop enabled while submitting
>> 3d draw commands and seems like under this scenario everything still
>> works. Again, this all needs to be tested with VRAM BAR move as then
>> I believe I will see more issues like handling of MMIO mapped VRAM 
>> objects (like GART table). In case you do have an AMD card you could 
>> also maybe give it a try. In the meanwhile I will add support to 
>> ioremapping of those VRAM objects.
>>
>> Andrey
>
> Just an update, added support for unmaping/remapping of all VRAM
> objects, both user space mmaped and kernel ioremaped. Seems to work
> ok but again, without forcing VRAM BAR to move I can't be sure.
> Alex, Chsristian - take a look when you have some time to give me some
> initial feedback on the amdgpu side.
>
> The code is at 
> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1

Mhm, that let's userspace busy retry until the BAR movement is done.

Not sure if that can't live lock somehow.

Christian.

>
> Andrey


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-12  9:03                                           ` Christian König
@ 2021-03-12 15:34                                             ` Andrey Grodzovsky
  2021-03-15 16:00                                               ` Andrey Grodzovsky
  2021-03-15 16:10                                               ` Christian König
  0 siblings, 2 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-12 15:34 UTC (permalink / raw)
  To: Christian König, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux



On 2021-03-12 4:03 a.m., Christian König wrote:
> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>> [SNIP]
>>>> The expected result is they all move closer to the start of PCI address
>>>> space.
>>>>
>>>
>>> Ok, I updated as you described. Also I removed PCI conf command to stop
>>> address decoding and restart later as I noticed PCI core does it itself
>>> when needed.
>>> I tested now also with graphic desktop enabled while submitting
>>> 3d draw commands and seems like under this scenario everything still
>>> works. Again, this all needs to be tested with VRAM BAR move as then
>>> I believe I will see more issues like handling of MMIO mapped VRAM 
>>> objects (like GART table). In case you do have an AMD card you could 
>>> also maybe give it a try. In the meanwhile I will add support to 
>>> ioremapping of those VRAM objects.
>>>
>>> Andrey
>>
>> Just an update, added support for unmaping/remapping of all VRAM
>> objects, both user space mmaped and kernel ioremaped. Seems to work
>> ok but again, without forcing VRAM BAR to move I can't be sure.
>> Alex, Chsristian - take a look when you have some time to give me some
>> initial feedback on the amdgpu side.
>>
>> The code is at 
>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>
> 
> Mhm, that let's userspace busy retry until the BAR movement is done.
> 
> Not sure if that can't live lock somehow.
> 
> Christian.

In my testing it didn't but, I can instead route them to some
global static dummy page while BARs are moving and then when everything
done just invalidate the device address space again and let the
pagefaults fill in valid PFNs again.

Andrey

> 
>>
>> Andrey
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-12 15:34                                             ` Andrey Grodzovsky
@ 2021-03-15 16:00                                               ` Andrey Grodzovsky
  2021-03-15 16:10                                               ` Christian König
  1 sibling, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-15 16:00 UTC (permalink / raw)
  To: Christian König, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux

Ping

Andrey

On 2021-03-12 10:34 a.m., Andrey Grodzovsky wrote:
>
>
> On 2021-03-12 4:03 a.m., Christian König wrote:
>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>> The expected result is they all move closer to the start of PCI 
>>>>> address
>>>>> space.
>>>>>
>>>>
>>>> Ok, I updated as you described. Also I removed PCI conf command to 
>>>> stop
>>>> address decoding and restart later as I noticed PCI core does it 
>>>> itself
>>>> when needed.
>>>> I tested now also with graphic desktop enabled while submitting
>>>> 3d draw commands and seems like under this scenario everything still
>>>> works. Again, this all needs to be tested with VRAM BAR move as then
>>>> I believe I will see more issues like handling of MMIO mapped VRAM 
>>>> objects (like GART table). In case you do have an AMD card you 
>>>> could also maybe give it a try. In the meanwhile I will add support 
>>>> to ioremapping of those VRAM objects.
>>>>
>>>> Andrey
>>>
>>> Just an update, added support for unmaping/remapping of all VRAM
>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>> Alex, Chsristian - take a look when you have some time to give me some
>>> initial feedback on the amdgpu side.
>>>
>>> The code is at 
>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>
>>
>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>
>> Not sure if that can't live lock somehow.
>>
>> Christian.
>
> In my testing it didn't but, I can instead route them to some
> global static dummy page while BARs are moving and then when everything
> done just invalidate the device address space again and let the
> pagefaults fill in valid PFNs again.
>
> Andrey
>
>>
>>>
>>> Andrey
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-12 15:34                                             ` Andrey Grodzovsky
  2021-03-15 16:00                                               ` Andrey Grodzovsky
@ 2021-03-15 16:10                                               ` Christian König
  2021-03-15 16:21                                                 ` Andrey Grodzovsky
  1 sibling, 1 reply; 15+ messages in thread
From: Christian König @ 2021-03-15 16:10 UTC (permalink / raw)
  To: Andrey Grodzovsky, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux

Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>
>
> On 2021-03-12 4:03 a.m., Christian König wrote:
>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>> The expected result is they all move closer to the start of PCI 
>>>>> address
>>>>> space.
>>>>>
>>>>
>>>> Ok, I updated as you described. Also I removed PCI conf command to 
>>>> stop
>>>> address decoding and restart later as I noticed PCI core does it 
>>>> itself
>>>> when needed.
>>>> I tested now also with graphic desktop enabled while submitting
>>>> 3d draw commands and seems like under this scenario everything still
>>>> works. Again, this all needs to be tested with VRAM BAR move as then
>>>> I believe I will see more issues like handling of MMIO mapped VRAM 
>>>> objects (like GART table). In case you do have an AMD card you 
>>>> could also maybe give it a try. In the meanwhile I will add support 
>>>> to ioremapping of those VRAM objects.
>>>>
>>>> Andrey
>>>
>>> Just an update, added support for unmaping/remapping of all VRAM
>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>> Alex, Chsristian - take a look when you have some time to give me some
>>> initial feedback on the amdgpu side.
>>>
>>> The code is at 
>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>
>>
>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>
>> Not sure if that can't live lock somehow.
>>
>> Christian.
>
> In my testing it didn't but, I can instead route them to some
> global static dummy page while BARs are moving and then when everything
> done just invalidate the device address space again and let the
> pagefaults fill in valid PFNs again.

Well that won't work because the reads/writes which are done in the 
meantime do need to wait for the BAR to be available again.

So waiting for the BAR move to finish is correct, but what we should do 
is to use a lock instead of an SRCU because that makes lockdep complain 
when we do something nasty.

Christian.

>
> Andrey
>
>>
>>>
>>> Andrey
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-15 16:10                                               ` Christian König
@ 2021-03-15 16:21                                                 ` Andrey Grodzovsky
  2021-03-15 16:55                                                   ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-15 16:21 UTC (permalink / raw)
  To: Christian König, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux


On 2021-03-15 12:10 p.m., Christian König wrote:
> Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>>
>>
>> On 2021-03-12 4:03 a.m., Christian König wrote:
>>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>>>> The expected result is they all move closer to the start of PCI 
>>>>>> address
>>>>>> space.
>>>>>>
>>>>>
>>>>> Ok, I updated as you described. Also I removed PCI conf command to 
>>>>> stop
>>>>> address decoding and restart later as I noticed PCI core does it 
>>>>> itself
>>>>> when needed.
>>>>> I tested now also with graphic desktop enabled while submitting
>>>>> 3d draw commands and seems like under this scenario everything still
>>>>> works. Again, this all needs to be tested with VRAM BAR move as then
>>>>> I believe I will see more issues like handling of MMIO mapped VRAM 
>>>>> objects (like GART table). In case you do have an AMD card you 
>>>>> could also maybe give it a try. In the meanwhile I will add 
>>>>> support to ioremapping of those VRAM objects.
>>>>>
>>>>> Andrey
>>>>
>>>> Just an update, added support for unmaping/remapping of all VRAM
>>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>>> Alex, Chsristian - take a look when you have some time to give me some
>>>> initial feedback on the amdgpu side.
>>>>
>>>> The code is at 
>>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>>
>>>
>>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>>
>>> Not sure if that can't live lock somehow.
>>>
>>> Christian.
>>
>> In my testing it didn't but, I can instead route them to some
>> global static dummy page while BARs are moving and then when everything
>> done just invalidate the device address space again and let the
>> pagefaults fill in valid PFNs again.
>
> Well that won't work because the reads/writes which are done in the 
> meantime do need to wait for the BAR to be available again.
>
> So waiting for the BAR move to finish is correct, but what we should 
> do is to use a lock instead of an SRCU because that makes lockdep 
> complain when we do something nasty.
>
> Christian.


Spinlock I assume ? We can't sleep there - it's an interrupt.

Andrey


>
>>
>> Andrey
>>
>>>
>>>>
>>>> Andrey
>>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-15 16:21                                                 ` Andrey Grodzovsky
@ 2021-03-15 16:55                                                   ` Christian König
  2021-03-15 17:11                                                     ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2021-03-15 16:55 UTC (permalink / raw)
  To: Andrey Grodzovsky, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux



Am 15.03.21 um 17:21 schrieb Andrey Grodzovsky:
>
> On 2021-03-15 12:10 p.m., Christian König wrote:
>> Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>>>
>>>
>>> On 2021-03-12 4:03 a.m., Christian König wrote:
>>>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>>>> The expected result is they all move closer to the start of PCI 
>>>>>>> address
>>>>>>> space.
>>>>>>>
>>>>>>
>>>>>> Ok, I updated as you described. Also I removed PCI conf command 
>>>>>> to stop
>>>>>> address decoding and restart later as I noticed PCI core does it 
>>>>>> itself
>>>>>> when needed.
>>>>>> I tested now also with graphic desktop enabled while submitting
>>>>>> 3d draw commands and seems like under this scenario everything still
>>>>>> works. Again, this all needs to be tested with VRAM BAR move as then
>>>>>> I believe I will see more issues like handling of MMIO mapped 
>>>>>> VRAM objects (like GART table). In case you do have an AMD card 
>>>>>> you could also maybe give it a try. In the meanwhile I will add 
>>>>>> support to ioremapping of those VRAM objects.
>>>>>>
>>>>>> Andrey
>>>>>
>>>>> Just an update, added support for unmaping/remapping of all VRAM
>>>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>>>> Alex, Chsristian - take a look when you have some time to give me 
>>>>> some
>>>>> initial feedback on the amdgpu side.
>>>>>
>>>>> The code is at 
>>>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>>>
>>>>
>>>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>>>
>>>> Not sure if that can't live lock somehow.
>>>>
>>>> Christian.
>>>
>>> In my testing it didn't but, I can instead route them to some
>>> global static dummy page while BARs are moving and then when everything
>>> done just invalidate the device address space again and let the
>>> pagefaults fill in valid PFNs again.
>>
>> Well that won't work because the reads/writes which are done in the 
>> meantime do need to wait for the BAR to be available again.
>>
>> So waiting for the BAR move to finish is correct, but what we should 
>> do is to use a lock instead of an SRCU because that makes lockdep 
>> complain when we do something nasty.
>>
>> Christian.
>
>
> Spinlock I assume ? We can't sleep there - it's an interrupt.

Mhm, the BAR movement is in interrupt context?

Well that is rather bad. I was hoping to rename the GPU reset rw_sem 
into device_access rw_sem and then use the same lock for both (It's 
essentially the same problem).

But when we need to move the BAR in atomic/interrupt context that makes 
things a bit more complicated.

Christian.

>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-15 16:55                                                   ` Christian König
@ 2021-03-15 17:11                                                     ` Andrey Grodzovsky
  2021-03-16 16:17                                                       ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-15 17:11 UTC (permalink / raw)
  To: Christian König, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux


On 2021-03-15 12:55 p.m., Christian König wrote:
>
>
> Am 15.03.21 um 17:21 schrieb Andrey Grodzovsky:
>>
>> On 2021-03-15 12:10 p.m., Christian König wrote:
>>> Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>>>>
>>>>
>>>> On 2021-03-12 4:03 a.m., Christian König wrote:
>>>>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>>>>> [SNIP]
>>>>>>>> The expected result is they all move closer to the start of PCI 
>>>>>>>> address
>>>>>>>> space.
>>>>>>>>
>>>>>>>
>>>>>>> Ok, I updated as you described. Also I removed PCI conf command 
>>>>>>> to stop
>>>>>>> address decoding and restart later as I noticed PCI core does it 
>>>>>>> itself
>>>>>>> when needed.
>>>>>>> I tested now also with graphic desktop enabled while submitting
>>>>>>> 3d draw commands and seems like under this scenario everything 
>>>>>>> still
>>>>>>> works. Again, this all needs to be tested with VRAM BAR move as 
>>>>>>> then
>>>>>>> I believe I will see more issues like handling of MMIO mapped 
>>>>>>> VRAM objects (like GART table). In case you do have an AMD card 
>>>>>>> you could also maybe give it a try. In the meanwhile I will add 
>>>>>>> support to ioremapping of those VRAM objects.
>>>>>>>
>>>>>>> Andrey
>>>>>>
>>>>>> Just an update, added support for unmaping/remapping of all VRAM
>>>>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>>>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>>>>> Alex, Chsristian - take a look when you have some time to give me 
>>>>>> some
>>>>>> initial feedback on the amdgpu side.
>>>>>>
>>>>>> The code is at 
>>>>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>>>>
>>>>>
>>>>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>>>>
>>>>> Not sure if that can't live lock somehow.
>>>>>
>>>>> Christian.
>>>>
>>>> In my testing it didn't but, I can instead route them to some
>>>> global static dummy page while BARs are moving and then when 
>>>> everything
>>>> done just invalidate the device address space again and let the
>>>> pagefaults fill in valid PFNs again.
>>>
>>> Well that won't work because the reads/writes which are done in the 
>>> meantime do need to wait for the BAR to be available again.
>>>
>>> So waiting for the BAR move to finish is correct, but what we should 
>>> do is to use a lock instead of an SRCU because that makes lockdep 
>>> complain when we do something nasty.
>>>
>>> Christian.
>>
>>
>> Spinlock I assume ? We can't sleep there - it's an interrupt.
>
> Mhm, the BAR movement is in interrupt context?


No, BARs move is in task context I believe. The page faults are in 
interrupt context and so we can only lock a spinlock there I assume, not 
a mutex which might sleep. But we can't lock
spinlock for the entire BAR move because HW suspend + asic reset is a 
long process with some sleeps/context switches inside it probably.


>
> Well that is rather bad. I was hoping to rename the GPU reset rw_sem 
> into device_access rw_sem and then use the same lock for both (It's 
> essentially the same problem).


I was thinking about it from day 1 but what looked to me different is 
that in GPU reset case there is no technical need to block MMIO accesses 
as the BARs are not moving
and so the page table entries remain valid. It's true that while the 
device in reset those MMIO accesses are meaninglessness - so this indeed 
could be good reason to block
access even during GPU reset.

Andrey


>
> But when we need to move the BAR in atomic/interrupt context that 
> makes things a bit more complicated.
>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-15 17:11                                                     ` Andrey Grodzovsky
@ 2021-03-16 16:17                                                       ` Christian König
  2021-03-16 17:39                                                         ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2021-03-16 16:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux

Am 15.03.21 um 18:11 schrieb Andrey Grodzovsky:
>
> On 2021-03-15 12:55 p.m., Christian König wrote:
>>
>>
>> Am 15.03.21 um 17:21 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-03-15 12:10 p.m., Christian König wrote:
>>>> Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>>>>>
>>>>>
>>>>> On 2021-03-12 4:03 a.m., Christian König wrote:
>>>>>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>>>>>> [SNIP]
>>>>>>>>> The expected result is they all move closer to the start of 
>>>>>>>>> PCI address
>>>>>>>>> space.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, I updated as you described. Also I removed PCI conf command 
>>>>>>>> to stop
>>>>>>>> address decoding and restart later as I noticed PCI core does 
>>>>>>>> it itself
>>>>>>>> when needed.
>>>>>>>> I tested now also with graphic desktop enabled while submitting
>>>>>>>> 3d draw commands and seems like under this scenario everything 
>>>>>>>> still
>>>>>>>> works. Again, this all needs to be tested with VRAM BAR move as 
>>>>>>>> then
>>>>>>>> I believe I will see more issues like handling of MMIO mapped 
>>>>>>>> VRAM objects (like GART table). In case you do have an AMD card 
>>>>>>>> you could also maybe give it a try. In the meanwhile I will add 
>>>>>>>> support to ioremapping of those VRAM objects.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>
>>>>>>> Just an update, added support for unmaping/remapping of all VRAM
>>>>>>> objects, both user space mmaped and kernel ioremaped. Seems to work
>>>>>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>>>>>> Alex, Chsristian - take a look when you have some time to give 
>>>>>>> me some
>>>>>>> initial feedback on the amdgpu side.
>>>>>>>
>>>>>>> The code is at 
>>>>>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>>>>>
>>>>>>
>>>>>> Mhm, that let's userspace busy retry until the BAR movement is done.
>>>>>>
>>>>>> Not sure if that can't live lock somehow.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>> In my testing it didn't but, I can instead route them to some
>>>>> global static dummy page while BARs are moving and then when 
>>>>> everything
>>>>> done just invalidate the device address space again and let the
>>>>> pagefaults fill in valid PFNs again.
>>>>
>>>> Well that won't work because the reads/writes which are done in the 
>>>> meantime do need to wait for the BAR to be available again.
>>>>
>>>> So waiting for the BAR move to finish is correct, but what we 
>>>> should do is to use a lock instead of an SRCU because that makes 
>>>> lockdep complain when we do something nasty.
>>>>
>>>> Christian.
>>>
>>>
>>> Spinlock I assume ? We can't sleep there - it's an interrupt.
>>
>> Mhm, the BAR movement is in interrupt context?
>
>
> No, BARs move is in task context I believe. The page faults are in 
> interrupt context and so we can only lock a spinlock there I assume,

No, page faults are in task context as well! Otherwise you wouldn't be 
able to sleep for network I/O in a page fault for example.

> not a mutex which might sleep. But we can't lock
> spinlock for the entire BAR move because HW suspend + asic reset is a 
> long process with some sleeps/context switches inside it probably.
>
>
>>
>> Well that is rather bad. I was hoping to rename the GPU reset rw_sem 
>> into device_access rw_sem and then use the same lock for both (It's 
>> essentially the same problem).
>
>
> I was thinking about it from day 1 but what looked to me different is 
> that in GPU reset case there is no technical need to block MMIO 
> accesses as the BARs are not moving
> and so the page table entries remain valid. It's true that while the 
> device in reset those MMIO accesses are meaninglessness - so this 
> indeed could be good reason to block
> access even during GPU reset.

 From the experience now I would say that we should block MMIO access 
during GPU reset as necessary.

We can't do things like always taking the lock in each IOCTL, but for 
low level hardware access it shouldn't be a problem at all.

Christian.

>
> Andrey
>
>
>>
>> But when we need to move the BAR in atomic/interrupt context that 
>> makes things a bit more complicated.
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about supporting AMD eGPU hot plug case
  2021-03-16 16:17                                                       ` Christian König
@ 2021-03-16 17:39                                                         ` Andrey Grodzovsky
  0 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2021-03-16 17:39 UTC (permalink / raw)
  To: Christian König, Sergei Miroshnichenko
  Cc: Alexander.Deucher, anatoli.antonovitch, helgaas, linux-pci, linux


On 2021-03-16 12:17 p.m., Christian König wrote:
> Am 15.03.21 um 18:11 schrieb Andrey Grodzovsky:
>>
>> On 2021-03-15 12:55 p.m., Christian König wrote:
>>>
>>>
>>> Am 15.03.21 um 17:21 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2021-03-15 12:10 p.m., Christian König wrote:
>>>>> Am 12.03.21 um 16:34 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>>
>>>>>> On 2021-03-12 4:03 a.m., Christian König wrote:
>>>>>>> Am 11.03.21 um 23:40 schrieb Andrey Grodzovsky:
>>>>>>>> [SNIP]
>>>>>>>>>> The expected result is they all move closer to the start of 
>>>>>>>>>> PCI address
>>>>>>>>>> space.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, I updated as you described. Also I removed PCI conf 
>>>>>>>>> command to stop
>>>>>>>>> address decoding and restart later as I noticed PCI core does 
>>>>>>>>> it itself
>>>>>>>>> when needed.
>>>>>>>>> I tested now also with graphic desktop enabled while submitting
>>>>>>>>> 3d draw commands and seems like under this scenario everything 
>>>>>>>>> still
>>>>>>>>> works. Again, this all needs to be tested with VRAM BAR move 
>>>>>>>>> as then
>>>>>>>>> I believe I will see more issues like handling of MMIO mapped 
>>>>>>>>> VRAM objects (like GART table). In case you do have an AMD 
>>>>>>>>> card you could also maybe give it a try. In the meanwhile I 
>>>>>>>>> will add support to ioremapping of those VRAM objects.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> Just an update, added support for unmaping/remapping of all VRAM
>>>>>>>> objects, both user space mmaped and kernel ioremaped. Seems to 
>>>>>>>> work
>>>>>>>> ok but again, without forcing VRAM BAR to move I can't be sure.
>>>>>>>> Alex, Chsristian - take a look when you have some time to give 
>>>>>>>> me some
>>>>>>>> initial feedback on the amdgpu side.
>>>>>>>>
>>>>>>>> The code is at 
>>>>>>>> https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1 
>>>>>>>>
>>>>>>>
>>>>>>> Mhm, that let's userspace busy retry until the BAR movement is 
>>>>>>> done.
>>>>>>>
>>>>>>> Not sure if that can't live lock somehow.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> In my testing it didn't but, I can instead route them to some
>>>>>> global static dummy page while BARs are moving and then when 
>>>>>> everything
>>>>>> done just invalidate the device address space again and let the
>>>>>> pagefaults fill in valid PFNs again.
>>>>>
>>>>> Well that won't work because the reads/writes which are done in 
>>>>> the meantime do need to wait for the BAR to be available again.
>>>>>
>>>>> So waiting for the BAR move to finish is correct, but what we 
>>>>> should do is to use a lock instead of an SRCU because that makes 
>>>>> lockdep complain when we do something nasty.
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> Spinlock I assume ? We can't sleep there - it's an interrupt.
>>>
>>> Mhm, the BAR movement is in interrupt context?
>>
>>
>> No, BARs move is in task context I believe. The page faults are in 
>> interrupt context and so we can only lock a spinlock there I assume,
>
> No, page faults are in task context as well! Otherwise you wouldn't be 
> able to sleep for network I/O in a page fault for example.

Ok, that was a long standing confusion on my side, especially because 
'Understanding the Linux Kernel' states that do_page_fault is an 
interrupt handler here - 
https://vistech.net/~champ/online-docs/books/linuxkernel2/060.htm
while in fact this is an exception handler which is ran in the context 
of the user process causing it and hence can sleep ( as explained here 
by Rober Love himself) https://www.spinics.net/lists/newbies/msg07287.html


>
>> not a mutex which might sleep. But we can't lock
>> spinlock for the entire BAR move because HW suspend + asic reset is a 
>> long process with some sleeps/context switches inside it probably.
>>
>>
>>>
>>> Well that is rather bad. I was hoping to rename the GPU reset rw_sem 
>>> into device_access rw_sem and then use the same lock for both (It's 
>>> essentially the same problem).
>>
>>
>> I was thinking about it from day 1 but what looked to me different is 
>> that in GPU reset case there is no technical need to block MMIO 
>> accesses as the BARs are not moving
>> and so the page table entries remain valid. It's true that while the 
>> device in reset those MMIO accesses are meaninglessness - so this 
>> indeed could be good reason to block
>> access even during GPU reset.
>
> From the experience now I would say that we should block MMIO access 
> during GPU reset as necessary.
>
> We can't do things like always taking the lock in each IOCTL, but for 
> low level hardware access it shouldn't be a problem at all.
>
> Christian.


I will update the code then to reuse  our adev->reset_sem for this locking.

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>> But when we need to move the BAR in atomic/interrupt context that 
>>> makes things a bit more complicated.
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-03-16 17:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ddef2da4-4726-7321-40fe-5f90788cc836@amd.com>
     [not found] ` <MN2PR12MB44880052F5C4ECF3EF22B58EF7B69@MN2PR12MB4488.namprd12.prod.outlook.com>
     [not found]   ` <ffb816ae8336ff2373ec5fcf695e698900c3d557.camel@yadro.com>
     [not found]     ` <9c41221f-ecfa-5554-f2ea-6f72cfe7dc7e@amd.com>
     [not found]       ` <dae8dfd8-3a99-620d-f0aa-ceb39923b807@amd.com>
     [not found]         ` <7d9e947648ce47a4ba8d223cdec08197@yadro.com>
     [not found]           ` <c82919f3-5296-cd0a-0b8f-c33614ca3ea9@amd.com>
     [not found]             ` <340062dba82b813b311b2c78022d2d3d0e6f414d.camel@yadro.com>
     [not found]               ` <927d7fbe-756f-a4f1-44cd-c68aecd906d7@amd.com>
     [not found]                 ` <dc2de228b92c4e27819c7681f650e1e5@yadro.com>
     [not found]                   ` <a6af29ed-179a-7619-3dde-474542c180f4@amd.com>
     [not found]                     ` <8f53f1403f0c4121885398487bbfa241@yadro.com>
     [not found]                       ` <fc2ea091-8470-9501-242d-8d82714adecb@amd.com>
     [not found]                         ` <50afd1079dbabeba36fd35fdef84e6e15470ef45.camel@yadro.com>
     [not found]                           ` <c53c23b5-dd37-44f1-dffd-ff9699018a82@amd.com>
     [not found]                             ` <8d7e2d7b7982d8d78c0ecaa74b9d40ace4db8453.camel@yadro.com>
2021-03-04 19:49                               ` Question about supporting AMD eGPU hot plug case Andrey Grodzovsky
2021-03-05 16:08                                 ` Sergei Miroshnichenko
2021-03-05 17:13                                   ` Andrey Grodzovsky
2021-03-05 19:12                                     ` Sergei Miroshnichenko
2021-03-05 23:13                                       ` Andrey Grodzovsky
2021-03-11 22:40                                         ` Andrey Grodzovsky
2021-03-12  9:03                                           ` Christian König
2021-03-12 15:34                                             ` Andrey Grodzovsky
2021-03-15 16:00                                               ` Andrey Grodzovsky
2021-03-15 16:10                                               ` Christian König
2021-03-15 16:21                                                 ` Andrey Grodzovsky
2021-03-15 16:55                                                   ` Christian König
2021-03-15 17:11                                                     ` Andrey Grodzovsky
2021-03-16 16:17                                                       ` Christian König
2021-03-16 17:39                                                         ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.