Re: Question about supporting AMD eGPU hot plug case

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Sergei Miroshnichenko <s.miroshnichenko@yadro.com>
Cc: "Alexander.Deucher@amd.com" <Alexander.Deucher@amd.com>,
	"Christian.Koenig@amd.com" <Christian.Koenig@amd.com>,
	"anatoli.antonovitch@amd.com" <anatoli.antonovitch@amd.com>,
	"helgaas@kernel.org" <helgaas@kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux@yadro.com" <linux@yadro.com>
Subject: Re: Question about supporting AMD eGPU hot plug case
Date: Thu, 11 Mar 2021 17:40:55 -0500	[thread overview]
Message-ID: <e3f3de55-8011-77d8-25ac-f16f8256beff@amd.com> (raw)
In-Reply-To: <146844cc-e2d9-aade-8223-db41b37853c5@amd.com>

On 2021-03-05 6:13 p.m., Andrey Grodzovsky wrote:
> 
> 
> On 2021-03-05 2:12 p.m., Sergei Miroshnichenko wrote:
>> On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
>>>
>>> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
>>>> On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
>>>>> + linux-pci
>>>>>
>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>>>>> ...
>>>>>>>>> Are you saying that even without hot-plugging, while both
>>>>>>>>> nvme
>>>>>>>>> and
>>>>>>>>> AMD
>>>>>>>>> card are present
>>>>>>>>> right from boot, you still get BARs moving and MMIO
>>>>>>>>> ranges
>>>>>>>>> reassigned
>>>>>>>>> for NVME BARs
>>>>>>>>> just because amdgpu driver will start resize of AMD card
>>>>>>>>> BARs
>>>>>>>>> and
>>>>>>>>> this
>>>>>>>>> will trigger NVMEs BARs move to
>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>>>>> NVMe's
>>>>>>>> BAR
>>>>>>>> movement will help. In this particular case BAR movement is
>>>>>>>> not
>>>>>>>> needed,
>>>>>>>> but is done anyway.
>>>>>>>>
>>>>>>>> BARs are not moved one by one, but the kernel releases all
>>>>>>>> the
>>>>>>>> releasable ones, and then recalculates a new BAR layout to
>>>>>>>> fit
>>>>>>>> them
>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME
>>>>>>>> has
>>>>>>>> appeared
>>>>>>>> at a new place.
>>>>>>>>
>>>>>>>> This is triggered by following:
>>>>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>>>>> - during pci_resize_resource();
>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a
>>>>>>>> manual
>>>>>>>> via
>>>>>>>> sysfs.
>>>>>>>
>>>>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>>>>> trigger
>>>>>>> PCI
>>>>>>> code to call my callbacks even without having external PCI
>>>>>>> cage
>>>>>>> for
>>>>>>> GPU
>>>>>>> (will take me some time to get it).
>>>>>>
>>>>>> Yeah, this is our way to go when a device can't be physically
>>>>>> removed
>>>>>> or unpowered remotely. With just a bit shorter path:
>>>>>>
>>>>>>      sudo sh -c 'echo 1 >
>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>>>>      sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>>>>
>>>>>> Or, just a second command (rescan) is enough: a BAR movement
>>>>>> attempt
>>>>>> will be triggered even if there were no changes in PCI
>>>>>> topology.
>>>>>>
>>>>>> Serge
>>>>>>
>>>>>
>>>>> Hi Segrei
>>>>>
>>>>> Here is a link to initial implementation on top of your tree
>>>>> (movable_bars_v9.1) -
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qEC3qIAM8h1vU4gGEgT6sThXsaCuatTI2UjM9Bb8KGI%3D&amp;reserved=0 
>>>>>
>>>>> I am able to pass one re-scan cycle and can use the card
>>>>> afterwards
>>>>> (see
>>>>> log1.log).
>>>>> But, according to your prints only BAR5 which is registers BAR
>>>>> was
>>>>> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
>>>>> 0xfc100000)
>>>>> while I am interested to test BAR0 (Graphic RAM) move since this
>>>>> is
>>>>> where most of the complexity is. Is there a way to hack your code
>>>>> to
>>>>> force this ?
>>>>
>>>> Hi Andrey,
>>>>
>>>> Regarding the amdgpu's BAR0 remaining on its place: it seems this
>>>> is
>>>> because of fixed BARs starting from fc600000. The kernel tends to
>>>> group
>>>> the BARs close to each other, making a bridge window as compact as
>>>> possible. So the BAR0 had occupied the closest "comfortable" slots
>>>> 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
>>>> covering all the BARs:
>>>>
>>>>       pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
>>>> window]
>>>>
>>>> I'll let you know if I get an idea how to rearrange that manually.
>>>>
>>>> Two GPUs can actually swap their places.
>>>
>>> What do you mean ?
>>
>> I was thinking: when the scenario of a PCI rescan with two GPUs (as was
>> described below) will start working, BAR0 of GPU0 can take place of
>> BAR0 of GPU1 after the first rescan.
>>
>>>> What also can make a BAR movable -- is rmmod'ing its driver. It
>>>> could
>>>> be some hack from within a tmux, like:
>>>>
>>>>     rmmod igb; \
>>>>     rmmod xhci_hcd; \
>>>>     rmmod ahci; \
>>>>     echo 1 > /sys/bus/pci/rescan; \
>>>>     modprobe igb; \
>>>>     modprobe xhci_hcd; \
>>>>     modprobe ahci
>>>
>>> But should I also rmmod amdgpu ? Or modprobing back the other
>>> drivers
>>> should cause (hopefully) BAR0 move in AMD graphic card ?
>>
>> You have already made the amdgpu movable, so no need to rmmod it --
>> just those with fixed BARs:
>>
>>      xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
>> 0xfc6fffff 64bit]
>>      igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
>>      igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
>>      ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff
>> pref]
>>      ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
>> 0xfcb9ffff]
>>      xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
>> 0xfcba7fff 64bit]
>>      xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
>> 0xfca07fff 64bit]
>>      ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
>> 0xfce08fff]
>>
>> The expected result is they all move closer to the start of PCI address
>> space.
>>
> 
> Ok, I updated as you described. Also I removed PCI conf command to stop
> address decoding and restart later as I noticed PCI core does it itself
> when needed.
> I tested now also with graphic desktop enabled while submitting
> 3d draw commands and seems like under this scenario everything still
> works. Again, this all needs to be tested with VRAM BAR move as then
> I believe I will see more issues like handling of MMIO mapped VRAM 
> objects (like GART table). In case you do have an AMD card you could 
> also maybe give it a try. In the meanwhile I will add support to 
> ioremapping of those VRAM objects.
> 
> Andrey

Just an update, added support for unmaping/remapping of all VRAM
objects, both user space mmaped and kernel ioremaped. Seems to work
ok but again, without forcing VRAM BAR to move I can't be sure.
Alex, Chsristian - take a look when you have some time to give me some
initial feedback on the amdgpu side.

The code is at 
https://cgit.freedesktop.org/~agrodzov/linux/log/?h=yadro%2Fpcie_hotplug%2Fmovable_bars_v9.1

Andrey

> 
>>>> I think pci_release_resource() should not be in
>>>> amdgpu_device_unmap_mmio() -- the patched kernel will do that
>>>> itself
>>>> for BARs the amdgpu_device_bar_fixed() returns false. Even more --
>>>> the
>>>> kernel will ensure that all BARs which were working before, are
>>>> reassigned properly, so it needs them to be assigned before the
>>>> procedure.
>>>> The same for pci_assign_unassigned_bus_resources() in
>>>> amdgpu_device_remap_mmio(): this callback is invoked from
>>>> pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
>>>
>>> This seems to me in contrast to your documentation (see
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F5bc12ba7c74f1c19c11db29b4807bd32acfac2c2&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kO1OlRL8iHMTcijuV0jDpODCtXpCCTpJv6YIn%2FuypNQ%3D&amp;reserved=0 
>>>
>>> step 1) although step 2 seems also to contradict step 1 with regard
>>> to
>>> BARs release - so now I am a bit confused. Also looking at
>>> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical
>>> acquisition happens in nvme_dev_unmap.
>>
>> Ah, there is a difference between pci_release_region() and
>> pci_release_resource(), so subtle that I had to refresh my memory. You
>> are right, this has to be explained in the documentation!
>>
>> $ sudo cat /proc/iomem
>> ...
>> f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
>> ...
>>    fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
>>      fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
>>        fcf00000-fcf03fff : nvme          -- pci region (reserved by
>>                                             a driver, has its name).
>>
>> So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
>> is not used by the driver anymore -- this actually should be called in
>> every rescan_prepare().
>>
>> But the pci_release_resource() tells to the PCI subsystem that the BAR
>> is "released" from the device and has to be assigned to some address
>> before using again, and makes the pci_resource_start(pdev,
>> relased_barno) invalid.
>>
>> Why the quotes: pci_release_resource() doesn't turn off the BAR,
>> doesn't write the registers -- this happens later.
>>
>> I thouht at first that pci_release_resource() is not safe in a
>> rescan_prepare(), but then double-checked, and found it's fine, just
>> not needed, as the kernel will do it anyway. And the
>> pci_bus_check_bars_assigned() to compare the bitmasks of successfully
>> assigned BARs is called *before* the hook.
>>
>>>>> When testing with 2 graphic cards and triggering rescan, hard
>>>>> hang of
>>>>> the system happens during rescan_prepare of the second card  when
>>>>> stopping the HW (see log2.log) - I don't understand why this
>>>>> would
>>>>> happen as each of them passes fine when they are standalone
>>>>> tested
>>>>> and
>>>>> there should be no interdependence between them as far as i know.
>>>>> Do you have any idea ?
>>>>
>>>> What happens with two GPUs is unclear for me as well, nothing looks
>>>> suspicious.
>>>>
>>>> Serge
>>>>