Re: Question about supporting AMD eGPU hot plug case

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Sergei Miroshnichenko <s.miroshnichenko@yadro.com>
Cc: "Alexander.Deucher@amd.com" <Alexander.Deucher@amd.com>,
	"Christian.Koenig@amd.com" <Christian.Koenig@amd.com>,
	"anatoli.antonovitch@amd.com" <anatoli.antonovitch@amd.com>,
	"helgaas@kernel.org" <helgaas@kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux@yadro.com" <linux@yadro.com>
Subject: Re: Question about supporting AMD eGPU hot plug case
Date: Fri, 5 Mar 2021 18:13:28 -0500	[thread overview]
Message-ID: <146844cc-e2d9-aade-8223-db41b37853c5@amd.com> (raw)
In-Reply-To: <98ac52f982409e22fbd6e6659e2724f9b1f2fafd.camel@yadro.com>

On 2021-03-05 2:12 p.m., Sergei Miroshnichenko wrote:
> On Fri, 2021-03-05 at 12:13 -0500, Andrey Grodzovsky wrote:
>>
>> On 2021-03-05 11:08 a.m., Sergei Miroshnichenko wrote:
>>> On Thu, 2021-03-04 at 14:49 -0500, Andrey Grodzovsky wrote:
>>>> + linux-pci
>>>>
>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>>>> ...
>>>>>>>> Are you saying that even without hot-plugging, while both
>>>>>>>> nvme
>>>>>>>> and
>>>>>>>> AMD
>>>>>>>> card are present
>>>>>>>> right from boot, you still get BARs moving and MMIO
>>>>>>>> ranges
>>>>>>>> reassigned
>>>>>>>> for NVME BARs
>>>>>>>> just because amdgpu driver will start resize of AMD card
>>>>>>>> BARs
>>>>>>>> and
>>>>>>>> this
>>>>>>>> will trigger NVMEs BARs move to
>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>>>> NVMe's
>>>>>>> BAR
>>>>>>> movement will help. In this particular case BAR movement is
>>>>>>> not
>>>>>>> needed,
>>>>>>> but is done anyway.
>>>>>>>
>>>>>>> BARs are not moved one by one, but the kernel releases all
>>>>>>> the
>>>>>>> releasable ones, and then recalculates a new BAR layout to
>>>>>>> fit
>>>>>>> them
>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME
>>>>>>> has
>>>>>>> appeared
>>>>>>> at a new place.
>>>>>>>
>>>>>>> This is triggered by following:
>>>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>>>> - during pci_resize_resource();
>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a
>>>>>>> manual
>>>>>>> via
>>>>>>> sysfs.
>>>>>>
>>>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>>>> trigger
>>>>>> PCI
>>>>>> code to call my callbacks even without having external PCI
>>>>>> cage
>>>>>> for
>>>>>> GPU
>>>>>> (will take me some time to get it).
>>>>>
>>>>> Yeah, this is our way to go when a device can't be physically
>>>>> removed
>>>>> or unpowered remotely. With just a bit shorter path:
>>>>>
>>>>>      sudo sh -c 'echo 1 >
>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>>>      sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>>>
>>>>> Or, just a second command (rescan) is enough: a BAR movement
>>>>> attempt
>>>>> will be triggered even if there were no changes in PCI
>>>>> topology.
>>>>>
>>>>> Serge
>>>>>
>>>>
>>>> Hi Segrei
>>>>
>>>> Here is a link to initial implementation on top of your tree
>>>> (movable_bars_v9.1) -
>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D05d6abceed650181bb7fe0a49884a26e378b908e&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qEC3qIAM8h1vU4gGEgT6sThXsaCuatTI2UjM9Bb8KGI%3D&amp;reserved=0
>>>> I am able to pass one re-scan cycle and can use the card
>>>> afterwards
>>>> (see
>>>> log1.log).
>>>> But, according to your prints only BAR5 which is registers BAR
>>>> was
>>>> updated (amdgpu 0000:0b:00.0: BAR 5 updated: 0xfcc00000 ->
>>>> 0xfc100000)
>>>> while I am interested to test BAR0 (Graphic RAM) move since this
>>>> is
>>>> where most of the complexity is. Is there a way to hack your code
>>>> to
>>>> force this ?
>>>
>>> Hi Andrey,
>>>
>>> Regarding the amdgpu's BAR0 remaining on its place: it seems this
>>> is
>>> because of fixed BARs starting from fc600000. The kernel tends to
>>> group
>>> the BARs close to each other, making a bridge window as compact as
>>> possible. So the BAR0 had occupied the closest "comfortable" slots
>>> 0xe0000000-0xefffffff, with the resulting bridge window of bus 00
>>> covering all the BARs:
>>>
>>>       pci_bus 0000:00: resource 10 [mem 0xe0000000-0xfec2ffff
>>> window]
>>>
>>> I'll let you know if I get an idea how to rearrange that manually.
>>>
>>> Two GPUs can actually swap their places.
>>
>> What do you mean ?
> 
> I was thinking: when the scenario of a PCI rescan with two GPUs (as was
> described below) will start working, BAR0 of GPU0 can take place of
> BAR0 of GPU1 after the first rescan.
> 
>>> What also can make a BAR movable -- is rmmod'ing its driver. It
>>> could
>>> be some hack from within a tmux, like:
>>>
>>>     rmmod igb; \
>>>     rmmod xhci_hcd; \
>>>     rmmod ahci; \
>>>     echo 1 > /sys/bus/pci/rescan; \
>>>     modprobe igb; \
>>>     modprobe xhci_hcd; \
>>>     modprobe ahci
>>
>> But should I also rmmod amdgpu ? Or modprobing back the other
>> drivers
>> should cause (hopefully) BAR0 move in AMD graphic card ?
> 
> You have already made the amdgpu movable, so no need to rmmod it --
> just those with fixed BARs:
> 
>      xhci_hcd 0000:0c:00.3: BAR 0: assigned fixed [mem 0xfc600000-
> 0xfc6fffff 64bit]
>      igb 0000:07:00.0: BAR 0: assigned fixed [mem 0xfc900000-0xfc91ffff]
>      igb 0000:07:00.0: BAR 3: assigned fixed [mem 0xfc920000-0xfc923fff]
>      ahci 0000:02:00.1: BAR 6: assigned fixed [mem 0xfcb00000-0xfcb7ffff
> pref]
>      ahci 0000:02:00.1: BAR 5: assigned fixed [mem 0xfcb80000-
> 0xfcb9ffff]
>      xhci_hcd 0000:02:00.0: BAR 0: assigned fixed [mem 0xfcba0000-
> 0xfcba7fff 64bit]
>      xhci_hcd 0000:05:00.0: BAR 0: assigned fixed [mem 0xfca00000-
> 0xfca07fff 64bit]
>      ahci 0000:0d:00.2: BAR 5: assigned fixed [mem 0xfce08000-
> 0xfce08fff]
> 
> The expected result is they all move closer to the start of PCI address
> space.
> 

Ok, I updated as you described. Also I removed PCI conf command to stop
address decoding and restart later as I noticed PCI core does it itself
when needed.
I tested now also with graphic desktop enabled while submitting
3d draw commands and seems like under this scenario everything still
works. Again, this all needs to be tested with VRAM BAR move as then
I believe I will see more issues like handling of MMIO mapped VRAM 
objects (like GART table). In case you do have an AMD card you could 
also maybe give it a try. In the meanwhile I will add support to 
ioremapping of those VRAM objects.

Andrey

>>> I think pci_release_resource() should not be in
>>> amdgpu_device_unmap_mmio() -- the patched kernel will do that
>>> itself
>>> for BARs the amdgpu_device_bar_fixed() returns false. Even more --
>>> the
>>> kernel will ensure that all BARs which were working before, are
>>> reassigned properly, so it needs them to be assigned before the
>>> procedure.
>>> The same for pci_assign_unassigned_bus_resources() in
>>> amdgpu_device_remap_mmio(): this callback is invoked from
>>> pci_rescan_bus() after pci_assign_unassigned_root_bus_resources().
>>
>> This seems to me in contrast to your documentation (see
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F5bc12ba7c74f1c19c11db29b4807bd32acfac2c2&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C6658f0cc7c344791ce0f08d8e00a96bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637505683386334114%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kO1OlRL8iHMTcijuV0jDpODCtXpCCTpJv6YIn%2FuypNQ%3D&amp;reserved=0
>> step 1) although step 2 seems also to contradict step 1 with regard
>> to
>> BARs release - so now I am a bit confused. Also looking at
>> nvme_dev_unmap - it calls pci_release_mem_regions. Symmetrical
>> acquisition happens in nvme_dev_unmap.
> 
> Ah, there is a difference between pci_release_region() and
> pci_release_resource(), so subtle that I had to refresh my memory. You
> are right, this has to be explained in the documentation!
> 
> $ sudo cat /proc/iomem
> ...
> f0000000-fcffffff : PCI Bus 0000:00     -- root bus resource
> ...
>    fcf00000-fcffffff : PCI Bus 0000:01   -- bridge window
>      fcf00000-fcf03fff : 0000:01:00.0    -- pci resource (BAR)
>        fcf00000-fcf03fff : nvme          -- pci region (reserved by
>                                             a driver, has its name).
> 
> So the nvme_dev_unmap() reflects with pci_release_region() that the BAR
> is not used by the driver anymore -- this actually should be called in
> every rescan_prepare().
> 
> But the pci_release_resource() tells to the PCI subsystem that the BAR
> is "released" from the device and has to be assigned to some address
> before using again, and makes the pci_resource_start(pdev,
> relased_barno) invalid.
> 
> Why the quotes: pci_release_resource() doesn't turn off the BAR,
> doesn't write the registers -- this happens later.
> 
> I thouht at first that pci_release_resource() is not safe in a
> rescan_prepare(), but then double-checked, and found it's fine, just
> not needed, as the kernel will do it anyway. And the
> pci_bus_check_bars_assigned() to compare the bitmasks of successfully
> assigned BARs is called *before* the hook.
> 
>>>> When testing with 2 graphic cards and triggering rescan, hard
>>>> hang of
>>>> the system happens during rescan_prepare of the second card  when
>>>> stopping the HW (see log2.log) - I don't understand why this
>>>> would
>>>> happen as each of them passes fine when they are standalone
>>>> tested
>>>> and
>>>> there should be no interdependence between them as far as i know.
>>>> Do you have any idea ?
>>>
>>> What happens with two GPUs is unclear for me as well, nothing looks
>>> suspicious.
>>>
>>> Serge
>>>