regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
       [not found] ` <20230215052642.6016-3-vasant.hegde@amd.com>
@ 2023-02-16  9:41   ` Thorsten Leemhuis
  2023-02-17  5:56     ` Vasant Hegde
  0 siblings, 1 reply; 6+ messages in thread
From: Thorsten Leemhuis @ 2023-02-16  9:41 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

On 15.02.23 06:26, Vasant Hegde wrote:
> If IOMMU domain for device group is not setup properly then we may hit
> IOMMU page fault. Current page fault handler assumes that domain is
> always setup and it will hit NULL pointer derefence (see below sample log).
> 
> Lets check whether domain is setup or not and log appropriate message.
>
> [...]

Many thx for taking care of this. Is this something that is still
suitable for merging before 6.2 is released? If not it would be good to
directly add a "CC: <stable.." tag to ensure it's backported to 6.2
later, as maybe there are more people affected by this.

> Reported-by: Matt Fagnani <matt.fagnani@bell.net>

There is one small thing to improve, please add the following tags here
to make things easier for future code archaeologists:

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
Link:
https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/

To explain: Linus[1] and others considered proper link tags with the URL
to the report important, as they allow anyone to look into the backstory
of a fix weeks or years later. That's nothing new, the documentation[2]
for some time says to place such tags in cases like this. I care
personally (and made it a bit more explicit in the docs a while ago),
because these tags make my regression tracking efforts a whole lot
easier, as they allow my tracking bot 'regzbot' to automatically connect
reports with patches posted or committed to fix tracked regressions.

Apropos regzbot, let me tell regzbot to monitor this thread:

#regzbot ^backmonitor:
https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/

> Suggested-by: Joerg Roedel <joro@8bytes.org>
> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
> [...]

/me wonders if this also should have a Fixes: tag.


Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

[1] for details, see:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/

[2] see Documentation/process/submitting-patches.rst
(http://docs.kernel.org/process/submitting-patches.html) and
Documentation/process/5.Posting.rst
(https://docs.kernel.org/process/5.Posting.html)

--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-16  9:41   ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Thorsten Leemhuis
@ 2023-02-17  5:56     ` Vasant Hegde
  2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 6+ messages in thread
From: Vasant Hegde @ 2023-02-17  5:56 UTC (permalink / raw)
  To: Thorsten Leemhuis, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

Hi Thorsten,


On 2/16/2023 3:11 PM, Thorsten Leemhuis wrote:
> On 15.02.23 06:26, Vasant Hegde wrote:
>> If IOMMU domain for device group is not setup properly then we may hit
>> IOMMU page fault. Current page fault handler assumes that domain is
>> always setup and it will hit NULL pointer derefence (see below sample log).
>>
>> Lets check whether domain is setup or not and log appropriate message.
>>
>> [...]
> 
> Many thx for taking care of this. Is this something that is still
> suitable for merging before 6.2 is released? If not it would be good to
> directly add a "CC: <stable.." tag to ensure it's backported to 6.2
> later, as maybe there are more people affected by this.

I think patch 1/3 is candidate for 6.2. I have request Joerg to look into it.


> 
>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
> 
> There is one small thing to improve, please add the following tags here
> to make things easier for future code archaeologists:
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
> Link:
> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
> 

I am sorry. I should have added above two links. I will keep a note so that I
don't miss things next time.

Joerg did added these link before applying. Thanks Joerg.

-Vasant



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-17  5:56     ` Vasant Hegde
@ 2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-02-17  6:11 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

On 17.02.23 06:56, Vasant Hegde wrote:
> On 2/16/2023 3:11 PM, Thorsten Leemhuis wrote:
>> On 15.02.23 06:26, Vasant Hegde wrote:
>>> If IOMMU domain for device group is not setup properly then we may hit
>>> IOMMU page fault. Current page fault handler assumes that domain is
>>> always setup and it will hit NULL pointer derefence (see below sample log).
>>>
>>> Lets check whether domain is setup or not and log appropriate message.
>>>
>>> [...]
>>
>> Many thx for taking care of this. Is this something that is still
>> suitable for merging before 6.2 is released? If not it would be good to
>> directly add a "CC: <stable.." tag to ensure it's backported to 6.2
>> later, as maybe there are more people affected by this.
> 
> I think patch 1/3 is candidate for 6.2. I have request Joerg to look into it.

Great, many thx for clarifying. Sorry for creating confusion, I
sometimes get details wrong, as I juggle with so many different
regression at the same time...

>>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
>>
>> There is one small thing to improve, please add the following tags here
>> to make things easier for future code archaeologists:
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> Link:
>> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
>>
> 
> I am sorry. I should have added above two links. I will keep a note so that I
> don't miss things next time.
> 
> Joerg did added these link before applying. Thanks Joerg.

+1

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
       [not found] ` <0f6e2838-9d7f-dc2c-6e5b-0d02e239f032@amd.com>
@ 2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
  2023-03-13  4:29     ` Vasant Hegde
  0 siblings, 1 reply; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-12 14:42 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani, Linux kernel regressions list

On 17.02.23 06:53, Vasant Hegde wrote:
> 
> [+ CCed few folks from other thread (Bug 216865 - Black screen when amdgpu ...).]
> 
> This patch helps to solve the 'black screen' issue we hit in upstream [1].
> 
> With this patch we are handling error path gracefully. System is booting without
> enabling PASID for AMD gpu. Matt has tested this patch and confirmed its working [2]
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=216865
> [2]
> https://lore.kernel.org/linux-iommu/e17d1ee9-7383-6717-6f10-9465b273b1ed@amd.com/T/#m98390232691e3e55db7f1187073016436fd2176a
> 
> Can you consider this patch for 6.2?
> 
> I did miss Report-by, tested by tags. Can you append below tags or do you want
> me to respin patches?
> 
> --------
> 
> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
> Link:
> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
> Tested-by: Matt Fagnani <matt.fagnani@bell.net>

Thx again for handling this. Matt now seems to run into a different
problem reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=217170
https://gitlab.freedesktop.org/drm/amd/-/issues/2454

In short: amdgpu failed to resume when IOMMU is in use; when
"amd_iommu=off" is used, suspend works fine.

Is this something that might be related to these fixes?

Ciao, Thorsten

> On 2/15/2023 10:56 AM, Vasant Hegde wrote:
>> iommu_attach_group() attaches all devices in a group to domain and then
>> sets group domain (group->domain). Current code (__iommu_attach_group())
>> does not handle error path. This creates problem as devices to domain
>> attachment is in inconsistent state.
>>
>> Flow:
>>   - During boot iommu attach devices to default domain
>>   - Later some device driver (like amd/iommu_v2 or vfio) tries to attach
>>     device to new domain.
>>   - In iommu_attach_group() path we detach device from current domain.
>>     Then it tries to attach devices to new domain.
>>   - If it fails to attach device to new domain then device to domain link
>>     is broken.
>>   - iommu_attach_group() returns error.
>>   - At this stage iommu_attach_group() caller thinks, attaching device to
>>     new domain failed and devices are still attached to old domain.
>>   - But in reality device to old domain link is broken. It will result
>>     in all sort of failures (like IO page fault) later.
>>
>> To recover from this situation, we need to attach all devices back to the
>> old domain. Also log warning if it fails attach device back to old domain.
>>
>> Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
>> ---
>> Robin, Joerg,
>>   As explained in previous version [1], I still kept the changes in
>>   iommu core layer. I hope this is fine with you.
>>
>> [1] https://lore.kernel.org/linux-iommu/f2151603-de54-7717-60f9-748e9205139e@amd.com/
>>
>> -Vasant
>>
>>  drivers/iommu/iommu.c | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index de91dd88705b..e3336a11b6b9 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2124,8 +2124,22 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>>  
>>  	ret = __iommu_group_for_each_dev(group, domain,
>>  					 iommu_group_do_attach_device);
>> -	if (ret == 0)
>> +	if (ret == 0) {
>>  		group->domain = domain;
>> +	} else {
>> +		/*
>> +		 * To recover from the case when certain device within the
>> +		 * group fails to attach to the new domain, we need force
>> +		 * attaching all devices back to the old domain. The old
>> +		 * domain is compatible for all devices in the group,
>> +		 * hence the iommu driver should always return success.
>> +		 */
>> +		struct iommu_domain *old_domain = group->domain;
>> +
>> +		group->domain = NULL;
>> +		WARN(__iommu_group_set_domain(group, old_domain),
>> +		     "iommu driver failed to attach a compatible domain");
>> +	}
>>  
>>  	return ret;
>>  }
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-03-12 14:42   ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-13  4:29     ` Vasant Hegde
  2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 6+ messages in thread
From: Vasant Hegde @ 2023-03-13  4:29 UTC (permalink / raw)
  To: Linux regressions mailing list, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani

Hi Thorsten,


On 3/12/2023 8:12 PM, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 17.02.23 06:53, Vasant Hegde wrote:
>>
>> [+ CCed few folks from other thread (Bug 216865 - Black screen when amdgpu ...).]
>>
>> This patch helps to solve the 'black screen' issue we hit in upstream [1].
>>
>> With this patch we are handling error path gracefully. System is booting without
>> enabling PASID for AMD gpu. Matt has tested this patch and confirmed its working [2]
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> [2]
>> https://lore.kernel.org/linux-iommu/e17d1ee9-7383-6717-6f10-9465b273b1ed@amd.com/T/#m98390232691e3e55db7f1187073016436fd2176a
>>
>> Can you consider this patch for 6.2?
>>
>> I did miss Report-by, tested by tags. Can you append below tags or do you want
>> me to respin patches?
>>
>> --------
>>
>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> Link:
>> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
>> Tested-by: Matt Fagnani <matt.fagnani@bell.net>
> 
> Thx again for handling this. Matt now seems to run into a different
> problem reported here:

As i described in other thread [1] it looks like GPU needs to handle failure in
resume path.


[1]
https://lore.kernel.org/linux-iommu/9b688cbe-ec48-17a7-0e40-5734d58e102d@amd.com/T/#t


-Vasant


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-03-13  4:29     ` Vasant Hegde
@ 2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-13  6:29 UTC (permalink / raw)
  To: Vasant Hegde, Linux regressions mailing list, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani

On 13.03.23 05:29, Vasant Hegde wrote:
> Hi Thorsten,
> 
> As i described in other thread [1] it looks like GPU needs to handle failure in
> resume path.
>
> [1]
> https://lore.kernel.org/linux-iommu/9b688cbe-ec48-17a7-0e40-5734d58e102d@amd.com/T/#t

Thx for that, sorry, I had missed that other thread for some reason (I
really wonder why, I know I looked if Matt reported this to the list,
but seems I missed it somehow; whatever).

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-13  6:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230215052642.6016-1-vasant.hegde@amd.com>
     [not found] ` <20230215052642.6016-3-vasant.hegde@amd.com>
2023-02-16  9:41   ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Thorsten Leemhuis
2023-02-17  5:56     ` Vasant Hegde
2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
     [not found] ` <0f6e2838-9d7f-dc2c-6e5b-0d02e239f032@amd.com>
2023-03-12 14:42   ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Linux regression tracking (Thorsten Leemhuis)
2023-03-13  4:29     ` Vasant Hegde
2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).