All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/3] iommu: Attach device group to old domain in error path
@ 2023-02-15  5:26 Vasant Hegde
  2023-02-15  5:26 ` [PATCH v3 2/3] iommu/amd: Skip attach device domain is same as new domain Vasant Hegde
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Vasant Hegde @ 2023-02-15  5:26 UTC (permalink / raw)
  To: iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Vasant Hegde

iommu_attach_group() attaches all devices in a group to domain and then
sets group domain (group->domain). Current code (__iommu_attach_group())
does not handle error path. This creates problem as devices to domain
attachment is in inconsistent state.

Flow:
  - During boot iommu attach devices to default domain
  - Later some device driver (like amd/iommu_v2 or vfio) tries to attach
    device to new domain.
  - In iommu_attach_group() path we detach device from current domain.
    Then it tries to attach devices to new domain.
  - If it fails to attach device to new domain then device to domain link
    is broken.
  - iommu_attach_group() returns error.
  - At this stage iommu_attach_group() caller thinks, attaching device to
    new domain failed and devices are still attached to old domain.
  - But in reality device to old domain link is broken. It will result
    in all sort of failures (like IO page fault) later.

To recover from this situation, we need to attach all devices back to the
old domain. Also log warning if it fails attach device back to old domain.

Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
---
Robin, Joerg,
  As explained in previous version [1], I still kept the changes in
  iommu core layer. I hope this is fine with you.

[1] https://lore.kernel.org/linux-iommu/f2151603-de54-7717-60f9-748e9205139e@amd.com/

-Vasant

 drivers/iommu/iommu.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index de91dd88705b..e3336a11b6b9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2124,8 +2124,22 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 
 	ret = __iommu_group_for_each_dev(group, domain,
 					 iommu_group_do_attach_device);
-	if (ret == 0)
+	if (ret == 0) {
 		group->domain = domain;
+	} else {
+		/*
+		 * To recover from the case when certain device within the
+		 * group fails to attach to the new domain, we need force
+		 * attaching all devices back to the old domain. The old
+		 * domain is compatible for all devices in the group,
+		 * hence the iommu driver should always return success.
+		 */
+		struct iommu_domain *old_domain = group->domain;
+
+		group->domain = NULL;
+		WARN(__iommu_group_set_domain(group, old_domain),
+		     "iommu driver failed to attach a compatible domain");
+	}
 
 	return ret;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] iommu/amd: Skip attach device domain is same as new domain
  2023-02-15  5:26 [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
@ 2023-02-15  5:26 ` Vasant Hegde
  2023-02-15  5:26 ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Vasant Hegde
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Vasant Hegde @ 2023-02-15  5:26 UTC (permalink / raw)
  To: iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Vasant Hegde

If device->domain is same as new domain then we can skip the
device attach process.

Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
---
 drivers/iommu/amd/iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 2f8c018e1ffd..26c96b821fa2 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2161,6 +2161,13 @@ static int amd_iommu_attach_device(struct iommu_domain *dom,
 	struct amd_iommu *iommu = rlookup_amd_iommu(dev);
 	int ret;
 
+	/*
+	 * Skip attach device to domain if new domain is same as
+	 * devices current domain
+	 */
+	if (dev_data->domain == domain)
+		return 0;
+
 	dev_data->defer_attach = false;
 
 	if (dev_data->domain)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-15  5:26 [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
  2023-02-15  5:26 ` [PATCH v3 2/3] iommu/amd: Skip attach device domain is same as new domain Vasant Hegde
@ 2023-02-15  5:26 ` Vasant Hegde
  2023-02-16  9:41   ` Thorsten Leemhuis
  2023-02-16 10:18   ` Joerg Roedel
  2023-02-17  5:53 ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
  2023-02-17 13:44 ` Jason Gunthorpe
  3 siblings, 2 replies; 14+ messages in thread
From: Vasant Hegde @ 2023-02-15  5:26 UTC (permalink / raw)
  To: iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Vasant Hegde, Matt Fagnani

If IOMMU domain for device group is not setup properly then we may hit
IOMMU page fault. Current page fault handler assumes that domain is
always setup and it will hit NULL pointer derefence (see below sample log).

Lets check whether domain is setup or not and log appropriate message.

Sample log:
----------
Jan 06 02:07:52 kernel: amdgpu 0000:00:01.0: amdgpu: SE 1, SH per SE 1, CU per SH 8, active_cu_number 6
Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000058
Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
Jan 06 02:07:53 kernel: PGD 0 P4D 0
Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted 6.2.0-rc2+ #89
Jan 06 02:07:53 kernel: Hardware name: xxx
Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90
Jan 06 02:07:53 kernel: Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 ea e8 4a 00 41
Jan 06 02:07:53 kernel: RSP: 0018:ffffb9b7803ebe08 EFLAGS: 00010246
Jan 06 02:07:53 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jan 06 02:07:53 kernel: RDX: 00000001522e2500 RSI: ffff940cc09e30d0 RDI: 0000000000000010
Jan 06 02:07:53 kernel: RBP: 00000001522e2500 R08: ffff940cc12c2b00 R09: 0000000000000050
Jan 06 02:07:53 kernel: R10: ffff940cc021e000 R11: 0000000000000000 R12: 0000000000000000
Jan 06 02:07:53 kernel: R13: ffff940cc005b000 R14: 0000000000000008 R15: 0000000000000000
Jan 06 02:07:53 kernel: FS:  0000000000000000(0000) GS:ffff940db7500000(0000) knlGS:0000000000000000
Jan 06 02:07:53 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 06 02:07:53 kernel: CR2: 0000000000000058 CR3: 000000010230a000 CR4: 00000000001506e0
Jan 06 02:07:53 kernel: Call Trace:
Jan 06 02:07:53 kernel:  <TASK>
Jan 06 02:07:53 kernel:  amd_iommu_int_thread+0x60c/0x760
Jan 06 02:07:53 kernel:  ? __pfx_irq_thread_fn+0x10/0x10
Jan 06 02:07:53 kernel:  irq_thread_fn+0x1f/0x60
Jan 06 02:07:53 kernel:  irq_thread+0xea/0x1a0
Jan 06 02:07:53 kernel:  ? preempt_count_add+0x6a/0xa0
Jan 06 02:07:53 kernel:  ? __pfx_irq_thread_dtor+0x10/0x10
Jan 06 02:07:53 kernel:  ? __pfx_irq_thread+0x10/0x10
Jan 06 02:07:53 kernel:  kthread+0xe9/0x110
Jan 06 02:07:53 kernel:  ? __pfx_kthread+0x10/0x10
Jan 06 02:07:53 kernel:  ret_from_fork+0x2c/0x50
Jan 06 02:07:53 kernel:  </TASK>

Reported-by: Matt Fagnani <matt.fagnani@bell.net>
Suggested-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
---
 drivers/iommu/amd/iommu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 26c96b821fa2..9dc41c48c29d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -558,6 +558,15 @@ static void amd_iommu_report_page_fault(struct amd_iommu *iommu,
 		 * prevent logging it.
 		 */
 		if (IS_IOMMU_MEM_TRANSACTION(flags)) {
+			/* Device not attached to domain properly */
+			if (dev_data->domain == NULL) {
+				pr_err_ratelimited("Event logged [Device not attached to domain properly]\n");
+				pr_err_ratelimited("  device=%04x:%02x:%02x.%x domain=0x%04x\n",
+						   iommu->pci_seg->id, PCI_BUS_NUM(devid), PCI_SLOT(devid),
+						   PCI_FUNC(devid), domain_id);
+				goto out;
+			}
+
 			if (!report_iommu_fault(&dev_data->domain->domain,
 						&pdev->dev, address,
 						IS_WRITE_REQUEST(flags) ?
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-15  5:26 ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Vasant Hegde
@ 2023-02-16  9:41   ` Thorsten Leemhuis
  2023-02-17  5:56     ` Vasant Hegde
  2023-02-16 10:18   ` Joerg Roedel
  1 sibling, 1 reply; 14+ messages in thread
From: Thorsten Leemhuis @ 2023-02-16  9:41 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

On 15.02.23 06:26, Vasant Hegde wrote:
> If IOMMU domain for device group is not setup properly then we may hit
> IOMMU page fault. Current page fault handler assumes that domain is
> always setup and it will hit NULL pointer derefence (see below sample log).
> 
> Lets check whether domain is setup or not and log appropriate message.
>
> [...]

Many thx for taking care of this. Is this something that is still
suitable for merging before 6.2 is released? If not it would be good to
directly add a "CC: <stable.." tag to ensure it's backported to 6.2
later, as maybe there are more people affected by this.

> Reported-by: Matt Fagnani <matt.fagnani@bell.net>

There is one small thing to improve, please add the following tags here
to make things easier for future code archaeologists:

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
Link:
https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/

To explain: Linus[1] and others considered proper link tags with the URL
to the report important, as they allow anyone to look into the backstory
of a fix weeks or years later. That's nothing new, the documentation[2]
for some time says to place such tags in cases like this. I care
personally (and made it a bit more explicit in the docs a while ago),
because these tags make my regression tracking efforts a whole lot
easier, as they allow my tracking bot 'regzbot' to automatically connect
reports with patches posted or committed to fix tracked regressions.

Apropos regzbot, let me tell regzbot to monitor this thread:

#regzbot ^backmonitor:
https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/

> Suggested-by: Joerg Roedel <joro@8bytes.org>
> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
> [...]

/me wonders if this also should have a Fixes: tag.


Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

[1] for details, see:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/

[2] see Documentation/process/submitting-patches.rst
(http://docs.kernel.org/process/submitting-patches.html) and
Documentation/process/5.Posting.rst
(https://docs.kernel.org/process/5.Posting.html)

--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-15  5:26 ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Vasant Hegde
  2023-02-16  9:41   ` Thorsten Leemhuis
@ 2023-02-16 10:18   ` Joerg Roedel
  1 sibling, 0 replies; 14+ messages in thread
From: Joerg Roedel @ 2023-02-16 10:18 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, baolu.lu, suravee.suthikulpanit, robin.murphy, will, Matt Fagnani

On Wed, Feb 15, 2023 at 05:26:42AM +0000, Vasant Hegde wrote:
> If IOMMU domain for device group is not setup properly then we may hit
> IOMMU page fault. Current page fault handler assumes that domain is
> always setup and it will hit NULL pointer derefence (see below sample log).

Edited the commit message a bit and added Thorstens Link tags. This patch
is now queued for v6.3.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-02-15  5:26 [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
  2023-02-15  5:26 ` [PATCH v3 2/3] iommu/amd: Skip attach device domain is same as new domain Vasant Hegde
  2023-02-15  5:26 ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Vasant Hegde
@ 2023-02-17  5:53 ` Vasant Hegde
  2023-02-18 14:42   ` Joerg Roedel
  2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
  2023-02-17 13:44 ` Jason Gunthorpe
  3 siblings, 2 replies; 14+ messages in thread
From: Vasant Hegde @ 2023-02-17  5:53 UTC (permalink / raw)
  To: iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Thorsten Leemhuis, Christian König,
	Jason Gunthorpe, Bjorn Helgaas, Matt Fagnani

Hi Joerg,

[+ CCed few folks from other thread (Bug 216865 - Black screen when amdgpu ...).]

This patch helps to solve the 'black screen' issue we hit in upstream [1].

With this patch we are handling error path gracefully. System is booting without
enabling PASID for AMD gpu. Matt has tested this patch and confirmed its working [2]

[1] https://bugzilla.kernel.org/show_bug.cgi?id=216865
[2]
https://lore.kernel.org/linux-iommu/e17d1ee9-7383-6717-6f10-9465b273b1ed@amd.com/T/#m98390232691e3e55db7f1187073016436fd2176a


Can you consider this patch for 6.2?

I did miss Report-by, tested by tags. Can you append below tags or do you want
me to respin patches?

--------

Reported-by: Matt Fagnani <matt.fagnani@bell.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
Link:
https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
Tested-by: Matt Fagnani <matt.fagnani@bell.net>


-Vasant


On 2/15/2023 10:56 AM, Vasant Hegde wrote:
> iommu_attach_group() attaches all devices in a group to domain and then
> sets group domain (group->domain). Current code (__iommu_attach_group())
> does not handle error path. This creates problem as devices to domain
> attachment is in inconsistent state.
> 
> Flow:
>   - During boot iommu attach devices to default domain
>   - Later some device driver (like amd/iommu_v2 or vfio) tries to attach
>     device to new domain.
>   - In iommu_attach_group() path we detach device from current domain.
>     Then it tries to attach devices to new domain.
>   - If it fails to attach device to new domain then device to domain link
>     is broken.
>   - iommu_attach_group() returns error.
>   - At this stage iommu_attach_group() caller thinks, attaching device to
>     new domain failed and devices are still attached to old domain.
>   - But in reality device to old domain link is broken. It will result
>     in all sort of failures (like IO page fault) later.
> 
> To recover from this situation, we need to attach all devices back to the
> old domain. Also log warning if it fails attach device back to old domain.
> 
> Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
> ---
> Robin, Joerg,
>   As explained in previous version [1], I still kept the changes in
>   iommu core layer. I hope this is fine with you.
> 
> [1] https://lore.kernel.org/linux-iommu/f2151603-de54-7717-60f9-748e9205139e@amd.com/
> 
> -Vasant
> 
>  drivers/iommu/iommu.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index de91dd88705b..e3336a11b6b9 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2124,8 +2124,22 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>  
>  	ret = __iommu_group_for_each_dev(group, domain,
>  					 iommu_group_do_attach_device);
> -	if (ret == 0)
> +	if (ret == 0) {
>  		group->domain = domain;
> +	} else {
> +		/*
> +		 * To recover from the case when certain device within the
> +		 * group fails to attach to the new domain, we need force
> +		 * attaching all devices back to the old domain. The old
> +		 * domain is compatible for all devices in the group,
> +		 * hence the iommu driver should always return success.
> +		 */
> +		struct iommu_domain *old_domain = group->domain;
> +
> +		group->domain = NULL;
> +		WARN(__iommu_group_set_domain(group, old_domain),
> +		     "iommu driver failed to attach a compatible domain");
> +	}
>  
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-16  9:41   ` Thorsten Leemhuis
@ 2023-02-17  5:56     ` Vasant Hegde
  2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 14+ messages in thread
From: Vasant Hegde @ 2023-02-17  5:56 UTC (permalink / raw)
  To: Thorsten Leemhuis, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

Hi Thorsten,


On 2/16/2023 3:11 PM, Thorsten Leemhuis wrote:
> On 15.02.23 06:26, Vasant Hegde wrote:
>> If IOMMU domain for device group is not setup properly then we may hit
>> IOMMU page fault. Current page fault handler assumes that domain is
>> always setup and it will hit NULL pointer derefence (see below sample log).
>>
>> Lets check whether domain is setup or not and log appropriate message.
>>
>> [...]
> 
> Many thx for taking care of this. Is this something that is still
> suitable for merging before 6.2 is released? If not it would be good to
> directly add a "CC: <stable.." tag to ensure it's backported to 6.2
> later, as maybe there are more people affected by this.

I think patch 1/3 is candidate for 6.2. I have request Joerg to look into it.


> 
>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
> 
> There is one small thing to improve, please add the following tags here
> to make things easier for future code archaeologists:
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
> Link:
> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
> 

I am sorry. I should have added above two links. I will keep a note so that I
don't miss things next time.

Joerg did added these link before applying. Thanks Joerg.

-Vasant



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] iommu/amd: Improve page fault error reporting
  2023-02-17  5:56     ` Vasant Hegde
@ 2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-02-17  6:11 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Matt Fagnani,
	Linux kernel regressions list

On 17.02.23 06:56, Vasant Hegde wrote:
> On 2/16/2023 3:11 PM, Thorsten Leemhuis wrote:
>> On 15.02.23 06:26, Vasant Hegde wrote:
>>> If IOMMU domain for device group is not setup properly then we may hit
>>> IOMMU page fault. Current page fault handler assumes that domain is
>>> always setup and it will hit NULL pointer derefence (see below sample log).
>>>
>>> Lets check whether domain is setup or not and log appropriate message.
>>>
>>> [...]
>>
>> Many thx for taking care of this. Is this something that is still
>> suitable for merging before 6.2 is released? If not it would be good to
>> directly add a "CC: <stable.." tag to ensure it's backported to 6.2
>> later, as maybe there are more people affected by this.
> 
> I think patch 1/3 is candidate for 6.2. I have request Joerg to look into it.

Great, many thx for clarifying. Sorry for creating confusion, I
sometimes get details wrong, as I juggle with so many different
regression at the same time...

>>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
>>
>> There is one small thing to improve, please add the following tags here
>> to make things easier for future code archaeologists:
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> Link:
>> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
>>
> 
> I am sorry. I should have added above two links. I will keep a note so that I
> don't miss things next time.
> 
> Joerg did added these link before applying. Thanks Joerg.

+1

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-02-15  5:26 [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
                   ` (2 preceding siblings ...)
  2023-02-17  5:53 ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
@ 2023-02-17 13:44 ` Jason Gunthorpe
  3 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2023-02-17 13:44 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, joro, baolu.lu, suravee.suthikulpanit, robin.murphy, will

On Wed, Feb 15, 2023 at 05:26:40AM +0000, Vasant Hegde wrote:

> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index de91dd88705b..e3336a11b6b9 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2124,8 +2124,22 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>  
>  	ret = __iommu_group_for_each_dev(group, domain,
>  					 iommu_group_do_attach_device);
> -	if (ret == 0)
> +	if (ret == 0) {
>  		group->domain = domain;
> +	} else {
> +		/*
> +		 * To recover from the case when certain device within the
> +		 * group fails to attach to the new domain, we need force
> +		 * attaching all devices back to the old domain. The old
> +		 * domain is compatible for all devices in the group,
> +		 * hence the iommu driver should always return success.
> +		 */
> +		struct iommu_domain *old_domain = group->domain;
> +
> +		group->domain = NULL;
> +		WARN(__iommu_group_set_domain(group, old_domain),
> +		     "iommu driver failed to attach a compatible domain");
> +	}

OK

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-02-17  5:53 ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
@ 2023-02-18 14:42   ` Joerg Roedel
  2023-02-19 10:17     ` Vasant Hegde
  2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 1 reply; 14+ messages in thread
From: Joerg Roedel @ 2023-02-18 14:42 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, baolu.lu, suravee.suthikulpanit, robin.murphy, will,
	Felix Kuehling, Alex Deucher, Thorsten Leemhuis,
	Christian König, Jason Gunthorpe, Bjorn Helgaas,
	Matt Fagnani

On Fri, Feb 17, 2023 at 11:23:36AM +0530, Vasant Hegde wrote:
> Can you consider this patch for 6.2?

Queued to core branch for 6.2.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-02-18 14:42   ` Joerg Roedel
@ 2023-02-19 10:17     ` Vasant Hegde
  0 siblings, 0 replies; 14+ messages in thread
From: Vasant Hegde @ 2023-02-19 10:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: iommu, baolu.lu, suravee.suthikulpanit, robin.murphy, will,
	Felix Kuehling, Alex Deucher, Thorsten Leemhuis,
	Christian König, Jason Gunthorpe, Bjorn Helgaas,
	Matt Fagnani



On 2/18/2023 8:12 PM, Joerg Roedel wrote:
> On Fri, Feb 17, 2023 at 11:23:36AM +0530, Vasant Hegde wrote:
>> Can you consider this patch for 6.2?
> 
> Queued to core branch for 6.2.

Thanks Joerg.

-Vasant

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-02-17  5:53 ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
  2023-02-18 14:42   ` Joerg Roedel
@ 2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
  2023-03-13  4:29     ` Vasant Hegde
  1 sibling, 1 reply; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-12 14:42 UTC (permalink / raw)
  To: Vasant Hegde, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani, Linux kernel regressions list

On 17.02.23 06:53, Vasant Hegde wrote:
> 
> [+ CCed few folks from other thread (Bug 216865 - Black screen when amdgpu ...).]
> 
> This patch helps to solve the 'black screen' issue we hit in upstream [1].
> 
> With this patch we are handling error path gracefully. System is booting without
> enabling PASID for AMD gpu. Matt has tested this patch and confirmed its working [2]
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=216865
> [2]
> https://lore.kernel.org/linux-iommu/e17d1ee9-7383-6717-6f10-9465b273b1ed@amd.com/T/#m98390232691e3e55db7f1187073016436fd2176a
> 
> Can you consider this patch for 6.2?
> 
> I did miss Report-by, tested by tags. Can you append below tags or do you want
> me to respin patches?
> 
> --------
> 
> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
> Link:
> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
> Tested-by: Matt Fagnani <matt.fagnani@bell.net>

Thx again for handling this. Matt now seems to run into a different
problem reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=217170
https://gitlab.freedesktop.org/drm/amd/-/issues/2454

In short: amdgpu failed to resume when IOMMU is in use; when
"amd_iommu=off" is used, suspend works fine.

Is this something that might be related to these fixes?

Ciao, Thorsten

> On 2/15/2023 10:56 AM, Vasant Hegde wrote:
>> iommu_attach_group() attaches all devices in a group to domain and then
>> sets group domain (group->domain). Current code (__iommu_attach_group())
>> does not handle error path. This creates problem as devices to domain
>> attachment is in inconsistent state.
>>
>> Flow:
>>   - During boot iommu attach devices to default domain
>>   - Later some device driver (like amd/iommu_v2 or vfio) tries to attach
>>     device to new domain.
>>   - In iommu_attach_group() path we detach device from current domain.
>>     Then it tries to attach devices to new domain.
>>   - If it fails to attach device to new domain then device to domain link
>>     is broken.
>>   - iommu_attach_group() returns error.
>>   - At this stage iommu_attach_group() caller thinks, attaching device to
>>     new domain failed and devices are still attached to old domain.
>>   - But in reality device to old domain link is broken. It will result
>>     in all sort of failures (like IO page fault) later.
>>
>> To recover from this situation, we need to attach all devices back to the
>> old domain. Also log warning if it fails attach device back to old domain.
>>
>> Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
>> ---
>> Robin, Joerg,
>>   As explained in previous version [1], I still kept the changes in
>>   iommu core layer. I hope this is fine with you.
>>
>> [1] https://lore.kernel.org/linux-iommu/f2151603-de54-7717-60f9-748e9205139e@amd.com/
>>
>> -Vasant
>>
>>  drivers/iommu/iommu.c | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index de91dd88705b..e3336a11b6b9 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2124,8 +2124,22 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>>  
>>  	ret = __iommu_group_for_each_dev(group, domain,
>>  					 iommu_group_do_attach_device);
>> -	if (ret == 0)
>> +	if (ret == 0) {
>>  		group->domain = domain;
>> +	} else {
>> +		/*
>> +		 * To recover from the case when certain device within the
>> +		 * group fails to attach to the new domain, we need force
>> +		 * attaching all devices back to the old domain. The old
>> +		 * domain is compatible for all devices in the group,
>> +		 * hence the iommu driver should always return success.
>> +		 */
>> +		struct iommu_domain *old_domain = group->domain;
>> +
>> +		group->domain = NULL;
>> +		WARN(__iommu_group_set_domain(group, old_domain),
>> +		     "iommu driver failed to attach a compatible domain");
>> +	}
>>  
>>  	return ret;
>>  }
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-13  4:29     ` Vasant Hegde
  2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 14+ messages in thread
From: Vasant Hegde @ 2023-03-13  4:29 UTC (permalink / raw)
  To: Linux regressions mailing list, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani

Hi Thorsten,


On 3/12/2023 8:12 PM, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 17.02.23 06:53, Vasant Hegde wrote:
>>
>> [+ CCed few folks from other thread (Bug 216865 - Black screen when amdgpu ...).]
>>
>> This patch helps to solve the 'black screen' issue we hit in upstream [1].
>>
>> With this patch we are handling error path gracefully. System is booting without
>> enabling PASID for AMD gpu. Matt has tested this patch and confirmed its working [2]
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> [2]
>> https://lore.kernel.org/linux-iommu/e17d1ee9-7383-6717-6f10-9465b273b1ed@amd.com/T/#m98390232691e3e55db7f1187073016436fd2176a
>>
>> Can you consider this patch for 6.2?
>>
>> I did miss Report-by, tested by tags. Can you append below tags or do you want
>> me to respin patches?
>>
>> --------
>>
>> Reported-by: Matt Fagnani <matt.fagnani@bell.net>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> Link:
>> https://lore.kernel.org/lkml/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
>> Tested-by: Matt Fagnani <matt.fagnani@bell.net>
> 
> Thx again for handling this. Matt now seems to run into a different
> problem reported here:

As i described in other thread [1] it looks like GPU needs to handle failure in
resume path.


[1]
https://lore.kernel.org/linux-iommu/9b688cbe-ec48-17a7-0e40-5734d58e102d@amd.com/T/#t


-Vasant


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] iommu: Attach device group to old domain in error path
  2023-03-13  4:29     ` Vasant Hegde
@ 2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-13  6:29 UTC (permalink / raw)
  To: Vasant Hegde, Linux regressions mailing list, iommu, joro, baolu.lu
  Cc: suravee.suthikulpanit, robin.murphy, will, Felix Kuehling,
	Alex Deucher, Christian König, Jason Gunthorpe,
	Bjorn Helgaas, Matt Fagnani

On 13.03.23 05:29, Vasant Hegde wrote:
> Hi Thorsten,
> 
> As i described in other thread [1] it looks like GPU needs to handle failure in
> resume path.
>
> [1]
> https://lore.kernel.org/linux-iommu/9b688cbe-ec48-17a7-0e40-5734d58e102d@amd.com/T/#t

Thx for that, sorry, I had missed that other thread for some reason (I
really wonder why, I know I looked if Matt reported this to the list,
but seems I missed it somehow; whatever).

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-03-13  6:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-15  5:26 [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
2023-02-15  5:26 ` [PATCH v3 2/3] iommu/amd: Skip attach device domain is same as new domain Vasant Hegde
2023-02-15  5:26 ` [PATCH v3 3/3] iommu/amd: Improve page fault error reporting Vasant Hegde
2023-02-16  9:41   ` Thorsten Leemhuis
2023-02-17  5:56     ` Vasant Hegde
2023-02-17  6:11       ` Linux regression tracking (Thorsten Leemhuis)
2023-02-16 10:18   ` Joerg Roedel
2023-02-17  5:53 ` [PATCH v3 1/3] iommu: Attach device group to old domain in error path Vasant Hegde
2023-02-18 14:42   ` Joerg Roedel
2023-02-19 10:17     ` Vasant Hegde
2023-03-12 14:42   ` Linux regression tracking (Thorsten Leemhuis)
2023-03-13  4:29     ` Vasant Hegde
2023-03-13  6:29       ` Linux regression tracking (Thorsten Leemhuis)
2023-02-17 13:44 ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.