All of lore.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] Failed buffer allocation in Tegra fbdev
@ 2024-01-23 14:33 diogo.ivo
  2024-01-23 15:15 ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: diogo.ivo @ 2024-01-23 14:33 UTC (permalink / raw)
  To: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh, jgg,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions
  Cc: diogo.ivo

Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
at least on the Pixel C with the following error message reporting
a failed buffer allocation:

[    1.857660] drm drm: failed to allocate buffer of size 18432000

This error message is printed from tegra_bo_alloc() which is called by
tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
allocations would fail as well, not just the framebuffer.

This may be connected with an error in of_iommu_configure() that
became visible after commit 6ff6e184f1f4d:

[    1.200004] host1x drm: iommu configuration for device failed with -ENOENT

Best regards,

Diogo

#regzbot introduced: c8cc2655cc6c 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-23 14:33 [REGRESSION] Failed buffer allocation in Tegra fbdev diogo.ivo
@ 2024-01-23 15:15 ` Jason Gunthorpe
  2024-01-23 18:44   ` Diogo Ivo
  2024-01-24  9:13   ` Diogo Ivo
  0 siblings, 2 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2024-01-23 15:15 UTC (permalink / raw)
  To: diogo.ivo
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions

On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt wrote:
> Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
> at least on the Pixel C with the following error message reporting
> a failed buffer allocation:
> 
> [    1.857660] drm drm: failed to allocate buffer of size 18432000
> 
> This error message is printed from tegra_bo_alloc() which is called by
> tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
> allocations would fail as well, not just the framebuffer.

Presumably this is because iommu_map_sgtable() under
tegra_bo_iommu_map() fails?

Which I suspect is because of the logic in
host1x_client_iommu_attach().

After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.

So this:

	if (domain && domain != tegra->domain)
		return 0;

Will happen and the domain will be left at IDENTITY, while I suppose
the tegra_bo_iommu_map() assumes the domain was switched to tegra->domain.

Does this solve your issue?

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index ff36171c8fb700..15c7910b2e1c76 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct host1x_client *client)
         * not the shared IOMMU domain, don't try to attach it to a different
         * domain. This allows using the IOMMU-backed DMA API.
         */
-       if (domain && domain != tegra->domain)
+       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain != tegra->domain)
                return 0;
 
        if (tegra->domain) {

> This may be connected with an error in of_iommu_configure() that
> became visible after commit 6ff6e184f1f4d:
> 
> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT

Hmmm

This is a new logging, so it doesn't necessarily mean something has
changed in the code flow.

It seems the issue is something in there is returning ENOENT when it
probably should be ENODEV, but I haven't been able to guess where it
comes from.

Can you do some tracing and figure out where under
of_iommu_configure() this ENOENT return code is from?

Jason

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-23 15:15 ` Jason Gunthorpe
@ 2024-01-23 18:44   ` Diogo Ivo
  2024-01-24 10:15     ` Jon Hunter
  2024-01-26 19:51     ` Jason Gunthorpe
  2024-01-24  9:13   ` Diogo Ivo
  1 sibling, 2 replies; 17+ messages in thread
From: Diogo Ivo @ 2024-01-23 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions


On 1/23/24 15:15, Jason Gunthorpe wrote:
> On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt wrote:
>> Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
>> at least on the Pixel C with the following error message reporting
>> a failed buffer allocation:
>>
>> [    1.857660] drm drm: failed to allocate buffer of size 18432000
>>
>> This error message is printed from tegra_bo_alloc() which is called by
>> tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
>> allocations would fail as well, not just the framebuffer.
> Presumably this is because iommu_map_sgtable() under
> tegra_bo_iommu_map() fails?
>
> Which I suspect is because of the logic in
> host1x_client_iommu_attach().
>
> After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
>
> So this:
>
> 	if (domain && domain != tegra->domain)
> 		return 0;
>
> Will happen and the domain will be left at IDENTITY, while I suppose
> the tegra_bo_iommu_map() assumes the domain was switched to tegra->domain.
>
> Does this solve your issue?
>
> diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
> index ff36171c8fb700..15c7910b2e1c76 100644
> --- a/drivers/gpu/drm/tegra/drm.c
> +++ b/drivers/gpu/drm/tegra/drm.c
> @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct host1x_client *client)
>           * not the shared IOMMU domain, don't try to attach it to a different
>           * domain. This allows using the IOMMU-backed DMA API.
>           */
> -       if (domain && domain != tegra->domain)
> +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain != tegra->domain)
>                  return 0;
>   
>          if (tegra->domain) {
>
>> This may be connected with an error in of_iommu_configure() that
>> became visible after commit 6ff6e184f1f4d:
>>
>> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> Hmmm
>
> This is a new logging, so it doesn't necessarily mean something has
> changed in the code flow.
>
> It seems the issue is something in there is returning ENOENT when it
> probably should be ENODEV, but I haven't been able to guess where it
> comes from.
>
> Can you do some tracing and figure out where under
> of_iommu_configure() this ENOENT return code is from?
>
> Jason

Yes, this does fix the issue!

As for the -ENOENT I will do the tracing and get back with the results.


Thank you for your time,

Diogo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-23 15:15 ` Jason Gunthorpe
  2024-01-23 18:44   ` Diogo Ivo
@ 2024-01-24  9:13   ` Diogo Ivo
  2024-01-24 10:17     ` Jon Hunter
  2024-01-24 11:46     ` Robin Murphy
  1 sibling, 2 replies; 17+ messages in thread
From: Diogo Ivo @ 2024-01-24  9:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions,
	diogo.ivo

On Tue, Jan 23, 2024 at 11:15:08AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt wrote:
> > Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
> > at least on the Pixel C with the following error message reporting
> > a failed buffer allocation:
> > 
> > [    1.857660] drm drm: failed to allocate buffer of size 18432000
> > 
> > This error message is printed from tegra_bo_alloc() which is called by
> > tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
> > allocations would fail as well, not just the framebuffer.
> 
> Presumably this is because iommu_map_sgtable() under
> tegra_bo_iommu_map() fails?
> 
> Which I suspect is because of the logic in
> host1x_client_iommu_attach().
> 
> After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
> 
> So this:
> 
> 	if (domain && domain != tegra->domain)
> 		return 0;
> 
> Will happen and the domain will be left at IDENTITY, while I suppose
> the tegra_bo_iommu_map() assumes the domain was switched to tegra->domain.
> 
> Does this solve your issue?
> 
> diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
> index ff36171c8fb700..15c7910b2e1c76 100644
> --- a/drivers/gpu/drm/tegra/drm.c
> +++ b/drivers/gpu/drm/tegra/drm.c
> @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct host1x_client *client)
>          * not the shared IOMMU domain, don't try to attach it to a different
>          * domain. This allows using the IOMMU-backed DMA API.
>          */
> -       if (domain && domain != tegra->domain)
> +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain != tegra->domain)
>                 return 0;
>  
>         if (tegra->domain) {
> 
> > This may be connected with an error in of_iommu_configure() that
> > became visible after commit 6ff6e184f1f4d:
> > 
> > [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> 
> Hmmm
> 
> This is a new logging, so it doesn't necessarily mean something has
> changed in the code flow.
> 
> It seems the issue is something in there is returning ENOENT when it
> probably should be ENODEV, but I haven't been able to guess where it
> comes from.
> 
> Can you do some tracing and figure out where under
> of_iommu_configure() this ENOENT return code is from?
>
> Jason

I did the tracing and found that the ENOENT is coming from
sysfs_do_create_link_sd() in the following function call chain:

of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
iommu_init_device() -> iommu_device_link() -> sysfs_add_link_to_group() ->
sysfs_create_link_sd() -> sysfs_do_create_link_sd()

The particular block that emits the error is

static int sysfs_do_create_link_sd()
{
	...

	if (target_kobj->sd) {
		target = target_kobj->sd;
		kernfs_get(target);
	}

	if (!target)
		return -ENOENT;

	...
}

Best regargs,
Diogo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-23 18:44   ` Diogo Ivo
@ 2024-01-24 10:15     ` Jon Hunter
  2024-01-24 10:30       ` Diogo Ivo
  2024-01-26 19:51     ` Jason Gunthorpe
  1 sibling, 1 reply; 17+ messages in thread
From: Jon Hunter @ 2024-01-24 10:15 UTC (permalink / raw)
  To: Diogo Ivo, Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, baolu.lu,
	jsnitsel, jroedel, linux-tegra, iommu, regressions


On 23/01/2024 18:44, Diogo Ivo wrote:
> 
> On 1/23/24 15:15, Jason Gunthorpe wrote:
>> On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt 
>> wrote:
>>> Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
>>> at least on the Pixel C with the following error message reporting
>>> a failed buffer allocation:
>>>
>>> [    1.857660] drm drm: failed to allocate buffer of size 18432000
>>>
>>> This error message is printed from tegra_bo_alloc() which is called by
>>> tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
>>> allocations would fail as well, not just the framebuffer.
>> Presumably this is because iommu_map_sgtable() under
>> tegra_bo_iommu_map() fails?
>>
>> Which I suspect is because of the logic in
>> host1x_client_iommu_attach().
>>
>> After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
>>
>> So this:
>>
>>     if (domain && domain != tegra->domain)
>>         return 0;
>>
>> Will happen and the domain will be left at IDENTITY, while I suppose
>> the tegra_bo_iommu_map() assumes the domain was switched to 
>> tegra->domain.
>>
>> Does this solve your issue?
>>
>> diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
>> index ff36171c8fb700..15c7910b2e1c76 100644
>> --- a/drivers/gpu/drm/tegra/drm.c
>> +++ b/drivers/gpu/drm/tegra/drm.c
>> @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct 
>> host1x_client *client)
>>           * not the shared IOMMU domain, don't try to attach it to a 
>> different
>>           * domain. This allows using the IOMMU-backed DMA API.
>>           */
>> -       if (domain && domain != tegra->domain)
>> +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain 
>> != tegra->domain)
>>                  return 0;
>>          if (tegra->domain) {
>>
>>> This may be connected with an error in of_iommu_configure() that
>>> became visible after commit 6ff6e184f1f4d:
>>>
>>> [    1.200004] host1x drm: iommu configuration for device failed with 
>>> -ENOENT
>> Hmmm
>>
>> This is a new logging, so it doesn't necessarily mean something has
>> changed in the code flow.
>>
>> It seems the issue is something in there is returning ENOENT when it
>> probably should be ENODEV, but I haven't been able to guess where it
>> comes from.
>>
>> Can you do some tracing and figure out where under
>> of_iommu_configure() this ENOENT return code is from?
>>
>> Jason
> 
> Yes, this does fix the issue!


I am seeing the same error message on Tegra194 AGX Jetson. However, the 
above does not seem to fix it. In this case it is slightly different as 
I am not seeing the DRM allocation failure. In this case, it appears to 
be the of_dma_configure_id() call from the 
host1x_memory_context_list_init() function that is failing.

Jon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24  9:13   ` Diogo Ivo
@ 2024-01-24 10:17     ` Jon Hunter
  2024-01-24 11:46     ` Robin Murphy
  1 sibling, 0 replies; 17+ messages in thread
From: Jon Hunter @ 2024-01-24 10:17 UTC (permalink / raw)
  To: Diogo Ivo, Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, baolu.lu,
	jsnitsel, jroedel, linux-tegra, iommu, regressions


On 24/01/2024 09:13, Diogo Ivo wrote:

...

> I did the tracing and found that the ENOENT is coming from
> sysfs_do_create_link_sd() in the following function call chain:
> 
> of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
> iommu_init_device() -> iommu_device_link() -> sysfs_add_link_to_group() ->
> sysfs_create_link_sd() -> sysfs_do_create_link_sd()
> 
> The particular block that emits the error is
> 
> static int sysfs_do_create_link_sd()
> {
> 	...
> 
> 	if (target_kobj->sd) {
> 		target = target_kobj->sd;
> 		kernfs_get(target);
> 	}
> 
> 	if (!target)
> 		return -ENOENT;
> 
> 	...
> }


I believe that this is the same in the case that I am seeing as well.

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 10:15     ` Jon Hunter
@ 2024-01-24 10:30       ` Diogo Ivo
  2024-01-24 10:36         ` Jon Hunter
  0 siblings, 1 reply; 17+ messages in thread
From: Diogo Ivo @ 2024-01-24 10:30 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Jason Gunthorpe, thierry.reding, vdumpa, joro, will,
	robin.murphy, baolu.lu, jsnitsel, jroedel, linux-tegra, iommu,
	regressions, diogo.ivo

On Wed, Jan 24, 2024 at 10:15:34AM +0000, Jon Hunter wrote:
> 
> On 23/01/2024 18:44, Diogo Ivo wrote:
> > 
> > On 1/23/24 15:15, Jason Gunthorpe wrote:
> > > On Tue, Jan 23, 2024 at 02:33:15PM +0000,
> > > diogo.ivo@tecnico.ulisboa.pt wrote:
> > > > Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
> > > > at least on the Pixel C with the following error message reporting
> > > > a failed buffer allocation:
> > > > 
> > > > [    1.857660] drm drm: failed to allocate buffer of size 18432000
> > > > 
> > > > This error message is printed from tegra_bo_alloc() which is called by
> > > > tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
> > > > allocations would fail as well, not just the framebuffer.
> > > Presumably this is because iommu_map_sgtable() under
> > > tegra_bo_iommu_map() fails?
> > > 
> > > Which I suspect is because of the logic in
> > > host1x_client_iommu_attach().
> > > 
> > > After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
> > > 
> > > So this:
> > > 
> > >     if (domain && domain != tegra->domain)
> > >         return 0;
> > > 
> > > Will happen and the domain will be left at IDENTITY, while I suppose
> > > the tegra_bo_iommu_map() assumes the domain was switched to
> > > tegra->domain.
> > > 
> > > Does this solve your issue?
> > > 
> > > diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
> > > index ff36171c8fb700..15c7910b2e1c76 100644
> > > --- a/drivers/gpu/drm/tegra/drm.c
> > > +++ b/drivers/gpu/drm/tegra/drm.c
> > > @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct
> > > host1x_client *client)
> > >           * not the shared IOMMU domain, don't try to attach it to a
> > > different
> > >           * domain. This allows using the IOMMU-backed DMA API.
> > >           */
> > > -       if (domain && domain != tegra->domain)
> > > +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
> > > domain != tegra->domain)
> > >                  return 0;
> > >          if (tegra->domain) {
> > > 
> > > > This may be connected with an error in of_iommu_configure() that
> > > > became visible after commit 6ff6e184f1f4d:
> > > > 
> > > > [    1.200004] host1x drm: iommu configuration for device failed
> > > > with -ENOENT
> > > Hmmm
> > > 
> > > This is a new logging, so it doesn't necessarily mean something has
> > > changed in the code flow.
> > > 
> > > It seems the issue is something in there is returning ENOENT when it
> > > probably should be ENODEV, but I haven't been able to guess where it
> > > comes from.
> > > 
> > > Can you do some tracing and figure out where under
> > > of_iommu_configure() this ENOENT return code is from?
> > > 
> > > Jason
> > 
> > Yes, this does fix the issue!
> 
> 
> I am seeing the same error message on Tegra194 AGX Jetson. However, the
> above does not seem to fix it. In this case it is slightly different as I am
> not seeing the DRM allocation failure. In this case, it appears to be the
> of_dma_configure_id() call from the host1x_memory_context_list_init()
> function that is failing.
> 
> Jon

Just to clarify, when you mention the same error message which one are
you referring to? In my case only the drm allocation is fixed with the
above patch, and I am still getting the ENOENT.

Diogo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 10:30       ` Diogo Ivo
@ 2024-01-24 10:36         ` Jon Hunter
  0 siblings, 0 replies; 17+ messages in thread
From: Jon Hunter @ 2024-01-24 10:36 UTC (permalink / raw)
  To: Diogo Ivo
  Cc: Jason Gunthorpe, thierry.reding, vdumpa, joro, will,
	robin.murphy, baolu.lu, jsnitsel, jroedel, linux-tegra, iommu,
	regressions


On 24/01/2024 10:30, Diogo Ivo wrote:

...

>>> Yes, this does fix the issue!
>>
>>
>> I am seeing the same error message on Tegra194 AGX Jetson. However, the
>> above does not seem to fix it. In this case it is slightly different as I am
>> not seeing the DRM allocation failure. In this case, it appears to be the
>> of_dma_configure_id() call from the host1x_memory_context_list_init()
>> function that is failing.
>>
>> Jon
> 
> Just to clarify, when you mention the same error message which one are
> you referring to? In my case only the drm allocation is fixed with the
> above patch, and I am still getting the ENOENT.


OK, got it. In my case, I only see ...

  ERR KERN host1x drm: iommu configuration for device failed with -ENOENT

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24  9:13   ` Diogo Ivo
  2024-01-24 10:17     ` Jon Hunter
@ 2024-01-24 11:46     ` Robin Murphy
  2024-01-24 12:56       ` Diogo Ivo
  2024-01-24 17:03       ` Jason Gunthorpe
  1 sibling, 2 replies; 17+ messages in thread
From: Robin Murphy @ 2024-01-24 11:46 UTC (permalink / raw)
  To: Diogo Ivo, Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, jonathanh, baolu.lu,
	jsnitsel, jroedel, linux-tegra, iommu, regressions

On 2024-01-24 9:13 am, Diogo Ivo wrote:
> On Tue, Jan 23, 2024 at 11:15:08AM -0400, Jason Gunthorpe wrote:
>> On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt wrote:
>>> Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
>>> at least on the Pixel C with the following error message reporting
>>> a failed buffer allocation:
>>>
>>> [    1.857660] drm drm: failed to allocate buffer of size 18432000
>>>
>>> This error message is printed from tegra_bo_alloc() which is called by
>>> tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
>>> allocations would fail as well, not just the framebuffer.
>>
>> Presumably this is because iommu_map_sgtable() under
>> tegra_bo_iommu_map() fails?
>>
>> Which I suspect is because of the logic in
>> host1x_client_iommu_attach().
>>
>> After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
>>
>> So this:
>>
>> 	if (domain && domain != tegra->domain)
>> 		return 0;
>>
>> Will happen and the domain will be left at IDENTITY, while I suppose
>> the tegra_bo_iommu_map() assumes the domain was switched to tegra->domain.
>>
>> Does this solve your issue?
>>
>> diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
>> index ff36171c8fb700..15c7910b2e1c76 100644
>> --- a/drivers/gpu/drm/tegra/drm.c
>> +++ b/drivers/gpu/drm/tegra/drm.c
>> @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct host1x_client *client)
>>           * not the shared IOMMU domain, don't try to attach it to a different
>>           * domain. This allows using the IOMMU-backed DMA API.
>>           */
>> -       if (domain && domain != tegra->domain)
>> +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain != tegra->domain)
>>                  return 0;
>>   
>>          if (tegra->domain) {
>>
>>> This may be connected with an error in of_iommu_configure() that
>>> became visible after commit 6ff6e184f1f4d:
>>>
>>> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
>>
>> Hmmm
>>
>> This is a new logging, so it doesn't necessarily mean something has
>> changed in the code flow.
>>
>> It seems the issue is something in there is returning ENOENT when it
>> probably should be ENODEV, but I haven't been able to guess where it
>> comes from.
>>
>> Can you do some tracing and figure out where under
>> of_iommu_configure() this ENOENT return code is from?
>>
>> Jason
> 
> I did the tracing and found that the ENOENT is coming from
> sysfs_do_create_link_sd() in the following function call chain:
> 
> of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->

What's the call path leading up to that? If it's the one from 
host1x_device_add() then it's expected and benign - for fiddly reasons, 
iommu_probe_device() ends up being called too early, but will soon be 
run again in the correct circumstances once we proceed into 
host1x_subdev_register()->device_add(). That will have been happening 
for years, we just never reported errors in that spot before (and 
frankly I'm not convinced it's valuable to have added it now).

Thanks,
Robin.

> iommu_init_device() -> iommu_device_link() -> sysfs_add_link_to_group() ->
> sysfs_create_link_sd() -> sysfs_do_create_link_sd()
> 
> The particular block that emits the error is
> 
> static int sysfs_do_create_link_sd()
> {
> 	...
> 
> 	if (target_kobj->sd) {
> 		target = target_kobj->sd;
> 		kernfs_get(target);
> 	}
> 
> 	if (!target)
> 		return -ENOENT;
> 
> 	...
> }
> 
> Best regargs,
> Diogo
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 11:46     ` Robin Murphy
@ 2024-01-24 12:56       ` Diogo Ivo
  2024-02-29 14:50         ` Jon Hunter
  2024-01-24 17:03       ` Jason Gunthorpe
  1 sibling, 1 reply; 17+ messages in thread
From: Diogo Ivo @ 2024-01-24 12:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, thierry.reding, vdumpa, joro, will, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions,
	diogo.ivo

On Wed, Jan 24, 2024 at 11:46:59AM +0000, Robin Murphy wrote:
> On 2024-01-24 9:13 am, Diogo Ivo wrote:
> > On Tue, Jan 23, 2024 at 11:15:08AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 23, 2024 at 02:33:15PM +0000, diogo.ivo@tecnico.ulisboa.pt wrote:
> > > > Commit c8cc2655cc6c in the recent IOMMU changes breaks Tegra fbdev
> > > > at least on the Pixel C with the following error message reporting
> > > > a failed buffer allocation:
> > > > 
> > > > [    1.857660] drm drm: failed to allocate buffer of size 18432000
> > > > 
> > > > This error message is printed from tegra_bo_alloc() which is called by
> > > > tegra_bo_create() in tegra_fbdev_probe(), which may indicate that other
> > > > allocations would fail as well, not just the framebuffer.
> > > 
> > > Presumably this is because iommu_map_sgtable() under
> > > tegra_bo_iommu_map() fails?
> > > 
> > > Which I suspect is because of the logic in
> > > host1x_client_iommu_attach().
> > > 
> > > After c8cc2655cc6c iommu_get_domain_for_dev() will never return NULL.
> > > 
> > > So this:
> > > 
> > > 	if (domain && domain != tegra->domain)
> > > 		return 0;
> > > 
> > > Will happen and the domain will be left at IDENTITY, while I suppose
> > > the tegra_bo_iommu_map() assumes the domain was switched to tegra->domain.
> > > 
> > > Does this solve your issue?
> > > 
> > > diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
> > > index ff36171c8fb700..15c7910b2e1c76 100644
> > > --- a/drivers/gpu/drm/tegra/drm.c
> > > +++ b/drivers/gpu/drm/tegra/drm.c
> > > @@ -960,7 +960,7 @@ int host1x_client_iommu_attach(struct host1x_client *client)
> > >           * not the shared IOMMU domain, don't try to attach it to a different
> > >           * domain. This allows using the IOMMU-backed DMA API.
> > >           */
> > > -       if (domain && domain != tegra->domain)
> > > +       if (domain && domain->type != IOMMU_DOMAIN_IDENTITY && domain != tegra->domain)
> > >                  return 0;
> > >          if (tegra->domain) {
> > > 
> > > > This may be connected with an error in of_iommu_configure() that
> > > > became visible after commit 6ff6e184f1f4d:
> > > > 
> > > > [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> > > 
> > > Hmmm
> > > 
> > > This is a new logging, so it doesn't necessarily mean something has
> > > changed in the code flow.
> > > 
> > > It seems the issue is something in there is returning ENOENT when it
> > > probably should be ENODEV, but I haven't been able to guess where it
> > > comes from.
> > > 
> > > Can you do some tracing and figure out where under
> > > of_iommu_configure() this ENOENT return code is from?
> > > 
> > > Jason
> > 
> > I did the tracing and found that the ENOENT is coming from
> > sysfs_do_create_link_sd() in the following function call chain:
> > 
> > of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
> 
> What's the call path leading up to that? If it's the one from
> host1x_device_add() then it's expected and benign - for fiddly reasons,
> iommu_probe_device() ends up being called too early, but will soon be run
> again in the correct circumstances once we proceed into
> host1x_subdev_register()->device_add(). That will have been happening for
> years, we just never reported errors in that spot before (and frankly I'm
> not convinced it's valuable to have added it now).
> 
> Thanks,
> Robin.

Yes, it is the one called from host1x_device_add(), so this
is solved and only the patch sent above needs to be merged.

Thanks,
Diogo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 11:46     ` Robin Murphy
  2024-01-24 12:56       ` Diogo Ivo
@ 2024-01-24 17:03       ` Jason Gunthorpe
  2024-02-08  1:22         ` Robin Murphy
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2024-01-24 17:03 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Diogo Ivo, thierry.reding, vdumpa, joro, will, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions

On Wed, Jan 24, 2024 at 11:46:59AM +0000, Robin Murphy wrote:
> > > > This may be connected with an error in of_iommu_configure() that
> > > > became visible after commit 6ff6e184f1f4d:
> > > > 
> > > > [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> > > 
> > > Hmmm
> > > 
> > > This is a new logging, so it doesn't necessarily mean something has
> > > changed in the code flow.
> > > 
> > > It seems the issue is something in there is returning ENOENT when it
> > > probably should be ENODEV, but I haven't been able to guess where it
> > > comes from.
> > > 
> > > Can you do some tracing and figure out where under
> > > of_iommu_configure() this ENOENT return code is from?
> > 
> > I did the tracing and found that the ENOENT is coming from
> > sysfs_do_create_link_sd() in the following function call chain:
> > 
> > of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
> 
> What's the call path leading up to that? If it's the one from
> host1x_device_add() then it's expected and benign - for fiddly reasons,
> iommu_probe_device() ends up being called too early, but will soon be run
> again in the correct circumstances once we proceed into
> host1x_subdev_register()->device_add(). That will have been happening for
> years, we just never reported errors in that spot before (and frankly I'm
> not convinced it's valuable to have added it now).

Hmm. Prior to

commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
iommu_init/deinit_device()")

The error from iommu_device_link() was ignored. It seems like for most
of the years the probe actually succeeded, just with a mangled sysfs?

Though that host1x_device_add() ignored the return code does make me
wonder..

This is the only clue I see:

commit c95469aa5a188384ccf8ac520ece931c66caf8aa
Author: Alexandre Courbot <acourbot@nvidia.com>
Date:   Fri Feb 26 18:06:53 2016 +0900

    gpu: host1x: Set DMA ops on device creation
    
    Currently host1x-instanciated devices have their dma_ops left to NULL,
    which makes any DMA operation (like buffer import) on ARM64 fallback
    to the dummy_dma_ops and fail with an error.
    
    This patch calls of_dma_configure() with the host1x node when creating
    such a device, so the proper DMA operations are set.
    
    Suggested-by: Thierry Reding <thierry.reding@gmail.com>
    Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
    Signed-off-by: Thierry Reding <treding@nvidia.com>

Which is no longer happening anymore as failure of
iommu_probe_device() will not cause the dma ops to be setup.

So, if everything still works and something else is calling
of_dma_configure() prior to using the struct device for any DMA
operations (eg because a driver is always probed?) then we should just
delete this call.

Robin do you know more? Specifically where is the "soon be run again"?
Was the above issue fixed in commit 07397df29e57 ("dma-mapping: move
dma configuration to bus infrastructure") ?

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-23 18:44   ` Diogo Ivo
  2024-01-24 10:15     ` Jon Hunter
@ 2024-01-26 19:51     ` Jason Gunthorpe
  2024-01-29 12:07       ` Diogo Ivo
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2024-01-26 19:51 UTC (permalink / raw)
  To: Diogo Ivo
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions

On Tue, Jan 23, 2024 at 06:44:42PM +0000, Diogo Ivo wrote:

> Yes, this does fix the issue!

Thanks, just trying to pin down how the fix should be..

What kind of kernel are you running, 

Is this 32 bit?

Is CONFIG_ARM_DMA_USE_IOMMU set?

Is CONFIG_IOMMU_DMA set?

I'm guessing yes/yes/no?

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-26 19:51     ` Jason Gunthorpe
@ 2024-01-29 12:07       ` Diogo Ivo
  0 siblings, 0 replies; 17+ messages in thread
From: Diogo Ivo @ 2024-01-29 12:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: thierry.reding, vdumpa, joro, will, robin.murphy, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions,
	diogo.ivo

On Fri, Jan 26, 2024 at 03:51:42PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 23, 2024 at 06:44:42PM +0000, Diogo Ivo wrote:
> 
> > Yes, this does fix the issue!
> 
> Thanks, just trying to pin down how the fix should be..
> 
> What kind of kernel are you running, 
> 
> Is this 32 bit?
> 
> Is CONFIG_ARM_DMA_USE_IOMMU set?
> 
> Is CONFIG_IOMMU_DMA set?
> 
> I'm guessing yes/yes/no?

It is a 64-bit kernel, CONFIG_ARM_DMA_USE_IOMMU=n and
CONFIG_IOMMU_DMA=y.

Diogo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 17:03       ` Jason Gunthorpe
@ 2024-02-08  1:22         ` Robin Murphy
  2024-02-08  2:31           ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2024-02-08  1:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Diogo Ivo, thierry.reding, vdumpa, joro, will, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions

On 24/01/2024 5:03 pm, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 11:46:59AM +0000, Robin Murphy wrote:
>>>>> This may be connected with an error in of_iommu_configure() that
>>>>> became visible after commit 6ff6e184f1f4d:
>>>>>
>>>>> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
>>>>
>>>> Hmmm
>>>>
>>>> This is a new logging, so it doesn't necessarily mean something has
>>>> changed in the code flow.
>>>>
>>>> It seems the issue is something in there is returning ENOENT when it
>>>> probably should be ENODEV, but I haven't been able to guess where it
>>>> comes from.
>>>>
>>>> Can you do some tracing and figure out where under
>>>> of_iommu_configure() this ENOENT return code is from?
>>>
>>> I did the tracing and found that the ENOENT is coming from
>>> sysfs_do_create_link_sd() in the following function call chain:
>>>
>>> of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
>>
>> What's the call path leading up to that? If it's the one from
>> host1x_device_add() then it's expected and benign - for fiddly reasons,
>> iommu_probe_device() ends up being called too early, but will soon be run
>> again in the correct circumstances once we proceed into
>> host1x_subdev_register()->device_add(). That will have been happening for
>> years, we just never reported errors in that spot before (and frankly I'm
>> not convinced it's valuable to have added it now).
> 
> Hmm. Prior to
> 
> commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
> iommu_init/deinit_device()")
> 
> The error from iommu_device_link() was ignored. It seems like for most
> of the years the probe actually succeeded, just with a mangled sysfs?

Hmm, yeah, I suppose the iommu_group/<n>/devices/ dir just didn't get 
populated and nobody noticed or cared (which TBH doesn't really surprise 
me for a case where a kernel driver is expected to manage the IOMMU 
anyway such that it's pretty much irrelevant to userspace).

> Though that host1x_device_add() ignored the return code does make me
> wonder..
> 
> This is the only clue I see:
> 
> commit c95469aa5a188384ccf8ac520ece931c66caf8aa
> Author: Alexandre Courbot <acourbot@nvidia.com>
> Date:   Fri Feb 26 18:06:53 2016 +0900
> 
>      gpu: host1x: Set DMA ops on device creation
>      
>      Currently host1x-instanciated devices have their dma_ops left to NULL,
>      which makes any DMA operation (like buffer import) on ARM64 fallback
>      to the dummy_dma_ops and fail with an error.
>      
>      This patch calls of_dma_configure() with the host1x node when creating
>      such a device, so the proper DMA operations are set.
>      
>      Suggested-by: Thierry Reding <thierry.reding@gmail.com>
>      Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>      Signed-off-by: Thierry Reding <treding@nvidia.com>
> 
> Which is no longer happening anymore as failure of
> iommu_probe_device() will not cause the dma ops to be setup.
> 
> So, if everything still works and something else is calling
> of_dma_configure() prior to using the struct device for any DMA
> operations (eg because a driver is always probed?) then we should just
> delete this call.

I've considered that before - arguably it could have been removed when 
Mikko implemented the full bus model, but now I kind of don't want to 
since it's one of the remaining few that *is* still in the right place 
where it's always meant to be. The correct fix to meet the original 
expectations *should* be simply:

----->8-----
diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
index 84d042796d2e..6cab950690a0 100644
--- a/drivers/gpu/host1x/bus.c
+++ b/drivers/gpu/host1x/bus.c
@@ -455,11 +455,11 @@ static int host1x_device_add(struct host1x *host1x,
  	device->dev.dma_mask = &device->dev.coherent_dma_mask;
  	dev_set_name(&device->dev, "%s", driver->driver.name);
  	device->dev.release = host1x_device_release;
-	device->dev.bus = &host1x_bus_type;
  	device->dev.parent = host1x->dev;

  	of_dma_configure(&device->dev, host1x->dev->of_node, true);

+	device->dev.bus = &host1x_bus_type;
  	device->dev.dma_parms = &device->dma_parms;
  	dma_set_max_seg_size(&device->dev, UINT_MAX);

-----8<-----

...except you've now also broken that at the same time by removing the 
dev->bus check from of_iommu_configure() :(

> Robin do you know more? Specifically where is the "soon be run again"?

I can't say why it's done in quite such a roundabout manner by 
host1x_subdev_register(), but as I called out, we quickly get to the 
actual device_add() call, wherein things proceed as normal. The only 
thing going wrong here is that we "replay" the iommu_probe_device() call 
*before* it ever could have happened normally.

Frankly, for all you protest whenever I call you out for demonstrating a 
lack of understanding of this tangled mess, I sure do seem to spend a 
lot of time explaining it to you... :/

I've never claimed it's *not* a horrible mess, but at the risk of 
repeating myself, it *is* fragile, and the consequences of mucking about 
with it are tricky to reason about even when one does understand all the 
history of how it's intended to work vs. what actually happens. To coin 
a phrase I find enjoyable, this is definitely "F around and find out" 
code; as this and other threads show, now you're well into the finding 
out part.

> Was the above issue fixed in commit 07397df29e57 ("dma-mapping: move
> dma configuration to bus infrastructure") ?

Um, no. That commit was no more than code movement essentially 
separating PCI from non-PCI configuration; merely host1x didn't need to 
share the full path which ended up as platform_dma_configure() since 
host1x doesn't support ACPI. Once again the fact that I've had to 
explain that drives me to utter despair...

Robin.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-02-08  1:22         ` Robin Murphy
@ 2024-02-08  2:31           ` Jason Gunthorpe
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2024-02-08  2:31 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Diogo Ivo, thierry.reding, vdumpa, joro, will, jonathanh,
	baolu.lu, jsnitsel, jroedel, linux-tegra, iommu, regressions

On Thu, Feb 08, 2024 at 01:22:47AM +0000, Robin Murphy wrote:
> > So, if everything still works and something else is calling
> > of_dma_configure() prior to using the struct device for any DMA
> > operations (eg because a driver is always probed?) then we should just
> > delete this call.
> 
> I've considered that before - arguably it could have been removed when Mikko
> implemented the full bus model, but now I kind of don't want to since it's
> one of the remaining few that *is* still in the right place where it's
> always meant to be. The correct fix to meet the original expectations
> *should* be simply:

It is a bit jarring to hear you call a rare place, where it doesn't
even actually work right, "meet the original expectations". :\

It is pretty clear the function doesn't do what it was originally
expected too anymore, and there are now so many call sites doing
different things I don't think that past is so relevant.

> ----->8-----
> diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
> index 84d042796d2e..6cab950690a0 100644
> --- a/drivers/gpu/host1x/bus.c
> +++ b/drivers/gpu/host1x/bus.c
> @@ -455,11 +455,11 @@ static int host1x_device_add(struct host1x *host1x,
>  	device->dev.dma_mask = &device->dev.coherent_dma_mask;
>  	dev_set_name(&device->dev, "%s", driver->driver.name);
>  	device->dev.release = host1x_device_release;
> -	device->dev.bus = &host1x_bus_type;
>  	device->dev.parent = host1x->dev;
> 
>  	of_dma_configure(&device->dev, host1x->dev->of_node, true);
> 
> +	device->dev.bus = &host1x_bus_type;
>  	device->dev.dma_parms = &device->dma_parms;
>  	dma_set_max_seg_size(&device->dev, UINT_MAX);
> 
> -----8<-----
> 
> ...except you've now also broken that at the same time by removing the
> dev->bus check from of_iommu_configure() :(

Umm. Is there examples in the tree of this ordering? I looked for a
while and I only found a bunch of counter examples..

Trying to order it before the bus is set feels like a big hack,
something so subtle should not be part of a driver facing API.

At this point, if it is to stay here, the functionality it needs
probably wants a new function name and no entanglement with the iommu
layer.

> Frankly, for all you protest whenever I call you out for demonstrating a
> lack of understanding of this tangled mess, I sure do seem to spend a lot of
> time explaining it to you... :/

You never do actually explain it though. You throw some insults and
make a few statements and often don't answer any followup
questions.

The code itself is not enlightening, there are no comments explaining
these entanglements, these patterns you point to as "right" are not
really followed. The designs you explain as "right" don't fully
work. There are lots of little accumulated hacks to trip on.

It is a huge red flag that something so important as probing the iommu
is so completely baroque.

> I've never claimed it's *not* a horrible mess, but at the risk of repeating
> myself, it *is* fragile, and the consequences of mucking about with it are
> tricky to reason about even when one does understand all the history of how
> it's intended to work vs. what actually happens. To coin a phrase I find
> enjoyable, this is definitely "F around and find out" code; as this and
> other threads show, now you're well into the finding out part.

Well sure, fragile is how this kind of kernel work usually goes. How
many times has Thomas broke Xen refactoring x86 stuff? This is normal.

It is impossible to know what every crazy driver has done, and stuff
is getting broken. The point is to slowly make progress toward a
robust and understandable design in the core layer and facing the
drivers, especially one that releases the drivers of any complex
understanding.

We've made good progress on this front. The iommu drivers especially
are slowly getting more uniform.

Here we got a log message that points to something really WTF going
on. John's case is different and seems to have identified a real
race. I think we are doing pretty good on this.

It would be nice to come to conclusion on what should be done here
that improves and makes the driver facing API more robust. But it is
just a log message so maybe it is fine to leave it.

> > Was the above issue fixed in commit 07397df29e57 ("dma-mapping: move
> > dma configuration to bus infrastructure") ?
> 
> Um, no. That commit was no more than code movement essentially separating
> PCI from non-PCI configuration; merely host1x didn't need to share the full
> path which ended up as platform_dma_configure() since host1x doesn't support
> ACPI. Once again the fact that I've had to explain that drives me to utter
> despair...

Okay, I didn't go through the whole patch, I was only looking at the
hunks in drivers/gpu/host1x/bus.c that added a new of_dma_configure()
call.

Regards,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-01-24 12:56       ` Diogo Ivo
@ 2024-02-29 14:50         ` Jon Hunter
  2024-02-29 16:46           ` Thierry Reding
  0 siblings, 1 reply; 17+ messages in thread
From: Jon Hunter @ 2024-02-29 14:50 UTC (permalink / raw)
  To: Diogo Ivo, Robin Murphy
  Cc: Jason Gunthorpe, thierry.reding, vdumpa, joro, will, baolu.lu,
	jsnitsel, jroedel, linux-tegra, iommu, regressions


On 24/01/2024 12:56, Diogo Ivo wrote:

...

>>> I did the tracing and found that the ENOENT is coming from
>>> sysfs_do_create_link_sd() in the following function call chain:
>>>
>>> of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
>>
>> What's the call path leading up to that? If it's the one from
>> host1x_device_add() then it's expected and benign - for fiddly reasons,
>> iommu_probe_device() ends up being called too early, but will soon be run
>> again in the correct circumstances once we proceed into
>> host1x_subdev_register()->device_add(). That will have been happening for
>> years, we just never reported errors in that spot before (and frankly I'm
>> not convinced it's valuable to have added it now).
>>
>> Thanks,
>> Robin.
> 
> Yes, it is the one called from host1x_device_add(), so this
> is solved and only the patch sent above needs to be merged.


Sorry for the delay in getting back to this. I have been doing more
testing and the backtrace I see from this warning is ...

[    7.001380]  drm: iommu configuration for device failed with -ENOENT
[    7.001550] CPU: 4 PID: 263 Comm: systemd-udevd Not tainted 6.8.0-rc6-gbbe953beb8b9-dirty #2
[    7.001559] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT)
[    7.001564] Call trace:
[    7.001568]  dump_backtrace.part.6+0x84/0xdc
[    7.001583]  show_stack+0x14/0x1c
[    7.001590]  dump_stack_lvl+0x48/0x5c
[    7.001600]  dump_stack+0x14/0x1c
[    7.001606]  of_dma_configure_id+0x218/0x400
[    7.001636]  host1x_attach_driver+0x150/0x2d0 [host1x]
[    7.001664]  host1x_driver_register_full+0x7c/0xdc [host1x]
[    7.001711]  host1x_drm_init+0x3c/0x1000 [tegra_drm]
[    7.001746]  do_one_initcall+0x58/0x1c0
[    7.001752]  do_init_module+0x54/0x1d8
[    7.001761]  load_module+0x18b8/0x18ec
[    7.001770]  init_module_from_file+0x8c/0xc8
[    7.001777]  __arm64_sys_finit_module+0x1c4/0x29c
[    7.001784]  invoke_syscall+0x40/0xf4
[    7.001792]  el0_svc_common.constprop.1+0xc4/0xec
[    7.001814]  do_el0_svc+0x18/0x20
[    7.001820]  el0_svc+0x28/0x90
[    7.001826]  el0t_64_sync_handler+0x9c/0xc0
[    7.001845]  el0t_64_sync+0x160/0x164


I could have sworn that this was coming from
host1x_memory_context_list_init() but that is not the case.

Anyway, we have a test that checks for warnings/errors and this
is causing that test to fail. Even if this particular instance
of error is benign we would still like to trap any instances
that are not. So is there something we can fix here to avoid
this?

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] Failed buffer allocation in Tegra fbdev
  2024-02-29 14:50         ` Jon Hunter
@ 2024-02-29 16:46           ` Thierry Reding
  0 siblings, 0 replies; 17+ messages in thread
From: Thierry Reding @ 2024-02-29 16:46 UTC (permalink / raw)
  To: Jon Hunter, Diogo Ivo, Robin Murphy
  Cc: Jason Gunthorpe, vdumpa, joro, will, baolu.lu, jsnitsel, jroedel,
	linux-tegra, iommu, regressions

[-- Attachment #1: Type: text/plain, Size: 3180 bytes --]

On Thu Feb 29, 2024 at 3:50 PM CET, Jon Hunter wrote:
>
> On 24/01/2024 12:56, Diogo Ivo wrote:
>
> ...
>
> >>> I did the tracing and found that the ENOENT is coming from
> >>> sysfs_do_create_link_sd() in the following function call chain:
> >>>
> >>> of_iommu_configure() -> iommu_probe_device() -> __iommu_probe_device() ->
> >>
> >> What's the call path leading up to that? If it's the one from
> >> host1x_device_add() then it's expected and benign - for fiddly reasons,
> >> iommu_probe_device() ends up being called too early, but will soon be run
> >> again in the correct circumstances once we proceed into
> >> host1x_subdev_register()->device_add(). That will have been happening for
> >> years, we just never reported errors in that spot before (and frankly I'm
> >> not convinced it's valuable to have added it now).
> >>
> >> Thanks,
> >> Robin.
> > 
> > Yes, it is the one called from host1x_device_add(), so this
> > is solved and only the patch sent above needs to be merged.
>
>
> Sorry for the delay in getting back to this. I have been doing more
> testing and the backtrace I see from this warning is ...
>
> [    7.001380]  drm: iommu configuration for device failed with -ENOENT
> [    7.001550] CPU: 4 PID: 263 Comm: systemd-udevd Not tainted 6.8.0-rc6-gbbe953beb8b9-dirty #2
> [    7.001559] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT)
> [    7.001564] Call trace:
> [    7.001568]  dump_backtrace.part.6+0x84/0xdc
> [    7.001583]  show_stack+0x14/0x1c
> [    7.001590]  dump_stack_lvl+0x48/0x5c
> [    7.001600]  dump_stack+0x14/0x1c
> [    7.001606]  of_dma_configure_id+0x218/0x400
> [    7.001636]  host1x_attach_driver+0x150/0x2d0 [host1x]
> [    7.001664]  host1x_driver_register_full+0x7c/0xdc [host1x]
> [    7.001711]  host1x_drm_init+0x3c/0x1000 [tegra_drm]
> [    7.001746]  do_one_initcall+0x58/0x1c0
> [    7.001752]  do_init_module+0x54/0x1d8
> [    7.001761]  load_module+0x18b8/0x18ec
> [    7.001770]  init_module_from_file+0x8c/0xc8
> [    7.001777]  __arm64_sys_finit_module+0x1c4/0x29c
> [    7.001784]  invoke_syscall+0x40/0xf4
> [    7.001792]  el0_svc_common.constprop.1+0xc4/0xec
> [    7.001814]  do_el0_svc+0x18/0x20
> [    7.001820]  el0_svc+0x28/0x90
> [    7.001826]  el0t_64_sync_handler+0x9c/0xc0
> [    7.001845]  el0t_64_sync+0x160/0x164
>
>
> I could have sworn that this was coming from
> host1x_memory_context_list_init() but that is not the case.
>
> Anyway, we have a test that checks for warnings/errors and this
> is causing that test to fail. Even if this particular instance
> of error is benign we would still like to trap any instances
> that are not. So is there something we can fix here to avoid
> this?

I was wondering why I wasn't seeing this and looking through some of the
code I noticed that I have commented out the of_dma_configure() call in
host1x_device_add() in my local development tree. I probably came across
this at some point while trying to fix something else with the intention
of getting back to it but then never did.

Anyway, let me try to refresh my memory and take a stab at fixing this.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-02-29 16:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-23 14:33 [REGRESSION] Failed buffer allocation in Tegra fbdev diogo.ivo
2024-01-23 15:15 ` Jason Gunthorpe
2024-01-23 18:44   ` Diogo Ivo
2024-01-24 10:15     ` Jon Hunter
2024-01-24 10:30       ` Diogo Ivo
2024-01-24 10:36         ` Jon Hunter
2024-01-26 19:51     ` Jason Gunthorpe
2024-01-29 12:07       ` Diogo Ivo
2024-01-24  9:13   ` Diogo Ivo
2024-01-24 10:17     ` Jon Hunter
2024-01-24 11:46     ` Robin Murphy
2024-01-24 12:56       ` Diogo Ivo
2024-02-29 14:50         ` Jon Hunter
2024-02-29 16:46           ` Thierry Reding
2024-01-24 17:03       ` Jason Gunthorpe
2024-02-08  1:22         ` Robin Murphy
2024-02-08  2:31           ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.