[Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2019-12-12 21:25 ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2019-12-12 21:25 UTC (permalink / raw)
  To: Thierry Reding, linux-tegra, DRI Development

Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and
Tegra30 using.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3

Interestingly the performance is okay on Tegra30 if
CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and
you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this
`grate/texture-filter -f -s` which should produce over 100 FPS for 720p
display resolution and currently it's ~11 FPS.

[2]
https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c

Previously I was seeing some memory errors coming from Host1x DMA, but
don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the
problem thoroughly yet. Please let me know if you'll find a solution,
I'll be happy to test it. Thanks in advance!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2019-12-12 21:25 ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2019-12-12 21:25 UTC (permalink / raw)
  To: Thierry Reding, linux-tegra, DRI Development

Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and
Tegra30 using.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3

Interestingly the performance is okay on Tegra30 if
CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and
you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this
`grate/texture-filter -f -s` which should produce over 100 FPS for 720p
display resolution and currently it's ~11 FPS.

[2]
https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c

Previously I was seeing some memory errors coming from Host1x DMA, but
don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the
problem thoroughly yet. Please let me know if you'll find a solution,
I'll be happy to test it. Thanks in advance!
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2019-12-12 21:25 ` Dmitry Osipenko
  (?)
@ 2019-12-13 15:10 ` Thierry Reding
  2019-12-13 15:35   ` Dmitry Osipenko
  -1 siblings, 1 reply; 18+ messages in thread
From: Thierry Reding @ 2019-12-13 15:10 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra, DRI Development

[-- Attachment #1.1: Type: text/plain, Size: 2568 bytes --]

On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
> Hello Thierry,
> 
> Commit [1] introduced a severe GPU performance regression on Tegra20 and
> Tegra30 using.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
> 
> Interestingly the performance is okay on Tegra30 if
> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
> Tegra20.
> 
> I was telling you about this problem on the #tegra IRC sometime ago and
> you asked to report it in a trackable form, so finally here it is.
> 
> You could reproduce the problem by running [2] like this
> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
> display resolution and currently it's ~11 FPS.
> 
> [2]
> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
> 
> Previously I was seeing some memory errors coming from Host1x DMA, but
> don't see any errors at all right now.
> 
> I don't see anything done horribly wrong in the offending commit.
> 
> Unfortunately I couldn't dedicate enough time to sit down and debug the
> problem thoroughly yet. Please let me know if you'll find a solution,
> I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API,
which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
that that code doesn't coalesce entries in the SG table, so we may end
up calling iommu_map() a lot of times, and miss out on much of the
advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do
before. This we should be able to improve by passing the attribute
DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA
API we actually map/unmap the buffers for every submission. This is
because the DMA API semantics require that buffers be mapped/unmapped
every time you use them. Previously we would basically only map each
buffer once (at allocation time) and only have to deal with cache
maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
may want to restore explicit IOMMU usage, at least on anything prior to
Tegra124 where we're unlikely to ever use different IOMMU domains anyway
(because they are such a scarce resource).

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2019-12-13 15:10 ` Thierry Reding
@ 2019-12-13 15:35   ` Dmitry Osipenko
       [not found]     ` <d03876b8-b0d1-850b-7ae8-a61302e23843-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2019-12-13 15:35 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, DRI Development

13.12.2019 18:10, Thierry Reding пишет:
> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>> Hello Thierry,
>>
>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>> Tegra30 using.
>>
>> [1]
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>
>> Interestingly the performance is okay on Tegra30 if
>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>> Tegra20.
>>
>> I was telling you about this problem on the #tegra IRC sometime ago and
>> you asked to report it in a trackable form, so finally here it is.
>>
>> You could reproduce the problem by running [2] like this
>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>> display resolution and currently it's ~11 FPS.
>>
>> [2]
>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>
>> Previously I was seeing some memory errors coming from Host1x DMA, but
>> don't see any errors at all right now.
>>
>> I don't see anything done horribly wrong in the offending commit.
>>
>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>> problem thoroughly yet. Please let me know if you'll find a solution,
>> I'll be happy to test it. Thanks in advance!
> 
> I suspect that the problem here is that we're now using the DMA API,
> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
> that that code doesn't coalesce entries in the SG table, so we may end
> up calling iommu_map() a lot of times, and miss out on much of the
> advantages that the ->iotlb_sync_map() gives us on Tegra20.
> 
> At the same time dma_map_sg() will flush caches, which we didn't do
> before. This we should be able to improve by passing the attribute
> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
> maintenance isn't needed.
> 
> And while thinking about it, one other difference is that with the DMA
> API we actually map/unmap the buffers for every submission. This is
> because the DMA API semantics require that buffers be mapped/unmapped
> every time you use them. Previously we would basically only map each
> buffer once (at allocation time) and only have to deal with cache
> maintenance, so the overhead per submission was drastically lower.
> 
> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
> may want to restore explicit IOMMU usage, at least on anything prior to
> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
> (because they are such a scarce resource).

Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
think that it's the root of the problem. Disabling IOMMU for Tegra30
also didn't help (IIRC).

The offending patch shouldn't change anything in regards to the DMA API,
if I'm not missing something. Strange..

Please keep me up-to-date!
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2019-12-13 15:35   ` Dmitry Osipenko
@ 2020-01-20  2:53         ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-20  2:53 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

13.12.2019 18:35, Dmitry Osipenko пишет:
> 13.12.2019 18:10, Thierry Reding пишет:
>> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>>> Hello Thierry,
>>>
>>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>>> Tegra30 using.
>>>
>>> [1]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>>
>>> Interestingly the performance is okay on Tegra30 if
>>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>>> Tegra20.
>>>
>>> I was telling you about this problem on the #tegra IRC sometime ago and
>>> you asked to report it in a trackable form, so finally here it is.
>>>
>>> You could reproduce the problem by running [2] like this
>>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>>> display resolution and currently it's ~11 FPS.
>>>
>>> [2]
>>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>>
>>> Previously I was seeing some memory errors coming from Host1x DMA, but
>>> don't see any errors at all right now.
>>>
>>> I don't see anything done horribly wrong in the offending commit.
>>>
>>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>>> problem thoroughly yet. Please let me know if you'll find a solution,
>>> I'll be happy to test it. Thanks in advance!
>>
>> I suspect that the problem here is that we're now using the DMA API,
>> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
>> that that code doesn't coalesce entries in the SG table, so we may end
>> up calling iommu_map() a lot of times, and miss out on much of the
>> advantages that the ->iotlb_sync_map() gives us on Tegra20.
>>
>> At the same time dma_map_sg() will flush caches, which we didn't do
>> before. This we should be able to improve by passing the attribute
>> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
>> maintenance isn't needed.
>>
>> And while thinking about it, one other difference is that with the DMA
>> API we actually map/unmap the buffers for every submission. This is
>> because the DMA API semantics require that buffers be mapped/unmapped
>> every time you use them. Previously we would basically only map each
>> buffer once (at allocation time) and only have to deal with cache
>> maintenance, so the overhead per submission was drastically lower.
>>
>> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
>> may want to restore explicit IOMMU usage, at least on anything prior to
>> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
>> (because they are such a scarce resource).
> 
> Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
> think that it's the root of the problem. Disabling IOMMU for Tegra30
> also didn't help (IIRC).
> 
> The offending patch shouldn't change anything in regards to the DMA API,
> if I'm not missing something. Strange..
> 
> Please keep me up-to-date!
> 

Hello Thierry,

I took another look at the problem and here what was found:

1) The "Optionally attach clients to the IOMMU" patch is wrong because:

    1. host1x_drm_probe() is invoked *before* any of the
       host1x_client_iommu_attach() happens, so there is no way
       on earth the 'use_explicit_iommu' could ever be true.

    2. Not attaching DRM clients to IOMMU if HOST1x isn't
       attached is wrong because it never attached in the case
       of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
       makes no sense for T20/30 that do not support LPAE.

[1]
https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205

2) Because of the above problems, the DRM clients are erroneously not
getting attached to IOMMU at all and thus CMA is getting used for the BO
allocations. Here comes the problems introduced by the "gpu: host1x:
Support DMA mapping of buffers" patch, which makes DMA API to perform
CPU cache maintenance on each job submission and apparently this is
super bad for performance. This also makes no sense in comparison to the
case of enabled IOMMU, where cache maintenance isn't performed at all
(like it should be).

Please let me know if you're going to fix the problems or if you'd
prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported
buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
*dev, struct host1x_bo *bo,
         * If we've manually mapped the buffer object through the IOMMU,
make
         * sure to return the IOVA address of our mapping.
         */
-       if (phys && obj->mm) {
+       if (phys && (obj->mm || obj->vaddr)) {
                *phys = obj->iova;
                return NULL;
        }
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index 25ca54de8fc5..69adfd66196b 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

        for (i = 0; i < job->num_relocs; i++) {
                struct host1x_reloc *reloc = &job->relocs[i];
-               dma_addr_t phys_addr, *phys;
+               dma_addr_t phys_addr;
                struct sg_table *sgt;

                reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               if (client->group)
-                       phys = &phys_addr;
-               else
-                       phys = NULL;
-
-               sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
+               sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               sgt = host1x_bo_pin(host->dev, g->bo, NULL);
+               sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

                        job->unpins[job->num_unpins].size = gather_size;
                        phys_addr = iova_dma_addr(&host->iova, alloc);
-               } else {
+               } else if (sgt) {
                        err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
                                         DMA_TO_DEVICE);
                        if (!err) {

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-01-20  2:53         ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-20  2:53 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, DRI Development

13.12.2019 18:35, Dmitry Osipenko пишет:
> 13.12.2019 18:10, Thierry Reding пишет:
>> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>>> Hello Thierry,
>>>
>>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>>> Tegra30 using.
>>>
>>> [1]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>>
>>> Interestingly the performance is okay on Tegra30 if
>>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>>> Tegra20.
>>>
>>> I was telling you about this problem on the #tegra IRC sometime ago and
>>> you asked to report it in a trackable form, so finally here it is.
>>>
>>> You could reproduce the problem by running [2] like this
>>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>>> display resolution and currently it's ~11 FPS.
>>>
>>> [2]
>>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>>
>>> Previously I was seeing some memory errors coming from Host1x DMA, but
>>> don't see any errors at all right now.
>>>
>>> I don't see anything done horribly wrong in the offending commit.
>>>
>>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>>> problem thoroughly yet. Please let me know if you'll find a solution,
>>> I'll be happy to test it. Thanks in advance!
>>
>> I suspect that the problem here is that we're now using the DMA API,
>> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
>> that that code doesn't coalesce entries in the SG table, so we may end
>> up calling iommu_map() a lot of times, and miss out on much of the
>> advantages that the ->iotlb_sync_map() gives us on Tegra20.
>>
>> At the same time dma_map_sg() will flush caches, which we didn't do
>> before. This we should be able to improve by passing the attribute
>> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
>> maintenance isn't needed.
>>
>> And while thinking about it, one other difference is that with the DMA
>> API we actually map/unmap the buffers for every submission. This is
>> because the DMA API semantics require that buffers be mapped/unmapped
>> every time you use them. Previously we would basically only map each
>> buffer once (at allocation time) and only have to deal with cache
>> maintenance, so the overhead per submission was drastically lower.
>>
>> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
>> may want to restore explicit IOMMU usage, at least on anything prior to
>> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
>> (because they are such a scarce resource).
> 
> Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
> think that it's the root of the problem. Disabling IOMMU for Tegra30
> also didn't help (IIRC).
> 
> The offending patch shouldn't change anything in regards to the DMA API,
> if I'm not missing something. Strange..
> 
> Please keep me up-to-date!
> 

Hello Thierry,

I took another look at the problem and here what was found:

1) The "Optionally attach clients to the IOMMU" patch is wrong because:

    1. host1x_drm_probe() is invoked *before* any of the
       host1x_client_iommu_attach() happens, so there is no way
       on earth the 'use_explicit_iommu' could ever be true.

    2. Not attaching DRM clients to IOMMU if HOST1x isn't
       attached is wrong because it never attached in the case
       of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
       makes no sense for T20/30 that do not support LPAE.

[1]
https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205

2) Because of the above problems, the DRM clients are erroneously not
getting attached to IOMMU at all and thus CMA is getting used for the BO
allocations. Here comes the problems introduced by the "gpu: host1x:
Support DMA mapping of buffers" patch, which makes DMA API to perform
CPU cache maintenance on each job submission and apparently this is
super bad for performance. This also makes no sense in comparison to the
case of enabled IOMMU, where cache maintenance isn't performed at all
(like it should be).

Please let me know if you're going to fix the problems or if you'd
prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported
buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
*dev, struct host1x_bo *bo,
         * If we've manually mapped the buffer object through the IOMMU,
make
         * sure to return the IOVA address of our mapping.
         */
-       if (phys && obj->mm) {
+       if (phys && (obj->mm || obj->vaddr)) {
                *phys = obj->iova;
                return NULL;
        }
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index 25ca54de8fc5..69adfd66196b 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

        for (i = 0; i < job->num_relocs; i++) {
                struct host1x_reloc *reloc = &job->relocs[i];
-               dma_addr_t phys_addr, *phys;
+               dma_addr_t phys_addr;
                struct sg_table *sgt;

                reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               if (client->group)
-                       phys = &phys_addr;
-               else
-                       phys = NULL;
-
-               sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
+               sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               sgt = host1x_bo_pin(host->dev, g->bo, NULL);
+               sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

                        job->unpins[job->num_unpins].size = gather_size;
                        phys_addr = iova_dma_addr(&host->iova, alloc);
-               } else {
+               } else if (sgt) {
                        err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
                                         DMA_TO_DEVICE);
                        if (!err) {
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-01-20  2:53         ` Dmitry Osipenko
@ 2020-01-28 15:43             ` Dmitry Osipenko
  -1 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-28 15:43 UTC (permalink / raw)
  To: Thierry Reding
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development,
	Daniel Vetter, Dave Airlie

20.01.2020 05:53, Dmitry Osipenko пишет:
> 13.12.2019 18:35, Dmitry Osipenko пишет:
>> 13.12.2019 18:10, Thierry Reding пишет:
>>> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>>>> Hello Thierry,
>>>>
>>>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>>>> Tegra30 using.
>>>>
>>>> [1]
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>>>
>>>> Interestingly the performance is okay on Tegra30 if
>>>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>>>> Tegra20.
>>>>
>>>> I was telling you about this problem on the #tegra IRC sometime ago and
>>>> you asked to report it in a trackable form, so finally here it is.
>>>>
>>>> You could reproduce the problem by running [2] like this
>>>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>>>> display resolution and currently it's ~11 FPS.
>>>>
>>>> [2]
>>>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>>>
>>>> Previously I was seeing some memory errors coming from Host1x DMA, but
>>>> don't see any errors at all right now.
>>>>
>>>> I don't see anything done horribly wrong in the offending commit.
>>>>
>>>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>>>> problem thoroughly yet. Please let me know if you'll find a solution,
>>>> I'll be happy to test it. Thanks in advance!
>>>
>>> I suspect that the problem here is that we're now using the DMA API,
>>> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
>>> that that code doesn't coalesce entries in the SG table, so we may end
>>> up calling iommu_map() a lot of times, and miss out on much of the
>>> advantages that the ->iotlb_sync_map() gives us on Tegra20.
>>>
>>> At the same time dma_map_sg() will flush caches, which we didn't do
>>> before. This we should be able to improve by passing the attribute
>>> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
>>> maintenance isn't needed.
>>>
>>> And while thinking about it, one other difference is that with the DMA
>>> API we actually map/unmap the buffers for every submission. This is
>>> because the DMA API semantics require that buffers be mapped/unmapped
>>> every time you use them. Previously we would basically only map each
>>> buffer once (at allocation time) and only have to deal with cache
>>> maintenance, so the overhead per submission was drastically lower.
>>>
>>> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
>>> may want to restore explicit IOMMU usage, at least on anything prior to
>>> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
>>> (because they are such a scarce resource).
>>
>> Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
>> think that it's the root of the problem. Disabling IOMMU for Tegra30
>> also didn't help (IIRC).
>>
>> The offending patch shouldn't change anything in regards to the DMA API,
>> if I'm not missing something. Strange..
>>
>> Please keep me up-to-date!
>>
> 
> Hello Thierry,
> 
> I took another look at the problem and here what was found:
> 
> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> 
>     1. host1x_drm_probe() is invoked *before* any of the
>        host1x_client_iommu_attach() happens, so there is no way
>        on earth the 'use_explicit_iommu' could ever be true.
> 
>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>        attached is wrong because it never attached in the case
>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>        makes no sense for T20/30 that do not support LPAE.
> 
> [1]
> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> 
> 2) Because of the above problems, the DRM clients are erroneously not
> getting attached to IOMMU at all and thus CMA is getting used for the BO
> allocations. Here comes the problems introduced by the "gpu: host1x:
> Support DMA mapping of buffers" patch, which makes DMA API to perform
> CPU cache maintenance on each job submission and apparently this is
> super bad for performance. This also makes no sense in comparison to the
> case of enabled IOMMU, where cache maintenance isn't performed at all
> (like it should be).
> 
> Please let me know if you're going to fix the problems or if you'd
> prefer me to create the patches.
> 
> Here is a draft of the fix for #2, it doesn't cover case of imported
> buffers (which should be statically mapped, IIUC):
...

The v5.5 is released now with the unusable GPU driver. Thierry, could
please let me know if you're planning to do something about it? Should I
help?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-01-28 15:43             ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-28 15:43 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, DRI Development

20.01.2020 05:53, Dmitry Osipenko пишет:
> 13.12.2019 18:35, Dmitry Osipenko пишет:
>> 13.12.2019 18:10, Thierry Reding пишет:
>>> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>>>> Hello Thierry,
>>>>
>>>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>>>> Tegra30 using.
>>>>
>>>> [1]
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>>>
>>>> Interestingly the performance is okay on Tegra30 if
>>>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>>>> Tegra20.
>>>>
>>>> I was telling you about this problem on the #tegra IRC sometime ago and
>>>> you asked to report it in a trackable form, so finally here it is.
>>>>
>>>> You could reproduce the problem by running [2] like this
>>>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>>>> display resolution and currently it's ~11 FPS.
>>>>
>>>> [2]
>>>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>>>
>>>> Previously I was seeing some memory errors coming from Host1x DMA, but
>>>> don't see any errors at all right now.
>>>>
>>>> I don't see anything done horribly wrong in the offending commit.
>>>>
>>>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>>>> problem thoroughly yet. Please let me know if you'll find a solution,
>>>> I'll be happy to test it. Thanks in advance!
>>>
>>> I suspect that the problem here is that we're now using the DMA API,
>>> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
>>> that that code doesn't coalesce entries in the SG table, so we may end
>>> up calling iommu_map() a lot of times, and miss out on much of the
>>> advantages that the ->iotlb_sync_map() gives us on Tegra20.
>>>
>>> At the same time dma_map_sg() will flush caches, which we didn't do
>>> before. This we should be able to improve by passing the attribute
>>> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
>>> maintenance isn't needed.
>>>
>>> And while thinking about it, one other difference is that with the DMA
>>> API we actually map/unmap the buffers for every submission. This is
>>> because the DMA API semantics require that buffers be mapped/unmapped
>>> every time you use them. Previously we would basically only map each
>>> buffer once (at allocation time) and only have to deal with cache
>>> maintenance, so the overhead per submission was drastically lower.
>>>
>>> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
>>> may want to restore explicit IOMMU usage, at least on anything prior to
>>> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
>>> (because they are such a scarce resource).
>>
>> Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
>> think that it's the root of the problem. Disabling IOMMU for Tegra30
>> also didn't help (IIRC).
>>
>> The offending patch shouldn't change anything in regards to the DMA API,
>> if I'm not missing something. Strange..
>>
>> Please keep me up-to-date!
>>
> 
> Hello Thierry,
> 
> I took another look at the problem and here what was found:
> 
> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> 
>     1. host1x_drm_probe() is invoked *before* any of the
>        host1x_client_iommu_attach() happens, so there is no way
>        on earth the 'use_explicit_iommu' could ever be true.
> 
>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>        attached is wrong because it never attached in the case
>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>        makes no sense for T20/30 that do not support LPAE.
> 
> [1]
> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> 
> 2) Because of the above problems, the DRM clients are erroneously not
> getting attached to IOMMU at all and thus CMA is getting used for the BO
> allocations. Here comes the problems introduced by the "gpu: host1x:
> Support DMA mapping of buffers" patch, which makes DMA API to perform
> CPU cache maintenance on each job submission and apparently this is
> super bad for performance. This also makes no sense in comparison to the
> case of enabled IOMMU, where cache maintenance isn't performed at all
> (like it should be).
> 
> Please let me know if you're going to fix the problems or if you'd
> prefer me to create the patches.
> 
> Here is a draft of the fix for #2, it doesn't cover case of imported
> buffers (which should be statically mapped, IIUC):
...

The v5.5 is released now with the unusable GPU driver. Thierry, could
please let me know if you're planning to do something about it? Should I
help?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-01-20  2:53         ` Dmitry Osipenko
@ 2020-01-29 12:39             ` Thierry Reding
  -1 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-01-29 12:39 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

[-- Attachment #1: Type: text/plain, Size: 13489 bytes --]

On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
> 13.12.2019 18:35, Dmitry Osipenko пишет:
> > 13.12.2019 18:10, Thierry Reding пишет:
> >> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
> >>> Hello Thierry,
> >>>
> >>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
> >>> Tegra30 using.
> >>>
> >>> [1]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
> >>>
> >>> Interestingly the performance is okay on Tegra30 if
> >>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
> >>> Tegra20.
> >>>
> >>> I was telling you about this problem on the #tegra IRC sometime ago and
> >>> you asked to report it in a trackable form, so finally here it is.
> >>>
> >>> You could reproduce the problem by running [2] like this
> >>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
> >>> display resolution and currently it's ~11 FPS.
> >>>
> >>> [2]
> >>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
> >>>
> >>> Previously I was seeing some memory errors coming from Host1x DMA, but
> >>> don't see any errors at all right now.
> >>>
> >>> I don't see anything done horribly wrong in the offending commit.
> >>>
> >>> Unfortunately I couldn't dedicate enough time to sit down and debug the
> >>> problem thoroughly yet. Please let me know if you'll find a solution,
> >>> I'll be happy to test it. Thanks in advance!
> >>
> >> I suspect that the problem here is that we're now using the DMA API,
> >> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
> >> that that code doesn't coalesce entries in the SG table, so we may end
> >> up calling iommu_map() a lot of times, and miss out on much of the
> >> advantages that the ->iotlb_sync_map() gives us on Tegra20.
> >>
> >> At the same time dma_map_sg() will flush caches, which we didn't do
> >> before. This we should be able to improve by passing the attribute
> >> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
> >> maintenance isn't needed.
> >>
> >> And while thinking about it, one other difference is that with the DMA
> >> API we actually map/unmap the buffers for every submission. This is
> >> because the DMA API semantics require that buffers be mapped/unmapped
> >> every time you use them. Previously we would basically only map each
> >> buffer once (at allocation time) and only have to deal with cache
> >> maintenance, so the overhead per submission was drastically lower.
> >>
> >> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
> >> may want to restore explicit IOMMU usage, at least on anything prior to
> >> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
> >> (because they are such a scarce resource).
> > 
> > Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
> > think that it's the root of the problem. Disabling IOMMU for Tegra30
> > also didn't help (IIRC).
> > 
> > The offending patch shouldn't change anything in regards to the DMA API,
> > if I'm not missing something. Strange..
> > 
> > Please keep me up-to-date!
> > 
> 
> Hello Thierry,
> 
> I took another look at the problem and here what was found:
> 
> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> 
>     1. host1x_drm_probe() is invoked *before* any of the
>        host1x_client_iommu_attach() happens, so there is no way
>        on earth the 'use_explicit_iommu' could ever be true.

That's not correct. host1x_client_iommu_attach() happens during
host1x_device_init(), which is called during host1x_drm_probe(). The
idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
devices can attach, if they want to. If any of them connect (because
they aren't already attached via something like the DMA/IOMMU glue)
then the tegra->use_explicit_iommu is set to true, in which case the
IOMMU domain setup for explicit IOMMU API usage is completed. If, on
the other hand, none of the clients have a need for the explicit IOMMU
domain, there's no need to set it up and host1x_drm_probe() will just
discard it.

>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>        attached is wrong because it never attached in the case
>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>        makes no sense for T20/30 that do not support LPAE.

It's not at all wrong. Take for example the case of Tegra124 and
Tegra210 where host1x and its clients can address 34 bits. In those
cases, allocating individual pages via shmem has a high probability of
hitting physical addresses beyond the 32-bit range, which means that the
host1x can not access them unless it is also attached to an IOMMU where
physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
virtual addresses. This is a very real problem that I was running into
when testing on Tegra124 and Tegra210.

But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
should be able to remedy the situation on Tegra20 and Tegra30 by adding
another check, based on the DMA mask. Something like the below should
work:

--- >8 ---
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index aa9e49f04988..bd268028fb3d 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1037,23 +1037,9 @@ void tegra_drm_free(struct tegra_drm *tegra, size_t size, void *virt,
 	free_pages((unsigned long)virt, get_order(size));
 }
 
-static int host1x_drm_probe(struct host1x_device *dev)
+static bool host1x_drm_wants_iommu(struct host1x_device *dev)
 {
-	struct drm_driver *driver = &tegra_drm_driver;
 	struct iommu_domain *domain;
-	struct tegra_drm *tegra;
-	struct drm_device *drm;
-	int err;
-
-	drm = drm_dev_alloc(driver, &dev->dev);
-	if (IS_ERR(drm))
-		return PTR_ERR(drm);
-
-	tegra = kzalloc(sizeof(*tegra), GFP_KERNEL);
-	if (!tegra) {
-		err = -ENOMEM;
-		goto put;
-	}
 
 	/*
 	 * If the Tegra DRM clients are backed by an IOMMU, push buffers are
@@ -1082,9 +1068,38 @@ static int host1x_drm_probe(struct host1x_device *dev)
 	 * up the device tree appropriately. This is considered an problem
 	 * of integration, so care must be taken for the DT to be consistent.
 	 */
-	domain = iommu_get_domain_for_dev(drm->dev->parent);
+	domain = iommu_get_domain_for_dev(dev->dev.parent);
+
+	/*
+	 * Tegra20 and Tegra30 don't support addressing memory beyond the
+	 * 32-bit boundary, so the regular GATHER opcodes will always be
+	 * sufficient and whether or not the host1x is attached to an IOMMU
+	 * doesn't matter.
+	 */
+	if (!domain && dma_get_mask(dev->dev.parent) <= DMA_BIT_MASK(32))
+		return true;
+
+	return domain != NULL;
+}
+
+static int host1x_drm_probe(struct host1x_device *dev)
+{
+	struct drm_driver *driver = &tegra_drm_driver;
+	struct tegra_drm *tegra;
+	struct drm_device *drm;
+	int err;
+
+	drm = drm_dev_alloc(driver, &dev->dev);
+	if (IS_ERR(drm))
+		return PTR_ERR(drm);
+
+	tegra = kzalloc(sizeof(*tegra), GFP_KERNEL);
+	if (!tegra) {
+		err = -ENOMEM;
+		goto put;
+	}
 
-	if (domain && iommu_present(&platform_bus_type)) {
+	if (host1x_drm_wants_iommu(dev) && iommu_present(&platform_bus_type)) {
 		tegra->domain = iommu_domain_alloc(&platform_bus_type);
 		if (!tegra->domain) {
 			err = -ENOMEM;
--- >8 ---

> [1]
> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> 
> 2) Because of the above problems, the DRM clients are erroneously not
> getting attached to IOMMU at all and thus CMA is getting used for the BO
> allocations. Here comes the problems introduced by the "gpu: host1x:
> Support DMA mapping of buffers" patch, which makes DMA API to perform
> CPU cache maintenance on each job submission and apparently this is
> super bad for performance. This also makes no sense in comparison to the
> case of enabled IOMMU, where cache maintenance isn't performed at all
> (like it should be).

It actually does make a lot of sense. Very strictly speaking we were
violating the DMA API prior to the above patch because we were not DMA
mapping the buffers at all. Whenever you pass a buffer to hardware you
need to map it for the device. At that point, the kernel does not know
whether or not the buffer is dirty, so it has to perform a cache flush.
Similarily, when the hardware is done with a buffer, we need to unmap it
so that the CPU can access it again. This typically requires a cache
invalidate.

That things even worked to begin with more by accident than by design.

So yes, this is different from what we were doing before, but it's
actually the right thing to do. That said, I'm sure we can find ways to
optimize this. For example, as part of the DMA API conversion series I
added the possibility to set direction flags for relocation buffers. In
cases where it is known that a certain buffer will only be used for
reading, we should be able to avoid at least the cache invalidate
operation after a job is done, since the hardware won't have modified
the contents (when using an SMMU this can even be enforced). It's
slightly trickier to avoid cache flushes. For buffers that are only
going to be written, there's no need to flush the cache because the CPUs
changes can be assumed to be overwritten by the hardware anyway. However
we still need to make sure that we invalidate the caches in that case to
ensure subsequent cache flushes don't overwrite data already written by
hardware.

One other potential optimization I can imagine is to add flags to make
cache maintenance optional on buffers when we know it's safe to do so.
I'm not sure we can always know, so this is going to require further
thought.

> Please let me know if you're going to fix the problems or if you'd
> prefer me to create the patches.
> 
> Here is a draft of the fix for #2, it doesn't cover case of imported
> buffers (which should be statically mapped, IIUC):
> 
> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> *dev, struct host1x_bo *bo,
>          * If we've manually mapped the buffer object through the IOMMU,
> make
>          * sure to return the IOVA address of our mapping.
>          */
> -       if (phys && obj->mm) {
> +       if (phys && (obj->mm || obj->vaddr)) {
>                 *phys = obj->iova;

This doesn't work for the case where we use the DMA API for mapping. Or
at least it isn't going to work in the general case. The reason is
because obj->iova is only valid for whatever the device was that mapped
or allocated the buffer, which in this case is the host1x device, which
isn't even a real device, so it won't work. The only case where it does
work is if we're not behind an IOMMU, so obj->iova will actually be the
physical address.

So what this basically ends up doing is avoid dma_map_*() all the time,
which I guess is what you're trying to achieve. But it also gives you
the wrong I/O virtual address in any case where an IOMMU is involved.
Also, as discussed above, avoiding cache maintenance isn't correct.

Thierry

>                 return NULL;
>         }
> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> index 25ca54de8fc5..69adfd66196b 100644
> --- a/drivers/gpu/host1x/job.c
> +++ b/drivers/gpu/host1x/job.c
> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>         for (i = 0; i < job->num_relocs; i++) {
>                 struct host1x_reloc *reloc = &job->relocs[i];
> -               dma_addr_t phys_addr, *phys;
> +               dma_addr_t phys_addr;
>                 struct sg_table *sgt;
> 
>                 reloc->target.bo = host1x_bo_get(reloc->target.bo);
> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>                         goto unpin;
>                 }
> 
> -               if (client->group)
> -                       phys = &phys_addr;
> -               else
> -                       phys = NULL;
> -
> -               sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> +               sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>                 if (IS_ERR(sgt)) {
>                         err = PTR_ERR(sgt);
>                         goto unpin;
> @@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>                         goto unpin;
>                 }
> 
> -               sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> +               sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>                 if (IS_ERR(sgt)) {
>                         err = PTR_ERR(sgt);
>                         goto unpin;
> @@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>                         job->unpins[job->num_unpins].size = gather_size;
>                         phys_addr = iova_dma_addr(&host->iova, alloc);
> -               } else {
> +               } else if (sgt) {
>                         err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>                                          DMA_TO_DEVICE);
>                         if (!err) {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-01-29 12:39             ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-01-29 12:39 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra, DRI Development


[-- Attachment #1.1: Type: text/plain, Size: 13489 bytes --]

On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
> 13.12.2019 18:35, Dmitry Osipenko пишет:
> > 13.12.2019 18:10, Thierry Reding пишет:
> >> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
> >>> Hello Thierry,
> >>>
> >>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
> >>> Tegra30 using.
> >>>
> >>> [1]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
> >>>
> >>> Interestingly the performance is okay on Tegra30 if
> >>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
> >>> Tegra20.
> >>>
> >>> I was telling you about this problem on the #tegra IRC sometime ago and
> >>> you asked to report it in a trackable form, so finally here it is.
> >>>
> >>> You could reproduce the problem by running [2] like this
> >>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
> >>> display resolution and currently it's ~11 FPS.
> >>>
> >>> [2]
> >>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
> >>>
> >>> Previously I was seeing some memory errors coming from Host1x DMA, but
> >>> don't see any errors at all right now.
> >>>
> >>> I don't see anything done horribly wrong in the offending commit.
> >>>
> >>> Unfortunately I couldn't dedicate enough time to sit down and debug the
> >>> problem thoroughly yet. Please let me know if you'll find a solution,
> >>> I'll be happy to test it. Thanks in advance!
> >>
> >> I suspect that the problem here is that we're now using the DMA API,
> >> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
> >> that that code doesn't coalesce entries in the SG table, so we may end
> >> up calling iommu_map() a lot of times, and miss out on much of the
> >> advantages that the ->iotlb_sync_map() gives us on Tegra20.
> >>
> >> At the same time dma_map_sg() will flush caches, which we didn't do
> >> before. This we should be able to improve by passing the attribute
> >> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
> >> maintenance isn't needed.
> >>
> >> And while thinking about it, one other difference is that with the DMA
> >> API we actually map/unmap the buffers for every submission. This is
> >> because the DMA API semantics require that buffers be mapped/unmapped
> >> every time you use them. Previously we would basically only map each
> >> buffer once (at allocation time) and only have to deal with cache
> >> maintenance, so the overhead per submission was drastically lower.
> >>
> >> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
> >> may want to restore explicit IOMMU usage, at least on anything prior to
> >> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
> >> (because they are such a scarce resource).
> > 
> > Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
> > think that it's the root of the problem. Disabling IOMMU for Tegra30
> > also didn't help (IIRC).
> > 
> > The offending patch shouldn't change anything in regards to the DMA API,
> > if I'm not missing something. Strange..
> > 
> > Please keep me up-to-date!
> > 
> 
> Hello Thierry,
> 
> I took another look at the problem and here what was found:
> 
> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> 
>     1. host1x_drm_probe() is invoked *before* any of the
>        host1x_client_iommu_attach() happens, so there is no way
>        on earth the 'use_explicit_iommu' could ever be true.

That's not correct. host1x_client_iommu_attach() happens during
host1x_device_init(), which is called during host1x_drm_probe(). The
idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
devices can attach, if they want to. If any of them connect (because
they aren't already attached via something like the DMA/IOMMU glue)
then the tegra->use_explicit_iommu is set to true, in which case the
IOMMU domain setup for explicit IOMMU API usage is completed. If, on
the other hand, none of the clients have a need for the explicit IOMMU
domain, there's no need to set it up and host1x_drm_probe() will just
discard it.

>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>        attached is wrong because it never attached in the case
>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>        makes no sense for T20/30 that do not support LPAE.

It's not at all wrong. Take for example the case of Tegra124 and
Tegra210 where host1x and its clients can address 34 bits. In those
cases, allocating individual pages via shmem has a high probability of
hitting physical addresses beyond the 32-bit range, which means that the
host1x can not access them unless it is also attached to an IOMMU where
physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
virtual addresses. This is a very real problem that I was running into
when testing on Tegra124 and Tegra210.

But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
should be able to remedy the situation on Tegra20 and Tegra30 by adding
another check, based on the DMA mask. Something like the below should
work:

--- >8 ---
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index aa9e49f04988..bd268028fb3d 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1037,23 +1037,9 @@ void tegra_drm_free(struct tegra_drm *tegra, size_t size, void *virt,
 	free_pages((unsigned long)virt, get_order(size));
 }
 
-static int host1x_drm_probe(struct host1x_device *dev)
+static bool host1x_drm_wants_iommu(struct host1x_device *dev)
 {
-	struct drm_driver *driver = &tegra_drm_driver;
 	struct iommu_domain *domain;
-	struct tegra_drm *tegra;
-	struct drm_device *drm;
-	int err;
-
-	drm = drm_dev_alloc(driver, &dev->dev);
-	if (IS_ERR(drm))
-		return PTR_ERR(drm);
-
-	tegra = kzalloc(sizeof(*tegra), GFP_KERNEL);
-	if (!tegra) {
-		err = -ENOMEM;
-		goto put;
-	}
 
 	/*
 	 * If the Tegra DRM clients are backed by an IOMMU, push buffers are
@@ -1082,9 +1068,38 @@ static int host1x_drm_probe(struct host1x_device *dev)
 	 * up the device tree appropriately. This is considered an problem
 	 * of integration, so care must be taken for the DT to be consistent.
 	 */
-	domain = iommu_get_domain_for_dev(drm->dev->parent);
+	domain = iommu_get_domain_for_dev(dev->dev.parent);
+
+	/*
+	 * Tegra20 and Tegra30 don't support addressing memory beyond the
+	 * 32-bit boundary, so the regular GATHER opcodes will always be
+	 * sufficient and whether or not the host1x is attached to an IOMMU
+	 * doesn't matter.
+	 */
+	if (!domain && dma_get_mask(dev->dev.parent) <= DMA_BIT_MASK(32))
+		return true;
+
+	return domain != NULL;
+}
+
+static int host1x_drm_probe(struct host1x_device *dev)
+{
+	struct drm_driver *driver = &tegra_drm_driver;
+	struct tegra_drm *tegra;
+	struct drm_device *drm;
+	int err;
+
+	drm = drm_dev_alloc(driver, &dev->dev);
+	if (IS_ERR(drm))
+		return PTR_ERR(drm);
+
+	tegra = kzalloc(sizeof(*tegra), GFP_KERNEL);
+	if (!tegra) {
+		err = -ENOMEM;
+		goto put;
+	}
 
-	if (domain && iommu_present(&platform_bus_type)) {
+	if (host1x_drm_wants_iommu(dev) && iommu_present(&platform_bus_type)) {
 		tegra->domain = iommu_domain_alloc(&platform_bus_type);
 		if (!tegra->domain) {
 			err = -ENOMEM;
--- >8 ---

> [1]
> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> 
> 2) Because of the above problems, the DRM clients are erroneously not
> getting attached to IOMMU at all and thus CMA is getting used for the BO
> allocations. Here comes the problems introduced by the "gpu: host1x:
> Support DMA mapping of buffers" patch, which makes DMA API to perform
> CPU cache maintenance on each job submission and apparently this is
> super bad for performance. This also makes no sense in comparison to the
> case of enabled IOMMU, where cache maintenance isn't performed at all
> (like it should be).

It actually does make a lot of sense. Very strictly speaking we were
violating the DMA API prior to the above patch because we were not DMA
mapping the buffers at all. Whenever you pass a buffer to hardware you
need to map it for the device. At that point, the kernel does not know
whether or not the buffer is dirty, so it has to perform a cache flush.
Similarily, when the hardware is done with a buffer, we need to unmap it
so that the CPU can access it again. This typically requires a cache
invalidate.

That things even worked to begin with more by accident than by design.

So yes, this is different from what we were doing before, but it's
actually the right thing to do. That said, I'm sure we can find ways to
optimize this. For example, as part of the DMA API conversion series I
added the possibility to set direction flags for relocation buffers. In
cases where it is known that a certain buffer will only be used for
reading, we should be able to avoid at least the cache invalidate
operation after a job is done, since the hardware won't have modified
the contents (when using an SMMU this can even be enforced). It's
slightly trickier to avoid cache flushes. For buffers that are only
going to be written, there's no need to flush the cache because the CPUs
changes can be assumed to be overwritten by the hardware anyway. However
we still need to make sure that we invalidate the caches in that case to
ensure subsequent cache flushes don't overwrite data already written by
hardware.

One other potential optimization I can imagine is to add flags to make
cache maintenance optional on buffers when we know it's safe to do so.
I'm not sure we can always know, so this is going to require further
thought.

> Please let me know if you're going to fix the problems or if you'd
> prefer me to create the patches.
> 
> Here is a draft of the fix for #2, it doesn't cover case of imported
> buffers (which should be statically mapped, IIUC):
> 
> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> *dev, struct host1x_bo *bo,
>          * If we've manually mapped the buffer object through the IOMMU,
> make
>          * sure to return the IOVA address of our mapping.
>          */
> -       if (phys && obj->mm) {
> +       if (phys && (obj->mm || obj->vaddr)) {
>                 *phys = obj->iova;

This doesn't work for the case where we use the DMA API for mapping. Or
at least it isn't going to work in the general case. The reason is
because obj->iova is only valid for whatever the device was that mapped
or allocated the buffer, which in this case is the host1x device, which
isn't even a real device, so it won't work. The only case where it does
work is if we're not behind an IOMMU, so obj->iova will actually be the
physical address.

So what this basically ends up doing is avoid dma_map_*() all the time,
which I guess is what you're trying to achieve. But it also gives you
the wrong I/O virtual address in any case where an IOMMU is involved.
Also, as discussed above, avoiding cache maintenance isn't correct.

Thierry

>                 return NULL;
>         }
> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> index 25ca54de8fc5..69adfd66196b 100644
> --- a/drivers/gpu/host1x/job.c
> +++ b/drivers/gpu/host1x/job.c
> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>         for (i = 0; i < job->num_relocs; i++) {
>                 struct host1x_reloc *reloc = &job->relocs[i];
> -               dma_addr_t phys_addr, *phys;
> +               dma_addr_t phys_addr;
>                 struct sg_table *sgt;
> 
>                 reloc->target.bo = host1x_bo_get(reloc->target.bo);
> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>                         goto unpin;
>                 }
> 
> -               if (client->group)
> -                       phys = &phys_addr;
> -               else
> -                       phys = NULL;
> -
> -               sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> +               sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>                 if (IS_ERR(sgt)) {
>                         err = PTR_ERR(sgt);
>                         goto unpin;
> @@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>                         goto unpin;
>                 }
> 
> -               sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> +               sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>                 if (IS_ERR(sgt)) {
>                         err = PTR_ERR(sgt);
>                         goto unpin;
> @@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>                         job->unpins[job->num_unpins].size = gather_size;
>                         phys_addr = iova_dma_addr(&host->iova, alloc);
> -               } else {
> +               } else if (sgt) {
>                         err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>                                          DMA_TO_DEVICE);
>                         if (!err) {

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-01-29 12:39             ` Thierry Reding
@ 2020-01-30  4:36               ` Dmitry Osipenko
  -1 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-30  4:36 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

29.01.2020 15:39, Thierry Reding пишет:
> On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
>> 13.12.2019 18:35, Dmitry Osipenko пишет:
...
>> Hello Thierry,
>>
>> I took another look at the problem and here what was found:
>>
>> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
>>
>>     1. host1x_drm_probe() is invoked *before* any of the
>>        host1x_client_iommu_attach() happens, so there is no way
>>        on earth the 'use_explicit_iommu' could ever be true.
> 
> That's not correct. host1x_client_iommu_attach() happens during
> host1x_device_init(), which is called during host1x_drm_probe().

Looks like I previously got confused by accident, my bad.

> The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
> devices can attach, if they want to. If any of them connect (because
> they aren't already attached via something like the DMA/IOMMU glue)
> then the tegra->use_explicit_iommu is set to true, in which case the
> IOMMU domain setup for explicit IOMMU API usage is completed. If, on
> the other hand, none of the clients have a need for the explicit IOMMU
> domain, there's no need to set it up and host1x_drm_probe() will just
> discard it.

This matches my understanding of what you wanted to achieve, thanks.

>>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>>        attached is wrong because it never attached in the case
>>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>>        makes no sense for T20/30 that do not support LPAE.
> 
> It's not at all wrong. Take for example the case of Tegra124 and
> Tegra210 where host1x and its clients can address 34 bits. In those
> cases, allocating individual pages via shmem has a high probability of
> hitting physical addresses beyond the 32-bit range, which means that the
> host1x can not access them unless it is also attached to an IOMMU where
> physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
> virtual addresses. This is a very real problem that I was running into
> when testing on Tegra124 and Tegra210.

Why not to set the DMA mask to 32bits if IOMMU is unavailable?

I'm a bit puzzled by the actual need to support the case where Host1x is
backed by IOMMU and clients not.. How we could ever end up with this
scenario in the upstream kernel?

What about the reverse scenario? You won't be able to patch cmdstream
properly for >32bit addresses.

The root of the problem is that Tegra DRM UAPI doesn't support 64bit
addresses, so you can't use "wide" opcodes and can't patch cmdstream.

Perhaps it is better not to add any new things or quirks to the Host1x /
Tegra DRM for now. The drivers need a serious clean up, otherwise mess
only continues to grow up. Don't you think so?

> But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
> should be able to remedy the situation on Tegra20 and Tegra30 by adding
> another check, based on the DMA mask. Something like the below should
> work:
> 
> --- >8 ---
[snip]
> --- >8 ---

This works, thanks.

>> [1]
>> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
>>
>> 2) Because of the above problems, the DRM clients are erroneously not
>> getting attached to IOMMU at all and thus CMA is getting used for the BO
>> allocations. Here comes the problems introduced by the "gpu: host1x:
>> Support DMA mapping of buffers" patch, which makes DMA API to perform
>> CPU cache maintenance on each job submission and apparently this is
>> super bad for performance. This also makes no sense in comparison to the
>> case of enabled IOMMU, where cache maintenance isn't performed at all
>> (like it should be).
> 
> It actually does make a lot of sense. Very strictly speaking we were
> violating the DMA API prior to the above patch because we were not DMA
> mapping the buffers at all. Whenever you pass a buffer to hardware you
> need to map it for the device. At that point, the kernel does not know
> whether or not the buffer is dirty, so it has to perform a cache flush.
> Similarily, when the hardware is done with a buffer, we need to unmap it
> so that the CPU can access it again. This typically requires a cache
> invalidate.
> 
> That things even worked to begin with more by accident than by design.
> 
> So yes, this is different from what we were doing before, but it's
> actually the right thing to do. That said, I'm sure we can find ways to
> optimize this. For example, as part of the DMA API conversion series I
> added the possibility to set direction flags for relocation buffers. In
> cases where it is known that a certain buffer will only be used for
> reading, we should be able to avoid at least the cache invalidate
> operation after a job is done, since the hardware won't have modified
> the contents (when using an SMMU this can even be enforced). It's
> slightly trickier to avoid cache flushes. For buffers that are only
> going to be written, there's no need to flush the cache because the CPUs
> changes can be assumed to be overwritten by the hardware anyway. However
> we still need to make sure that we invalidate the caches in that case to
> ensure subsequent cache flushes don't overwrite data already written by
> hardware.
> 
> One other potential optimization I can imagine is to add flags to make
> cache maintenance optional on buffers when we know it's safe to do so.
> I'm not sure we can always know, so this is going to require further
> thought.

Doesn't sound good to me.. this is not going to be good for GPU drivers.
All cache maintenance should be in control of userspace, the userspace
should be telling kernel driver when it needs to get CPU access and when
to finish the access. DMABUF has generic UAPI for the synchronizations,
although a mature GPU driver may need more than that.

Today Tegra DRM driver supports only write-combined BO allocations, and
thus, we don't need to do more than to flush CPU buffers before
executing HW job.

>> Please let me know if you're going to fix the problems or if you'd
>> prefer me to create the patches.
>>
>> Here is a draft of the fix for #2, it doesn't cover case of imported
>> buffers (which should be statically mapped, IIUC):
>>
>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
>> *dev, struct host1x_bo *bo,
>>          * If we've manually mapped the buffer object through the IOMMU,
>> make
>>          * sure to return the IOVA address of our mapping.
>>          */
>> -       if (phys && obj->mm) {
>> +       if (phys && (obj->mm || obj->vaddr)) {
>>                 *phys = obj->iova;
> 
> This doesn't work for the case where we use the DMA API for mapping. Or
> at least it isn't going to work in the general case.

Right, looks like I'll need to update my memory about the DMA API usage.

> The reason is because obj->iova is only valid for whatever the device was that mapped
> or allocated the buffer, which in this case is the host1x device, which
> isn't even a real device, so it won't work. The only case where it does
> work is if we're not behind an IOMMU, so obj->iova will actually be the
> physical address.

But why do you need to dynamically map/unmap the statically-allocated
buffers on each job submission, could you please explain what is the
point? Perhaps it's a temporary workaround just to get a minimum of
things working for the case of implicit IOMMU?

All buffers should be statically allocated and statically mapped, and
when there is a need to sync an already mapped buffer, the dma_sync_*
API should be used.

Like I said above, the syncing should be done by userspace for the
buffers that are in control of userspace.

> So what this basically ends up doing is avoid dma_map_*() all the time,
> which I guess is what you're trying to achieve. But it also gives you
> the wrong I/O virtual address in any case where an IOMMU is involved.
> Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a
non-implicit IOMMU, in order to bring back the good old behavior (at
least temporary, until there will be a more comprehensive solution).

What do you think about this variant:

--- >8 ---
diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
index 1237df157e05..555a6424e9fa 100644
--- a/drivers/gpu/drm/tegra/gem.c
+++ b/drivers/gpu/drm/tegra/gem.c
@@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
*dev, struct host1x_bo *bo,
 				     dma_addr_t *phys)
 {
 	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
+	struct tegra_drm *tegra = obj->gem.dev->dev_private;
 	struct sg_table *sgt;
 	int err;

-	/*
-	 * If we've manually mapped the buffer object through the IOMMU, make
-	 * sure to return the IOVA address of our mapping.
-	 */
-	if (phys && obj->mm) {
-		*phys = obj->iova;
-		return NULL;
+	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
+		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
+		if (obj->vaddr) {
+			*phys = obj->iova;
+			return NULL;
+		}
+
+		/*
+		 * If we've manually mapped the buffer object through the
+		 * IOMMU, make sure to return the IOVA address of our mapping.
+		 */
+		if (obj->mm) {
+			*phys = obj->iova;
+			return NULL;
+		}
 	}

 	/*
diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
index 6d1872ddef17..91304b9034fa 100644
--- a/drivers/gpu/drm/tegra/plane.c
+++ b/drivers/gpu/drm/tegra/plane.c
@@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)

 	for (i = 0; i < state->base.fb->format->num_planes; i++) {
 		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
+		struct sg_table *sgt;

-		if (!dc->client.group) {
-			struct sg_table *sgt;
-
-			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
-			if (IS_ERR(sgt)) {
-				err = PTR_ERR(sgt);
-				goto unpin;
-			}
+		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
+		if (IS_ERR(sgt)) {
+			err = PTR_ERR(sgt);
+			goto unpin;
+		}

+		if (sgt) {
 			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
 					 DMA_TO_DEVICE);
 			if (err == 0) {
@@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)

 			state->iova[i] = sg_dma_address(sgt->sgl);
 			state->sgt[i] = sgt;
-		} else {
-			state->iova[i] = bo->iova;
 		}
 	}

@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)
 		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
 		struct sg_table *sgt = state->sgt[i];

-		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
-		host1x_bo_unpin(dc->dev, &bo->base, sgt);
+		if (sgt) {
+			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
+				     DMA_TO_DEVICE);
+			host1x_bo_unpin(dc->dev, &bo->base, sgt);
+		}

 		state->iova[i] = DMA_MAPPING_ERROR;
 		state->sgt[i] = NULL;
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index 60b2fedd0061..538c0f0b40a4 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

 	for (i = 0; i < job->num_relocs; i++) {
 		struct host1x_reloc *reloc = &job->relocs[i];
-		dma_addr_t phys_addr, *phys;
+		dma_addr_t phys_addr;
 		struct sg_table *sgt;

 		reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 			goto unpin;
 		}

-		if (client->group)
-			phys = &phys_addr;
-		else
-			phys = NULL;
-
-		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
+		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
 		if (IS_ERR(sgt)) {
 			err = PTR_ERR(sgt);
 			goto unpin;
@@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 		job->num_unpins++;
 	}

+	/*
+	 * In a case of enabled firewall, all user gathers are copied into
+	 * the secured job->gather_copy_mapped.
+	 */
+	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
+		return 0;
+
 	for (i = 0; i < job->num_gathers; i++) {
 		struct host1x_job_gather *g = &job->gathers[i];
 		size_t gather_size = 0;
@@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 			goto unpin;
 		}

-		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
+		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
 		if (IS_ERR(sgt)) {
 			err = PTR_ERR(sgt);
 			goto unpin;
 		}

-		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
+		if (host->domain) {
 			for_each_sg(sgt->sgl, sg, sgt->nents, j)
 				gather_size += sg->length;
 			gather_size = iova_align(&host->iova, gather_size);
@@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

 			job->unpins[job->num_unpins].size = gather_size;
 			phys_addr = iova_dma_addr(&host->iova, alloc);
-		} else {
+		} else if (sgt) {
 			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
 					 DMA_TO_DEVICE);
 			if (!err) {

--- >8 ---

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-01-30  4:36               ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-01-30  4:36 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, DRI Development

29.01.2020 15:39, Thierry Reding пишет:
> On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
>> 13.12.2019 18:35, Dmitry Osipenko пишет:
...
>> Hello Thierry,
>>
>> I took another look at the problem and here what was found:
>>
>> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
>>
>>     1. host1x_drm_probe() is invoked *before* any of the
>>        host1x_client_iommu_attach() happens, so there is no way
>>        on earth the 'use_explicit_iommu' could ever be true.
> 
> That's not correct. host1x_client_iommu_attach() happens during
> host1x_device_init(), which is called during host1x_drm_probe().

Looks like I previously got confused by accident, my bad.

> The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
> devices can attach, if they want to. If any of them connect (because
> they aren't already attached via something like the DMA/IOMMU glue)
> then the tegra->use_explicit_iommu is set to true, in which case the
> IOMMU domain setup for explicit IOMMU API usage is completed. If, on
> the other hand, none of the clients have a need for the explicit IOMMU
> domain, there's no need to set it up and host1x_drm_probe() will just
> discard it.

This matches my understanding of what you wanted to achieve, thanks.

>>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
>>        attached is wrong because it never attached in the case
>>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
>>        makes no sense for T20/30 that do not support LPAE.
> 
> It's not at all wrong. Take for example the case of Tegra124 and
> Tegra210 where host1x and its clients can address 34 bits. In those
> cases, allocating individual pages via shmem has a high probability of
> hitting physical addresses beyond the 32-bit range, which means that the
> host1x can not access them unless it is also attached to an IOMMU where
> physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
> virtual addresses. This is a very real problem that I was running into
> when testing on Tegra124 and Tegra210.

Why not to set the DMA mask to 32bits if IOMMU is unavailable?

I'm a bit puzzled by the actual need to support the case where Host1x is
backed by IOMMU and clients not.. How we could ever end up with this
scenario in the upstream kernel?

What about the reverse scenario? You won't be able to patch cmdstream
properly for >32bit addresses.

The root of the problem is that Tegra DRM UAPI doesn't support 64bit
addresses, so you can't use "wide" opcodes and can't patch cmdstream.

Perhaps it is better not to add any new things or quirks to the Host1x /
Tegra DRM for now. The drivers need a serious clean up, otherwise mess
only continues to grow up. Don't you think so?

> But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
> should be able to remedy the situation on Tegra20 and Tegra30 by adding
> another check, based on the DMA mask. Something like the below should
> work:
> 
> --- >8 ---
[snip]
> --- >8 ---

This works, thanks.

>> [1]
>> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
>>
>> 2) Because of the above problems, the DRM clients are erroneously not
>> getting attached to IOMMU at all and thus CMA is getting used for the BO
>> allocations. Here comes the problems introduced by the "gpu: host1x:
>> Support DMA mapping of buffers" patch, which makes DMA API to perform
>> CPU cache maintenance on each job submission and apparently this is
>> super bad for performance. This also makes no sense in comparison to the
>> case of enabled IOMMU, where cache maintenance isn't performed at all
>> (like it should be).
> 
> It actually does make a lot of sense. Very strictly speaking we were
> violating the DMA API prior to the above patch because we were not DMA
> mapping the buffers at all. Whenever you pass a buffer to hardware you
> need to map it for the device. At that point, the kernel does not know
> whether or not the buffer is dirty, so it has to perform a cache flush.
> Similarily, when the hardware is done with a buffer, we need to unmap it
> so that the CPU can access it again. This typically requires a cache
> invalidate.
> 
> That things even worked to begin with more by accident than by design.
> 
> So yes, this is different from what we were doing before, but it's
> actually the right thing to do. That said, I'm sure we can find ways to
> optimize this. For example, as part of the DMA API conversion series I
> added the possibility to set direction flags for relocation buffers. In
> cases where it is known that a certain buffer will only be used for
> reading, we should be able to avoid at least the cache invalidate
> operation after a job is done, since the hardware won't have modified
> the contents (when using an SMMU this can even be enforced). It's
> slightly trickier to avoid cache flushes. For buffers that are only
> going to be written, there's no need to flush the cache because the CPUs
> changes can be assumed to be overwritten by the hardware anyway. However
> we still need to make sure that we invalidate the caches in that case to
> ensure subsequent cache flushes don't overwrite data already written by
> hardware.
> 
> One other potential optimization I can imagine is to add flags to make
> cache maintenance optional on buffers when we know it's safe to do so.
> I'm not sure we can always know, so this is going to require further
> thought.

Doesn't sound good to me.. this is not going to be good for GPU drivers.
All cache maintenance should be in control of userspace, the userspace
should be telling kernel driver when it needs to get CPU access and when
to finish the access. DMABUF has generic UAPI for the synchronizations,
although a mature GPU driver may need more than that.

Today Tegra DRM driver supports only write-combined BO allocations, and
thus, we don't need to do more than to flush CPU buffers before
executing HW job.

>> Please let me know if you're going to fix the problems or if you'd
>> prefer me to create the patches.
>>
>> Here is a draft of the fix for #2, it doesn't cover case of imported
>> buffers (which should be statically mapped, IIUC):
>>
>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
>> *dev, struct host1x_bo *bo,
>>          * If we've manually mapped the buffer object through the IOMMU,
>> make
>>          * sure to return the IOVA address of our mapping.
>>          */
>> -       if (phys && obj->mm) {
>> +       if (phys && (obj->mm || obj->vaddr)) {
>>                 *phys = obj->iova;
> 
> This doesn't work for the case where we use the DMA API for mapping. Or
> at least it isn't going to work in the general case.

Right, looks like I'll need to update my memory about the DMA API usage.

> The reason is because obj->iova is only valid for whatever the device was that mapped
> or allocated the buffer, which in this case is the host1x device, which
> isn't even a real device, so it won't work. The only case where it does
> work is if we're not behind an IOMMU, so obj->iova will actually be the
> physical address.

But why do you need to dynamically map/unmap the statically-allocated
buffers on each job submission, could you please explain what is the
point? Perhaps it's a temporary workaround just to get a minimum of
things working for the case of implicit IOMMU?

All buffers should be statically allocated and statically mapped, and
when there is a need to sync an already mapped buffer, the dma_sync_*
API should be used.

Like I said above, the syncing should be done by userspace for the
buffers that are in control of userspace.

> So what this basically ends up doing is avoid dma_map_*() all the time,
> which I guess is what you're trying to achieve. But it also gives you
> the wrong I/O virtual address in any case where an IOMMU is involved.
> Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a
non-implicit IOMMU, in order to bring back the good old behavior (at
least temporary, until there will be a more comprehensive solution).

What do you think about this variant:

--- >8 ---
diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
index 1237df157e05..555a6424e9fa 100644
--- a/drivers/gpu/drm/tegra/gem.c
+++ b/drivers/gpu/drm/tegra/gem.c
@@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
*dev, struct host1x_bo *bo,
 				     dma_addr_t *phys)
 {
 	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
+	struct tegra_drm *tegra = obj->gem.dev->dev_private;
 	struct sg_table *sgt;
 	int err;

-	/*
-	 * If we've manually mapped the buffer object through the IOMMU, make
-	 * sure to return the IOVA address of our mapping.
-	 */
-	if (phys && obj->mm) {
-		*phys = obj->iova;
-		return NULL;
+	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
+		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
+		if (obj->vaddr) {
+			*phys = obj->iova;
+			return NULL;
+		}
+
+		/*
+		 * If we've manually mapped the buffer object through the
+		 * IOMMU, make sure to return the IOVA address of our mapping.
+		 */
+		if (obj->mm) {
+			*phys = obj->iova;
+			return NULL;
+		}
 	}

 	/*
diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
index 6d1872ddef17..91304b9034fa 100644
--- a/drivers/gpu/drm/tegra/plane.c
+++ b/drivers/gpu/drm/tegra/plane.c
@@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)

 	for (i = 0; i < state->base.fb->format->num_planes; i++) {
 		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
+		struct sg_table *sgt;

-		if (!dc->client.group) {
-			struct sg_table *sgt;
-
-			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
-			if (IS_ERR(sgt)) {
-				err = PTR_ERR(sgt);
-				goto unpin;
-			}
+		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
+		if (IS_ERR(sgt)) {
+			err = PTR_ERR(sgt);
+			goto unpin;
+		}

+		if (sgt) {
 			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
 					 DMA_TO_DEVICE);
 			if (err == 0) {
@@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)

 			state->iova[i] = sg_dma_address(sgt->sgl);
 			state->sgt[i] = sgt;
-		} else {
-			state->iova[i] = bo->iova;
 		}
 	}

@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
tegra_plane_state *state)
 		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
 		struct sg_table *sgt = state->sgt[i];

-		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
-		host1x_bo_unpin(dc->dev, &bo->base, sgt);
+		if (sgt) {
+			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
+				     DMA_TO_DEVICE);
+			host1x_bo_unpin(dc->dev, &bo->base, sgt);
+		}

 		state->iova[i] = DMA_MAPPING_ERROR;
 		state->sgt[i] = NULL;
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index 60b2fedd0061..538c0f0b40a4 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

 	for (i = 0; i < job->num_relocs; i++) {
 		struct host1x_reloc *reloc = &job->relocs[i];
-		dma_addr_t phys_addr, *phys;
+		dma_addr_t phys_addr;
 		struct sg_table *sgt;

 		reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 			goto unpin;
 		}

-		if (client->group)
-			phys = &phys_addr;
-		else
-			phys = NULL;
-
-		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
+		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
 		if (IS_ERR(sgt)) {
 			err = PTR_ERR(sgt);
 			goto unpin;
@@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 		job->num_unpins++;
 	}

+	/*
+	 * In a case of enabled firewall, all user gathers are copied into
+	 * the secured job->gather_copy_mapped.
+	 */
+	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
+		return 0;
+
 	for (i = 0; i < job->num_gathers; i++) {
 		struct host1x_job_gather *g = &job->gathers[i];
 		size_t gather_size = 0;
@@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
 			goto unpin;
 		}

-		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
+		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
 		if (IS_ERR(sgt)) {
 			err = PTR_ERR(sgt);
 			goto unpin;
 		}

-		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
+		if (host->domain) {
 			for_each_sg(sgt->sgl, sg, sgt->nents, j)
 				gather_size += sg->length;
 			gather_size = iova_align(&host->iova, gather_size);
@@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

 			job->unpins[job->num_unpins].size = gather_size;
 			phys_addr = iova_dma_addr(&host->iova, alloc);
-		} else {
+		} else if (sgt) {
 			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
 					 DMA_TO_DEVICE);
 			if (!err) {

--- >8 ---
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-01-30  4:36               ` Dmitry Osipenko
@ 2020-01-30 12:08                   ` Thierry Reding
  -1 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-01-30 12:08 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

[-- Attachment #1: Type: text/plain, Size: 20532 bytes --]

On Thu, Jan 30, 2020 at 07:36:36AM +0300, Dmitry Osipenko wrote:
> 29.01.2020 15:39, Thierry Reding пишет:
> > On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
> >> 13.12.2019 18:35, Dmitry Osipenko пишет:
> ...
> >> Hello Thierry,
> >>
> >> I took another look at the problem and here what was found:
> >>
> >> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> >>
> >>     1. host1x_drm_probe() is invoked *before* any of the
> >>        host1x_client_iommu_attach() happens, so there is no way
> >>        on earth the 'use_explicit_iommu' could ever be true.
> > 
> > That's not correct. host1x_client_iommu_attach() happens during
> > host1x_device_init(), which is called during host1x_drm_probe().
> 
> Looks like I previously got confused by accident, my bad.
> 
> > The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
> > devices can attach, if they want to. If any of them connect (because
> > they aren't already attached via something like the DMA/IOMMU glue)
> > then the tegra->use_explicit_iommu is set to true, in which case the
> > IOMMU domain setup for explicit IOMMU API usage is completed. If, on
> > the other hand, none of the clients have a need for the explicit IOMMU
> > domain, there's no need to set it up and host1x_drm_probe() will just
> > discard it.
> 
> This matches my understanding of what you wanted to achieve, thanks.
> 
> >>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
> >>        attached is wrong because it never attached in the case
> >>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
> >>        makes no sense for T20/30 that do not support LPAE.
> > 
> > It's not at all wrong. Take for example the case of Tegra124 and
> > Tegra210 where host1x and its clients can address 34 bits. In those
> > cases, allocating individual pages via shmem has a high probability of
> > hitting physical addresses beyond the 32-bit range, which means that the
> > host1x can not access them unless it is also attached to an IOMMU where
> > physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
> > virtual addresses. This is a very real problem that I was running into
> > when testing on Tegra124 and Tegra210.
> 
> Why not to set the DMA mask to 32bits if IOMMU is unavailable?

We already do that. If you look at host1x_iommu_init() in
drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
available and the host1x doesn't support wide GATHER opcodes, then
we limit the DMA Mask to 32 bits.

But that's not enough, see below.

> I'm a bit puzzled by the actual need to support the case where Host1x is
> backed by IOMMU and clients not.. How we could ever end up with this
> scenario in the upstream kernel?

That's not what we're doing here. The fundamental problem is that we
have a couple of generations where the hardware is mismatched in that
clients support 34-bit addresses while host1x can only use 32-bit
addresses in the GATHER opcode. The only way to get around this mismatch
is by using an IOMMU.

However, with an IOMMU enabled for clients, we can run into the case
where sparse pages would be allocated via shmem and end up beyond the
32-bit boundary. If the host1x is not attached to an IOMMU, there's no
way for it to access these pages with standard GATHER opcodes.

This is what used to happen prior to this change when the host1x
firewall was enabled. Since we were not attaching it to an IOMMU in that
case, we would end up with sparse buffers allocated from pages that the
host1x couldn't address.

> What about the reverse scenario? You won't be able to patch cmdstream
> properly for >32bit addresses.

I don't think that scenario exists. I'm not aware of a Tegra device that
has system memory outside of the CPU-addressable region.

> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
> addresses, so you can't use "wide" opcodes and can't patch cmdstream.

There's nothing in the UAPI that deals with addresses directly. We only
pass around handles and these are resolved to buffer objects in the
kernel where the address of the buffers can be 32-bit or 64-bit.

And we do in fact support wide opcodes and patch command streams just
fine on 64-bit systems.

I mean, it's not like I've been doing this just for the fun of it. There
are actual configurations where this is needed in order for it to work.

> Perhaps it is better not to add any new things or quirks to the Host1x /
> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
> only continues to grow up. Don't you think so?

This isn't anything new or a quirk. This is bug fixes to ensure that the
driver works in (now hopefully) all configurations. Previously it was a
matter of getting the configuration just right in order for it to work.
All the work I did here (starting with the wide opcode support and then
the DMA API and IOMMU work) was to make sure it would safely work in any
setup.

And I do consider these changes to also be cleanups and incremental
improvements of what the state was before. Again, I don't consider a
rewrite a serious cleanup.

I'm fully aware that the driver has been collecting dust for a while and
it isn't perfect. But it's also not overly messy. It's perhaps a bit
more complex than your average driver, but it's also some pretty complex
hardware.

> > But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
> > should be able to remedy the situation on Tegra20 and Tegra30 by adding
> > another check, based on the DMA mask. Something like the below should
> > work:
> > 
> > --- >8 ---
> [snip]
> > --- >8 ---
> 
> This works, thanks.

Great, I'll send this out then.

> >> [1]
> >> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> >>
> >> 2) Because of the above problems, the DRM clients are erroneously not
> >> getting attached to IOMMU at all and thus CMA is getting used for the BO
> >> allocations. Here comes the problems introduced by the "gpu: host1x:
> >> Support DMA mapping of buffers" patch, which makes DMA API to perform
> >> CPU cache maintenance on each job submission and apparently this is
> >> super bad for performance. This also makes no sense in comparison to the
> >> case of enabled IOMMU, where cache maintenance isn't performed at all
> >> (like it should be).
> > 
> > It actually does make a lot of sense. Very strictly speaking we were
> > violating the DMA API prior to the above patch because we were not DMA
> > mapping the buffers at all. Whenever you pass a buffer to hardware you
> > need to map it for the device. At that point, the kernel does not know
> > whether or not the buffer is dirty, so it has to perform a cache flush.
> > Similarily, when the hardware is done with a buffer, we need to unmap it
> > so that the CPU can access it again. This typically requires a cache
> > invalidate.
> > 
> > That things even worked to begin with more by accident than by design.
> > 
> > So yes, this is different from what we were doing before, but it's
> > actually the right thing to do. That said, I'm sure we can find ways to
> > optimize this. For example, as part of the DMA API conversion series I
> > added the possibility to set direction flags for relocation buffers. In
> > cases where it is known that a certain buffer will only be used for
> > reading, we should be able to avoid at least the cache invalidate
> > operation after a job is done, since the hardware won't have modified
> > the contents (when using an SMMU this can even be enforced). It's
> > slightly trickier to avoid cache flushes. For buffers that are only
> > going to be written, there's no need to flush the cache because the CPUs
> > changes can be assumed to be overwritten by the hardware anyway. However
> > we still need to make sure that we invalidate the caches in that case to
> > ensure subsequent cache flushes don't overwrite data already written by
> > hardware.
> > 
> > One other potential optimization I can imagine is to add flags to make
> > cache maintenance optional on buffers when we know it's safe to do so.
> > I'm not sure we can always know, so this is going to require further
> > thought.
> 
> Doesn't sound good to me.. this is not going to be good for GPU drivers.
> All cache maintenance should be in control of userspace, the userspace
> should be telling kernel driver when it needs to get CPU access and when
> to finish the access. DMABUF has generic UAPI for the synchronizations,
> although a mature GPU driver may need more than that.

I agree. But that's not something that we can do at this point. We don't
have a way of passing information such as this to the driver, so the
driver has to assume that caches are dirty for all buffers, otherwise it
will not be able to guarantee that random cache flushes won't corrupt
the data that it's passing to the hardware.

So yes, when we do have a way of explicitly flushing the caches for
buffers, then we can add a mechanism to pass that information to the
kernel so that it can optimize. But until then we just can't be sure.
And I prefer a kernel driver that gives me slow and reliable, rather
than fast but unpredictable results.

> Today Tegra DRM driver supports only write-combined BO allocations, and
> thus, we don't need to do more than to flush CPU buffers before
> executing HW job.

That's only true when you allocate with the DMA API. When you allocate
from a shmem mapping you don't get write-combined memory, so we do have
to perform cache maintenance on the pages.

> >> Please let me know if you're going to fix the problems or if you'd
> >> prefer me to create the patches.
> >>
> >> Here is a draft of the fix for #2, it doesn't cover case of imported
> >> buffers (which should be statically mapped, IIUC):
> >>
> >> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> >> *dev, struct host1x_bo *bo,
> >>          * If we've manually mapped the buffer object through the IOMMU,
> >> make
> >>          * sure to return the IOVA address of our mapping.
> >>          */
> >> -       if (phys && obj->mm) {
> >> +       if (phys && (obj->mm || obj->vaddr)) {
> >>                 *phys = obj->iova;
> > 
> > This doesn't work for the case where we use the DMA API for mapping. Or
> > at least it isn't going to work in the general case.
> 
> Right, looks like I'll need to update my memory about the DMA API usage.
> 
> > The reason is because obj->iova is only valid for whatever the device was that mapped
> > or allocated the buffer, which in this case is the host1x device, which
> > isn't even a real device, so it won't work. The only case where it does
> > work is if we're not behind an IOMMU, so obj->iova will actually be the
> > physical address.
> 
> But why do you need to dynamically map/unmap the statically-allocated
> buffers on each job submission, could you please explain what is the
> point? Perhaps it's a temporary workaround just to get a minimum of
> things working for the case of implicit IOMMU?

It's primarily because we don't really know if a buffer has been mapped
for a specific device. We always map at allocation time for the Tegra
DRM parent device (which isn't a real device) but before it's used by
any other host1x client, it has to be mapped for that device as well.
That's important in case any of these devices have different IOMMU
domains.

Actually, given that the device isn't a real device, the DMA handle
returned from dma_alloc_wc() is actually a physical address and not
valid for any device behind an IOMMU. So even in the case where we
share an IOMMU domain among multiple device, the mapping created by
dma_alloc_wc() is useless for them.

Because of the above and probably a bunch of other reasons, it's also a
requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
API will start printing a bunch of errors if you violate those and they
typically indicate that what you're doing may not work. That doesn't
mean it can't work, but it usually only does so accidentally.

> All buffers should be statically allocated and statically mapped, and
> when there is a need to sync an already mapped buffer, the dma_sync_*
> API should be used.

That's my understanding as well. However, it's slightly more complicated
than that. Once you move away from the assumption that a mapping for a
buffer the same for all devices, you can no longer do that. For example,
while the statically allocated buffer may be mapped for the Tegra DRM
device (again, it's not a real device), it's not mapped for any of the
clients yet. So before a client can use it, its driver has to go and map
the buffer for the device. The logical point to do that is during
host1x_job_pin(). Once the client no longer has a use for the buffer it
should also unmap the buffer again because it will otherwise occupy the
IOVA space unnecessarily. The logical point to do that is during
host1x_job_unpin().

host1x_job_unpin() is also the point at which the job releases its
reference to the buffer, so the backing memory could go away at any
point after that, which means that the IOVA mapping could point at
invalid memory if we didn't unmap the buffer.

I'm not aware of an easy way to optimize this while at the same time
making sure that everything is still consistent. I suppose one way to do
this would be to keep a cache of buffer objects and their mappings for
each device and avoid mapping/unmapping them for every job. The problem
with that is that we also don't want to hold on to buffer objects
indefinitely because that will potentially cause a lot of memory and
IOVA space to be used.

> Like I said above, the syncing should be done by userspace for the
> buffers that are in control of userspace.

Yes, I agree. We already have an implementation of the .begin_cpu_access
and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
something like that for native buffers. Alternatively, I suppose user-
space could be required to flush/invalidate using the DMA-BUF, but that
potentially has the drawback of having to export a DMA-BUF for every
single buffer.

A better option may be to add a driver-specific IOCTL to do cache
maintenance. I think other drivers already do that.

> > So what this basically ends up doing is avoid dma_map_*() all the time,
> > which I guess is what you're trying to achieve. But it also gives you
> > the wrong I/O virtual address in any case where an IOMMU is involved.
> > Also, as discussed above, avoiding cache maintenance isn't correct.
> 
> Alright, then right now we need to bypass the dma_map_*() in a case of a
> non-implicit IOMMU, in order to bring back the good old behavior (at
> least temporary, until there will be a more comprehensive solution).

But it's not good old behaviour. You're still going to side-step the
cache maintenance and violate the DMA API semantics.

Thierry

> 
> What do you think about this variant:
> 
> --- >8 ---
> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
> index 1237df157e05..555a6424e9fa 100644
> --- a/drivers/gpu/drm/tegra/gem.c
> +++ b/drivers/gpu/drm/tegra/gem.c
> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
> *dev, struct host1x_bo *bo,
>  				     dma_addr_t *phys)
>  {
>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
>  	struct sg_table *sgt;
>  	int err;
> 
> -	/*
> -	 * If we've manually mapped the buffer object through the IOMMU, make
> -	 * sure to return the IOVA address of our mapping.
> -	 */
> -	if (phys && obj->mm) {
> -		*phys = obj->iova;
> -		return NULL;
> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
> +		if (obj->vaddr) {
> +			*phys = obj->iova;
> +			return NULL;
> +		}
> +
> +		/*
> +		 * If we've manually mapped the buffer object through the
> +		 * IOMMU, make sure to return the IOVA address of our mapping.
> +		 */
> +		if (obj->mm) {
> +			*phys = obj->iova;
> +			return NULL;
> +		}
>  	}
> 
>  	/*
> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> index 6d1872ddef17..91304b9034fa 100644
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
> 
>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> +		struct sg_table *sgt;
> 
> -		if (!dc->client.group) {
> -			struct sg_table *sgt;
> -
> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
> -			if (IS_ERR(sgt)) {
> -				err = PTR_ERR(sgt);
> -				goto unpin;
> -			}
> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
> +		if (IS_ERR(sgt)) {
> +			err = PTR_ERR(sgt);
> +			goto unpin;
> +		}
> 
> +		if (sgt) {
>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
>  					 DMA_TO_DEVICE);
>  			if (err == 0) {
> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
> 
>  			state->iova[i] = sg_dma_address(sgt->sgl);
>  			state->sgt[i] = sgt;
> -		} else {
> -			state->iova[i] = bo->iova;
>  		}
>  	}
> 
> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>  		struct sg_table *sgt = state->sgt[i];
> 
> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
> +		if (sgt) {
> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
> +				     DMA_TO_DEVICE);
> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
> +		}
> 
>  		state->iova[i] = DMA_MAPPING_ERROR;
>  		state->sgt[i] = NULL;
> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> index 60b2fedd0061..538c0f0b40a4 100644
> --- a/drivers/gpu/host1x/job.c
> +++ b/drivers/gpu/host1x/job.c
> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>  	for (i = 0; i < job->num_relocs; i++) {
>  		struct host1x_reloc *reloc = &job->relocs[i];
> -		dma_addr_t phys_addr, *phys;
> +		dma_addr_t phys_addr;
>  		struct sg_table *sgt;
> 
>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  			goto unpin;
>  		}
> 
> -		if (client->group)
> -			phys = &phys_addr;
> -		else
> -			phys = NULL;
> -
> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>  		if (IS_ERR(sgt)) {
>  			err = PTR_ERR(sgt);
>  			goto unpin;
> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  		job->num_unpins++;
>  	}
> 
> +	/*
> +	 * In a case of enabled firewall, all user gathers are copied into
> +	 * the secured job->gather_copy_mapped.
> +	 */
> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
> +		return 0;
> +
>  	for (i = 0; i < job->num_gathers; i++) {
>  		struct host1x_job_gather *g = &job->gathers[i];
>  		size_t gather_size = 0;
> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  			goto unpin;
>  		}
> 
> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>  		if (IS_ERR(sgt)) {
>  			err = PTR_ERR(sgt);
>  			goto unpin;
>  		}
> 
> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
> +		if (host->domain) {
>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
>  				gather_size += sg->length;
>  			gather_size = iova_align(&host->iova, gather_size);
> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>  			job->unpins[job->num_unpins].size = gather_size;
>  			phys_addr = iova_dma_addr(&host->iova, alloc);
> -		} else {
> +		} else if (sgt) {
>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>  					 DMA_TO_DEVICE);
>  			if (!err) {
> 
> --- >8 ---

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-01-30 12:08                   ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-01-30 12:08 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra, DRI Development

[-- Attachment #1.1: Type: text/plain, Size: 20532 bytes --]

On Thu, Jan 30, 2020 at 07:36:36AM +0300, Dmitry Osipenko wrote:
> 29.01.2020 15:39, Thierry Reding пишет:
> > On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:
> >> 13.12.2019 18:35, Dmitry Osipenko пишет:
> ...
> >> Hello Thierry,
> >>
> >> I took another look at the problem and here what was found:
> >>
> >> 1) The "Optionally attach clients to the IOMMU" patch is wrong because:
> >>
> >>     1. host1x_drm_probe() is invoked *before* any of the
> >>        host1x_client_iommu_attach() happens, so there is no way
> >>        on earth the 'use_explicit_iommu' could ever be true.
> > 
> > That's not correct. host1x_client_iommu_attach() happens during
> > host1x_device_init(), which is called during host1x_drm_probe().
> 
> Looks like I previously got confused by accident, my bad.
> 
> > The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all
> > devices can attach, if they want to. If any of them connect (because
> > they aren't already attached via something like the DMA/IOMMU glue)
> > then the tegra->use_explicit_iommu is set to true, in which case the
> > IOMMU domain setup for explicit IOMMU API usage is completed. If, on
> > the other hand, none of the clients have a need for the explicit IOMMU
> > domain, there's no need to set it up and host1x_drm_probe() will just
> > discard it.
> 
> This matches my understanding of what you wanted to achieve, thanks.
> 
> >>     2. Not attaching DRM clients to IOMMU if HOST1x isn't
> >>        attached is wrong because it never attached in the case
> >>        of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
> >>        makes no sense for T20/30 that do not support LPAE.
> > 
> > It's not at all wrong. Take for example the case of Tegra124 and
> > Tegra210 where host1x and its clients can address 34 bits. In those
> > cases, allocating individual pages via shmem has a high probability of
> > hitting physical addresses beyond the 32-bit range, which means that the
> > host1x can not access them unless it is also attached to an IOMMU where
> > physical addresses to >= 4 GiB addresses can be translated into < 4 GiB
> > virtual addresses. This is a very real problem that I was running into
> > when testing on Tegra124 and Tegra210.
> 
> Why not to set the DMA mask to 32bits if IOMMU is unavailable?

We already do that. If you look at host1x_iommu_init() in
drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
available and the host1x doesn't support wide GATHER opcodes, then
we limit the DMA Mask to 32 bits.

But that's not enough, see below.

> I'm a bit puzzled by the actual need to support the case where Host1x is
> backed by IOMMU and clients not.. How we could ever end up with this
> scenario in the upstream kernel?

That's not what we're doing here. The fundamental problem is that we
have a couple of generations where the hardware is mismatched in that
clients support 34-bit addresses while host1x can only use 32-bit
addresses in the GATHER opcode. The only way to get around this mismatch
is by using an IOMMU.

However, with an IOMMU enabled for clients, we can run into the case
where sparse pages would be allocated via shmem and end up beyond the
32-bit boundary. If the host1x is not attached to an IOMMU, there's no
way for it to access these pages with standard GATHER opcodes.

This is what used to happen prior to this change when the host1x
firewall was enabled. Since we were not attaching it to an IOMMU in that
case, we would end up with sparse buffers allocated from pages that the
host1x couldn't address.

> What about the reverse scenario? You won't be able to patch cmdstream
> properly for >32bit addresses.

I don't think that scenario exists. I'm not aware of a Tegra device that
has system memory outside of the CPU-addressable region.

> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
> addresses, so you can't use "wide" opcodes and can't patch cmdstream.

There's nothing in the UAPI that deals with addresses directly. We only
pass around handles and these are resolved to buffer objects in the
kernel where the address of the buffers can be 32-bit or 64-bit.

And we do in fact support wide opcodes and patch command streams just
fine on 64-bit systems.

I mean, it's not like I've been doing this just for the fun of it. There
are actual configurations where this is needed in order for it to work.

> Perhaps it is better not to add any new things or quirks to the Host1x /
> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
> only continues to grow up. Don't you think so?

This isn't anything new or a quirk. This is bug fixes to ensure that the
driver works in (now hopefully) all configurations. Previously it was a
matter of getting the configuration just right in order for it to work.
All the work I did here (starting with the wide opcode support and then
the DMA API and IOMMU work) was to make sure it would safely work in any
setup.

And I do consider these changes to also be cleanups and incremental
improvements of what the state was before. Again, I don't consider a
rewrite a serious cleanup.

I'm fully aware that the driver has been collecting dust for a while and
it isn't perfect. But it's also not overly messy. It's perhaps a bit
more complex than your average driver, but it's also some pretty complex
hardware.

> > But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We
> > should be able to remedy the situation on Tegra20 and Tegra30 by adding
> > another check, based on the DMA mask. Something like the below should
> > work:
> > 
> > --- >8 ---
> [snip]
> > --- >8 ---
> 
> This works, thanks.

Great, I'll send this out then.

> >> [1]
> >> https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205
> >>
> >> 2) Because of the above problems, the DRM clients are erroneously not
> >> getting attached to IOMMU at all and thus CMA is getting used for the BO
> >> allocations. Here comes the problems introduced by the "gpu: host1x:
> >> Support DMA mapping of buffers" patch, which makes DMA API to perform
> >> CPU cache maintenance on each job submission and apparently this is
> >> super bad for performance. This also makes no sense in comparison to the
> >> case of enabled IOMMU, where cache maintenance isn't performed at all
> >> (like it should be).
> > 
> > It actually does make a lot of sense. Very strictly speaking we were
> > violating the DMA API prior to the above patch because we were not DMA
> > mapping the buffers at all. Whenever you pass a buffer to hardware you
> > need to map it for the device. At that point, the kernel does not know
> > whether or not the buffer is dirty, so it has to perform a cache flush.
> > Similarily, when the hardware is done with a buffer, we need to unmap it
> > so that the CPU can access it again. This typically requires a cache
> > invalidate.
> > 
> > That things even worked to begin with more by accident than by design.
> > 
> > So yes, this is different from what we were doing before, but it's
> > actually the right thing to do. That said, I'm sure we can find ways to
> > optimize this. For example, as part of the DMA API conversion series I
> > added the possibility to set direction flags for relocation buffers. In
> > cases where it is known that a certain buffer will only be used for
> > reading, we should be able to avoid at least the cache invalidate
> > operation after a job is done, since the hardware won't have modified
> > the contents (when using an SMMU this can even be enforced). It's
> > slightly trickier to avoid cache flushes. For buffers that are only
> > going to be written, there's no need to flush the cache because the CPUs
> > changes can be assumed to be overwritten by the hardware anyway. However
> > we still need to make sure that we invalidate the caches in that case to
> > ensure subsequent cache flushes don't overwrite data already written by
> > hardware.
> > 
> > One other potential optimization I can imagine is to add flags to make
> > cache maintenance optional on buffers when we know it's safe to do so.
> > I'm not sure we can always know, so this is going to require further
> > thought.
> 
> Doesn't sound good to me.. this is not going to be good for GPU drivers.
> All cache maintenance should be in control of userspace, the userspace
> should be telling kernel driver when it needs to get CPU access and when
> to finish the access. DMABUF has generic UAPI for the synchronizations,
> although a mature GPU driver may need more than that.

I agree. But that's not something that we can do at this point. We don't
have a way of passing information such as this to the driver, so the
driver has to assume that caches are dirty for all buffers, otherwise it
will not be able to guarantee that random cache flushes won't corrupt
the data that it's passing to the hardware.

So yes, when we do have a way of explicitly flushing the caches for
buffers, then we can add a mechanism to pass that information to the
kernel so that it can optimize. But until then we just can't be sure.
And I prefer a kernel driver that gives me slow and reliable, rather
than fast but unpredictable results.

> Today Tegra DRM driver supports only write-combined BO allocations, and
> thus, we don't need to do more than to flush CPU buffers before
> executing HW job.

That's only true when you allocate with the DMA API. When you allocate
from a shmem mapping you don't get write-combined memory, so we do have
to perform cache maintenance on the pages.

> >> Please let me know if you're going to fix the problems or if you'd
> >> prefer me to create the patches.
> >>
> >> Here is a draft of the fix for #2, it doesn't cover case of imported
> >> buffers (which should be statically mapped, IIUC):
> >>
> >> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> >> *dev, struct host1x_bo *bo,
> >>          * If we've manually mapped the buffer object through the IOMMU,
> >> make
> >>          * sure to return the IOVA address of our mapping.
> >>          */
> >> -       if (phys && obj->mm) {
> >> +       if (phys && (obj->mm || obj->vaddr)) {
> >>                 *phys = obj->iova;
> > 
> > This doesn't work for the case where we use the DMA API for mapping. Or
> > at least it isn't going to work in the general case.
> 
> Right, looks like I'll need to update my memory about the DMA API usage.
> 
> > The reason is because obj->iova is only valid for whatever the device was that mapped
> > or allocated the buffer, which in this case is the host1x device, which
> > isn't even a real device, so it won't work. The only case where it does
> > work is if we're not behind an IOMMU, so obj->iova will actually be the
> > physical address.
> 
> But why do you need to dynamically map/unmap the statically-allocated
> buffers on each job submission, could you please explain what is the
> point? Perhaps it's a temporary workaround just to get a minimum of
> things working for the case of implicit IOMMU?

It's primarily because we don't really know if a buffer has been mapped
for a specific device. We always map at allocation time for the Tegra
DRM parent device (which isn't a real device) but before it's used by
any other host1x client, it has to be mapped for that device as well.
That's important in case any of these devices have different IOMMU
domains.

Actually, given that the device isn't a real device, the DMA handle
returned from dma_alloc_wc() is actually a physical address and not
valid for any device behind an IOMMU. So even in the case where we
share an IOMMU domain among multiple device, the mapping created by
dma_alloc_wc() is useless for them.

Because of the above and probably a bunch of other reasons, it's also a
requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
API will start printing a bunch of errors if you violate those and they
typically indicate that what you're doing may not work. That doesn't
mean it can't work, but it usually only does so accidentally.

> All buffers should be statically allocated and statically mapped, and
> when there is a need to sync an already mapped buffer, the dma_sync_*
> API should be used.

That's my understanding as well. However, it's slightly more complicated
than that. Once you move away from the assumption that a mapping for a
buffer the same for all devices, you can no longer do that. For example,
while the statically allocated buffer may be mapped for the Tegra DRM
device (again, it's not a real device), it's not mapped for any of the
clients yet. So before a client can use it, its driver has to go and map
the buffer for the device. The logical point to do that is during
host1x_job_pin(). Once the client no longer has a use for the buffer it
should also unmap the buffer again because it will otherwise occupy the
IOVA space unnecessarily. The logical point to do that is during
host1x_job_unpin().

host1x_job_unpin() is also the point at which the job releases its
reference to the buffer, so the backing memory could go away at any
point after that, which means that the IOVA mapping could point at
invalid memory if we didn't unmap the buffer.

I'm not aware of an easy way to optimize this while at the same time
making sure that everything is still consistent. I suppose one way to do
this would be to keep a cache of buffer objects and their mappings for
each device and avoid mapping/unmapping them for every job. The problem
with that is that we also don't want to hold on to buffer objects
indefinitely because that will potentially cause a lot of memory and
IOVA space to be used.

> Like I said above, the syncing should be done by userspace for the
> buffers that are in control of userspace.

Yes, I agree. We already have an implementation of the .begin_cpu_access
and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
something like that for native buffers. Alternatively, I suppose user-
space could be required to flush/invalidate using the DMA-BUF, but that
potentially has the drawback of having to export a DMA-BUF for every
single buffer.

A better option may be to add a driver-specific IOCTL to do cache
maintenance. I think other drivers already do that.

> > So what this basically ends up doing is avoid dma_map_*() all the time,
> > which I guess is what you're trying to achieve. But it also gives you
> > the wrong I/O virtual address in any case where an IOMMU is involved.
> > Also, as discussed above, avoiding cache maintenance isn't correct.
> 
> Alright, then right now we need to bypass the dma_map_*() in a case of a
> non-implicit IOMMU, in order to bring back the good old behavior (at
> least temporary, until there will be a more comprehensive solution).

But it's not good old behaviour. You're still going to side-step the
cache maintenance and violate the DMA API semantics.

Thierry

> 
> What do you think about this variant:
> 
> --- >8 ---
> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
> index 1237df157e05..555a6424e9fa 100644
> --- a/drivers/gpu/drm/tegra/gem.c
> +++ b/drivers/gpu/drm/tegra/gem.c
> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
> *dev, struct host1x_bo *bo,
>  				     dma_addr_t *phys)
>  {
>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
>  	struct sg_table *sgt;
>  	int err;
> 
> -	/*
> -	 * If we've manually mapped the buffer object through the IOMMU, make
> -	 * sure to return the IOVA address of our mapping.
> -	 */
> -	if (phys && obj->mm) {
> -		*phys = obj->iova;
> -		return NULL;
> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
> +		if (obj->vaddr) {
> +			*phys = obj->iova;
> +			return NULL;
> +		}
> +
> +		/*
> +		 * If we've manually mapped the buffer object through the
> +		 * IOMMU, make sure to return the IOVA address of our mapping.
> +		 */
> +		if (obj->mm) {
> +			*phys = obj->iova;
> +			return NULL;
> +		}
>  	}
> 
>  	/*
> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> index 6d1872ddef17..91304b9034fa 100644
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
> 
>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> +		struct sg_table *sgt;
> 
> -		if (!dc->client.group) {
> -			struct sg_table *sgt;
> -
> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
> -			if (IS_ERR(sgt)) {
> -				err = PTR_ERR(sgt);
> -				goto unpin;
> -			}
> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
> +		if (IS_ERR(sgt)) {
> +			err = PTR_ERR(sgt);
> +			goto unpin;
> +		}
> 
> +		if (sgt) {
>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
>  					 DMA_TO_DEVICE);
>  			if (err == 0) {
> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
> 
>  			state->iova[i] = sg_dma_address(sgt->sgl);
>  			state->sgt[i] = sgt;
> -		} else {
> -			state->iova[i] = bo->iova;
>  		}
>  	}
> 
> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> tegra_plane_state *state)
>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>  		struct sg_table *sgt = state->sgt[i];
> 
> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
> +		if (sgt) {
> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
> +				     DMA_TO_DEVICE);
> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
> +		}
> 
>  		state->iova[i] = DMA_MAPPING_ERROR;
>  		state->sgt[i] = NULL;
> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> index 60b2fedd0061..538c0f0b40a4 100644
> --- a/drivers/gpu/host1x/job.c
> +++ b/drivers/gpu/host1x/job.c
> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>  	for (i = 0; i < job->num_relocs; i++) {
>  		struct host1x_reloc *reloc = &job->relocs[i];
> -		dma_addr_t phys_addr, *phys;
> +		dma_addr_t phys_addr;
>  		struct sg_table *sgt;
> 
>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  			goto unpin;
>  		}
> 
> -		if (client->group)
> -			phys = &phys_addr;
> -		else
> -			phys = NULL;
> -
> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>  		if (IS_ERR(sgt)) {
>  			err = PTR_ERR(sgt);
>  			goto unpin;
> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  		job->num_unpins++;
>  	}
> 
> +	/*
> +	 * In a case of enabled firewall, all user gathers are copied into
> +	 * the secured job->gather_copy_mapped.
> +	 */
> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
> +		return 0;
> +
>  	for (i = 0; i < job->num_gathers; i++) {
>  		struct host1x_job_gather *g = &job->gathers[i];
>  		size_t gather_size = 0;
> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
>  			goto unpin;
>  		}
> 
> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>  		if (IS_ERR(sgt)) {
>  			err = PTR_ERR(sgt);
>  			goto unpin;
>  		}
> 
> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
> +		if (host->domain) {
>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
>  				gather_size += sg->length;
>  			gather_size = iova_align(&host->iova, gather_size);
> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
> struct host1x_job *job)
> 
>  			job->unpins[job->num_unpins].size = gather_size;
>  			phys_addr = iova_dma_addr(&host->iova, alloc);
> -		} else {
> +		} else if (sgt) {
>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>  					 DMA_TO_DEVICE);
>  			if (!err) {
> 
> --- >8 ---

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-01-30 12:08                   ` Thierry Reding
@ 2020-02-02 21:00                     ` Dmitry Osipenko
  -1 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-02-02 21:00 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

30.01.2020 15:08, Thierry Reding пишет:
...
>> Why not to set the DMA mask to 32bits if IOMMU is unavailable?
> 
> We already do that. If you look at host1x_iommu_init() in
> drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
> available and the host1x doesn't support wide GATHER opcodes, then
> we limit the DMA Mask to 32 bits.
> 
> But that's not enough, see below.
> 
>> I'm a bit puzzled by the actual need to support the case where Host1x is
>> backed by IOMMU and clients not.. How we could ever end up with this
>> scenario in the upstream kernel?
> 
> That's not what we're doing here. The fundamental problem is that we
> have a couple of generations where the hardware is mismatched in that
> clients support 34-bit addresses while host1x can only use 32-bit
> addresses in the GATHER opcode. The only way to get around this mismatch
> is by using an IOMMU.

Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
index 6a995db51d6d..4253dd53ea2e 100644
--- a/drivers/gpu/host1x/bus.c
+++ b/drivers/gpu/host1x/bus.c
@@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct
host1x_device *device,
  */
 int host1x_device_init(struct host1x_device *device)
 {
+	struct host1x *host = dev_get_drvdata(device->dev.parent);
+	struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev);
 	struct host1x_client *client;
+	u64 mask;
 	int err;

 	mutex_lock(&device->clients_lock);

 	list_for_each_entry(client, &device->clients, list) {
+		mask = dma_get_mask(client->dev);
+
+		if (!domain && !host->info->has_wide_gather &&
+		    mask > DMA_BIT_MASK(32)) {
+			err = dma_coerce_mask_and_coherent(client->dev,
+							   DMA_BIT_MASK(32));
+			if (err < 0) {
+				dev_err(&device->dev,
+					"failed to set DMA mask: %d\n", err);
+				goto teardown;
+			}
+		}
+
 		if (client->ops && client->ops->init) {
 			err = client->ops->init(client);
 			if (err < 0) {

> However, with an IOMMU enabled for clients, we can run into the case
> where sparse pages would be allocated via shmem and end up beyond the
> 32-bit boundary. If the host1x is not attached to an IOMMU, there's no
> way for it to access these pages with standard GATHER opcodes.
> 
> This is what used to happen prior to this change when the host1x
> firewall was enabled.

I assume you're meaning the *disabled* firewall because gathers are
copied into the Host1x's DMA buffer if firewall is enabled.

> Since we were not attaching it to an IOMMU in that
> case, we would end up with sparse buffers allocated from pages that the
> host1x couldn't address.

Could you please clarify what do you mean by the shmem? If you mean the
get_pages(), then one option will be to simply enforce Host1x firewall
in order to get gathers always copied, and thus, we can remove the
TEGRA_HOST1X_FIREWALL Kconfig option.

The other option could be *not* to use get_pages() whenever it can't work.

It also should be worthwhile to simply restrict kernel's configuration
by disabling all functionality which requires IOMMU if IOMMU is not
available. Just error out with informative message in that case, telling
that kernel/device-tree configuration is wrong.

Please notice that grate-drivers do not work with the disabled firewall
anyways, because it's simply impractical/unusable from a userspace
perspective to track liveness of the gather BO.

Also, take a look at the grate-kernel's WIP patch where Host1x is
separated from Tegra DRM and Host1x maintains gather buffer allocations
by itself. In the case of legacy UAPI, the gather commands are copied
from DRM GEM to the Host1x's BO, similarly to what we have now in
upstream for the case of the enabled firewall, but in a more optimized
manner (using pre-allocated pool and etc). I'd like to see the upstream
driver doing the same if we won't end up dropping the legacy staging
UAPI entirely.

>> What about the reverse scenario? You won't be able to patch cmdstream
>> properly for >32bit addresses.
> 
> I don't think that scenario exists. I'm not aware of a Tegra device that
> has system memory outside of the CPU-addressable region.
> 
>> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
>> addresses, so you can't use "wide" opcodes and can't patch cmdstream.
> 
> There's nothing in the UAPI that deals with addresses directly. We only
> pass around handles and these are resolved to buffer objects in the
> kernel where the address of the buffers can be 32-bit or 64-bit.

If buffer is 64-bit, then client's address-registers take two 32-bit
hardware registers, and thus, cmdstream should reserve two words for the
patched handles.

Host1x driver doesn't support patching of 64-bit addresses and there is
no way to tell userspace to reserve two words per-address using the
current UAPI.

> And we do in fact support wide opcodes and patch command streams just
> fine on 64-bit systems.
> 
> I mean, it's not like I've been doing this just for the fun of it. There
> are actual configurations where this is needed in order for it to work.
> 
>> Perhaps it is better not to add any new things or quirks to the Host1x /
>> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
>> only continues to grow up. Don't you think so?
> 
> This isn't anything new or a quirk. This is bug fixes to ensure that the
> driver works in (now hopefully) all configurations. Previously it was a
> matter of getting the configuration just right in order for it to work.
> All the work I did here (starting with the wide opcode support and then
> the DMA API and IOMMU work) was to make sure it would safely work in any
> setup.
> 
> And I do consider these changes to also be cleanups and incremental
> improvements of what the state was before. Again, I don't consider a
> rewrite a serious cleanup.
> 
> I'm fully aware that the driver has been collecting dust for a while and
> it isn't perfect. But it's also not overly messy. It's perhaps a bit
> more complex than your average driver, but it's also some pretty complex
> hardware.
...
>> Doesn't sound good to me.. this is not going to be good for GPU drivers.
>> All cache maintenance should be in control of userspace, the userspace
>> should be telling kernel driver when it needs to get CPU access and when
>> to finish the access. DMABUF has generic UAPI for the synchronizations,
>> although a mature GPU driver may need more than that.
> 
> I agree. But that's not something that we can do at this point. We don't
> have a way of passing information such as this to the driver, so the
> driver has to assume that caches are dirty for all buffers, otherwise it
> will not be able to guarantee that random cache flushes won't corrupt
> the data that it's passing to the hardware.
> 
> So yes, when we do have a way of explicitly flushing the caches for
> buffers, then we can add a mechanism to pass that information to the
> kernel so that it can optimize. But until then we just can't be sure.

It worked fine for the last 7 years on T20/30 and continues to work if
the offending patches are reverted, no problems here.

> And I prefer a kernel driver that gives me slow and reliable, rather
> than fast but unpredictable results.

Your changes introduced regression on Tegra20/30 without solving any
problem. This is unacceptable, it's not an improvement, please don't do
it :)

Apparently some problem exists only for later Tegra generations, but I
still can't recognize what it's is from your words.

Please give a concrete real-world example of the trouble you're trying
to solve.

>> Today Tegra DRM driver supports only write-combined BO allocations, and
>> thus, we don't need to do more than to flush CPU buffers before
>> executing HW job.
> 
> That's only true when you allocate with the DMA API. When you allocate
> from a shmem mapping you don't get write-combined memory, so we do have
> to perform cache maintenance on the pages.

In all cases GEM's memory mappings are write-combined, including the
case of get_pages(). Host1x driver performs the maintenance by doing
wmb() before executing job.

If you're meaning that imported buffers could be backed by cacheable
pages, then userspace must use DMABUF syncing UAPI or any other means to
sync CPU cache, there is no way around it. You should assume that yours
userpspace is broken if it doesn't do it.

>>>> Please let me know if you're going to fix the problems or if you'd
>>>> prefer me to create the patches.
>>>>
>>>> Here is a draft of the fix for #2, it doesn't cover case of imported
>>>> buffers (which should be statically mapped, IIUC):
>>>>
>>>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
>>>> *dev, struct host1x_bo *bo,
>>>>          * If we've manually mapped the buffer object through the IOMMU,
>>>> make
>>>>          * sure to return the IOVA address of our mapping.
>>>>          */
>>>> -       if (phys && obj->mm) {
>>>> +       if (phys && (obj->mm || obj->vaddr)) {
>>>>                 *phys = obj->iova;
>>>
>>> This doesn't work for the case where we use the DMA API for mapping. Or
>>> at least it isn't going to work in the general case.
>>
>> Right, looks like I'll need to update my memory about the DMA API usage.
>>
>>> The reason is because obj->iova is only valid for whatever the device was that mapped
>>> or allocated the buffer, which in this case is the host1x device, which
>>> isn't even a real device, so it won't work. The only case where it does
>>> work is if we're not behind an IOMMU, so obj->iova will actually be the
>>> physical address.
>>
>> But why do you need to dynamically map/unmap the statically-allocated
>> buffers on each job submission, could you please explain what is the
>> point? Perhaps it's a temporary workaround just to get a minimum of
>> things working for the case of implicit IOMMU?
> 
> It's primarily because we don't really know if a buffer has been mapped
> for a specific device. We always map at allocation time for the Tegra
> DRM parent device (which isn't a real device) but before it's used by
> any other host1x client, it has to be mapped for that device as well.

> That's important in case any of these devices have different IOMMU
> domains.

This case should be covered by the solution I'm proposing below.

Today's mainline kernel has all clients in the same shared IOMMU domain.
Please note that I'm focused on solving the primary regression problem,
the potential future problems are secondary.

> Actually, given that the device isn't a real device, the DMA handle
> returned from dma_alloc_wc() is actually a physical address and not
> valid for any device behind an IOMMU. So even in the case where we
> share an IOMMU domain among multiple device, the mapping created by
> dma_alloc_wc() is useless for them.

What about to pick any DRM sub-device and then statically map DMA buffer
for this picked device during of BO allocation / import? All other DRM
sub-devices can access that mapping because of the shared domain.

> Because of the above and probably a bunch of other reasons, it's also a
> requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
> API will start printing a bunch of errors if you violate those and they
> typically indicate that what you're doing may not work. That doesn't
> mean it can't work, but it usually only does so accidentally.
> 
>> All buffers should be statically allocated and statically mapped, and
>> when there is a need to sync an already mapped buffer, the dma_sync_*
>> API should be used.
> 
> That's my understanding as well. However, it's slightly more complicated
> than that. Once you move away from the assumption that a mapping for a
> buffer the same for all devices, you can no longer do that. For example,
> while the statically allocated buffer may be mapped for the Tegra DRM
> device (again, it's not a real device), it's not mapped for any of the
> clients yet.

The same what I wrote above. Just pick any DRM sub-device and map buffer
for that device during allocation / import.

> So before a client can use it, its driver has to go and map
> the buffer for the device. The logical point to do that is during
> host1x_job_pin(). Once the client no longer has a use for the buffer it
> should also unmap the buffer again because it will otherwise occupy the
> IOVA space unnecessarily. The logical point to do that is during
> host1x_job_unpin().

The IOVA space problem really exists only for Tegra20.

I suppose that page table size is ~10-20MB per 1GB of allocations, this
is not something to worry about.

> host1x_job_unpin() is also the point at which the job releases its
> reference to the buffer, so the backing memory could go away at any
> point after that, which means that the IOVA mapping could point at
> invalid memory if we didn't unmap the buffer.

Backing memory can't go away before buffer is unmapped, it is a refcount
bug otherwise.

> I'm not aware of an easy way to optimize this while at the same time
> making sure that everything is still consistent. I suppose one way to do
> this would be to keep a cache of buffer objects and their mappings for
> each device and avoid mapping/unmapping them for every job. The problem
> with that is that we also don't want to hold on to buffer objects
> indefinitely because that will potentially cause a lot of memory and
> IOVA space to be used.

The case of explicit IOMMU is nicer because we can use DRM's MM scanner
to maintain a limited window of IOMMU mappings. Take a look at
https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/gart.c
for example.

The LRU cache should be easy to implement, it also should be useful if
we'll get around to supporting memory shrinker (BO swap-out).

>> Like I said above, the syncing should be done by userspace for the
>> buffers that are in control of userspace.
> 
> Yes, I agree. We already have an implementation of the .begin_cpu_access
> and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
> something like that for native buffers. Alternatively, I suppose user-
> space could be required to flush/invalidate using the DMA-BUF, but that
> potentially has the drawback of having to export a DMA-BUF for every
> single buffer.> A better option may be to add a driver-specific IOCTL to do cache
> maintenance. I think other drivers already do that.
Perhaps.. but it needs a real-world use-case first.

>>> So what this basically ends up doing is avoid dma_map_*() all the time,
>>> which I guess is what you're trying to achieve. But it also gives you
>>> the wrong I/O virtual address in any case where an IOMMU is involved.
>>> Also, as discussed above, avoiding cache maintenance isn't correct.
>>
>> Alright, then right now we need to bypass the dma_map_*() in a case of a
>> non-implicit IOMMU, in order to bring back the good old behavior (at
>> least temporary, until there will be a more comprehensive solution).
> 
> But it's not good old behaviour. You're still going to side-step the
> cache maintenance and violate the DMA API semantics.

The DMA API usage that you introduced in the v5.5 is wrong too because
it makes driver unusable :)

From a user's perspective it is exactly the good old behaviour. I tried
to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly
fine before, and now it is absolutely unusable.

There are couple things we could do in order to prevent this situation
in the future:

  1. Better automated testing. The grate-tests could be useful for this
particular case, once it will get support for the offscreen testing.

  2. Better review and real-world testing of the patches. Don't you mind
if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

I'm suggesting to take a step back and revert to the old behaviour for
now, at least on T20/30. I don't know about ARM64, but ARM32 permits to
violate the DMA API semantics for the case of unavailable IOMMU, of
course that may change in the future and by that time I hope we'll
manage to get the DMA API usage done right.

The solution below restores old behavior for the case of unavailable
IOMMU and for the case of explicit IOMMU domain, it should work in the
case of implicit IOMMU as well.

>> What do you think about this variant:
>>
>> --- >8 ---
>> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
>> index 1237df157e05..555a6424e9fa 100644
>> --- a/drivers/gpu/drm/tegra/gem.c
>> +++ b/drivers/gpu/drm/tegra/gem.c
>> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
>> *dev, struct host1x_bo *bo,
>>  				     dma_addr_t *phys)
>>  {
>>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
>> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
>>  	struct sg_table *sgt;
>>  	int err;
>>
>> -	/*
>> -	 * If we've manually mapped the buffer object through the IOMMU, make
>> -	 * sure to return the IOVA address of our mapping.
>> -	 */
>> -	if (phys && obj->mm) {
>> -		*phys = obj->iova;
>> -		return NULL;
>> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
>> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */

>> +		if (obj->vaddr) {

If this is unacceptable on ARM64 (dma_handle != phys_addr), then it
could be:

		if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {

>> +			*phys = obj->iova;
>> +			return NULL;
>> +		}
>> +
>> +		/*
>> +		 * If we've manually mapped the buffer object through the
>> +		 * IOMMU, make sure to return the IOVA address of our mapping.
>> +		 */
>> +		if (obj->mm) {
>> +			*phys = obj->iova;
>> +			return NULL;
>> +		}
>>  	}
>>
>>  	/*
>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
>> index 6d1872ddef17..91304b9034fa 100644
>> --- a/drivers/gpu/drm/tegra/plane.c
>> +++ b/drivers/gpu/drm/tegra/plane.c
>> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>
>>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
>>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>> +		struct sg_table *sgt;
>>
>> -		if (!dc->client.group) {
>> -			struct sg_table *sgt;
>> -
>> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
>> -			if (IS_ERR(sgt)) {
>> -				err = PTR_ERR(sgt);
>> -				goto unpin;
>> -			}
>> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
>> +		if (IS_ERR(sgt)) {
>> +			err = PTR_ERR(sgt);
>> +			goto unpin;
>> +		}
>>
>> +		if (sgt) {
>>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
>>  					 DMA_TO_DEVICE);
>>  			if (err == 0) {
>> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>
>>  			state->iova[i] = sg_dma_address(sgt->sgl);
>>  			state->sgt[i] = sgt;
>> -		} else {
>> -			state->iova[i] = bo->iova;
>>  		}
>>  	}
>>
>> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>>  		struct sg_table *sgt = state->sgt[i];
>>
>> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
>> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
>> +		if (sgt) {
>> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
>> +				     DMA_TO_DEVICE);
>> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
>> +		}
>>
>>  		state->iova[i] = DMA_MAPPING_ERROR;
>>  		state->sgt[i] = NULL;
>> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
>> index 60b2fedd0061..538c0f0b40a4 100644
>> --- a/drivers/gpu/host1x/job.c
>> +++ b/drivers/gpu/host1x/job.c
>> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>
>>  	for (i = 0; i < job->num_relocs; i++) {
>>  		struct host1x_reloc *reloc = &job->relocs[i];
>> -		dma_addr_t phys_addr, *phys;
>> +		dma_addr_t phys_addr;
>>  		struct sg_table *sgt;
>>
>>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
>> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  			goto unpin;
>>  		}
>>
>> -		if (client->group)
>> -			phys = &phys_addr;
>> -		else
>> -			phys = NULL;
>> -
>> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
>> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>>  		if (IS_ERR(sgt)) {
>>  			err = PTR_ERR(sgt);
>>  			goto unpin;
>> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  		job->num_unpins++;
>>  	}
>>
>> +	/*
>> +	 * In a case of enabled firewall, all user gathers are copied into
>> +	 * the secured job->gather_copy_mapped.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
>> +		return 0;
>> +
>>  	for (i = 0; i < job->num_gathers; i++) {
>>  		struct host1x_job_gather *g = &job->gathers[i];
>>  		size_t gather_size = 0;
>> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  			goto unpin;
>>  		}
>>
>> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
>> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>>  		if (IS_ERR(sgt)) {
>>  			err = PTR_ERR(sgt);
>>  			goto unpin;
>>  		}
>>
>> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
>> +		if (host->domain) {
>>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
>>  				gather_size += sg->length;
>>  			gather_size = iova_align(&host->iova, gather_size);
>> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>
>>  			job->unpins[job->num_unpins].size = gather_size;
>>  			phys_addr = iova_dma_addr(&host->iova, alloc);
>> -		} else {
>> +		} else if (sgt) {
>>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>>  					 DMA_TO_DEVICE);
>>  			if (!err) {
>>
>> --- >8 ---

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-02-02 21:00                     ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2020-02-02 21:00 UTC (permalink / raw)
  To: Thierry Reding; +Cc: linux-tegra, DRI Development

30.01.2020 15:08, Thierry Reding пишет:
...
>> Why not to set the DMA mask to 32bits if IOMMU is unavailable?
> 
> We already do that. If you look at host1x_iommu_init() in
> drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
> available and the host1x doesn't support wide GATHER opcodes, then
> we limit the DMA Mask to 32 bits.
> 
> But that's not enough, see below.
> 
>> I'm a bit puzzled by the actual need to support the case where Host1x is
>> backed by IOMMU and clients not.. How we could ever end up with this
>> scenario in the upstream kernel?
> 
> That's not what we're doing here. The fundamental problem is that we
> have a couple of generations where the hardware is mismatched in that
> clients support 34-bit addresses while host1x can only use 32-bit
> addresses in the GATHER opcode. The only way to get around this mismatch
> is by using an IOMMU.

Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
index 6a995db51d6d..4253dd53ea2e 100644
--- a/drivers/gpu/host1x/bus.c
+++ b/drivers/gpu/host1x/bus.c
@@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct
host1x_device *device,
  */
 int host1x_device_init(struct host1x_device *device)
 {
+	struct host1x *host = dev_get_drvdata(device->dev.parent);
+	struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev);
 	struct host1x_client *client;
+	u64 mask;
 	int err;

 	mutex_lock(&device->clients_lock);

 	list_for_each_entry(client, &device->clients, list) {
+		mask = dma_get_mask(client->dev);
+
+		if (!domain && !host->info->has_wide_gather &&
+		    mask > DMA_BIT_MASK(32)) {
+			err = dma_coerce_mask_and_coherent(client->dev,
+							   DMA_BIT_MASK(32));
+			if (err < 0) {
+				dev_err(&device->dev,
+					"failed to set DMA mask: %d\n", err);
+				goto teardown;
+			}
+		}
+
 		if (client->ops && client->ops->init) {
 			err = client->ops->init(client);
 			if (err < 0) {

> However, with an IOMMU enabled for clients, we can run into the case
> where sparse pages would be allocated via shmem and end up beyond the
> 32-bit boundary. If the host1x is not attached to an IOMMU, there's no
> way for it to access these pages with standard GATHER opcodes.
> 
> This is what used to happen prior to this change when the host1x
> firewall was enabled.

I assume you're meaning the *disabled* firewall because gathers are
copied into the Host1x's DMA buffer if firewall is enabled.

> Since we were not attaching it to an IOMMU in that
> case, we would end up with sparse buffers allocated from pages that the
> host1x couldn't address.

Could you please clarify what do you mean by the shmem? If you mean the
get_pages(), then one option will be to simply enforce Host1x firewall
in order to get gathers always copied, and thus, we can remove the
TEGRA_HOST1X_FIREWALL Kconfig option.

The other option could be *not* to use get_pages() whenever it can't work.

It also should be worthwhile to simply restrict kernel's configuration
by disabling all functionality which requires IOMMU if IOMMU is not
available. Just error out with informative message in that case, telling
that kernel/device-tree configuration is wrong.

Please notice that grate-drivers do not work with the disabled firewall
anyways, because it's simply impractical/unusable from a userspace
perspective to track liveness of the gather BO.

Also, take a look at the grate-kernel's WIP patch where Host1x is
separated from Tegra DRM and Host1x maintains gather buffer allocations
by itself. In the case of legacy UAPI, the gather commands are copied
from DRM GEM to the Host1x's BO, similarly to what we have now in
upstream for the case of the enabled firewall, but in a more optimized
manner (using pre-allocated pool and etc). I'd like to see the upstream
driver doing the same if we won't end up dropping the legacy staging
UAPI entirely.

>> What about the reverse scenario? You won't be able to patch cmdstream
>> properly for >32bit addresses.
> 
> I don't think that scenario exists. I'm not aware of a Tegra device that
> has system memory outside of the CPU-addressable region.
> 
>> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
>> addresses, so you can't use "wide" opcodes and can't patch cmdstream.
> 
> There's nothing in the UAPI that deals with addresses directly. We only
> pass around handles and these are resolved to buffer objects in the
> kernel where the address of the buffers can be 32-bit or 64-bit.

If buffer is 64-bit, then client's address-registers take two 32-bit
hardware registers, and thus, cmdstream should reserve two words for the
patched handles.

Host1x driver doesn't support patching of 64-bit addresses and there is
no way to tell userspace to reserve two words per-address using the
current UAPI.

> And we do in fact support wide opcodes and patch command streams just
> fine on 64-bit systems.
> 
> I mean, it's not like I've been doing this just for the fun of it. There
> are actual configurations where this is needed in order for it to work.
> 
>> Perhaps it is better not to add any new things or quirks to the Host1x /
>> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
>> only continues to grow up. Don't you think so?
> 
> This isn't anything new or a quirk. This is bug fixes to ensure that the
> driver works in (now hopefully) all configurations. Previously it was a
> matter of getting the configuration just right in order for it to work.
> All the work I did here (starting with the wide opcode support and then
> the DMA API and IOMMU work) was to make sure it would safely work in any
> setup.
> 
> And I do consider these changes to also be cleanups and incremental
> improvements of what the state was before. Again, I don't consider a
> rewrite a serious cleanup.
> 
> I'm fully aware that the driver has been collecting dust for a while and
> it isn't perfect. But it's also not overly messy. It's perhaps a bit
> more complex than your average driver, but it's also some pretty complex
> hardware.
...
>> Doesn't sound good to me.. this is not going to be good for GPU drivers.
>> All cache maintenance should be in control of userspace, the userspace
>> should be telling kernel driver when it needs to get CPU access and when
>> to finish the access. DMABUF has generic UAPI for the synchronizations,
>> although a mature GPU driver may need more than that.
> 
> I agree. But that's not something that we can do at this point. We don't
> have a way of passing information such as this to the driver, so the
> driver has to assume that caches are dirty for all buffers, otherwise it
> will not be able to guarantee that random cache flushes won't corrupt
> the data that it's passing to the hardware.
> 
> So yes, when we do have a way of explicitly flushing the caches for
> buffers, then we can add a mechanism to pass that information to the
> kernel so that it can optimize. But until then we just can't be sure.

It worked fine for the last 7 years on T20/30 and continues to work if
the offending patches are reverted, no problems here.

> And I prefer a kernel driver that gives me slow and reliable, rather
> than fast but unpredictable results.

Your changes introduced regression on Tegra20/30 without solving any
problem. This is unacceptable, it's not an improvement, please don't do
it :)

Apparently some problem exists only for later Tegra generations, but I
still can't recognize what it's is from your words.

Please give a concrete real-world example of the trouble you're trying
to solve.

>> Today Tegra DRM driver supports only write-combined BO allocations, and
>> thus, we don't need to do more than to flush CPU buffers before
>> executing HW job.
> 
> That's only true when you allocate with the DMA API. When you allocate
> from a shmem mapping you don't get write-combined memory, so we do have
> to perform cache maintenance on the pages.

In all cases GEM's memory mappings are write-combined, including the
case of get_pages(). Host1x driver performs the maintenance by doing
wmb() before executing job.

If you're meaning that imported buffers could be backed by cacheable
pages, then userspace must use DMABUF syncing UAPI or any other means to
sync CPU cache, there is no way around it. You should assume that yours
userpspace is broken if it doesn't do it.

>>>> Please let me know if you're going to fix the problems or if you'd
>>>> prefer me to create the patches.
>>>>
>>>> Here is a draft of the fix for #2, it doesn't cover case of imported
>>>> buffers (which should be statically mapped, IIUC):
>>>>
>>>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
>>>> *dev, struct host1x_bo *bo,
>>>>          * If we've manually mapped the buffer object through the IOMMU,
>>>> make
>>>>          * sure to return the IOVA address of our mapping.
>>>>          */
>>>> -       if (phys && obj->mm) {
>>>> +       if (phys && (obj->mm || obj->vaddr)) {
>>>>                 *phys = obj->iova;
>>>
>>> This doesn't work for the case where we use the DMA API for mapping. Or
>>> at least it isn't going to work in the general case.
>>
>> Right, looks like I'll need to update my memory about the DMA API usage.
>>
>>> The reason is because obj->iova is only valid for whatever the device was that mapped
>>> or allocated the buffer, which in this case is the host1x device, which
>>> isn't even a real device, so it won't work. The only case where it does
>>> work is if we're not behind an IOMMU, so obj->iova will actually be the
>>> physical address.
>>
>> But why do you need to dynamically map/unmap the statically-allocated
>> buffers on each job submission, could you please explain what is the
>> point? Perhaps it's a temporary workaround just to get a minimum of
>> things working for the case of implicit IOMMU?
> 
> It's primarily because we don't really know if a buffer has been mapped
> for a specific device. We always map at allocation time for the Tegra
> DRM parent device (which isn't a real device) but before it's used by
> any other host1x client, it has to be mapped for that device as well.

> That's important in case any of these devices have different IOMMU
> domains.

This case should be covered by the solution I'm proposing below.

Today's mainline kernel has all clients in the same shared IOMMU domain.
Please note that I'm focused on solving the primary regression problem,
the potential future problems are secondary.

> Actually, given that the device isn't a real device, the DMA handle
> returned from dma_alloc_wc() is actually a physical address and not
> valid for any device behind an IOMMU. So even in the case where we
> share an IOMMU domain among multiple device, the mapping created by
> dma_alloc_wc() is useless for them.

What about to pick any DRM sub-device and then statically map DMA buffer
for this picked device during of BO allocation / import? All other DRM
sub-devices can access that mapping because of the shared domain.

> Because of the above and probably a bunch of other reasons, it's also a
> requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
> API will start printing a bunch of errors if you violate those and they
> typically indicate that what you're doing may not work. That doesn't
> mean it can't work, but it usually only does so accidentally.
> 
>> All buffers should be statically allocated and statically mapped, and
>> when there is a need to sync an already mapped buffer, the dma_sync_*
>> API should be used.
> 
> That's my understanding as well. However, it's slightly more complicated
> than that. Once you move away from the assumption that a mapping for a
> buffer the same for all devices, you can no longer do that. For example,
> while the statically allocated buffer may be mapped for the Tegra DRM
> device (again, it's not a real device), it's not mapped for any of the
> clients yet.

The same what I wrote above. Just pick any DRM sub-device and map buffer
for that device during allocation / import.

> So before a client can use it, its driver has to go and map
> the buffer for the device. The logical point to do that is during
> host1x_job_pin(). Once the client no longer has a use for the buffer it
> should also unmap the buffer again because it will otherwise occupy the
> IOVA space unnecessarily. The logical point to do that is during
> host1x_job_unpin().

The IOVA space problem really exists only for Tegra20.

I suppose that page table size is ~10-20MB per 1GB of allocations, this
is not something to worry about.

> host1x_job_unpin() is also the point at which the job releases its
> reference to the buffer, so the backing memory could go away at any
> point after that, which means that the IOVA mapping could point at
> invalid memory if we didn't unmap the buffer.

Backing memory can't go away before buffer is unmapped, it is a refcount
bug otherwise.

> I'm not aware of an easy way to optimize this while at the same time
> making sure that everything is still consistent. I suppose one way to do
> this would be to keep a cache of buffer objects and their mappings for
> each device and avoid mapping/unmapping them for every job. The problem
> with that is that we also don't want to hold on to buffer objects
> indefinitely because that will potentially cause a lot of memory and
> IOVA space to be used.

The case of explicit IOMMU is nicer because we can use DRM's MM scanner
to maintain a limited window of IOMMU mappings. Take a look at
https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/gart.c
for example.

The LRU cache should be easy to implement, it also should be useful if
we'll get around to supporting memory shrinker (BO swap-out).

>> Like I said above, the syncing should be done by userspace for the
>> buffers that are in control of userspace.
> 
> Yes, I agree. We already have an implementation of the .begin_cpu_access
> and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
> something like that for native buffers. Alternatively, I suppose user-
> space could be required to flush/invalidate using the DMA-BUF, but that
> potentially has the drawback of having to export a DMA-BUF for every
> single buffer.> A better option may be to add a driver-specific IOCTL to do cache
> maintenance. I think other drivers already do that.
Perhaps.. but it needs a real-world use-case first.

>>> So what this basically ends up doing is avoid dma_map_*() all the time,
>>> which I guess is what you're trying to achieve. But it also gives you
>>> the wrong I/O virtual address in any case where an IOMMU is involved.
>>> Also, as discussed above, avoiding cache maintenance isn't correct.
>>
>> Alright, then right now we need to bypass the dma_map_*() in a case of a
>> non-implicit IOMMU, in order to bring back the good old behavior (at
>> least temporary, until there will be a more comprehensive solution).
> 
> But it's not good old behaviour. You're still going to side-step the
> cache maintenance and violate the DMA API semantics.

The DMA API usage that you introduced in the v5.5 is wrong too because
it makes driver unusable :)

From a user's perspective it is exactly the good old behaviour. I tried
to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly
fine before, and now it is absolutely unusable.

There are couple things we could do in order to prevent this situation
in the future:

  1. Better automated testing. The grate-tests could be useful for this
particular case, once it will get support for the offscreen testing.

  2. Better review and real-world testing of the patches. Don't you mind
if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

I'm suggesting to take a step back and revert to the old behaviour for
now, at least on T20/30. I don't know about ARM64, but ARM32 permits to
violate the DMA API semantics for the case of unavailable IOMMU, of
course that may change in the future and by that time I hope we'll
manage to get the DMA API usage done right.

The solution below restores old behavior for the case of unavailable
IOMMU and for the case of explicit IOMMU domain, it should work in the
case of implicit IOMMU as well.

>> What do you think about this variant:
>>
>> --- >8 ---
>> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
>> index 1237df157e05..555a6424e9fa 100644
>> --- a/drivers/gpu/drm/tegra/gem.c
>> +++ b/drivers/gpu/drm/tegra/gem.c
>> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
>> *dev, struct host1x_bo *bo,
>>  				     dma_addr_t *phys)
>>  {
>>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
>> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
>>  	struct sg_table *sgt;
>>  	int err;
>>
>> -	/*
>> -	 * If we've manually mapped the buffer object through the IOMMU, make
>> -	 * sure to return the IOVA address of our mapping.
>> -	 */
>> -	if (phys && obj->mm) {
>> -		*phys = obj->iova;
>> -		return NULL;
>> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
>> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */

>> +		if (obj->vaddr) {

If this is unacceptable on ARM64 (dma_handle != phys_addr), then it
could be:

		if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {

>> +			*phys = obj->iova;
>> +			return NULL;
>> +		}
>> +
>> +		/*
>> +		 * If we've manually mapped the buffer object through the
>> +		 * IOMMU, make sure to return the IOVA address of our mapping.
>> +		 */
>> +		if (obj->mm) {
>> +			*phys = obj->iova;
>> +			return NULL;
>> +		}
>>  	}
>>
>>  	/*
>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
>> index 6d1872ddef17..91304b9034fa 100644
>> --- a/drivers/gpu/drm/tegra/plane.c
>> +++ b/drivers/gpu/drm/tegra/plane.c
>> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>
>>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
>>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>> +		struct sg_table *sgt;
>>
>> -		if (!dc->client.group) {
>> -			struct sg_table *sgt;
>> -
>> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
>> -			if (IS_ERR(sgt)) {
>> -				err = PTR_ERR(sgt);
>> -				goto unpin;
>> -			}
>> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
>> +		if (IS_ERR(sgt)) {
>> +			err = PTR_ERR(sgt);
>> +			goto unpin;
>> +		}
>>
>> +		if (sgt) {
>>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
>>  					 DMA_TO_DEVICE);
>>  			if (err == 0) {
>> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>
>>  			state->iova[i] = sg_dma_address(sgt->sgl);
>>  			state->sgt[i] = sgt;
>> -		} else {
>> -			state->iova[i] = bo->iova;
>>  		}
>>  	}
>>
>> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
>> tegra_plane_state *state)
>>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
>>  		struct sg_table *sgt = state->sgt[i];
>>
>> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
>> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
>> +		if (sgt) {
>> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
>> +				     DMA_TO_DEVICE);
>> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
>> +		}
>>
>>  		state->iova[i] = DMA_MAPPING_ERROR;
>>  		state->sgt[i] = NULL;
>> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
>> index 60b2fedd0061..538c0f0b40a4 100644
>> --- a/drivers/gpu/host1x/job.c
>> +++ b/drivers/gpu/host1x/job.c
>> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>
>>  	for (i = 0; i < job->num_relocs; i++) {
>>  		struct host1x_reloc *reloc = &job->relocs[i];
>> -		dma_addr_t phys_addr, *phys;
>> +		dma_addr_t phys_addr;
>>  		struct sg_table *sgt;
>>
>>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
>> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  			goto unpin;
>>  		}
>>
>> -		if (client->group)
>> -			phys = &phys_addr;
>> -		else
>> -			phys = NULL;
>> -
>> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
>> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
>>  		if (IS_ERR(sgt)) {
>>  			err = PTR_ERR(sgt);
>>  			goto unpin;
>> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  		job->num_unpins++;
>>  	}
>>
>> +	/*
>> +	 * In a case of enabled firewall, all user gathers are copied into
>> +	 * the secured job->gather_copy_mapped.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
>> +		return 0;
>> +
>>  	for (i = 0; i < job->num_gathers; i++) {
>>  		struct host1x_job_gather *g = &job->gathers[i];
>>  		size_t gather_size = 0;
>> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>  			goto unpin;
>>  		}
>>
>> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
>> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
>>  		if (IS_ERR(sgt)) {
>>  			err = PTR_ERR(sgt);
>>  			goto unpin;
>>  		}
>>
>> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
>> +		if (host->domain) {
>>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
>>  				gather_size += sg->length;
>>  			gather_size = iova_align(&host->iova, gather_size);
>> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
>> struct host1x_job *job)
>>
>>  			job->unpins[job->num_unpins].size = gather_size;
>>  			phys_addr = iova_dma_addr(&host->iova, alloc);
>> -		} else {
>> +		} else if (sgt) {
>>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
>>  					 DMA_TO_DEVICE);
>>  			if (!err) {
>>
>> --- >8 ---

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
  2020-02-02 21:00                     ` Dmitry Osipenko
@ 2020-02-03 12:56                         ` Thierry Reding
  -1 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-02-03 12:56 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA, DRI Development

[-- Attachment #1: Type: text/plain, Size: 31689 bytes --]

On Mon, Feb 03, 2020 at 12:00:49AM +0300, Dmitry Osipenko wrote:
> 30.01.2020 15:08, Thierry Reding пишет:
> ...
> >> Why not to set the DMA mask to 32bits if IOMMU is unavailable?
> > 
> > We already do that. If you look at host1x_iommu_init() in
> > drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
> > available and the host1x doesn't support wide GATHER opcodes, then
> > we limit the DMA Mask to 32 bits.
> > 
> > But that's not enough, see below.
> > 
> >> I'm a bit puzzled by the actual need to support the case where Host1x is
> >> backed by IOMMU and clients not.. How we could ever end up with this
> >> scenario in the upstream kernel?
> > 
> > That's not what we're doing here. The fundamental problem is that we
> > have a couple of generations where the hardware is mismatched in that
> > clients support 34-bit addresses while host1x can only use 32-bit
> > addresses in the GATHER opcode. The only way to get around this mismatch
> > is by using an IOMMU.
> 
> Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

Many things are possible, but I'm trying to find the best and most
intuitive way of encoding these restrictions in the driver. The problem
isn't so much that the hardware can only access 32-bits because the
memory interface is larger. It's more that the platform requires 32-bit
addresses because of integration mismatches like the above.

> diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
> index 6a995db51d6d..4253dd53ea2e 100644
> --- a/drivers/gpu/host1x/bus.c
> +++ b/drivers/gpu/host1x/bus.c
> @@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct
> host1x_device *device,
>   */
>  int host1x_device_init(struct host1x_device *device)
>  {
> +	struct host1x *host = dev_get_drvdata(device->dev.parent);
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev);
>  	struct host1x_client *client;
> +	u64 mask;
>  	int err;
> 
>  	mutex_lock(&device->clients_lock);
> 
>  	list_for_each_entry(client, &device->clients, list) {
> +		mask = dma_get_mask(client->dev);
> +
> +		if (!domain && !host->info->has_wide_gather &&
> +		    mask > DMA_BIT_MASK(32)) {
> +			err = dma_coerce_mask_and_coherent(client->dev,
> +							   DMA_BIT_MASK(32));
> +			if (err < 0) {
> +				dev_err(&device->dev,
> +					"failed to set DMA mask: %d\n", err);
> +				goto teardown;
> +			}
> +		}
> +
>  		if (client->ops && client->ops->init) {
>  			err = client->ops->init(client);
>  			if (err < 0) {

So yes, something like this would probably work as well, but it's not an
accurate description of what's going on, in my opinion.

> > However, with an IOMMU enabled for clients, we can run into the case
> > where sparse pages would be allocated via shmem and end up beyond the
> > 32-bit boundary. If the host1x is not attached to an IOMMU, there's no
> > way for it to access these pages with standard GATHER opcodes.
> > 
> > This is what used to happen prior to this change when the host1x
> > firewall was enabled.
> 
> I assume you're meaning the *disabled* firewall because gathers are
> copied into the Host1x's DMA buffer if firewall is enabled.
> 
> > Since we were not attaching it to an IOMMU in that
> > case, we would end up with sparse buffers allocated from pages that the
> > host1x couldn't address.
> 
> Could you please clarify what do you mean by the shmem? If you mean the
> get_pages(), then one option will be to simply enforce Host1x firewall
> in order to get gathers always copied, and thus, we can remove the
> TEGRA_HOST1X_FIREWALL Kconfig option.

The firewall is useful for other things that just copying the memory.
That said, I agree that always copying gathers doesn't sound like a bad
idea.

> The other option could be *not* to use get_pages() whenever it can't work.

That's basically what we're doing. drm_gem_get_pages() is only used when
IOMMU support is enabled. Otherwise the DMA API will be used and that
will respect the DMA mask that was previously configured.

So another way you could think of this is that we set the logical host1x
device's DMA mask to the intersection of the DMA masks for all the sub-
devices. That ensures that all devices can access the memory allocated
by the DMA API.

> It also should be worthwhile to simply restrict kernel's configuration
> by disabling all functionality which requires IOMMU if IOMMU is not
> available. Just error out with informative message in that case, telling
> that kernel/device-tree configuration is wrong.

What we currently do is to gracefully fall back to CMA allocations if
the IOMMU functionality isn't there. That sounds like a better option
to me than to fail and request people to rebuild the kernel.

Also note that we don't usually know at compile-time whether or not an
IOMMU is enabled.

> Please notice that grate-drivers do not work with the disabled firewall
> anyways, because it's simply impractical/unusable from a userspace
> perspective to track liveness of the gather BO.

Um... that's certainly a little strange. There's never going to be a
reliable way for userspace to detect whether or not the firewall is
enabled or not. Userspace also shouldn't care because nothing in the
UABI should in anyway indicate the presence of the firewall.

What exactly makes it impractical/unusable to track liveness of a gather
BO? The buffer can't go away until you destroy the GEM object, so it
sounds rather simple to me. As for ownership, userspace needs to
consider the gather buffer as owned by the kernel/hardware until the job
to which it was submitted has finished running.

> Also, take a look at the grate-kernel's WIP patch where Host1x is
> separated from Tegra DRM and Host1x maintains gather buffer allocations
> by itself. In the case of legacy UAPI, the gather commands are copied
> from DRM GEM to the Host1x's BO, similarly to what we have now in
> upstream for the case of the enabled firewall, but in a more optimized
> manner (using pre-allocated pool and etc). I'd like to see the upstream
> driver doing the same if we won't end up dropping the legacy staging
> UAPI entirely.

Do you have any pointers?

> >> What about the reverse scenario? You won't be able to patch cmdstream
> >> properly for >32bit addresses.
> > 
> > I don't think that scenario exists. I'm not aware of a Tegra device that
> > has system memory outside of the CPU-addressable region.
> > 
> >> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
> >> addresses, so you can't use "wide" opcodes and can't patch cmdstream.
> > 
> > There's nothing in the UAPI that deals with addresses directly. We only
> > pass around handles and these are resolved to buffer objects in the
> > kernel where the address of the buffers can be 32-bit or 64-bit.
> 
> If buffer is 64-bit, then client's address-registers take two 32-bit
> hardware registers, and thus, cmdstream should reserve two words for the
> patched handles.
> 
> Host1x driver doesn't support patching of 64-bit addresses and there is
> no way to tell userspace to reserve two words per-address using the
> current UAPI.

That's only partially true. Yes, we don't support patching of 64-bit
addresses, but the reason is because we don't have to. On SoC
generations where the memory interface is wider than 32 bits, the host1x
clients use a relocation shift (typically of 8 bits) to accomodate for
the 32-bit width restriction of the registers.

We currently deal with that in the ABI and I'm honestly not sure that's
a good solution. But it works perfectly fine with 64-bit BO addresses.

> > And we do in fact support wide opcodes and patch command streams just
> > fine on 64-bit systems.
> > 
> > I mean, it's not like I've been doing this just for the fun of it. There
> > are actual configurations where this is needed in order for it to work.
> > 
> >> Perhaps it is better not to add any new things or quirks to the Host1x /
> >> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
> >> only continues to grow up. Don't you think so?
> > 
> > This isn't anything new or a quirk. This is bug fixes to ensure that the
> > driver works in (now hopefully) all configurations. Previously it was a
> > matter of getting the configuration just right in order for it to work.
> > All the work I did here (starting with the wide opcode support and then
> > the DMA API and IOMMU work) was to make sure it would safely work in any
> > setup.
> > 
> > And I do consider these changes to also be cleanups and incremental
> > improvements of what the state was before. Again, I don't consider a
> > rewrite a serious cleanup.
> > 
> > I'm fully aware that the driver has been collecting dust for a while and
> > it isn't perfect. But it's also not overly messy. It's perhaps a bit
> > more complex than your average driver, but it's also some pretty complex
> > hardware.
> ...
> >> Doesn't sound good to me.. this is not going to be good for GPU drivers.
> >> All cache maintenance should be in control of userspace, the userspace
> >> should be telling kernel driver when it needs to get CPU access and when
> >> to finish the access. DMABUF has generic UAPI for the synchronizations,
> >> although a mature GPU driver may need more than that.
> > 
> > I agree. But that's not something that we can do at this point. We don't
> > have a way of passing information such as this to the driver, so the
> > driver has to assume that caches are dirty for all buffers, otherwise it
> > will not be able to guarantee that random cache flushes won't corrupt
> > the data that it's passing to the hardware.
> > 
> > So yes, when we do have a way of explicitly flushing the caches for
> > buffers, then we can add a mechanism to pass that information to the
> > kernel so that it can optimize. But until then we just can't be sure.
> 
> It worked fine for the last 7 years on T20/30 and continues to work if
> the offending patches are reverted, no problems here.

I do recall problems with cache maintenance during early development.
Note that I've never seen these issues when using only the DMA API. But
with explicit IOMMU usage there would typically be short horizontal
lines that were the wrong color because CPU caches hadn't been properly
flushed. We used to work around that by using dma_sync_sg_for_device(),
but that causes the DMA API to report abuses in debug mode.

Again, back at the time I was working on Tegra20 and Tegra30, and I was
definitely seeing those cache-related issues there as well.

> > And I prefer a kernel driver that gives me slow and reliable, rather
> > than fast but unpredictable results.
> 
> Your changes introduced regression on Tegra20/30 without solving any
> problem. This is unacceptable, it's not an improvement, please don't do
> it :)

It's perhaps a performance regression, but functionally it should still
be working fine. And see above for the issues that this solves.

Now, it may be that there are some cases where cache maintenance at this
level is not necessary, but keep in mind that we have a generic driver
that needs to work properly on many different setups at the same time. A
mechanism like the DMA API debug facilities is there for a reason. It
points out when you take shortcuts that may only work under some
circumstances but break terribly under others.

So I consider this an improvement in that it ensures that things always
work correctly, regardless of any assumptions.

> Apparently some problem exists only for later Tegra generations, but I
> still can't recognize what it's is from your words.
> 
> Please give a concrete real-world example of the trouble you're trying
> to solve.

See above. And, again, this is a problem that also exists on earlier
generations. I /think/ I've never seen cache-related issues when the
IOMMU is disabled, and that's probably because we have write-combine
mappings in that case. What you're saying seems to confirm that.

> >> Today Tegra DRM driver supports only write-combined BO allocations, and
> >> thus, we don't need to do more than to flush CPU buffers before
> >> executing HW job.
> > 
> > That's only true when you allocate with the DMA API. When you allocate
> > from a shmem mapping you don't get write-combined memory, so we do have
> > to perform cache maintenance on the pages.
> 
> In all cases GEM's memory mappings are write-combined, including the
> case of get_pages(). Host1x driver performs the maintenance by doing
> wmb() before executing job.

No, that's just not true. You don't get write-combined mappings for
drm_gem_get_pages(). The wmb() doesn't help in that case.

> If you're meaning that imported buffers could be backed by cacheable
> pages, then userspace must use DMABUF syncing UAPI or any other means to
> sync CPU cache, there is no way around it. You should assume that yours
> userpspace is broken if it doesn't do it.

I don't think it's supposed to work that way. Again, in practice this
may work fine most of the time. However, in the more general case it may
not.

The DMA-BUF sync UABI is really all about userspace synchronizing CPU
accesses with the exporter of the DMA-BUF. So, yes, if userspace makes
modifications to the buffer, it needs to tell the exporter about it with
the sync IOCTL.

But importers of the DMA-BUF aren't automatically on the safe side. So
there could be cases where an importer needs to do additional cache
maintenance in order to properly use a given buffer.

I'm not aware of any cases like that on Tegra, but that shouldn't be an
excuse for taking shortcuts. We should still be using the APIs in the
way they were intended. If they end up doing too much or even the wrong
thing, then I think we need to address that in their implementation and
not by abusing the API.

> >>>> Please let me know if you're going to fix the problems or if you'd
> >>>> prefer me to create the patches.
> >>>>
> >>>> Here is a draft of the fix for #2, it doesn't cover case of imported
> >>>> buffers (which should be statically mapped, IIUC):
> >>>>
> >>>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> >>>> *dev, struct host1x_bo *bo,
> >>>>          * If we've manually mapped the buffer object through the IOMMU,
> >>>> make
> >>>>          * sure to return the IOVA address of our mapping.
> >>>>          */
> >>>> -       if (phys && obj->mm) {
> >>>> +       if (phys && (obj->mm || obj->vaddr)) {
> >>>>                 *phys = obj->iova;
> >>>
> >>> This doesn't work for the case where we use the DMA API for mapping. Or
> >>> at least it isn't going to work in the general case.
> >>
> >> Right, looks like I'll need to update my memory about the DMA API usage.
> >>
> >>> The reason is because obj->iova is only valid for whatever the device was that mapped
> >>> or allocated the buffer, which in this case is the host1x device, which
> >>> isn't even a real device, so it won't work. The only case where it does
> >>> work is if we're not behind an IOMMU, so obj->iova will actually be the
> >>> physical address.
> >>
> >> But why do you need to dynamically map/unmap the statically-allocated
> >> buffers on each job submission, could you please explain what is the
> >> point? Perhaps it's a temporary workaround just to get a minimum of
> >> things working for the case of implicit IOMMU?
> > 
> > It's primarily because we don't really know if a buffer has been mapped
> > for a specific device. We always map at allocation time for the Tegra
> > DRM parent device (which isn't a real device) but before it's used by
> > any other host1x client, it has to be mapped for that device as well.
> 
> > That's important in case any of these devices have different IOMMU
> > domains.
> 
> This case should be covered by the solution I'm proposing below.
> 
> Today's mainline kernel has all clients in the same shared IOMMU domain.
> Please note that I'm focused on solving the primary regression problem,
> the potential future problems are secondary.

Again, that's only true up to and including Tegra210. For Tegra186 and
Tegra194 that's no longer possible.

> > Actually, given that the device isn't a real device, the DMA handle
> > returned from dma_alloc_wc() is actually a physical address and not
> > valid for any device behind an IOMMU. So even in the case where we
> > share an IOMMU domain among multiple device, the mapping created by
> > dma_alloc_wc() is useless for them.
> 
> What about to pick any DRM sub-device and then statically map DMA buffer
> for this picked device during of BO allocation / import? All other DRM
> sub-devices can access that mapping because of the shared domain.

Again, this only works if all devices share a common IOMMU domain. Since
that's not the case we can't do it.

> > Because of the above and probably a bunch of other reasons, it's also a
> > requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
> > API will start printing a bunch of errors if you violate those and they
> > typically indicate that what you're doing may not work. That doesn't
> > mean it can't work, but it usually only does so accidentally.
> > 
> >> All buffers should be statically allocated and statically mapped, and
> >> when there is a need to sync an already mapped buffer, the dma_sync_*
> >> API should be used.
> > 
> > That's my understanding as well. However, it's slightly more complicated
> > than that. Once you move away from the assumption that a mapping for a
> > buffer the same for all devices, you can no longer do that. For example,
> > while the statically allocated buffer may be mapped for the Tegra DRM
> > device (again, it's not a real device), it's not mapped for any of the
> > clients yet.
> 
> The same what I wrote above. Just pick any DRM sub-device and map buffer
> for that device during allocation / import.

That won't work in the general case where we have separate IOMMU
domains, because you need to map the buffer once for each domain.

> > So before a client can use it, its driver has to go and map
> > the buffer for the device. The logical point to do that is during
> > host1x_job_pin(). Once the client no longer has a use for the buffer it
> > should also unmap the buffer again because it will otherwise occupy the
> > IOVA space unnecessarily. The logical point to do that is during
> > host1x_job_unpin().
> 
> The IOVA space problem really exists only for Tegra20.
> 
> I suppose that page table size is ~10-20MB per 1GB of allocations, this
> is not something to worry about.

The IOVA space problem is going to be real for other devices once we
start leaking IOVA memory addresses.

> > host1x_job_unpin() is also the point at which the job releases its
> > reference to the buffer, so the backing memory could go away at any
> > point after that, which means that the IOVA mapping could point at
> > invalid memory if we didn't unmap the buffer.
> 
> Backing memory can't go away before buffer is unmapped, it is a refcount
> bug otherwise.

Right. However, my point was that if you don't unmap the buffer at
host1x_job_unpin() time, when would you unmap it? It's really at unpin
time that you know the buffer won't be used anymore. It might be used
again in the future, but the driver doesn't know that. Similarily, the
driver only knows at host1x_job_pin() time that a buffer will be used,
and by what device. It's not possible to predict ahead of time when a
buffer will be used, by which device and for how long.

> > I'm not aware of an easy way to optimize this while at the same time
> > making sure that everything is still consistent. I suppose one way to do
> > this would be to keep a cache of buffer objects and their mappings for
> > each device and avoid mapping/unmapping them for every job. The problem
> > with that is that we also don't want to hold on to buffer objects
> > indefinitely because that will potentially cause a lot of memory and
> > IOVA space to be used.
> 
> The case of explicit IOMMU is nicer because we can use DRM's MM scanner
> to maintain a limited window of IOMMU mappings. Take a look at
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/gart.c
> for example.
> 
> The LRU cache should be easy to implement, it also should be useful if
> we'll get around to supporting memory shrinker (BO swap-out).
> 
> >> Like I said above, the syncing should be done by userspace for the
> >> buffers that are in control of userspace.
> > 
> > Yes, I agree. We already have an implementation of the .begin_cpu_access
> > and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
> > something like that for native buffers. Alternatively, I suppose user-
> > space could be required to flush/invalidate using the DMA-BUF, but that
> > potentially has the drawback of having to export a DMA-BUF for every
> > single buffer.> A better option may be to add a driver-specific IOCTL to do cache
> > maintenance. I think other drivers already do that.
> Perhaps.. but it needs a real-world use-case first.
> 
> >>> So what this basically ends up doing is avoid dma_map_*() all the time,
> >>> which I guess is what you're trying to achieve. But it also gives you
> >>> the wrong I/O virtual address in any case where an IOMMU is involved.
> >>> Also, as discussed above, avoiding cache maintenance isn't correct.
> >>
> >> Alright, then right now we need to bypass the dma_map_*() in a case of a
> >> non-implicit IOMMU, in order to bring back the good old behavior (at
> >> least temporary, until there will be a more comprehensive solution).
> > 
> > But it's not good old behaviour. You're still going to side-step the
> > cache maintenance and violate the DMA API semantics.
> 
> The DMA API usage that you introduced in the v5.5 is wrong too because
> it makes driver unusable :)

No, it's not wrong. It does the right thing from an API point of view.
If it causes a performance regression that's because previously we did
too little and just got lucky that it still worked. So I think we need
to find out why exactly we could get away with how things worked before
and find a way to replicate that behaviour while still allowing the API
to be correctly used.

So if for whatever reason cache maintenance isn't necessary in this
case, can we make the DMA API not perform the cache maintenance. At this
point, I'm not even sure that cache maintenance is the issue. We'll need
to run some more tests to find out.

> From a user's perspective it is exactly the good old behaviour. I tried
> to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly
> fine before, and now it is absolutely unusable.
> 
> There are couple things we could do in order to prevent this situation
> in the future:
> 
>   1. Better automated testing. The grate-tests could be useful for this
> particular case, once it will get support for the offscreen testing.

Definitely. Right now we don't really have a good way of testing at all.
We also don't have a good way for benchmarking tests. All of that would
be a massive improvement.

FWIW, I've been holding back on the VIC test cases for libdrm because I
didn't want to port them to the old ABI, assuming we could fairly
quickly settle on a new ABI and get that merged. However, looking at the
current state of affairs, it may be worth porting them over to using the
old ABI so that we can also run tests on Tegra124 and later where grate
can't be supported anymore.

>   2. Better review and real-world testing of the patches. Don't you mind
> if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

You already review most of the patches anyway. And it would've been
pretty hard to find out during review that these patches would cause a
performance regression like this. To be honest, I'm still surprised that
the impact is this big, since I'm not seeing any impact at all doing
things like modeset -v on something like Tegra210, Tegra186 or Tegra194.

Granted, that's probably not the best test case since we're not really
doing any cache maintenance (hopefully) after the first two page-flips.
However, it does entail mapping/unmapping a 32 MiB framebuffer to/from
the IOMMU per frame, which should account for quite a bit of work as
well.

I guess that if you do a lot of updates from userspace and have to flush
the cache a lot for every frame that could have some impact. But you
were saying the difference was going from 100+ to 10- frames per second,
which seems a bit much for *only* cache maintenance. Sounds to me like
something else is going on here.

> I'm suggesting to take a step back and revert to the old behaviour for
> now, at least on T20/30. I don't know about ARM64, but ARM32 permits to
> violate the DMA API semantics for the case of unavailable IOMMU, of
> course that may change in the future and by that time I hope we'll
> manage to get the DMA API usage done right.

Okay, let's see if we can temporarily restore the performance on older
devices. But I'd also like to eventually figure out why it's even
regressing on performance that badly.

Thierry

> The solution below restores old behavior for the case of unavailable
> IOMMU and for the case of explicit IOMMU domain, it should work in the
> case of implicit IOMMU as well.
> 
> >> What do you think about this variant:
> >>
> >> --- >8 ---
> >> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
> >> index 1237df157e05..555a6424e9fa 100644
> >> --- a/drivers/gpu/drm/tegra/gem.c
> >> +++ b/drivers/gpu/drm/tegra/gem.c
> >> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
> >> *dev, struct host1x_bo *bo,
> >>  				     dma_addr_t *phys)
> >>  {
> >>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
> >> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
> >>  	struct sg_table *sgt;
> >>  	int err;
> >>
> >> -	/*
> >> -	 * If we've manually mapped the buffer object through the IOMMU, make
> >> -	 * sure to return the IOVA address of our mapping.
> >> -	 */
> >> -	if (phys && obj->mm) {
> >> -		*phys = obj->iova;
> >> -		return NULL;
> >> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
> >> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
> 
> >> +		if (obj->vaddr) {
> 
> If this is unacceptable on ARM64 (dma_handle != phys_addr), then it
> could be:
> 
> 		if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {
> 
> >> +			*phys = obj->iova;
> >> +			return NULL;
> >> +		}
> >> +
> >> +		/*
> >> +		 * If we've manually mapped the buffer object through the
> >> +		 * IOMMU, make sure to return the IOVA address of our mapping.
> >> +		 */
> >> +		if (obj->mm) {
> >> +			*phys = obj->iova;
> >> +			return NULL;
> >> +		}
> >>  	}
> >>
> >>  	/*
> >> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> >> index 6d1872ddef17..91304b9034fa 100644
> >> --- a/drivers/gpu/drm/tegra/plane.c
> >> +++ b/drivers/gpu/drm/tegra/plane.c
> >> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>
> >>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
> >>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> >> +		struct sg_table *sgt;
> >>
> >> -		if (!dc->client.group) {
> >> -			struct sg_table *sgt;
> >> -
> >> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
> >> -			if (IS_ERR(sgt)) {
> >> -				err = PTR_ERR(sgt);
> >> -				goto unpin;
> >> -			}
> >> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
> >> +		if (IS_ERR(sgt)) {
> >> +			err = PTR_ERR(sgt);
> >> +			goto unpin;
> >> +		}
> >>
> >> +		if (sgt) {
> >>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
> >>  					 DMA_TO_DEVICE);
> >>  			if (err == 0) {
> >> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>
> >>  			state->iova[i] = sg_dma_address(sgt->sgl);
> >>  			state->sgt[i] = sgt;
> >> -		} else {
> >> -			state->iova[i] = bo->iova;
> >>  		}
> >>  	}
> >>
> >> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> >>  		struct sg_table *sgt = state->sgt[i];
> >>
> >> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
> >> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
> >> +		if (sgt) {
> >> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
> >> +				     DMA_TO_DEVICE);
> >> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
> >> +		}
> >>
> >>  		state->iova[i] = DMA_MAPPING_ERROR;
> >>  		state->sgt[i] = NULL;
> >> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> >> index 60b2fedd0061..538c0f0b40a4 100644
> >> --- a/drivers/gpu/host1x/job.c
> >> +++ b/drivers/gpu/host1x/job.c
> >> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>
> >>  	for (i = 0; i < job->num_relocs; i++) {
> >>  		struct host1x_reloc *reloc = &job->relocs[i];
> >> -		dma_addr_t phys_addr, *phys;
> >> +		dma_addr_t phys_addr;
> >>  		struct sg_table *sgt;
> >>
> >>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
> >> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  			goto unpin;
> >>  		}
> >>
> >> -		if (client->group)
> >> -			phys = &phys_addr;
> >> -		else
> >> -			phys = NULL;
> >> -
> >> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> >> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
> >>  		if (IS_ERR(sgt)) {
> >>  			err = PTR_ERR(sgt);
> >>  			goto unpin;
> >> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  		job->num_unpins++;
> >>  	}
> >>
> >> +	/*
> >> +	 * In a case of enabled firewall, all user gathers are copied into
> >> +	 * the secured job->gather_copy_mapped.
> >> +	 */
> >> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
> >> +		return 0;
> >> +
> >>  	for (i = 0; i < job->num_gathers; i++) {
> >>  		struct host1x_job_gather *g = &job->gathers[i];
> >>  		size_t gather_size = 0;
> >> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  			goto unpin;
> >>  		}
> >>
> >> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> >> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
> >>  		if (IS_ERR(sgt)) {
> >>  			err = PTR_ERR(sgt);
> >>  			goto unpin;
> >>  		}
> >>
> >> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
> >> +		if (host->domain) {
> >>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
> >>  				gather_size += sg->length;
> >>  			gather_size = iova_align(&host->iova, gather_size);
> >> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>
> >>  			job->unpins[job->num_unpins].size = gather_size;
> >>  			phys_addr = iova_dma_addr(&host->iova, alloc);
> >> -		} else {
> >> +		} else if (sgt) {
> >>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
> >>  					 DMA_TO_DEVICE);
> >>  			if (!err) {
> >>
> >> --- >8 ---
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30
@ 2020-02-03 12:56                         ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2020-02-03 12:56 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: linux-tegra, DRI Development


[-- Attachment #1.1: Type: text/plain, Size: 31689 bytes --]

On Mon, Feb 03, 2020 at 12:00:49AM +0300, Dmitry Osipenko wrote:
> 30.01.2020 15:08, Thierry Reding пишет:
> ...
> >> Why not to set the DMA mask to 32bits if IOMMU is unavailable?
> > 
> > We already do that. If you look at host1x_iommu_init() in
> > drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is
> > available and the host1x doesn't support wide GATHER opcodes, then
> > we limit the DMA Mask to 32 bits.
> > 
> > But that's not enough, see below.
> > 
> >> I'm a bit puzzled by the actual need to support the case where Host1x is
> >> backed by IOMMU and clients not.. How we could ever end up with this
> >> scenario in the upstream kernel?
> > 
> > That's not what we're doing here. The fundamental problem is that we
> > have a couple of generations where the hardware is mismatched in that
> > clients support 34-bit addresses while host1x can only use 32-bit
> > addresses in the GATHER opcode. The only way to get around this mismatch
> > is by using an IOMMU.
> 
> Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

Many things are possible, but I'm trying to find the best and most
intuitive way of encoding these restrictions in the driver. The problem
isn't so much that the hardware can only access 32-bits because the
memory interface is larger. It's more that the platform requires 32-bit
addresses because of integration mismatches like the above.

> diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
> index 6a995db51d6d..4253dd53ea2e 100644
> --- a/drivers/gpu/host1x/bus.c
> +++ b/drivers/gpu/host1x/bus.c
> @@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct
> host1x_device *device,
>   */
>  int host1x_device_init(struct host1x_device *device)
>  {
> +	struct host1x *host = dev_get_drvdata(device->dev.parent);
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev);
>  	struct host1x_client *client;
> +	u64 mask;
>  	int err;
> 
>  	mutex_lock(&device->clients_lock);
> 
>  	list_for_each_entry(client, &device->clients, list) {
> +		mask = dma_get_mask(client->dev);
> +
> +		if (!domain && !host->info->has_wide_gather &&
> +		    mask > DMA_BIT_MASK(32)) {
> +			err = dma_coerce_mask_and_coherent(client->dev,
> +							   DMA_BIT_MASK(32));
> +			if (err < 0) {
> +				dev_err(&device->dev,
> +					"failed to set DMA mask: %d\n", err);
> +				goto teardown;
> +			}
> +		}
> +
>  		if (client->ops && client->ops->init) {
>  			err = client->ops->init(client);
>  			if (err < 0) {

So yes, something like this would probably work as well, but it's not an
accurate description of what's going on, in my opinion.

> > However, with an IOMMU enabled for clients, we can run into the case
> > where sparse pages would be allocated via shmem and end up beyond the
> > 32-bit boundary. If the host1x is not attached to an IOMMU, there's no
> > way for it to access these pages with standard GATHER opcodes.
> > 
> > This is what used to happen prior to this change when the host1x
> > firewall was enabled.
> 
> I assume you're meaning the *disabled* firewall because gathers are
> copied into the Host1x's DMA buffer if firewall is enabled.
> 
> > Since we were not attaching it to an IOMMU in that
> > case, we would end up with sparse buffers allocated from pages that the
> > host1x couldn't address.
> 
> Could you please clarify what do you mean by the shmem? If you mean the
> get_pages(), then one option will be to simply enforce Host1x firewall
> in order to get gathers always copied, and thus, we can remove the
> TEGRA_HOST1X_FIREWALL Kconfig option.

The firewall is useful for other things that just copying the memory.
That said, I agree that always copying gathers doesn't sound like a bad
idea.

> The other option could be *not* to use get_pages() whenever it can't work.

That's basically what we're doing. drm_gem_get_pages() is only used when
IOMMU support is enabled. Otherwise the DMA API will be used and that
will respect the DMA mask that was previously configured.

So another way you could think of this is that we set the logical host1x
device's DMA mask to the intersection of the DMA masks for all the sub-
devices. That ensures that all devices can access the memory allocated
by the DMA API.

> It also should be worthwhile to simply restrict kernel's configuration
> by disabling all functionality which requires IOMMU if IOMMU is not
> available. Just error out with informative message in that case, telling
> that kernel/device-tree configuration is wrong.

What we currently do is to gracefully fall back to CMA allocations if
the IOMMU functionality isn't there. That sounds like a better option
to me than to fail and request people to rebuild the kernel.

Also note that we don't usually know at compile-time whether or not an
IOMMU is enabled.

> Please notice that grate-drivers do not work with the disabled firewall
> anyways, because it's simply impractical/unusable from a userspace
> perspective to track liveness of the gather BO.

Um... that's certainly a little strange. There's never going to be a
reliable way for userspace to detect whether or not the firewall is
enabled or not. Userspace also shouldn't care because nothing in the
UABI should in anyway indicate the presence of the firewall.

What exactly makes it impractical/unusable to track liveness of a gather
BO? The buffer can't go away until you destroy the GEM object, so it
sounds rather simple to me. As for ownership, userspace needs to
consider the gather buffer as owned by the kernel/hardware until the job
to which it was submitted has finished running.

> Also, take a look at the grate-kernel's WIP patch where Host1x is
> separated from Tegra DRM and Host1x maintains gather buffer allocations
> by itself. In the case of legacy UAPI, the gather commands are copied
> from DRM GEM to the Host1x's BO, similarly to what we have now in
> upstream for the case of the enabled firewall, but in a more optimized
> manner (using pre-allocated pool and etc). I'd like to see the upstream
> driver doing the same if we won't end up dropping the legacy staging
> UAPI entirely.

Do you have any pointers?

> >> What about the reverse scenario? You won't be able to patch cmdstream
> >> properly for >32bit addresses.
> > 
> > I don't think that scenario exists. I'm not aware of a Tegra device that
> > has system memory outside of the CPU-addressable region.
> > 
> >> The root of the problem is that Tegra DRM UAPI doesn't support 64bit
> >> addresses, so you can't use "wide" opcodes and can't patch cmdstream.
> > 
> > There's nothing in the UAPI that deals with addresses directly. We only
> > pass around handles and these are resolved to buffer objects in the
> > kernel where the address of the buffers can be 32-bit or 64-bit.
> 
> If buffer is 64-bit, then client's address-registers take two 32-bit
> hardware registers, and thus, cmdstream should reserve two words for the
> patched handles.
> 
> Host1x driver doesn't support patching of 64-bit addresses and there is
> no way to tell userspace to reserve two words per-address using the
> current UAPI.

That's only partially true. Yes, we don't support patching of 64-bit
addresses, but the reason is because we don't have to. On SoC
generations where the memory interface is wider than 32 bits, the host1x
clients use a relocation shift (typically of 8 bits) to accomodate for
the 32-bit width restriction of the registers.

We currently deal with that in the ABI and I'm honestly not sure that's
a good solution. But it works perfectly fine with 64-bit BO addresses.

> > And we do in fact support wide opcodes and patch command streams just
> > fine on 64-bit systems.
> > 
> > I mean, it's not like I've been doing this just for the fun of it. There
> > are actual configurations where this is needed in order for it to work.
> > 
> >> Perhaps it is better not to add any new things or quirks to the Host1x /
> >> Tegra DRM for now. The drivers need a serious clean up, otherwise mess
> >> only continues to grow up. Don't you think so?
> > 
> > This isn't anything new or a quirk. This is bug fixes to ensure that the
> > driver works in (now hopefully) all configurations. Previously it was a
> > matter of getting the configuration just right in order for it to work.
> > All the work I did here (starting with the wide opcode support and then
> > the DMA API and IOMMU work) was to make sure it would safely work in any
> > setup.
> > 
> > And I do consider these changes to also be cleanups and incremental
> > improvements of what the state was before. Again, I don't consider a
> > rewrite a serious cleanup.
> > 
> > I'm fully aware that the driver has been collecting dust for a while and
> > it isn't perfect. But it's also not overly messy. It's perhaps a bit
> > more complex than your average driver, but it's also some pretty complex
> > hardware.
> ...
> >> Doesn't sound good to me.. this is not going to be good for GPU drivers.
> >> All cache maintenance should be in control of userspace, the userspace
> >> should be telling kernel driver when it needs to get CPU access and when
> >> to finish the access. DMABUF has generic UAPI for the synchronizations,
> >> although a mature GPU driver may need more than that.
> > 
> > I agree. But that's not something that we can do at this point. We don't
> > have a way of passing information such as this to the driver, so the
> > driver has to assume that caches are dirty for all buffers, otherwise it
> > will not be able to guarantee that random cache flushes won't corrupt
> > the data that it's passing to the hardware.
> > 
> > So yes, when we do have a way of explicitly flushing the caches for
> > buffers, then we can add a mechanism to pass that information to the
> > kernel so that it can optimize. But until then we just can't be sure.
> 
> It worked fine for the last 7 years on T20/30 and continues to work if
> the offending patches are reverted, no problems here.

I do recall problems with cache maintenance during early development.
Note that I've never seen these issues when using only the DMA API. But
with explicit IOMMU usage there would typically be short horizontal
lines that were the wrong color because CPU caches hadn't been properly
flushed. We used to work around that by using dma_sync_sg_for_device(),
but that causes the DMA API to report abuses in debug mode.

Again, back at the time I was working on Tegra20 and Tegra30, and I was
definitely seeing those cache-related issues there as well.

> > And I prefer a kernel driver that gives me slow and reliable, rather
> > than fast but unpredictable results.
> 
> Your changes introduced regression on Tegra20/30 without solving any
> problem. This is unacceptable, it's not an improvement, please don't do
> it :)

It's perhaps a performance regression, but functionally it should still
be working fine. And see above for the issues that this solves.

Now, it may be that there are some cases where cache maintenance at this
level is not necessary, but keep in mind that we have a generic driver
that needs to work properly on many different setups at the same time. A
mechanism like the DMA API debug facilities is there for a reason. It
points out when you take shortcuts that may only work under some
circumstances but break terribly under others.

So I consider this an improvement in that it ensures that things always
work correctly, regardless of any assumptions.

> Apparently some problem exists only for later Tegra generations, but I
> still can't recognize what it's is from your words.
> 
> Please give a concrete real-world example of the trouble you're trying
> to solve.

See above. And, again, this is a problem that also exists on earlier
generations. I /think/ I've never seen cache-related issues when the
IOMMU is disabled, and that's probably because we have write-combine
mappings in that case. What you're saying seems to confirm that.

> >> Today Tegra DRM driver supports only write-combined BO allocations, and
> >> thus, we don't need to do more than to flush CPU buffers before
> >> executing HW job.
> > 
> > That's only true when you allocate with the DMA API. When you allocate
> > from a shmem mapping you don't get write-combined memory, so we do have
> > to perform cache maintenance on the pages.
> 
> In all cases GEM's memory mappings are write-combined, including the
> case of get_pages(). Host1x driver performs the maintenance by doing
> wmb() before executing job.

No, that's just not true. You don't get write-combined mappings for
drm_gem_get_pages(). The wmb() doesn't help in that case.

> If you're meaning that imported buffers could be backed by cacheable
> pages, then userspace must use DMABUF syncing UAPI or any other means to
> sync CPU cache, there is no way around it. You should assume that yours
> userpspace is broken if it doesn't do it.

I don't think it's supposed to work that way. Again, in practice this
may work fine most of the time. However, in the more general case it may
not.

The DMA-BUF sync UABI is really all about userspace synchronizing CPU
accesses with the exporter of the DMA-BUF. So, yes, if userspace makes
modifications to the buffer, it needs to tell the exporter about it with
the sync IOCTL.

But importers of the DMA-BUF aren't automatically on the safe side. So
there could be cases where an importer needs to do additional cache
maintenance in order to properly use a given buffer.

I'm not aware of any cases like that on Tegra, but that shouldn't be an
excuse for taking shortcuts. We should still be using the APIs in the
way they were intended. If they end up doing too much or even the wrong
thing, then I think we need to address that in their implementation and
not by abusing the API.

> >>>> Please let me know if you're going to fix the problems or if you'd
> >>>> prefer me to create the patches.
> >>>>
> >>>> Here is a draft of the fix for #2, it doesn't cover case of imported
> >>>> buffers (which should be statically mapped, IIUC):
> >>>>
> >>>> @@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
> >>>> *dev, struct host1x_bo *bo,
> >>>>          * If we've manually mapped the buffer object through the IOMMU,
> >>>> make
> >>>>          * sure to return the IOVA address of our mapping.
> >>>>          */
> >>>> -       if (phys && obj->mm) {
> >>>> +       if (phys && (obj->mm || obj->vaddr)) {
> >>>>                 *phys = obj->iova;
> >>>
> >>> This doesn't work for the case where we use the DMA API for mapping. Or
> >>> at least it isn't going to work in the general case.
> >>
> >> Right, looks like I'll need to update my memory about the DMA API usage.
> >>
> >>> The reason is because obj->iova is only valid for whatever the device was that mapped
> >>> or allocated the buffer, which in this case is the host1x device, which
> >>> isn't even a real device, so it won't work. The only case where it does
> >>> work is if we're not behind an IOMMU, so obj->iova will actually be the
> >>> physical address.
> >>
> >> But why do you need to dynamically map/unmap the statically-allocated
> >> buffers on each job submission, could you please explain what is the
> >> point? Perhaps it's a temporary workaround just to get a minimum of
> >> things working for the case of implicit IOMMU?
> > 
> > It's primarily because we don't really know if a buffer has been mapped
> > for a specific device. We always map at allocation time for the Tegra
> > DRM parent device (which isn't a real device) but before it's used by
> > any other host1x client, it has to be mapped for that device as well.
> 
> > That's important in case any of these devices have different IOMMU
> > domains.
> 
> This case should be covered by the solution I'm proposing below.
> 
> Today's mainline kernel has all clients in the same shared IOMMU domain.
> Please note that I'm focused on solving the primary regression problem,
> the potential future problems are secondary.

Again, that's only true up to and including Tegra210. For Tegra186 and
Tegra194 that's no longer possible.

> > Actually, given that the device isn't a real device, the DMA handle
> > returned from dma_alloc_wc() is actually a physical address and not
> > valid for any device behind an IOMMU. So even in the case where we
> > share an IOMMU domain among multiple device, the mapping created by
> > dma_alloc_wc() is useless for them.
> 
> What about to pick any DRM sub-device and then statically map DMA buffer
> for this picked device during of BO allocation / import? All other DRM
> sub-devices can access that mapping because of the shared domain.

Again, this only works if all devices share a common IOMMU domain. Since
that's not the case we can't do it.

> > Because of the above and probably a bunch of other reasons, it's also a
> > requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA
> > API will start printing a bunch of errors if you violate those and they
> > typically indicate that what you're doing may not work. That doesn't
> > mean it can't work, but it usually only does so accidentally.
> > 
> >> All buffers should be statically allocated and statically mapped, and
> >> when there is a need to sync an already mapped buffer, the dma_sync_*
> >> API should be used.
> > 
> > That's my understanding as well. However, it's slightly more complicated
> > than that. Once you move away from the assumption that a mapping for a
> > buffer the same for all devices, you can no longer do that. For example,
> > while the statically allocated buffer may be mapped for the Tegra DRM
> > device (again, it's not a real device), it's not mapped for any of the
> > clients yet.
> 
> The same what I wrote above. Just pick any DRM sub-device and map buffer
> for that device during allocation / import.

That won't work in the general case where we have separate IOMMU
domains, because you need to map the buffer once for each domain.

> > So before a client can use it, its driver has to go and map
> > the buffer for the device. The logical point to do that is during
> > host1x_job_pin(). Once the client no longer has a use for the buffer it
> > should also unmap the buffer again because it will otherwise occupy the
> > IOVA space unnecessarily. The logical point to do that is during
> > host1x_job_unpin().
> 
> The IOVA space problem really exists only for Tegra20.
> 
> I suppose that page table size is ~10-20MB per 1GB of allocations, this
> is not something to worry about.

The IOVA space problem is going to be real for other devices once we
start leaking IOVA memory addresses.

> > host1x_job_unpin() is also the point at which the job releases its
> > reference to the buffer, so the backing memory could go away at any
> > point after that, which means that the IOVA mapping could point at
> > invalid memory if we didn't unmap the buffer.
> 
> Backing memory can't go away before buffer is unmapped, it is a refcount
> bug otherwise.

Right. However, my point was that if you don't unmap the buffer at
host1x_job_unpin() time, when would you unmap it? It's really at unpin
time that you know the buffer won't be used anymore. It might be used
again in the future, but the driver doesn't know that. Similarily, the
driver only knows at host1x_job_pin() time that a buffer will be used,
and by what device. It's not possible to predict ahead of time when a
buffer will be used, by which device and for how long.

> > I'm not aware of an easy way to optimize this while at the same time
> > making sure that everything is still consistent. I suppose one way to do
> > this would be to keep a cache of buffer objects and their mappings for
> > each device and avoid mapping/unmapping them for every job. The problem
> > with that is that we also don't want to hold on to buffer objects
> > indefinitely because that will potentially cause a lot of memory and
> > IOVA space to be used.
> 
> The case of explicit IOMMU is nicer because we can use DRM's MM scanner
> to maintain a limited window of IOMMU mappings. Take a look at
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/gart.c
> for example.
> 
> The LRU cache should be easy to implement, it also should be useful if
> we'll get around to supporting memory shrinker (BO swap-out).
> 
> >> Like I said above, the syncing should be done by userspace for the
> >> buffers that are in control of userspace.
> > 
> > Yes, I agree. We already have an implementation of the .begin_cpu_access
> > and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add
> > something like that for native buffers. Alternatively, I suppose user-
> > space could be required to flush/invalidate using the DMA-BUF, but that
> > potentially has the drawback of having to export a DMA-BUF for every
> > single buffer.> A better option may be to add a driver-specific IOCTL to do cache
> > maintenance. I think other drivers already do that.
> Perhaps.. but it needs a real-world use-case first.
> 
> >>> So what this basically ends up doing is avoid dma_map_*() all the time,
> >>> which I guess is what you're trying to achieve. But it also gives you
> >>> the wrong I/O virtual address in any case where an IOMMU is involved.
> >>> Also, as discussed above, avoiding cache maintenance isn't correct.
> >>
> >> Alright, then right now we need to bypass the dma_map_*() in a case of a
> >> non-implicit IOMMU, in order to bring back the good old behavior (at
> >> least temporary, until there will be a more comprehensive solution).
> > 
> > But it's not good old behaviour. You're still going to side-step the
> > cache maintenance and violate the DMA API semantics.
> 
> The DMA API usage that you introduced in the v5.5 is wrong too because
> it makes driver unusable :)

No, it's not wrong. It does the right thing from an API point of view.
If it causes a performance regression that's because previously we did
too little and just got lucky that it still worked. So I think we need
to find out why exactly we could get away with how things worked before
and find a way to replicate that behaviour while still allowing the API
to be correctly used.

So if for whatever reason cache maintenance isn't necessary in this
case, can we make the DMA API not perform the cache maintenance. At this
point, I'm not even sure that cache maintenance is the issue. We'll need
to run some more tests to find out.

> From a user's perspective it is exactly the good old behaviour. I tried
> to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly
> fine before, and now it is absolutely unusable.
> 
> There are couple things we could do in order to prevent this situation
> in the future:
> 
>   1. Better automated testing. The grate-tests could be useful for this
> particular case, once it will get support for the offscreen testing.

Definitely. Right now we don't really have a good way of testing at all.
We also don't have a good way for benchmarking tests. All of that would
be a massive improvement.

FWIW, I've been holding back on the VIC test cases for libdrm because I
didn't want to port them to the old ABI, assuming we could fairly
quickly settle on a new ABI and get that merged. However, looking at the
current state of affairs, it may be worth porting them over to using the
old ABI so that we can also run tests on Tegra124 and later where grate
can't be supported anymore.

>   2. Better review and real-world testing of the patches. Don't you mind
> if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

You already review most of the patches anyway. And it would've been
pretty hard to find out during review that these patches would cause a
performance regression like this. To be honest, I'm still surprised that
the impact is this big, since I'm not seeing any impact at all doing
things like modeset -v on something like Tegra210, Tegra186 or Tegra194.

Granted, that's probably not the best test case since we're not really
doing any cache maintenance (hopefully) after the first two page-flips.
However, it does entail mapping/unmapping a 32 MiB framebuffer to/from
the IOMMU per frame, which should account for quite a bit of work as
well.

I guess that if you do a lot of updates from userspace and have to flush
the cache a lot for every frame that could have some impact. But you
were saying the difference was going from 100+ to 10- frames per second,
which seems a bit much for *only* cache maintenance. Sounds to me like
something else is going on here.

> I'm suggesting to take a step back and revert to the old behaviour for
> now, at least on T20/30. I don't know about ARM64, but ARM32 permits to
> violate the DMA API semantics for the case of unavailable IOMMU, of
> course that may change in the future and by that time I hope we'll
> manage to get the DMA API usage done right.

Okay, let's see if we can temporarily restore the performance on older
devices. But I'd also like to eventually figure out why it's even
regressing on performance that badly.

Thierry

> The solution below restores old behavior for the case of unavailable
> IOMMU and for the case of explicit IOMMU domain, it should work in the
> case of implicit IOMMU as well.
> 
> >> What do you think about this variant:
> >>
> >> --- >8 ---
> >> diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
> >> index 1237df157e05..555a6424e9fa 100644
> >> --- a/drivers/gpu/drm/tegra/gem.c
> >> +++ b/drivers/gpu/drm/tegra/gem.c
> >> @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device
> >> *dev, struct host1x_bo *bo,
> >>  				     dma_addr_t *phys)
> >>  {
> >>  	struct tegra_bo *obj = host1x_to_tegra_bo(bo);
> >> +	struct tegra_drm *tegra = obj->gem.dev->dev_private;
> >>  	struct sg_table *sgt;
> >>  	int err;
> >>
> >> -	/*
> >> -	 * If we've manually mapped the buffer object through the IOMMU, make
> >> -	 * sure to return the IOVA address of our mapping.
> >> -	 */
> >> -	if (phys && obj->mm) {
> >> -		*phys = obj->iova;
> >> -		return NULL;
> >> +	if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
> >> +		/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
> 
> >> +		if (obj->vaddr) {
> 
> If this is unacceptable on ARM64 (dma_handle != phys_addr), then it
> could be:
> 
> 		if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {
> 
> >> +			*phys = obj->iova;
> >> +			return NULL;
> >> +		}
> >> +
> >> +		/*
> >> +		 * If we've manually mapped the buffer object through the
> >> +		 * IOMMU, make sure to return the IOVA address of our mapping.
> >> +		 */
> >> +		if (obj->mm) {
> >> +			*phys = obj->iova;
> >> +			return NULL;
> >> +		}
> >>  	}
> >>
> >>  	/*
> >> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> >> index 6d1872ddef17..91304b9034fa 100644
> >> --- a/drivers/gpu/drm/tegra/plane.c
> >> +++ b/drivers/gpu/drm/tegra/plane.c
> >> @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>
> >>  	for (i = 0; i < state->base.fb->format->num_planes; i++) {
> >>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> >> +		struct sg_table *sgt;
> >>
> >> -		if (!dc->client.group) {
> >> -			struct sg_table *sgt;
> >> -
> >> -			sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
> >> -			if (IS_ERR(sgt)) {
> >> -				err = PTR_ERR(sgt);
> >> -				goto unpin;
> >> -			}
> >> +		sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
> >> +		if (IS_ERR(sgt)) {
> >> +			err = PTR_ERR(sgt);
> >> +			goto unpin;
> >> +		}
> >>
> >> +		if (sgt) {
> >>  			err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
> >>  					 DMA_TO_DEVICE);
> >>  			if (err == 0) {
> >> @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>
> >>  			state->iova[i] = sg_dma_address(sgt->sgl);
> >>  			state->sgt[i] = sgt;
> >> -		} else {
> >> -			state->iova[i] = bo->iova;
> >>  		}
> >>  	}
> >>
> >> @@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct
> >> tegra_plane_state *state)
> >>  		struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
> >>  		struct sg_table *sgt = state->sgt[i];
> >>
> >> -		dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
> >> -		host1x_bo_unpin(dc->dev, &bo->base, sgt);
> >> +		if (sgt) {
> >> +			dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
> >> +				     DMA_TO_DEVICE);
> >> +			host1x_bo_unpin(dc->dev, &bo->base, sgt);
> >> +		}
> >>
> >>  		state->iova[i] = DMA_MAPPING_ERROR;
> >>  		state->sgt[i] = NULL;
> >> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> >> index 60b2fedd0061..538c0f0b40a4 100644
> >> --- a/drivers/gpu/host1x/job.c
> >> +++ b/drivers/gpu/host1x/job.c
> >> @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>
> >>  	for (i = 0; i < job->num_relocs; i++) {
> >>  		struct host1x_reloc *reloc = &job->relocs[i];
> >> -		dma_addr_t phys_addr, *phys;
> >> +		dma_addr_t phys_addr;
> >>  		struct sg_table *sgt;
> >>
> >>  		reloc->target.bo = host1x_bo_get(reloc->target.bo);
> >> @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  			goto unpin;
> >>  		}
> >>
> >> -		if (client->group)
> >> -			phys = &phys_addr;
> >> -		else
> >> -			phys = NULL;
> >> -
> >> -		sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
> >> +		sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
> >>  		if (IS_ERR(sgt)) {
> >>  			err = PTR_ERR(sgt);
> >>  			goto unpin;
> >> @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  		job->num_unpins++;
> >>  	}
> >>
> >> +	/*
> >> +	 * In a case of enabled firewall, all user gathers are copied into
> >> +	 * the secured job->gather_copy_mapped.
> >> +	 */
> >> +	if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
> >> +		return 0;
> >> +
> >>  	for (i = 0; i < job->num_gathers; i++) {
> >>  		struct host1x_job_gather *g = &job->gathers[i];
> >>  		size_t gather_size = 0;
> >> @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>  			goto unpin;
> >>  		}
> >>
> >> -		sgt = host1x_bo_pin(host->dev, g->bo, NULL);
> >> +		sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
> >>  		if (IS_ERR(sgt)) {
> >>  			err = PTR_ERR(sgt);
> >>  			goto unpin;
> >>  		}
> >>
> >> -		if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
> >> +		if (host->domain) {
> >>  			for_each_sg(sgt->sgl, sg, sgt->nents, j)
> >>  				gather_size += sg->length;
> >>  			gather_size = iova_align(&host->iova, gather_size);
> >> @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host,
> >> struct host1x_job *job)
> >>
> >>  			job->unpins[job->num_unpins].size = gather_size;
> >>  			phys_addr = iova_dma_addr(&host->iova, alloc);
> >> -		} else {
> >> +		} else if (sgt) {
> >>  			err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
> >>  					 DMA_TO_DEVICE);
> >>  			if (!err) {
> >>
> >> --- >8 ---
> 

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-02-03 12:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-12 21:25 [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30 Dmitry Osipenko
2019-12-12 21:25 ` Dmitry Osipenko
2019-12-13 15:10 ` Thierry Reding
2019-12-13 15:35   ` Dmitry Osipenko
     [not found]     ` <d03876b8-b0d1-850b-7ae8-a61302e23843-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-01-20  2:53       ` Dmitry Osipenko
2020-01-20  2:53         ` Dmitry Osipenko
     [not found]         ` <2f5c6fda-adf9-c6c3-7601-fa912813ce1f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-01-28 15:43           ` Dmitry Osipenko
2020-01-28 15:43             ` Dmitry Osipenko
2020-01-29 12:39           ` Thierry Reding
2020-01-29 12:39             ` Thierry Reding
2020-01-30  4:36             ` Dmitry Osipenko
2020-01-30  4:36               ` Dmitry Osipenko
     [not found]               ` <d69df90f-37c9-0242-708a-689a8789a30b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-01-30 12:08                 ` Thierry Reding
2020-01-30 12:08                   ` Thierry Reding
2020-02-02 21:00                   ` Dmitry Osipenko
2020-02-02 21:00                     ` Dmitry Osipenko
     [not found]                     ` <7b710148-15db-f2d8-4221-62b032542426-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-02-03 12:56                       ` Thierry Reding
2020-02-03 12:56                         ` Thierry Reding

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.