dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] drm/mgag200: Add a workaround for low-latency
@ 2024-02-08  9:51 Jocelyn Falempe
  2024-02-08 11:49 ` Thomas Zimmermann
  2024-03-12 12:56 ` [v2] " Sui Jingfeng
  0 siblings, 2 replies; 5+ messages in thread
From: Jocelyn Falempe @ 2024-02-08  9:51 UTC (permalink / raw)
  To: dri-devel, airlied, tzimmermann, daniel; +Cc: Jocelyn Falempe

We found a regression in v5.10 on real-time server, using the
rt-kernel and the mgag200 driver. It's some really specialized
workload, with <10us latency expectation on isolated core.
After the v5.10, the real time tasks missed their <10us latency
when something prints on the screen (fbcon or printk)

The regression has been bisected to 2 commits:
commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")

The first one changed the system memory framebuffer from Write-Combine
to the default caching.
Before the second commit, the mgag200 driver used to unmap the
framebuffer after each frame, which implicitly does a cache flush.
Both regressions are fixed by this commit, which restore WC mapping
for the framebuffer in system memory, and add a cache flush.
This is only needed on x86_64, for low-latency workload,
so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
PREEMPT_RT and X86.

For more context, the whole thread can be found here [1]

Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
Link: https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1
---
 drivers/gpu/drm/mgag200/Kconfig        | 12 ++++++++++++
 drivers/gpu/drm/mgag200/mgag200_drv.c  | 17 +++++++++++++++++
 drivers/gpu/drm/mgag200/mgag200_mode.c |  8 ++++++++
 3 files changed, 37 insertions(+)

diff --git a/drivers/gpu/drm/mgag200/Kconfig b/drivers/gpu/drm/mgag200/Kconfig
index b28c5e4828f4..5e4d48df4854 100644
--- a/drivers/gpu/drm/mgag200/Kconfig
+++ b/drivers/gpu/drm/mgag200/Kconfig
@@ -11,3 +11,15 @@ config DRM_MGAG200
 	 MGA G200 desktop chips and the server variants. It requires 0.3.0
 	 of the modesetting userspace driver, and a version of mga driver
 	 that will fail on KMS enabled devices.
+
+config DRM_MGAG200_IOBURST_WORKAROUND
+	bool "Disable buffer caching"
+	depends on DRM_MGAG200 && PREEMPT_RT && X86
+	help
+	  Enable a workaround to avoid I/O bursts within the mgag200 driver at
+	  the expense of overall display performance.
+	  It restores the <v5.10 behavior, by mapping the framebuffer in system
+	  RAM as Write-Combining, and flushing the cache after each write.
+	  This is only useful on x86_64 if you want to run processes with
+	  deterministic latency.
+	  If unsure, say N.
diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.c b/drivers/gpu/drm/mgag200/mgag200_drv.c
index 54fce00e2136..573dbe256aa8 100644
--- a/drivers/gpu/drm/mgag200/mgag200_drv.c
+++ b/drivers/gpu/drm/mgag200/mgag200_drv.c
@@ -84,6 +84,20 @@ resource_size_t mgag200_probe_vram(void __iomem *mem, resource_size_t size)
 	return offset - 65536;
 }
 
+#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
+static struct drm_gem_object *mgag200_create_object(struct drm_device *dev, size_t size)
+{
+	struct drm_gem_shmem_object *shmem;
+
+	shmem = kzalloc(sizeof(*shmem), GFP_KERNEL);
+	if (!shmem)
+		return NULL;
+
+	shmem->map_wc = true;
+	return &shmem->base;
+}
+#endif
+
 /*
  * DRM driver
  */
@@ -99,6 +113,9 @@ static const struct drm_driver mgag200_driver = {
 	.major = DRIVER_MAJOR,
 	.minor = DRIVER_MINOR,
 	.patchlevel = DRIVER_PATCHLEVEL,
+#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
+	.gem_create_object = mgag200_create_object,
+#endif
 	DRM_GEM_SHMEM_DRIVER_OPS,
 };
 
diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c b/drivers/gpu/drm/mgag200/mgag200_mode.c
index 0eb769dd76ce..e17cb4c5f774 100644
--- a/drivers/gpu/drm/mgag200/mgag200_mode.c
+++ b/drivers/gpu/drm/mgag200/mgag200_mode.c
@@ -13,6 +13,7 @@
 
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
+#include <drm/drm_cache.h>
 #include <drm/drm_damage_helper.h>
 #include <drm/drm_edid.h>
 #include <drm/drm_format_helper.h>
@@ -436,6 +437,13 @@ static void mgag200_handle_damage(struct mga_device *mdev, const struct iosys_ma
 
 	iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], fb->format, clip));
 	drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip);
+
+	/* Flushing the cache greatly improves latency on x86_64 */
+#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
+	if (!vmap->is_iomem)
+		drm_clflush_virt_range(vmap->vaddr + clip->y1 * fb->pitches[0],
+				       drm_rect_height(clip) * fb->pitches[0]);
+#endif
 }
 
 /*

base-commit: 1f36d634670d8001a45fe2f2dcae546819f9c7d8
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/mgag200: Add a workaround for low-latency
  2024-02-08  9:51 [PATCH v2] drm/mgag200: Add a workaround for low-latency Jocelyn Falempe
@ 2024-02-08 11:49 ` Thomas Zimmermann
  2024-02-26 15:53   ` Jocelyn Falempe
  2024-03-12 12:56 ` [v2] " Sui Jingfeng
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Zimmermann @ 2024-02-08 11:49 UTC (permalink / raw)
  To: Jocelyn Falempe, dri-devel, airlied, daniel



Am 08.02.24 um 10:51 schrieb Jocelyn Falempe:
> We found a regression in v5.10 on real-time server, using the
> rt-kernel and the mgag200 driver. It's some really specialized
> workload, with <10us latency expectation on isolated core.
> After the v5.10, the real time tasks missed their <10us latency
> when something prints on the screen (fbcon or printk)
>
> The regression has been bisected to 2 commits:
> commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
> commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
>
> The first one changed the system memory framebuffer from Write-Combine
> to the default caching.
> Before the second commit, the mgag200 driver used to unmap the
> framebuffer after each frame, which implicitly does a cache flush.
> Both regressions are fixed by this commit, which restore WC mapping
> for the framebuffer in system memory, and add a cache flush.
> This is only needed on x86_64, for low-latency workload,
> so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
> PREEMPT_RT and X86.
>
> For more context, the whole thread can be found here [1]
>
> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
> Link: https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

> ---
>   drivers/gpu/drm/mgag200/Kconfig        | 12 ++++++++++++
>   drivers/gpu/drm/mgag200/mgag200_drv.c  | 17 +++++++++++++++++
>   drivers/gpu/drm/mgag200/mgag200_mode.c |  8 ++++++++
>   3 files changed, 37 insertions(+)
>
> diff --git a/drivers/gpu/drm/mgag200/Kconfig b/drivers/gpu/drm/mgag200/Kconfig
> index b28c5e4828f4..5e4d48df4854 100644
> --- a/drivers/gpu/drm/mgag200/Kconfig
> +++ b/drivers/gpu/drm/mgag200/Kconfig
> @@ -11,3 +11,15 @@ config DRM_MGAG200
>   	 MGA G200 desktop chips and the server variants. It requires 0.3.0
>   	 of the modesetting userspace driver, and a version of mga driver
>   	 that will fail on KMS enabled devices.
> +
> +config DRM_MGAG200_IOBURST_WORKAROUND
> +	bool "Disable buffer caching"
> +	depends on DRM_MGAG200 && PREEMPT_RT && X86
> +	help
> +	  Enable a workaround to avoid I/O bursts within the mgag200 driver at
> +	  the expense of overall display performance.
> +	  It restores the <v5.10 behavior, by mapping the framebuffer in system
> +	  RAM as Write-Combining, and flushing the cache after each write.
> +	  This is only useful on x86_64 if you want to run processes with
> +	  deterministic latency.
> +	  If unsure, say N.
> diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.c b/drivers/gpu/drm/mgag200/mgag200_drv.c
> index 54fce00e2136..573dbe256aa8 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_drv.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.c
> @@ -84,6 +84,20 @@ resource_size_t mgag200_probe_vram(void __iomem *mem, resource_size_t size)
>   	return offset - 65536;
>   }
>   
> +#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
> +static struct drm_gem_object *mgag200_create_object(struct drm_device *dev, size_t size)
> +{
> +	struct drm_gem_shmem_object *shmem;
> +
> +	shmem = kzalloc(sizeof(*shmem), GFP_KERNEL);
> +	if (!shmem)
> +		return NULL;
> +
> +	shmem->map_wc = true;
> +	return &shmem->base;
> +}
> +#endif
> +
>   /*
>    * DRM driver
>    */
> @@ -99,6 +113,9 @@ static const struct drm_driver mgag200_driver = {
>   	.major = DRIVER_MAJOR,
>   	.minor = DRIVER_MINOR,
>   	.patchlevel = DRIVER_PATCHLEVEL,
> +#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
> +	.gem_create_object = mgag200_create_object,
> +#endif
>   	DRM_GEM_SHMEM_DRIVER_OPS,
>   };
>   
> diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c b/drivers/gpu/drm/mgag200/mgag200_mode.c
> index 0eb769dd76ce..e17cb4c5f774 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_mode.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_mode.c
> @@ -13,6 +13,7 @@
>   
>   #include <drm/drm_atomic.h>
>   #include <drm/drm_atomic_helper.h>
> +#include <drm/drm_cache.h>
>   #include <drm/drm_damage_helper.h>
>   #include <drm/drm_edid.h>
>   #include <drm/drm_format_helper.h>
> @@ -436,6 +437,13 @@ static void mgag200_handle_damage(struct mga_device *mdev, const struct iosys_ma
>   
>   	iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], fb->format, clip));
>   	drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip);
> +
> +	/* Flushing the cache greatly improves latency on x86_64 */
> +#if defined(CONFIG_DRM_MGAG200_IOBURST_WORKAROUND)
> +	if (!vmap->is_iomem)
> +		drm_clflush_virt_range(vmap->vaddr + clip->y1 * fb->pitches[0],
> +				       drm_rect_height(clip) * fb->pitches[0]);
> +#endif
>   }
>   
>   /*
>
> base-commit: 1f36d634670d8001a45fe2f2dcae546819f9c7d8

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/mgag200: Add a workaround for low-latency
  2024-02-08 11:49 ` Thomas Zimmermann
@ 2024-02-26 15:53   ` Jocelyn Falempe
  0 siblings, 0 replies; 5+ messages in thread
From: Jocelyn Falempe @ 2024-02-26 15:53 UTC (permalink / raw)
  To: Thomas Zimmermann, dri-devel, airlied, daniel


On 08/02/2024 12:49, Thomas Zimmermann wrote:
> 
> 
> Am 08.02.24 um 10:51 schrieb Jocelyn Falempe:
>> We found a regression in v5.10 on real-time server, using the
>> rt-kernel and the mgag200 driver. It's some really specialized
>> workload, with <10us latency expectation on isolated core.
>> After the v5.10, the real time tasks missed their <10us latency
>> when something prints on the screen (fbcon or printk)
>>
>> The regression has been bisected to 2 commits:
>> commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
>> commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
>>
>> The first one changed the system memory framebuffer from Write-Combine
>> to the default caching.
>> Before the second commit, the mgag200 driver used to unmap the
>> framebuffer after each frame, which implicitly does a cache flush.
>> Both regressions are fixed by this commit, which restore WC mapping
>> for the framebuffer in system memory, and add a cache flush.
>> This is only needed on x86_64, for low-latency workload,
>> so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
>> PREEMPT_RT and X86.
>>
>> For more context, the whole thread can be found here [1]
>>
>> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
>> Link: 
>> https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1
> 
> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

Applied to drm-misc-next.

Thanks,

-- 

Jocelyn


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [v2] drm/mgag200: Add a workaround for low-latency
  2024-02-08  9:51 [PATCH v2] drm/mgag200: Add a workaround for low-latency Jocelyn Falempe
  2024-02-08 11:49 ` Thomas Zimmermann
@ 2024-03-12 12:56 ` Sui Jingfeng
  2024-03-12 14:25   ` Jocelyn Falempe
  1 sibling, 1 reply; 5+ messages in thread
From: Sui Jingfeng @ 2024-03-12 12:56 UTC (permalink / raw)
  To: Jocelyn Falempe, dri-devel, airlied, tzimmermann, daniel

Hi,


Interesting patch! I know this patch already merged.
While study this patch, I have a few questions.


On 2024/2/8 17:51, Jocelyn Falempe wrote:
> We found a regression in v5.10 on real-time server, using the
> rt-kernel and the mgag200 driver. It's some really specialized
> workload, with <10us latency expectation on isolated core.
> After the v5.10, the real time tasks missed their <10us latency
> when something prints on the screen (fbcon or printk)
>
> The regression has been bisected to 2 commits:
> commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
> commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
>
> The first one changed the system memory framebuffer from Write-Combine
> to the default caching.
> Before the second commit, the mgag200 driver used to unmap the
> framebuffer after each frame, which implicitly does a cache flush.


I don't know why it need to do a cache flush, where is the code.
I'm asking because I want to study this technique.

Generally speaking, X86-64 platform's default page caching is cached.
And I think the cached mapping is fastest for software rendering. And
the platform guaranteed the coherency for us, right?

Because X86-64 platform(or CPU)'s write buffer is implemented on the
top of cache? I'm means that for ARM(or other) CPU, when using Write-combine
the data will has nothing to do with cache.

> Both regressions are fixed by this commit, which restore WC mapping
> for the framebuffer in system memory, and add a cache flush.

So switch back to WC probably will decrease overall performance, I think.
And the cache flush operation should not have a impact. Except X86-64's
Write-Combine is different other platform's Write-Combine?


> This is only needed on x86_64, for low-latency workload,
> so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
> PREEMPT_RT and X86.
>
> For more context, the whole thread can be found here [1]
>
> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
> Link: https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1
> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [v2] drm/mgag200: Add a workaround for low-latency
  2024-03-12 12:56 ` [v2] " Sui Jingfeng
@ 2024-03-12 14:25   ` Jocelyn Falempe
  0 siblings, 0 replies; 5+ messages in thread
From: Jocelyn Falempe @ 2024-03-12 14:25 UTC (permalink / raw)
  To: Sui Jingfeng, dri-devel, airlied, tzimmermann, daniel



On 12/03/2024 13:56, Sui Jingfeng wrote:
> Hi,
> 
> 
> Interesting patch! I know this patch already merged.
> While study this patch, I have a few questions.
> 
> 
> On 2024/2/8 17:51, Jocelyn Falempe wrote:
>> We found a regression in v5.10 on real-time server, using the
>> rt-kernel and the mgag200 driver. It's some really specialized
>> workload, with <10us latency expectation on isolated core.
>> After the v5.10, the real time tasks missed their <10us latency
>> when something prints on the screen (fbcon or printk)
>>
>> The regression has been bisected to 2 commits:
>> commit 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages")
>> commit 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail")
>>
>> The first one changed the system memory framebuffer from Write-Combine
>> to the default caching.
>> Before the second commit, the mgag200 driver used to unmap the
>> framebuffer after each frame, which implicitly does a cache flush.
> 
> 
> I don't know why it need to do a cache flush, where is the code.
> I'm asking because I want to study this technique.
> 
> Generally speaking, X86-64 platform's default page caching is cached.
> And I think the cached mapping is fastest for software rendering. And
> the platform guaranteed the coherency for us, right?
> 
> Because X86-64 platform(or CPU)'s write buffer is implemented on the
> top of cache? I'm means that for ARM(or other) CPU, when using 
> Write-combine
> the data will has nothing to do with cache.
> 
>> Both regressions are fixed by this commit, which restore WC mapping
>> for the framebuffer in system memory, and add a cache flush.
> 
> So switch back to WC probably will decrease overall performance, I think.
> And the cache flush operation should not have a impact. Except X86-64's
> Write-Combine is different other platform's Write-Combine?

Yes this patch is a bit weird. Usually you want your VRAM mapping to be 
Write-Combine. Here it also set the system memory framebuffer as 
Write-Combine. On most x86-64, Write Combine uses its own hardware 
buffer that is not in L1/L2/L3. So when it copies the framebuffer from 
WC system memory to VRAM, it doesn't involve the cache, and have less 
impact on latency for other tasks running on other CPU.
Also I think the cache flush is important to flush those WC buffers, so 
when the next frame comes, it won't have to wait for the buffers to be 
copied to the slow VRAM.
When running the latency tests, it's obvious that both are needed.
This is how I understand it, but I may be wrong.

-- 

Jocelyn

> 
> 
>> This is only needed on x86_64, for low-latency workload,
>> so the new kconfig DRM_MGAG200_IOBURST_WORKAROUND depends on
>> PREEMPT_RT and X86.
>>
>> For more context, the whole thread can be found here [1]
>>
>> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
>> Link: 
>> https://lore.kernel.org/dri-devel/20231019135655.313759-1-jfalempe@redhat.com/ # 1
>> Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-12 14:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-08  9:51 [PATCH v2] drm/mgag200: Add a workaround for low-latency Jocelyn Falempe
2024-02-08 11:49 ` Thomas Zimmermann
2024-02-26 15:53   ` Jocelyn Falempe
2024-03-12 12:56 ` [v2] " Sui Jingfeng
2024-03-12 14:25   ` Jocelyn Falempe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).