* [PATCH] drm/mgag200: Flush the cache to improve latency @ 2023-10-19 13:55 Jocelyn Falempe 2023-10-20 12:06 ` Thomas Zimmermann 0 siblings, 1 reply; 9+ messages in thread From: Jocelyn Falempe @ 2023-10-19 13:55 UTC (permalink / raw) To: dri-devel, tzimmermann, airlied, daniel; +Cc: Jocelyn Falempe We found a regression in v5.10 on real-time server, using the rt-kernel and the mgag200 driver. It's some really specialized workload, with <10us latency expectation on isolated core. After the v5.10, the real time tasks missed their <10us latency when something prints on the screen (fbcon or printk) The regression has been bisected to 2 commits: 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") The first one changed the system memory framebuffer from Write-Combine to the default caching. Before the second commit, the mgag200 driver used to unmap the framebuffer after each frame, which implicitly does a cache flush. Both regressions are fixed by the following patch, which forces a cache flush after each frame, reverting to almost v5.9 behavior. This is necessary only if you have strong realtime constraints, so I put the cache flush under the CONFIG_PREEMPT_RT config flag. Also clflush is only availabe on x86, (and this issue has only been reproduced on x86_64) so it's also under the CONFIG_X86 config flag. Fixes: 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") Fixes: 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> --- drivers/gpu/drm/mgag200/mgag200_mode.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c b/drivers/gpu/drm/mgag200/mgag200_mode.c index af3ce5a6a636..11660cd29cea 100644 --- a/drivers/gpu/drm/mgag200/mgag200_mode.c +++ b/drivers/gpu/drm/mgag200/mgag200_mode.c @@ -13,6 +13,7 @@ #include <drm/drm_atomic.h> #include <drm/drm_atomic_helper.h> +#include <drm/drm_cache.h> #include <drm/drm_damage_helper.h> #include <drm/drm_format_helper.h> #include <drm/drm_fourcc.h> @@ -436,6 +437,10 @@ static void mgag200_handle_damage(struct mga_device *mdev, const struct iosys_ma iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], fb->format, clip)); drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip); + /* On RT systems, flushing the cache reduces the latency for other RT tasks */ +#if defined(CONFIG_X86) && defined(CONFIG_PREEMPT_RT) + drm_clflush_virt_range(vmap, fb->height * fb->pitches[0]); +#endif } /* base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c -- 2.41.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-10-19 13:55 [PATCH] drm/mgag200: Flush the cache to improve latency Jocelyn Falempe @ 2023-10-20 12:06 ` Thomas Zimmermann 2023-10-23 8:30 ` Jocelyn Falempe 0 siblings, 1 reply; 9+ messages in thread From: Thomas Zimmermann @ 2023-10-20 12:06 UTC (permalink / raw) To: Jocelyn Falempe, dri-devel, airlied, daniel; +Cc: Linux Kernel Mailing List [-- Attachment #1.1: Type: text/plain, Size: 3960 bytes --] (cc'ing lkml for feedback) Hi Jocelyn Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: > We found a regression in v5.10 on real-time server, using the > rt-kernel and the mgag200 driver. It's some really specialized > workload, with <10us latency expectation on isolated core. > After the v5.10, the real time tasks missed their <10us latency > when something prints on the screen (fbcon or printk) I'd like to hear the opinion of the RT-devs on this patch. Because AFAIK we never did such a workaround in other drivers. And AFAIK printk is a PITA anyway. IMHO if that RT system cannot handle differences in framebuffer caching, it's under-powered. It's just a matter of time until something else changes and the problem returns. And (honest question) as it's an x86-64, how do they handle System Management Mode? > > The regression has been bisected to 2 commits: > 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") > 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") > > The first one changed the system memory framebuffer from Write-Combine > to the default caching. > Before the second commit, the mgag200 driver used to unmap the > framebuffer after each frame, which implicitly does a cache flush. > Both regressions are fixed by the following patch, which forces a > cache flush after each frame, reverting to almost v5.9 behavior. With that second commit, we essentially never unmap an active framebuffer console. But with commit 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access with vmap") we now again unmap the console framebuffer after the pageflip happened. So how does the latest kernel behave wrt to the problem? > This is necessary only if you have strong realtime constraints, so I > put the cache flush under the CONFIG_PREEMPT_RT config flag. > Also clflush is only availabe on x86, (and this issue has only been > reproduced on x86_64) so it's also under the CONFIG_X86 config flag. > > Fixes: 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") > Fixes: 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") > Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> > --- > drivers/gpu/drm/mgag200/mgag200_mode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c b/drivers/gpu/drm/mgag200/mgag200_mode.c > index af3ce5a6a636..11660cd29cea 100644 > --- a/drivers/gpu/drm/mgag200/mgag200_mode.c > +++ b/drivers/gpu/drm/mgag200/mgag200_mode.c > @@ -13,6 +13,7 @@ > > #include <drm/drm_atomic.h> > #include <drm/drm_atomic_helper.h> > +#include <drm/drm_cache.h> > #include <drm/drm_damage_helper.h> > #include <drm/drm_format_helper.h> > #include <drm/drm_fourcc.h> > @@ -436,6 +437,10 @@ static void mgag200_handle_damage(struct mga_device *mdev, const struct iosys_ma > > iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], fb->format, clip)); > drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip); > + /* On RT systems, flushing the cache reduces the latency for other RT tasks */ > +#if defined(CONFIG_X86) && defined(CONFIG_PREEMPT_RT) > + drm_clflush_virt_range(vmap, fb->height * fb->pitches[0]); > +#endif Your second commit is part of a larger patchset that updates several drivers. They might all be affected. So if anything, the patch should go here before the unmap call: https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_gem_atomic_helper.c#L377 with a much expanded comment. But I'd really like to hear other peoples' opinions on the matter. Best regards Thomas > } > > /* > > base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Frankenstrasse 146, 90461 Nuernberg, Germany GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman HRB 36809 (AG Nuernberg) [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-10-20 12:06 ` Thomas Zimmermann @ 2023-10-23 8:30 ` Jocelyn Falempe 2023-11-06 10:46 ` Jocelyn Falempe 0 siblings, 1 reply; 9+ messages in thread From: Jocelyn Falempe @ 2023-10-23 8:30 UTC (permalink / raw) To: Thomas Zimmermann, dri-devel, airlied, daniel; +Cc: Linux Kernel Mailing List On 20/10/2023 14:06, Thomas Zimmermann wrote: > (cc'ing lkml for feedback) > > Hi Jocelyn > > Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: >> We found a regression in v5.10 on real-time server, using the >> rt-kernel and the mgag200 driver. It's some really specialized >> workload, with <10us latency expectation on isolated core. >> After the v5.10, the real time tasks missed their <10us latency >> when something prints on the screen (fbcon or printk) > > I'd like to hear the opinion of the RT-devs on this patch. Because AFAIK > we never did such a workaround in other drivers. And AFAIK printk is a > PITA anyway. Most other drivers uses DMA, which means this workaround can't apply to them. > > IMHO if that RT system cannot handle differences in framebuffer caching, > it's under-powered. It's just a matter of time until something else > changes and the problem returns. And (honest question) as it's an > x86-64, how do they handle System Management Mode? I think it's not a big news, that the Matrox G200 from 1999 is under-powered. I was also a bit surprised that flushing the cache would have such effect on latency. The tests we are doing can run 24h with the workaround, without any interrupt taking more than 10us. Without the workaround, every ~30s the interrupt failed its 10us target. > >> >> The regression has been bisected to 2 commits: >> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >> >> The first one changed the system memory framebuffer from Write-Combine >> to the default caching. >> Before the second commit, the mgag200 driver used to unmap the >> framebuffer after each frame, which implicitly does a cache flush. >> Both regressions are fixed by the following patch, which forces a >> cache flush after each frame, reverting to almost v5.9 behavior. > > With that second commit, we essentially never unmap an active > framebuffer console. But with commit > > 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access > with vmap") > > we now again unmap the console framebuffer after the pageflip happened. > > So how does the latest kernel behave wrt to the problem? The regression was found when upgrading the server from v5.4 to v5.14, so we didn't test with later kernels. We will test with v6.3 (which should have 359c6649cd9a ) and see what it gives. > >> This is necessary only if you have strong realtime constraints, so I >> put the cache flush under the CONFIG_PREEMPT_RT config flag. >> Also clflush is only availabe on x86, (and this issue has only been >> reproduced on x86_64) so it's also under the CONFIG_X86 config flag. >> >> Fixes: 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >> Fixes: 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> >> --- >> drivers/gpu/drm/mgag200/mgag200_mode.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c >> b/drivers/gpu/drm/mgag200/mgag200_mode.c >> index af3ce5a6a636..11660cd29cea 100644 >> --- a/drivers/gpu/drm/mgag200/mgag200_mode.c >> +++ b/drivers/gpu/drm/mgag200/mgag200_mode.c >> @@ -13,6 +13,7 @@ >> #include <drm/drm_atomic.h> >> #include <drm/drm_atomic_helper.h> >> +#include <drm/drm_cache.h> >> #include <drm/drm_damage_helper.h> >> #include <drm/drm_format_helper.h> >> #include <drm/drm_fourcc.h> >> @@ -436,6 +437,10 @@ static void mgag200_handle_damage(struct >> mga_device *mdev, const struct iosys_ma >> iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], >> fb->format, clip)); >> drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip); >> + /* On RT systems, flushing the cache reduces the latency for >> other RT tasks */ >> +#if defined(CONFIG_X86) && defined(CONFIG_PREEMPT_RT) >> + drm_clflush_virt_range(vmap, fb->height * fb->pitches[0]); >> +#endif > > Your second commit is part of a larger patchset that updates several > drivers. They might all be affected. So if anything, the patch should go > here before the unmap call: > > https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_gem_atomic_helper.c#L377 > The regression was found only with G200 currently, so I don't want to apply it blindly on other drivers. Thanks for your help, Best regards, -- Jocelyn > with a much expanded comment. > > But I'd really like to hear other peoples' opinions on the matter. > > Best regards > Thomas > >> } >> /* >> >> base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-10-23 8:30 ` Jocelyn Falempe @ 2023-11-06 10:46 ` Jocelyn Falempe 2023-12-11 9:31 ` Jocelyn Falempe 2024-02-05 7:02 ` Dave Airlie 0 siblings, 2 replies; 9+ messages in thread From: Jocelyn Falempe @ 2023-11-06 10:46 UTC (permalink / raw) To: Thomas Zimmermann, dri-devel, airlied, daniel; +Cc: Linux Kernel Mailing List On 23/10/2023 10:30, Jocelyn Falempe wrote: > On 20/10/2023 14:06, Thomas Zimmermann wrote: >> (cc'ing lkml for feedback) >> >> Hi Jocelyn >> >> Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: >>> We found a regression in v5.10 on real-time server, using the >>> rt-kernel and the mgag200 driver. It's some really specialized >>> workload, with <10us latency expectation on isolated core. >>> After the v5.10, the real time tasks missed their <10us latency >>> when something prints on the screen (fbcon or printk) >> >> I'd like to hear the opinion of the RT-devs on this patch. Because >> AFAIK we never did such a workaround in other drivers. And AFAIK >> printk is a PITA anyway. > > Most other drivers uses DMA, which means this workaround can't apply to > them. > >> >> IMHO if that RT system cannot handle differences in framebuffer >> caching, it's under-powered. It's just a matter of time until >> something else changes and the problem returns. And (honest question) >> as it's an x86-64, how do they handle System Management Mode? > > I think it's not a big news, that the Matrox G200 from 1999 is > under-powered. > I was also a bit surprised that flushing the cache would have such > effect on latency. The tests we are doing can run 24h with the > workaround, without any interrupt taking more than 10us. Without the > workaround, every ~30s the interrupt failed its 10us target. > >> >>> >>> The regression has been bisected to 2 commits: >>> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >>> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >>> >>> The first one changed the system memory framebuffer from Write-Combine >>> to the default caching. >>> Before the second commit, the mgag200 driver used to unmap the >>> framebuffer after each frame, which implicitly does a cache flush. >>> Both regressions are fixed by the following patch, which forces a >>> cache flush after each frame, reverting to almost v5.9 behavior. >> >> With that second commit, we essentially never unmap an active >> framebuffer console. But with commit >> >> 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access >> with vmap") >> >> we now again unmap the console framebuffer after the pageflip happened. >> >> So how does the latest kernel behave wrt to the problem? > > The regression was found when upgrading the server from v5.4 to v5.14, > so we didn't test with later kernels. > We will test with v6.3 (which should have 359c6649cd9a ) and see what it > gives. I don't have a clear explanation, but testing with v6.3, and forcing the Write Combine, doesn't fix the latency issue. So forcing the cache flush is still needed. Also, on some systems, they use "isolated cpu" to handle RT task, but with a standard kernel (so without the CONFIG_PREEMPT_RT). So I'm wondering if we can use a kernel module parameter for this, so that users that wants to achieve low latency, can opt-in ? something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? Best regards, -- Jocelyn >> >>> This is necessary only if you have strong realtime constraints, so I >>> put the cache flush under the CONFIG_PREEMPT_RT config flag. >>> Also clflush is only availabe on x86, (and this issue has only been >>> reproduced on x86_64) so it's also under the CONFIG_X86 config flag. >>> >>> Fixes: 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >>> Fixes: 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >>> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> >>> --- >>> drivers/gpu/drm/mgag200/mgag200_mode.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/mgag200/mgag200_mode.c >>> b/drivers/gpu/drm/mgag200/mgag200_mode.c >>> index af3ce5a6a636..11660cd29cea 100644 >>> --- a/drivers/gpu/drm/mgag200/mgag200_mode.c >>> +++ b/drivers/gpu/drm/mgag200/mgag200_mode.c >>> @@ -13,6 +13,7 @@ >>> #include <drm/drm_atomic.h> >>> #include <drm/drm_atomic_helper.h> >>> +#include <drm/drm_cache.h> >>> #include <drm/drm_damage_helper.h> >>> #include <drm/drm_format_helper.h> >>> #include <drm/drm_fourcc.h> >>> @@ -436,6 +437,10 @@ static void mgag200_handle_damage(struct >>> mga_device *mdev, const struct iosys_ma >>> iosys_map_incr(&dst, drm_fb_clip_offset(fb->pitches[0], >>> fb->format, clip)); >>> drm_fb_memcpy(&dst, fb->pitches, vmap, fb, clip); >>> + /* On RT systems, flushing the cache reduces the latency for >>> other RT tasks */ >>> +#if defined(CONFIG_X86) && defined(CONFIG_PREEMPT_RT) >>> + drm_clflush_virt_range(vmap, fb->height * fb->pitches[0]); >>> +#endif >> >> Your second commit is part of a larger patchset that updates several >> drivers. They might all be affected. So if anything, the patch should >> go here before the unmap call: >> >> https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_gem_atomic_helper.c#L377 >> > The regression was found only with G200 currently, so I don't want to > apply it blindly on other drivers. > > Thanks for your help, > > Best regards, > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-11-06 10:46 ` Jocelyn Falempe @ 2023-12-11 9:31 ` Jocelyn Falempe 2024-02-06 13:33 ` Daniel Vetter 2024-02-05 7:02 ` Dave Airlie 1 sibling, 1 reply; 9+ messages in thread From: Jocelyn Falempe @ 2023-12-11 9:31 UTC (permalink / raw) To: Thomas Zimmermann, dri-devel, airlied, daniel; +Cc: Linux Kernel Mailing List On 06/11/2023 11:46, Jocelyn Falempe wrote: > On 23/10/2023 10:30, Jocelyn Falempe wrote: >> On 20/10/2023 14:06, Thomas Zimmermann wrote: >>> (cc'ing lkml for feedback) >>> >>> Hi Jocelyn >>> >>> Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: >>>> We found a regression in v5.10 on real-time server, using the >>>> rt-kernel and the mgag200 driver. It's some really specialized >>>> workload, with <10us latency expectation on isolated core. >>>> After the v5.10, the real time tasks missed their <10us latency >>>> when something prints on the screen (fbcon or printk) >>> >>> I'd like to hear the opinion of the RT-devs on this patch. Because >>> AFAIK we never did such a workaround in other drivers. And AFAIK >>> printk is a PITA anyway. >> >> Most other drivers uses DMA, which means this workaround can't apply >> to them. >> >>> >>> IMHO if that RT system cannot handle differences in framebuffer >>> caching, it's under-powered. It's just a matter of time until >>> something else changes and the problem returns. And (honest question) >>> as it's an x86-64, how do they handle System Management Mode? >> >> I think it's not a big news, that the Matrox G200 from 1999 is >> under-powered. >> I was also a bit surprised that flushing the cache would have such >> effect on latency. The tests we are doing can run 24h with the >> workaround, without any interrupt taking more than 10us. Without the >> workaround, every ~30s the interrupt failed its 10us target. >> >>> >>>> >>>> The regression has been bisected to 2 commits: >>>> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >>>> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >>>> >>>> The first one changed the system memory framebuffer from Write-Combine >>>> to the default caching. >>>> Before the second commit, the mgag200 driver used to unmap the >>>> framebuffer after each frame, which implicitly does a cache flush. >>>> Both regressions are fixed by the following patch, which forces a >>>> cache flush after each frame, reverting to almost v5.9 behavior. >>> >>> With that second commit, we essentially never unmap an active >>> framebuffer console. But with commit >>> >>> 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access >>> with vmap") >>> >>> we now again unmap the console framebuffer after the pageflip happened. >>> >>> So how does the latest kernel behave wrt to the problem? >> >> The regression was found when upgrading the server from v5.4 to v5.14, >> so we didn't test with later kernels. >> We will test with v6.3 (which should have 359c6649cd9a ) and see what >> it gives. > > I don't have a clear explanation, but testing with v6.3, and forcing the > Write Combine, doesn't fix the latency issue. So forcing the cache flush > is still needed. > > Also, on some systems, they use "isolated cpu" to handle RT task, but > with a standard kernel (so without the CONFIG_PREEMPT_RT). > So I'm wondering if we can use a kernel module parameter for this, > so that users that wants to achieve low latency, can opt-in ? > > something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? Hi, I have now access to a server that reproduce the issue, and I was able to test different workarounds. So it is definitely related to the "Write Combine" mode of the mga internal RAM. If I comment the two lines to enable wc: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/mgag200/mgag200_drv.c#L150, then the latency is <10us (but the performances are worse, from 20ms to 87ms to draw a full frame). I also tried to flush the vram using: drm_clflush_virt_range(mdev->vram + clip->y1 * fb->pitches[0], drm_rect_height(clip) * fb->pitches[0]); And that lower the latency to ~20us, but it's not enough. I tried "sfence" which I though would flush the WC buffers of the CPU, but that has no effect in practice. I think I can send a new patch, to not map the VRAM as Write Combine, either if CONFIG_PREEMPT_RT is set or if a module parameter is set. What do you think is the best approach ? My tests setup: I either flood the console with "cat /dev/urandom | base64" from the BMC, or write to fb0 with "while true; do dd if=/dev/urandom of=/dev/fb0 ; done" then I run: cd /sys/kernel/debug/tracing echo 0 > tracing_on echo 950000 > hwlat_detector/width echo hwlat > current_tracer echo 10 > tracing_thresh echo 1 > tracing_on sleep 300 cat trace echo 0 > tracing_on echo nop > current_tracer Test Results: V6.6 (~40us latency) # tracer: hwlat # # entries-in-buffer/entries-written: 6/6 #P:56 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-8795 [001] dn... 613.749741: #1 inner/outer(us): 41/41 ts:1702045267.016266287 count:9976 <...>-8804 [000] dn... 675.201727: #2 inner/outer(us): 40/40 ts:1702045328.469859973 count:6506 <...>-8804 [000] dn... 731.435258: #3 inner/outer(us): 40/41 ts:1702045384.704861049 count:9578 <...>-8804 [000] d.... 787.684533: #4 inner/outer(us): 41/41 ts:1702045440.955603974 count:22591 <...>-8804 [000] d.... 844.303041: #5 inner/outer(us): 41/41 ts:1702045497.575594006 count:33324 <...>-8804 [000] d.... 900.494844: #6 inner/outer(us): 40/40 ts:1702045553.768864888 count:1924 V6.6 + clfush mdev->vram (~20us latency) # tracer: hwlat # # entries-in-buffer/entries-written: 3/3 #P:56 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-8983 [001] d.... 534.595415: #1 inner/outer(us): 21/20 ts:1702024432.844243126 count:3018 <...>-8983 [000] dn... 758.560453: #2 inner/outer(us): 21/21 ts:1702024656.824367474 count:22238 <...>-8983 [000] dn... 815.117057: #3 inner/outer(us): 21/21 ts:1702024713.373220580 count:26455 V6.6 + no Write Combine for VRAM (<10us latency) # tracer: hwlat # # entries-in-buffer/entries-written: 0/0 #P:56 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | Best regards, -- Jocelyn ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-12-11 9:31 ` Jocelyn Falempe @ 2024-02-06 13:33 ` Daniel Vetter 2024-02-06 13:51 ` Jocelyn Falempe 0 siblings, 1 reply; 9+ messages in thread From: Daniel Vetter @ 2024-02-06 13:33 UTC (permalink / raw) To: Jocelyn Falempe Cc: Thomas Zimmermann, dri-devel, airlied, daniel, Linux Kernel Mailing List On Mon, Dec 11, 2023 at 10:31:28AM +0100, Jocelyn Falempe wrote: > On 06/11/2023 11:46, Jocelyn Falempe wrote: > > On 23/10/2023 10:30, Jocelyn Falempe wrote: > > > On 20/10/2023 14:06, Thomas Zimmermann wrote: > > > > (cc'ing lkml for feedback) > > > > > > > > Hi Jocelyn > > > > > > > > Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: > > > > > We found a regression in v5.10 on real-time server, using the > > > > > rt-kernel and the mgag200 driver. It's some really specialized > > > > > workload, with <10us latency expectation on isolated core. > > > > > After the v5.10, the real time tasks missed their <10us latency > > > > > when something prints on the screen (fbcon or printk) > > > > > > > > I'd like to hear the opinion of the RT-devs on this patch. > > > > Because AFAIK we never did such a workaround in other drivers. > > > > And AFAIK printk is a PITA anyway. > > > > > > Most other drivers uses DMA, which means this workaround can't apply > > > to them. > > > > > > > > > > > IMHO if that RT system cannot handle differences in framebuffer > > > > caching, it's under-powered. It's just a matter of time until > > > > something else changes and the problem returns. And (honest > > > > question) as it's an x86-64, how do they handle System > > > > Management Mode? > > > > > > I think it's not a big news, that the Matrox G200 from 1999 is > > > under-powered. > > > I was also a bit surprised that flushing the cache would have such > > > effect on latency. The tests we are doing can run 24h with the > > > workaround, without any interrupt taking more than 10us. Without the > > > workaround, every ~30s the interrupt failed its 10us target. > > > > > > > > > > > > > > > > > The regression has been bisected to 2 commits: > > > > > 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") > > > > > 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") > > > > > > > > > > The first one changed the system memory framebuffer from Write-Combine > > > > > to the default caching. > > > > > Before the second commit, the mgag200 driver used to unmap the > > > > > framebuffer after each frame, which implicitly does a cache flush. > > > > > Both regressions are fixed by the following patch, which forces a > > > > > cache flush after each frame, reverting to almost v5.9 behavior. > > > > > > > > With that second commit, we essentially never unmap an active > > > > framebuffer console. But with commit > > > > > > > > 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, > > > > end}_fb_access with vmap") > > > > > > > > we now again unmap the console framebuffer after the pageflip happened. > > > > > > > > So how does the latest kernel behave wrt to the problem? > > > > > > The regression was found when upgrading the server from v5.4 to > > > v5.14, so we didn't test with later kernels. > > > We will test with v6.3 (which should have 359c6649cd9a ) and see > > > what it gives. > > > > I don't have a clear explanation, but testing with v6.3, and forcing the > > Write Combine, doesn't fix the latency issue. So forcing the cache flush > > is still needed. > > > > Also, on some systems, they use "isolated cpu" to handle RT task, but > > with a standard kernel (so without the CONFIG_PREEMPT_RT). > > So I'm wondering if we can use a kernel module parameter for this, > > so that users that wants to achieve low latency, can opt-in ? > > > > something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? > > Hi, > > I have now access to a server that reproduce the issue, and I was able to > test different workarounds. > > So it is definitely related to the "Write Combine" mode of the mga internal > RAM. If I comment the two lines to enable wc: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/mgag200/mgag200_drv.c#L150, > then the latency is <10us (but the performances are worse, from 20ms to 87ms > to draw a full frame). Ok this is very strange, but at least it starts to make sense. Apparently if we stream a _lot_ of writes from wb to wc memory on a cpu that results in high latencies on other cpus. And the only way to fix that is by artificially making the wb source suffer from cache misses by flushing them out. > I also tried to flush the vram using: > drm_clflush_virt_range(mdev->vram + clip->y1 * fb->pitches[0], > drm_rect_height(clip) * fb->pitches[0]); > And that lower the latency to ~20us, but it's not enough. > > I tried "sfence" which I though would flush the WC buffers of the CPU, but > that has no effect in practice. > > I think I can send a new patch, to not map the VRAM as Write Combine, either > if CONFIG_PREEMPT_RT is set or if a module parameter is set. > What do you think is the best approach ? I think an mgag200 module option like Dave suggested is best. Plus the entire above debug story in the commit message, especially the things you've figured out in your latest testing (apologies for missing your mail from Dec, pls ping again if things get dropped like that) kinda explains what's going on. Still doesn't make much sense that a cpu doing a lot of wb->wc transfers can hurt other cores like this, but at least that seems technically plausible. Also please link to this thread for all the details on test setup, I think the above is enough as a summary for the commit message. But if you want you can include all the details below too. Cheers, Sima > My tests setup: > > I either flood the console with "cat /dev/urandom | base64" from the BMC, or > write to fb0 with "while true; do dd if=/dev/urandom of=/dev/fb0 ; done" > > then I run: > > cd /sys/kernel/debug/tracing > echo 0 > tracing_on > echo 950000 > hwlat_detector/width > echo hwlat > current_tracer > echo 10 > tracing_thresh > echo 1 > tracing_on > sleep 300 > cat trace > echo 0 > tracing_on > echo nop > current_tracer > > > Test Results: > > V6.6 (~40us latency) > > # tracer: hwlat > # > # entries-in-buffer/entries-written: 6/6 #P:56 > # > # _-----=> irqs-off/BH-disabled > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / _-=> migrate-disable > # |||| / delay > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > # | | | ||||| | | > <...>-8795 [001] dn... 613.749741: #1 inner/outer(us): > 41/41 ts:1702045267.016266287 count:9976 > <...>-8804 [000] dn... 675.201727: #2 inner/outer(us): > 40/40 ts:1702045328.469859973 count:6506 > <...>-8804 [000] dn... 731.435258: #3 inner/outer(us): > 40/41 ts:1702045384.704861049 count:9578 > <...>-8804 [000] d.... 787.684533: #4 inner/outer(us): > 41/41 ts:1702045440.955603974 count:22591 > <...>-8804 [000] d.... 844.303041: #5 inner/outer(us): > 41/41 ts:1702045497.575594006 count:33324 > <...>-8804 [000] d.... 900.494844: #6 inner/outer(us): > 40/40 ts:1702045553.768864888 count:1924 > > > V6.6 + clfush mdev->vram (~20us latency) > > # tracer: hwlat > # > # entries-in-buffer/entries-written: 3/3 #P:56 > # > # _-----=> irqs-off/BH-disabled > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / _-=> migrate-disable > # |||| / delay > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > # | | | ||||| | | > <...>-8983 [001] d.... 534.595415: #1 inner/outer(us): > 21/20 ts:1702024432.844243126 count:3018 > <...>-8983 [000] dn... 758.560453: #2 inner/outer(us): > 21/21 ts:1702024656.824367474 count:22238 > <...>-8983 [000] dn... 815.117057: #3 inner/outer(us): > 21/21 ts:1702024713.373220580 count:26455 > > > V6.6 + no Write Combine for VRAM (<10us latency) > > # tracer: hwlat > # > # entries-in-buffer/entries-written: 0/0 #P:56 > # > # _-----=> irqs-off/BH-disabled > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / _-=> migrate-disable > # |||| / delay > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > # | | | ||||| | | > > > Best regards, > > -- > > Jocelyn > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2024-02-06 13:33 ` Daniel Vetter @ 2024-02-06 13:51 ` Jocelyn Falempe 0 siblings, 0 replies; 9+ messages in thread From: Jocelyn Falempe @ 2024-02-06 13:51 UTC (permalink / raw) To: Thomas Zimmermann, dri-devel, airlied, Linux Kernel Mailing List On 06/02/2024 14:33, Daniel Vetter wrote: > On Mon, Dec 11, 2023 at 10:31:28AM +0100, Jocelyn Falempe wrote: >> On 06/11/2023 11:46, Jocelyn Falempe wrote: >>> On 23/10/2023 10:30, Jocelyn Falempe wrote: >>>> On 20/10/2023 14:06, Thomas Zimmermann wrote: >>>>> (cc'ing lkml for feedback) >>>>> >>>>> Hi Jocelyn >>>>> >>>>> Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: >>>>>> We found a regression in v5.10 on real-time server, using the >>>>>> rt-kernel and the mgag200 driver. It's some really specialized >>>>>> workload, with <10us latency expectation on isolated core. >>>>>> After the v5.10, the real time tasks missed their <10us latency >>>>>> when something prints on the screen (fbcon or printk) >>>>> >>>>> I'd like to hear the opinion of the RT-devs on this patch. >>>>> Because AFAIK we never did such a workaround in other drivers. >>>>> And AFAIK printk is a PITA anyway. >>>> >>>> Most other drivers uses DMA, which means this workaround can't apply >>>> to them. >>>> >>>>> >>>>> IMHO if that RT system cannot handle differences in framebuffer >>>>> caching, it's under-powered. It's just a matter of time until >>>>> something else changes and the problem returns. And (honest >>>>> question) as it's an x86-64, how do they handle System >>>>> Management Mode? >>>> >>>> I think it's not a big news, that the Matrox G200 from 1999 is >>>> under-powered. >>>> I was also a bit surprised that flushing the cache would have such >>>> effect on latency. The tests we are doing can run 24h with the >>>> workaround, without any interrupt taking more than 10us. Without the >>>> workaround, every ~30s the interrupt failed its 10us target. >>>> >>>>> >>>>>> >>>>>> The regression has been bisected to 2 commits: >>>>>> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") >>>>>> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") >>>>>> >>>>>> The first one changed the system memory framebuffer from Write-Combine >>>>>> to the default caching. >>>>>> Before the second commit, the mgag200 driver used to unmap the >>>>>> framebuffer after each frame, which implicitly does a cache flush. >>>>>> Both regressions are fixed by the following patch, which forces a >>>>>> cache flush after each frame, reverting to almost v5.9 behavior. >>>>> >>>>> With that second commit, we essentially never unmap an active >>>>> framebuffer console. But with commit >>>>> >>>>> 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, >>>>> end}_fb_access with vmap") >>>>> >>>>> we now again unmap the console framebuffer after the pageflip happened. >>>>> >>>>> So how does the latest kernel behave wrt to the problem? >>>> >>>> The regression was found when upgrading the server from v5.4 to >>>> v5.14, so we didn't test with later kernels. >>>> We will test with v6.3 (which should have 359c6649cd9a ) and see >>>> what it gives. >>> >>> I don't have a clear explanation, but testing with v6.3, and forcing the >>> Write Combine, doesn't fix the latency issue. So forcing the cache flush >>> is still needed. >>> >>> Also, on some systems, they use "isolated cpu" to handle RT task, but >>> with a standard kernel (so without the CONFIG_PREEMPT_RT). >>> So I'm wondering if we can use a kernel module parameter for this, >>> so that users that wants to achieve low latency, can opt-in ? >>> >>> something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? >> >> Hi, >> >> I have now access to a server that reproduce the issue, and I was able to >> test different workarounds. >> >> So it is definitely related to the "Write Combine" mode of the mga internal >> RAM. If I comment the two lines to enable wc: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/mgag200/mgag200_drv.c#L150, >> then the latency is <10us (but the performances are worse, from 20ms to 87ms >> to draw a full frame). > > Ok this is very strange, but at least it starts to make sense. Apparently > if we stream a _lot_ of writes from wb to wc memory on a cpu that results > in high latencies on other cpus. And the only way to fix that is by > artificially making the wb source suffer from cache misses by flushing > them out. > >> I also tried to flush the vram using: >> drm_clflush_virt_range(mdev->vram + clip->y1 * fb->pitches[0], >> drm_rect_height(clip) * fb->pitches[0]); >> And that lower the latency to ~20us, but it's not enough. >> >> I tried "sfence" which I though would flush the WC buffers of the CPU, but >> that has no effect in practice. >> >> I think I can send a new patch, to not map the VRAM as Write Combine, either >> if CONFIG_PREEMPT_RT is set or if a module parameter is set. >> What do you think is the best approach ? > > I think an mgag200 module option like Dave suggested is best. > > Plus the entire above debug story in the commit message, especially the > things you've figured out in your latest testing (apologies for missing > your mail from Dec, pls ping again if things get dropped like that) kinda > explains what's going on. > > Still doesn't make much sense that a cpu doing a lot of wb->wc transfers > can hurt other cores like this, but at least that seems technically > plausible. > > Also please link to this thread for all the details on test setup, I think > the above is enough as a summary for the commit message. But if you want > you can include all the details below too. Thanks, Let me send a new patch, with this change. For the module parameter, I propose to name it "writecombine" and default to 1, so if you want low latency, you need to add mgag200.writecombine=0 to the kernel command line. Best regards, -- Jocelyn > > Cheers, Sima > >> My tests setup: >> >> I either flood the console with "cat /dev/urandom | base64" from the BMC, or >> write to fb0 with "while true; do dd if=/dev/urandom of=/dev/fb0 ; done" >> >> then I run: >> >> cd /sys/kernel/debug/tracing >> echo 0 > tracing_on >> echo 950000 > hwlat_detector/width >> echo hwlat > current_tracer >> echo 10 > tracing_thresh >> echo 1 > tracing_on >> sleep 300 >> cat trace >> echo 0 > tracing_on >> echo nop > current_tracer >> >> >> Test Results: >> >> V6.6 (~40us latency) >> >> # tracer: hwlat >> # >> # entries-in-buffer/entries-written: 6/6 #P:56 >> # >> # _-----=> irqs-off/BH-disabled >> # / _----=> need-resched >> # | / _---=> hardirq/softirq >> # || / _--=> preempt-depth >> # ||| / _-=> migrate-disable >> # |||| / delay >> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION >> # | | | ||||| | | >> <...>-8795 [001] dn... 613.749741: #1 inner/outer(us): >> 41/41 ts:1702045267.016266287 count:9976 >> <...>-8804 [000] dn... 675.201727: #2 inner/outer(us): >> 40/40 ts:1702045328.469859973 count:6506 >> <...>-8804 [000] dn... 731.435258: #3 inner/outer(us): >> 40/41 ts:1702045384.704861049 count:9578 >> <...>-8804 [000] d.... 787.684533: #4 inner/outer(us): >> 41/41 ts:1702045440.955603974 count:22591 >> <...>-8804 [000] d.... 844.303041: #5 inner/outer(us): >> 41/41 ts:1702045497.575594006 count:33324 >> <...>-8804 [000] d.... 900.494844: #6 inner/outer(us): >> 40/40 ts:1702045553.768864888 count:1924 >> >> >> V6.6 + clfush mdev->vram (~20us latency) >> >> # tracer: hwlat >> # >> # entries-in-buffer/entries-written: 3/3 #P:56 >> # >> # _-----=> irqs-off/BH-disabled >> # / _----=> need-resched >> # | / _---=> hardirq/softirq >> # || / _--=> preempt-depth >> # ||| / _-=> migrate-disable >> # |||| / delay >> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION >> # | | | ||||| | | >> <...>-8983 [001] d.... 534.595415: #1 inner/outer(us): >> 21/20 ts:1702024432.844243126 count:3018 >> <...>-8983 [000] dn... 758.560453: #2 inner/outer(us): >> 21/21 ts:1702024656.824367474 count:22238 >> <...>-8983 [000] dn... 815.117057: #3 inner/outer(us): >> 21/21 ts:1702024713.373220580 count:26455 >> >> >> V6.6 + no Write Combine for VRAM (<10us latency) >> >> # tracer: hwlat >> # >> # entries-in-buffer/entries-written: 0/0 #P:56 >> # >> # _-----=> irqs-off/BH-disabled >> # / _----=> need-resched >> # | / _---=> hardirq/softirq >> # || / _--=> preempt-depth >> # ||| / _-=> migrate-disable >> # |||| / delay >> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION >> # | | | ||||| | | >> >> >> Best regards, >> >> -- >> >> Jocelyn >> > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2023-11-06 10:46 ` Jocelyn Falempe 2023-12-11 9:31 ` Jocelyn Falempe @ 2024-02-05 7:02 ` Dave Airlie 2024-02-06 13:30 ` Daniel Vetter 1 sibling, 1 reply; 9+ messages in thread From: Dave Airlie @ 2024-02-05 7:02 UTC (permalink / raw) To: Jocelyn Falempe Cc: Thomas Zimmermann, dri-devel, airlied, daniel, Linux Kernel Mailing List On Mon, 6 Nov 2023 at 20:47, Jocelyn Falempe <jfalempe@redhat.com> wrote: > > On 23/10/2023 10:30, Jocelyn Falempe wrote: > > On 20/10/2023 14:06, Thomas Zimmermann wrote: > >> (cc'ing lkml for feedback) > >> > >> Hi Jocelyn > >> > >> Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: > >>> We found a regression in v5.10 on real-time server, using the > >>> rt-kernel and the mgag200 driver. It's some really specialized > >>> workload, with <10us latency expectation on isolated core. > >>> After the v5.10, the real time tasks missed their <10us latency > >>> when something prints on the screen (fbcon or printk) > >> > >> I'd like to hear the opinion of the RT-devs on this patch. Because > >> AFAIK we never did such a workaround in other drivers. And AFAIK > >> printk is a PITA anyway. > > > > Most other drivers uses DMA, which means this workaround can't apply to > > them. > > > >> > >> IMHO if that RT system cannot handle differences in framebuffer > >> caching, it's under-powered. It's just a matter of time until > >> something else changes and the problem returns. And (honest question) > >> as it's an x86-64, how do they handle System Management Mode? > > > > I think it's not a big news, that the Matrox G200 from 1999 is > > under-powered. > > I was also a bit surprised that flushing the cache would have such > > effect on latency. The tests we are doing can run 24h with the > > workaround, without any interrupt taking more than 10us. Without the > > workaround, every ~30s the interrupt failed its 10us target. > > > >> > >>> > >>> The regression has been bisected to 2 commits: > >>> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") > >>> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") > >>> > >>> The first one changed the system memory framebuffer from Write-Combine > >>> to the default caching. > >>> Before the second commit, the mgag200 driver used to unmap the > >>> framebuffer after each frame, which implicitly does a cache flush. > >>> Both regressions are fixed by the following patch, which forces a > >>> cache flush after each frame, reverting to almost v5.9 behavior. > >> > >> With that second commit, we essentially never unmap an active > >> framebuffer console. But with commit > >> > >> 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access > >> with vmap") > >> > >> we now again unmap the console framebuffer after the pageflip happened. > >> > >> So how does the latest kernel behave wrt to the problem? > > > > The regression was found when upgrading the server from v5.4 to v5.14, > > so we didn't test with later kernels. > > We will test with v6.3 (which should have 359c6649cd9a ) and see what it > > gives. > > I don't have a clear explanation, but testing with v6.3, and forcing the > Write Combine, doesn't fix the latency issue. So forcing the cache flush > is still needed. > > Also, on some systems, they use "isolated cpu" to handle RT task, but > with a standard kernel (so without the CONFIG_PREEMPT_RT). > So I'm wondering if we can use a kernel module parameter for this, > so that users that wants to achieve low latency, can opt-in ? > > something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? I think we should either add a config option or command line parameter here. I'd don't think adding nopat to the kernel command line is a good suggestion in the long run, servers often have other cards plugged into them like nvidia gpus or rdma etc, you don't want to cripple them because you want reduced latency on the crappy on-board. I'd rather we put the default back to what it used to be, which was flush the cache though, I'm not sure why we have any objection to doing that, it used to work, it was clearly fine in operation, why undo it? Dave. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/mgag200: Flush the cache to improve latency 2024-02-05 7:02 ` Dave Airlie @ 2024-02-06 13:30 ` Daniel Vetter 0 siblings, 0 replies; 9+ messages in thread From: Daniel Vetter @ 2024-02-06 13:30 UTC (permalink / raw) To: Dave Airlie Cc: Jocelyn Falempe, Thomas Zimmermann, dri-devel, airlied, daniel, Linux Kernel Mailing List On Mon, Feb 05, 2024 at 05:02:58PM +1000, Dave Airlie wrote: > On Mon, 6 Nov 2023 at 20:47, Jocelyn Falempe <jfalempe@redhat.com> wrote: > > > > On 23/10/2023 10:30, Jocelyn Falempe wrote: > > > On 20/10/2023 14:06, Thomas Zimmermann wrote: > > >> (cc'ing lkml for feedback) > > >> > > >> Hi Jocelyn > > >> > > >> Am 19.10.23 um 15:55 schrieb Jocelyn Falempe: > > >>> We found a regression in v5.10 on real-time server, using the > > >>> rt-kernel and the mgag200 driver. It's some really specialized > > >>> workload, with <10us latency expectation on isolated core. > > >>> After the v5.10, the real time tasks missed their <10us latency > > >>> when something prints on the screen (fbcon or printk) > > >> > > >> I'd like to hear the opinion of the RT-devs on this patch. Because > > >> AFAIK we never did such a workaround in other drivers. And AFAIK > > >> printk is a PITA anyway. > > > > > > Most other drivers uses DMA, which means this workaround can't apply to > > > them. > > > > > >> > > >> IMHO if that RT system cannot handle differences in framebuffer > > >> caching, it's under-powered. It's just a matter of time until > > >> something else changes and the problem returns. And (honest question) > > >> as it's an x86-64, how do they handle System Management Mode? > > > > > > I think it's not a big news, that the Matrox G200 from 1999 is > > > under-powered. > > > I was also a bit surprised that flushing the cache would have such > > > effect on latency. The tests we are doing can run 24h with the > > > workaround, without any interrupt taking more than 10us. Without the > > > workaround, every ~30s the interrupt failed its 10us target. > > > > > >> > > >>> > > >>> The regression has been bisected to 2 commits: > > >>> 0b34d58b6c32 ("drm/mgag200: Enable caching for SHMEM pages") > > >>> 4862ffaec523 ("drm/mgag200: Move vmap out of commit tail") > > >>> > > >>> The first one changed the system memory framebuffer from Write-Combine > > >>> to the default caching. > > >>> Before the second commit, the mgag200 driver used to unmap the > > >>> framebuffer after each frame, which implicitly does a cache flush. > > >>> Both regressions are fixed by the following patch, which forces a > > >>> cache flush after each frame, reverting to almost v5.9 behavior. > > >> > > >> With that second commit, we essentially never unmap an active > > >> framebuffer console. But with commit > > >> > > >> 359c6649cd9a ("drm/gem: Implement shadow-plane {begin, end}_fb_access > > >> with vmap") > > >> > > >> we now again unmap the console framebuffer after the pageflip happened. > > >> > > >> So how does the latest kernel behave wrt to the problem? > > > > > > The regression was found when upgrading the server from v5.4 to v5.14, > > > so we didn't test with later kernels. > > > We will test with v6.3 (which should have 359c6649cd9a ) and see what it > > > gives. > > > > I don't have a clear explanation, but testing with v6.3, and forcing the > > Write Combine, doesn't fix the latency issue. So forcing the cache flush > > is still needed. > > > > Also, on some systems, they use "isolated cpu" to handle RT task, but > > with a standard kernel (so without the CONFIG_PREEMPT_RT). > > So I'm wondering if we can use a kernel module parameter for this, > > so that users that wants to achieve low latency, can opt-in ? > > > > something like mgag200.force_cache_flush=1 or mgag200.low_latency=1 ? > > I think we should either add a config option or command line parameter here. Yeah I think the situation is cursed enough that we need that, and module option for mgag200 sounds like the most reasonable way. > I'd don't think adding nopat to the kernel command line is a good > suggestion in the long run, servers often have other cards plugged > into them like nvidia gpus or rdma etc, you don't want to cripple them > because you want reduced latency on the crappy on-board. > > I'd rather we put the default back to what it used to be, which was > flush the cache though, I'm not sure why we have any objection to > doing that, it used to work, it was clearly fine in operation, why > undo it? Uh that default is a few patches back and I think it's not great if we change that, since it means all drivers using shadow buffers for fdbev will again suffer from rendering fbcon into a wc instead of wb buffer. But Jocelyn has meanwhile dug out more info in another reply, I'll reply there. Cheers, Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-02-06 14:08 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-19 13:55 [PATCH] drm/mgag200: Flush the cache to improve latency Jocelyn Falempe 2023-10-20 12:06 ` Thomas Zimmermann 2023-10-23 8:30 ` Jocelyn Falempe 2023-11-06 10:46 ` Jocelyn Falempe 2023-12-11 9:31 ` Jocelyn Falempe 2024-02-06 13:33 ` Daniel Vetter 2024-02-06 13:51 ` Jocelyn Falempe 2024-02-05 7:02 ` Dave Airlie 2024-02-06 13:30 ` Daniel Vetter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).