Re: fbdev: Garbage collect fbdev scrolling acceleration

From: Daniel Vetter <daniel@ffwll.ch>
To: Sven Schnelle <svens@stackframe.org>
Cc: Daniel Vetter <daniel@ffwll.ch>, Helge Deller <deller@gmx.de>,
	linux-fbdev@vger.kernel.org,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	dri-devel@lists.freedesktop.org,
	Thomas Zimmermann <tzimmermann@suse.de>,
	Hamza Mahfooz <someguy@effective-light.com>
Subject: Re: fbdev: Garbage collect fbdev scrolling acceleration
Date: Wed, 19 Jan 2022 17:21:49 +0100	[thread overview]
Message-ID: <Yeg6nYZX0/0UUd/N@phenom.ffwll.local> (raw)
In-Reply-To: <87y23bitvz.fsf@x1.stackframe.org>

On Wed, Jan 19, 2022 at 05:15:44PM +0100, Sven Schnelle wrote:
> Hi Daniel,
> 
> Daniel Vetter <daniel@ffwll.ch> writes:
> 
> > On Thu, Jan 13, 2022 at 10:46:03PM +0100, Sven Schnelle wrote:
> >> Helge Deller <deller@gmx.de> writes:
> >> > Maybe on fast new x86 boxes the performance difference isn't huge,
> >> > but for all old systems, or when emulated in qemu, this makes
> >> > a big difference.
> >> >
> >> > Helge
> >> 
> >> I second that. For most people, the framebuffer isn't important as
> >> they're mostly interested in getting to X11/wayland as fast as possible.
> >> But for systems like servers without X11 it's nice to have a fast
> >> console.
> >
> > Fast console howto:
> > - shadow buffer in cached memory
> > - timer based upload of changed areas to the real framebuffer
> >
> > This one is actually fast, instead of trying to use hw bltcopy and having
> > the most terrible fallback path if that's gone. Yes drm fbdev helpers has
> > this (but not enabled on most drivers because very, very few people care).
> 
> Hmm.... Take my Laptop with a 4k (3180x2160) screen as an example:
> 
> Lets say on average the half of every line is filled with text.
> 
> So 3840/2*2160 pixels that change = 4147200 pixels. Every pixel takes 4
> bytes = 16,588800 bytes per timer interrupt. In another Mail updating on
> vsync was mentioned, so multiply that by 60 and get ~927MB. And even if
> you only update the screen ony 4 times per second, that would be ~64MB
> of data. I'm likely missing something here.

Since you say 4k it's a modern box, so you have on the order of 10GB/s of
write bandwidth.

And around 100MB/s of read bandwidth. Both from the cpu. It all adds up.
It's that uncached read which kills you and means dmesg takes seconds to
display.

Also since this is 4k looking at sales volume we're talking integrated, so
whether it's the gpu or the cpu that's doing the memcpy, it's the same
memory bw budget you're burning down. And at that point doing less copying
(which the shadow buffer thing will do compared to fbcon accelerated
scrolling for every line) is the win.

And since max&usual resolutions pretty much scales down with pcie or
memory bandwidth for roughly the last 2-3 decades, this all works as well
on old stuff.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch