Re: race condition in display device caused by run_on_cpu() dropping the iothread lock

From: Gerd Hoffmann <kraxel@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: "QEMU Developers" <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: race condition in display device caused by run_on_cpu() dropping the iothread lock
Date: Mon, 15 Aug 2022 13:22:39 +0200	[thread overview]
Message-ID: <20220815112239.37xm3zwbe5gd7trz@sirius.home.kraxel.org> (raw)
In-Reply-To: <CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com>

On Mon, Aug 01, 2022 at 02:23:55PM +0100, Peter Maydell wrote:
> I've been debugging a segfault in the raspi3b display device, and I've
> tracked it down to a race condition, but I'm not sure what the right
> way to fix it is...
> 
> The race is that a vCPU thread is handling a guest register write that
> says "resize the framebuffer", which it implements by calling
> qemu_console_resize().

[ back online after vacation ]

Easiest is probably to not instantly resize the display surface but
let the update handler do that on the next display refresh.

Many display devices do that anyway because often multiple register
updates are needed to perform a resize and you don't want your ui
window run through all the temporary states ...

Alternative: The DisplaySurface is backed by pixman images which are
reference counted.  Some qemu code which depends on the backing store
staying around while not holding the iolock work with the pixman image
directly because they can just take a reference then to avoid the image
being freed while they use it.

>  * memory_region_snapshot_and_clear_dirty() ends up calling run_on_cpu(),
>    which briefly drops the iothread lock.

Oh.  Is that new?

> How is this intended to work? I feel like if run_on_cpu() silently
> drops the iothread lock this probably invalidates a lot of assumptions
> that QEMU code makes, especially in this kind of setup where
> the code making the assumptions is several layers in the callstack
> above whatever it is that ends up calling run_on_cpu()...

Indeed.  The display update code paths using dirty bitmap snapshots
certainly don't expect that.

take care,
  Gerd