All of lore.kernel.org
 help / color / mirror / Atom feed
* socket chardevs: data loss when other end closes connection?
@ 2022-10-18 17:55 Peter Maydell
  2022-10-18 19:21 ` Daniel P. Berrangé
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Maydell @ 2022-10-18 17:55 UTC (permalink / raw)
  To: QEMU Developers
  Cc: Marc-André Lureau, Paolo Bonzini, Philippe Mathieu-Daudé

I've been looking at a (long-standing) bug where an avocado test
intermittently fails.

This happens because at the avocado end we write "halt\r" to the
serial console, which is wired up to a Unix socket; but at the UART
model we only ever see the 'h' character and no further data.  As far
as I can tell this happens because Avocado closes the socket and the
QEMU socket chardev layer loses the last few characters of data that
the guest hasn't yet read at that point.

This is what seems to me to be going on:
 * Avocado writes the data ('halt\r') and closes the socket
   pretty much immediately afterwards
 * At the glib layer, the socket is polled, and it gets G_IO_IN
   and G_IO_HUP, indicating "readable, and also closed"
 * glib's source dispatch mechanism first calls tcp_chr_read()
   to handle the G_IO_IN part
 * tcp_chr_read() reads a single byte (the 'h'), because
   SocketChardev::max_size is 1 (which in turn is because the
   device model's can_write function returned 1 to say that's
   all it can accept for now). So there's still data to be
   read in future
 * glib now calls tcp_chr_hup() because of the G_IO_HUP (as part
   of the same handle-all-the-sources loop)
 * tcp_chr_hup() calls tcp_chr_disconnect(), which basically
   frees everything, tells the chardev backend that the connection
   just closed, etc
 * the data remaining in the socket to be read after the 'h'
   is never read

How is this intended to work? I guess the socket ought to go
into some kind of "disconnecting" state, but not actually do
a tcp_chr_disconnect() until all the data has been read via
tcp_chr_read() and it's finally got an EOF indication back from
tcp_chr_recv() ?

(The particular avocado test that intermittently fails is:
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
 -- when this happens, obviously the guest doesn't ever actually
shut down and exit, so the test harness times it out after a while.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: socket chardevs: data loss when other end closes connection?
  2022-10-18 17:55 socket chardevs: data loss when other end closes connection? Peter Maydell
@ 2022-10-18 19:21 ` Daniel P. Berrangé
  2022-10-19 16:26   ` Peter Maydell
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel P. Berrangé @ 2022-10-18 19:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Tue, Oct 18, 2022 at 06:55:08PM +0100, Peter Maydell wrote:
> I've been looking at a (long-standing) bug where an avocado test
> intermittently fails.
> 
> This happens because at the avocado end we write "halt\r" to the
> serial console, which is wired up to a Unix socket; but at the UART
> model we only ever see the 'h' character and no further data.  As far
> as I can tell this happens because Avocado closes the socket and the
> QEMU socket chardev layer loses the last few characters of data that
> the guest hasn't yet read at that point.
> 
> This is what seems to me to be going on:
>  * Avocado writes the data ('halt\r') and closes the socket
>    pretty much immediately afterwards
>  * At the glib layer, the socket is polled, and it gets G_IO_IN
>    and G_IO_HUP, indicating "readable, and also closed"
>  * glib's source dispatch mechanism first calls tcp_chr_read()
>    to handle the G_IO_IN part
>  * tcp_chr_read() reads a single byte (the 'h'), because
>    SocketChardev::max_size is 1 (which in turn is because the
>    device model's can_write function returned 1 to say that's
>    all it can accept for now). So there's still data to be
>    read in future
>  * glib now calls tcp_chr_hup() because of the G_IO_HUP (as part
>    of the same handle-all-the-sources loop)
>  * tcp_chr_hup() calls tcp_chr_disconnect(), which basically
>    frees everything, tells the chardev backend that the connection
>    just closed, etc
>  * the data remaining in the socket to be read after the 'h'
>    is never read
> 
> How is this intended to work? I guess the socket ought to go
> into some kind of "disconnecting" state, but not actually do
> a tcp_chr_disconnect() until all the data has been read via
> tcp_chr_read() and it's finally got an EOF indication back from
> tcp_chr_recv() ?

Right, this is basically broken by (lack of) design right now.

The main problem here is that we're watching the socket twice.
One set of callbacks added with io_add_watch_poll, and then
a second callback added with qio_chanel_create_watch just for
G_IO_HUP.

We need there to be only 1 callback, and when that callback
gets  G_IO_IN, it should *ignore* G_IO_HUP until tcp_chr_recv
returns 0 to indicate EOF. This would cause tcp_chr_read to
be invoked repeatedly with G_IO_IN | G_IO_HUP, as we read
"halt\r" one byte at a time.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: socket chardevs: data loss when other end closes connection?
  2022-10-18 19:21 ` Daniel P. Berrangé
@ 2022-10-19 16:26   ` Peter Maydell
  2022-10-19 16:29     ` Daniel P. Berrangé
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Maydell @ 2022-10-19 16:26 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: QEMU Developers, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Tue, 18 Oct 2022 at 20:21, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Oct 18, 2022 at 06:55:08PM +0100, Peter Maydell wrote:
> > How is this intended to work? I guess the socket ought to go
> > into some kind of "disconnecting" state, but not actually do
> > a tcp_chr_disconnect() until all the data has been read via
> > tcp_chr_read() and it's finally got an EOF indication back from
> > tcp_chr_recv() ?
>
> Right, this is basically broken by (lack of) design right now.
>
> The main problem here is that we're watching the socket twice.
> One set of callbacks added with io_add_watch_poll, and then
> a second callback added with qio_chanel_create_watch just for
> G_IO_HUP.
>
> We need there to be only 1 callback, and when that callback
> gets  G_IO_IN, it should *ignore* G_IO_HUP until tcp_chr_recv
> returns 0 to indicate EOF. This would cause tcp_chr_read to
> be invoked repeatedly with G_IO_IN | G_IO_HUP, as we read
> "halt\r" one byte at a time.

Makes sense.

I've filed https://gitlab.com/qemu-project/qemu/-/issues/1264 to
track this socket chardev bug.

It did occur to me that there's a potential complication with
the 'server' mode of this chardev: does it need to cope with
a new connection coming into the server socket while the old
fd is still hanging around in this "waiting for the guest to
read it" state? Currently tcp_chr_disconnect_locked() is where
we restart listening for new connections, so QEMU wouldn't
accept any new connection until the guest had got round to
completely draining the data from the old one.

-- PMM


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: socket chardevs: data loss when other end closes connection?
  2022-10-19 16:26   ` Peter Maydell
@ 2022-10-19 16:29     ` Daniel P. Berrangé
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel P. Berrangé @ 2022-10-19 16:29 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Wed, Oct 19, 2022 at 05:26:28PM +0100, Peter Maydell wrote:
> On Tue, 18 Oct 2022 at 20:21, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 06:55:08PM +0100, Peter Maydell wrote:
> > > How is this intended to work? I guess the socket ought to go
> > > into some kind of "disconnecting" state, but not actually do
> > > a tcp_chr_disconnect() until all the data has been read via
> > > tcp_chr_read() and it's finally got an EOF indication back from
> > > tcp_chr_recv() ?
> >
> > Right, this is basically broken by (lack of) design right now.
> >
> > The main problem here is that we're watching the socket twice.
> > One set of callbacks added with io_add_watch_poll, and then
> > a second callback added with qio_chanel_create_watch just for
> > G_IO_HUP.
> >
> > We need there to be only 1 callback, and when that callback
> > gets  G_IO_IN, it should *ignore* G_IO_HUP until tcp_chr_recv
> > returns 0 to indicate EOF. This would cause tcp_chr_read to
> > be invoked repeatedly with G_IO_IN | G_IO_HUP, as we read
> > "halt\r" one byte at a time.
> 
> Makes sense.
> 
> I've filed https://gitlab.com/qemu-project/qemu/-/issues/1264 to
> track this socket chardev bug.
> 
> It did occur to me that there's a potential complication with
> the 'server' mode of this chardev: does it need to cope with
> a new connection coming into the server socket while the old
> fd is still hanging around in this "waiting for the guest to
> read it" state? Currently tcp_chr_disconnect_locked() is where
> we restart listening for new connections, so QEMU wouldn't
> accept any new connection until the guest had got round to
> completely draining the data from the old one.

That's fine IMHO. We never actually stop listening at a socket
level, we just stop trying to accept(). So any new client will
get queued until we've drained data, then accept()d and its
new data handled

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-10-19 16:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-18 17:55 socket chardevs: data loss when other end closes connection? Peter Maydell
2022-10-18 19:21 ` Daniel P. Berrangé
2022-10-19 16:26   ` Peter Maydell
2022-10-19 16:29     ` Daniel P. Berrangé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.