socket chardevs: data loss when other end closes connection?

* socket chardevs: data loss when other end closes connection?
@ 2022-10-18 17:55 Peter Maydell
  2022-10-18 19:21 ` Daniel P. Berrangé
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Maydell @ 2022-10-18 17:55 UTC (permalink / raw)
  To: QEMU Developers
  Cc: Marc-André Lureau, Paolo Bonzini, Philippe Mathieu-Daudé

I've been looking at a (long-standing) bug where an avocado test
intermittently fails.

This happens because at the avocado end we write "halt\r" to the
serial console, which is wired up to a Unix socket; but at the UART
model we only ever see the 'h' character and no further data.  As far
as I can tell this happens because Avocado closes the socket and the
QEMU socket chardev layer loses the last few characters of data that
the guest hasn't yet read at that point.

This is what seems to me to be going on:
 * Avocado writes the data ('halt\r') and closes the socket
   pretty much immediately afterwards
 * At the glib layer, the socket is polled, and it gets G_IO_IN
   and G_IO_HUP, indicating "readable, and also closed"
 * glib's source dispatch mechanism first calls tcp_chr_read()
   to handle the G_IO_IN part
 * tcp_chr_read() reads a single byte (the 'h'), because
   SocketChardev::max_size is 1 (which in turn is because the
   device model's can_write function returned 1 to say that's
   all it can accept for now). So there's still data to be
   read in future
 * glib now calls tcp_chr_hup() because of the G_IO_HUP (as part
   of the same handle-all-the-sources loop)
 * tcp_chr_hup() calls tcp_chr_disconnect(), which basically
   frees everything, tells the chardev backend that the connection
   just closed, etc
 * the data remaining in the socket to be read after the 'h'
   is never read

How is this intended to work? I guess the socket ought to go
into some kind of "disconnecting" state, but not actually do
a tcp_chr_disconnect() until all the data has been read via
tcp_chr_read() and it's finally got an EOF indication back from
tcp_chr_recv() ?

(The particular avocado test that intermittently fails is:
tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
 -- when this happens, obviously the guest doesn't ever actually
shut down and exit, so the test harness times it out after a while.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 4+ messages in thread