On Fri, Jul 31, 2020 at 02:20:24PM -0400, Jagannathan Raman wrote: > @@ -343,3 +349,49 @@ static void probe_pci_info(PCIDevice *dev, Error **errp) > } > } > } > + > +static void hb_msg(PCIProxyDev *dev) > +{ > + DeviceState *ds = DEVICE(dev); > + Error *local_err = NULL; > + MPQemuMsg msg = { 0 }; > + > + msg.cmd = PROXY_PING; > + msg.bytestream = 0; > + msg.size = 0; > + > + (void)mpqemu_msg_send_and_await_reply(&msg, dev->ioc, &local_err); > + if (local_err) { > + error_report_err(local_err); > + qio_channel_close(dev->ioc, &local_err); > + error_setg(&error_fatal, "Lost contact with device %s", ds->id); > + } > +} Here is my feedback from the last revision. Was this addressed? This patch seems incomplete since no action is taken when the device fails to respond. vCPU threads that access the device will still get stuck. The simplest way to make this useful is to close the connection when a timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket should perform connection cleanup. At that point there are a few choices: 1. Stop guest execution and wait for the host admin to restore the mplink so execution can resume. This is similar to how -drive rerror=stop pauses the guest when a disk I/O error is encountered. 2. Stop guest execution but defer it until this stale device is actually accessed. This maximizes guest uptime. Guests that rarely access the device may not notice at all. 3. Return 0 from MemoryRegion read operations and ignore writes. The guest continues executing but the device is broken. This is risky because device drivers inside the guest may not be ready to deal with this. The result could be data loss or corruption. 4. Raise a bus-level event. Maybe PCI error reporting can be used to offline the device. 5. Terminate the guest with an error message. 6. ? Until the heartbeat is fully implemented and tested I suggest dropping it from this patch series. Remember the G_IO_HUP will happen anyway if the remote device process terminates.