On Fri, Jul 31, 2020 at 02:20:24PM -0400, Jagannathan Raman wrote:
> @@ -343,3 +349,49 @@ static void probe_pci_info(PCIDevice *dev, Error **errp)
>          }
>      }
>  }
> +
> +static void hb_msg(PCIProxyDev *dev)
> +{
> +    DeviceState *ds = DEVICE(dev);
> +    Error *local_err = NULL;
> +    MPQemuMsg msg = { 0 };
> +
> +    msg.cmd = PROXY_PING;
> +    msg.bytestream = 0;
> +    msg.size = 0;
> +
> +    (void)mpqemu_msg_send_and_await_reply(&msg, dev->ioc, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        qio_channel_close(dev->ioc, &local_err);
> +        error_setg(&error_fatal, "Lost contact with device %s", ds->id);
> +    }
> +}

Here is my feedback from the last revision. Was this addressed?

  This patch seems incomplete since no action is taken when the device
  fails to respond. vCPU threads that access the device will still get
  stuck.

  The simplest way to make this useful is to close the connection when a
  timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket
  should perform connection cleanup. At that point there are a few
  choices:

  1. Stop guest execution and wait for the host admin to restore the
     mplink so execution can resume. This is similar to how -drive
     rerror=stop pauses the guest when a disk I/O error is encountered.

  2. Stop guest execution but defer it until this stale device is actually
     accessed. This maximizes guest uptime. Guests that rarely access the
     device may not notice at all.

  3. Return 0 from MemoryRegion read operations and ignore writes. The
     guest continues executing but the device is broken. This is risky
     because device drivers inside the guest may not be ready to deal with
     this. The result could be data loss or corruption.

  4. Raise a bus-level event. Maybe PCI error reporting can be used to
     offline the device.

  5. Terminate the guest with an error message.

  6. ?

  Until the heartbeat is fully implemented and tested I suggest dropping
  it from this patch series. Remember the G_IO_HUP will happen anyway if
  the remote device process terminates.