All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
@ 2019-04-06 15:16 ` Dan Streetman
  2019-04-11 20:54 ` Dan Streetman
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-06 15:16 UTC (permalink / raw)
  To: qemu-devel

** Also affects: qemu (Ubuntu Trusty)
   Importance: Undecided
       Status: New

** Changed in: qemu (Ubuntu Trusty)
       Status: New => In Progress

** Changed in: qemu (Ubuntu Trusty)
   Importance: Undecided => Medium

** Changed in: qemu (Ubuntu Trusty)
     Assignee: (unassigned) => Dan Streetman (ddstreet)

** Also affects: qemu
   Importance: Undecided
       Status: New

** Changed in: qemu
       Status: New => In Progress

** Changed in: qemu
     Assignee: (unassigned) => Dan Streetman (ddstreet)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Trusty:
  In Progress
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Cosmic:
  In Progress
Status in qemu source package in Disco:
  In Progress

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds flags to prevent repeated calls to both
  vhost_net_stop() and vhost_net_cleanup() (really, prevents repeated
  calls to vhost_dev_cleanup()).  Any regression would be seen when
  stopping and/or cleaning up a vhost net.  Regressions might include
  failure to hot-remove a vhost net from a guest, or failure to cleanup
  (i.e. mem leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).  However, this appears to still apply upstream, and I am
  sending a patch to the qemu list to patch upstream as well.

  The specific race condition for this (in the qemu 2.5 code version)
  is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
  2019-04-06 15:16 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
@ 2019-04-11 20:54 ` Dan Streetman
  2019-04-11 21:28 ` Dan Streetman
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-11 20:54 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

  [impact]
  
  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):
  
  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683
  
  [test case]
  
  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall configuration
  of the guest as well as how exactly it's shut down - specifically, its
  vhost user net must be closed from the host side at a specific time
  during qemu shutdown.
  
  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me to
  reproduce so I have relied on their reproduction and testing to debug
  and craft the patch for this.
  
  [regression potential]
  
  the change adds flags to prevent repeated calls to both vhost_net_stop()
  and vhost_net_cleanup() (really, prevents repeated calls to
- vhost_dev_cleanup()).  Any regression would be seen when stopping and/or
- cleaning up a vhost net.  Regressions might include failure to hot-
- remove a vhost net from a guest, or failure to cleanup (i.e. mem leak),
- or crashes during cleanup or stopping a vhost net.
+ vhost_dev_cleanup(), but vhost_net_cleanup() does nothing else).  Any
+ regression would be seen when stopping and/or cleaning up a vhost net.
+ Regressions might include failure to hot-remove a vhost net from a
+ guest, or failure to cleanup (i.e. mem leak), or crashes during cleanup
+ or stopping a vhost net.
+ 
+ However, the flags are very unintrusive, and only in the shutdown path
+ (of a vhost_dev or vhost_net), and are unlikely to cause any
+ regressions.
  
  [other info]
  
  this was originally seen in the 2.5 version of qemu - specifically, the
  UCA version in trusty-mitaka (which uses the xenial qemu codebase).
  However, this appears to still apply upstream, and I am sending a patch
  to the qemu list to patch upstream as well.
  
  The specific race condition for this (in the qemu 2.5 code version) is:
  
  as shown in above bt, thread A starts shutting down qemu, e.g.:
  
  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status
  
  in this function, code gets to an if-else check for (!n->vhost_started),
  which is false (i.e. vhost_started is true) and enters the else code
  block, which calls vhost_net_stop() and then sets n->vhost_started to
  false.
  
  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:
  
  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status
  
  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows the
  same path that thread A followed, and enters vhost_net_stop().
  
  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.
  
  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its call
  to qmp_set_link() returns:
  
  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup
  
  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup
  
  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Trusty:
  In Progress
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Cosmic:
  In Progress
Status in qemu source package in Disco:
  In Progress

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds flags to prevent repeated calls to both
  vhost_net_stop() and vhost_net_cleanup() (really, prevents repeated
  calls to vhost_dev_cleanup(), but vhost_net_cleanup() does nothing
  else).  Any regression would be seen when stopping and/or cleaning up
  a vhost net.  Regressions might include failure to hot-remove a vhost
  net from a guest, or failure to cleanup (i.e. mem leak), or crashes
  during cleanup or stopping a vhost net.

  However, the flags are very unintrusive, and only in the shutdown path
  (of a vhost_dev or vhost_net), and are unlikely to cause any
  regressions.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).  However, this appears to still apply upstream, and I am
  sending a patch to the qemu list to patch upstream as well.

  The specific race condition for this (in the qemu 2.5 code version)
  is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
  2019-04-06 15:16 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
  2019-04-11 20:54 ` Dan Streetman
@ 2019-04-11 21:28 ` Dan Streetman
  2019-04-15 19:26 ` Dan Streetman
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-11 21:28 UTC (permalink / raw)
  To: qemu-devel

test builds in https://launchpad.net/~ddstreet/+archive/ubuntu/lp1823458

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Trusty:
  In Progress
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Cosmic:
  In Progress
Status in qemu source package in Disco:
  In Progress

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds flags to prevent repeated calls to both
  vhost_net_stop() and vhost_net_cleanup() (really, prevents repeated
  calls to vhost_dev_cleanup(), but vhost_net_cleanup() does nothing
  else).  Any regression would be seen when stopping and/or cleaning up
  a vhost net.  Regressions might include failure to hot-remove a vhost
  net from a guest, or failure to cleanup (i.e. mem leak), or crashes
  during cleanup or stopping a vhost net.

  However, the flags are very unintrusive, and only in the shutdown path
  (of a vhost_dev or vhost_net), and are unlikely to cause any
  regressions.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).  However, this appears to still apply upstream, and I am
  sending a patch to the qemu list to patch upstream as well.

  The specific race condition for this (in the qemu 2.5 code version)
  is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (2 preceding siblings ...)
  2019-04-11 21:28 ` Dan Streetman
@ 2019-04-15 19:26 ` Dan Streetman
  2019-04-23  9:12 ` Dan Streetman
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-15 19:26 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

  [impact]
  
  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):
  
  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683
  
  [test case]
  
  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall configuration
  of the guest as well as how exactly it's shut down - specifically, its
  vhost user net must be closed from the host side at a specific time
  during qemu shutdown.
  
  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me to
  reproduce so I have relied on their reproduction and testing to debug
  and craft the patch for this.
  
  [regression potential]
  
- the change adds flags to prevent repeated calls to both vhost_net_stop()
- and vhost_net_cleanup() (really, prevents repeated calls to
- vhost_dev_cleanup(), but vhost_net_cleanup() does nothing else).  Any
- regression would be seen when stopping and/or cleaning up a vhost net.
- Regressions might include failure to hot-remove a vhost net from a
- guest, or failure to cleanup (i.e. mem leak), or crashes during cleanup
- or stopping a vhost net.
- 
- However, the flags are very unintrusive, and only in the shutdown path
- (of a vhost_dev or vhost_net), and are unlikely to cause any
- regressions.
+ the change adds a flag to prevent repeated calls to vhost_net_stop().
+ This also prevents any calls to vhost_net_cleanup() from
+ net_vhost_user_event().  Any regression would be seen when stopping
+ and/or cleaning up a vhost net.  Regressions might include failure to
+ hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
+ leak), or crashes during cleanup or stopping a vhost net.
  
  [other info]
  
  this was originally seen in the 2.5 version of qemu - specifically, the
  UCA version in trusty-mitaka (which uses the xenial qemu codebase).
  However, this appears to still apply upstream, and I am sending a patch
  to the qemu list to patch upstream as well.
  
  The specific race condition for this (in the qemu 2.5 code version) is:
  
  as shown in above bt, thread A starts shutting down qemu, e.g.:
  
  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status
  
  in this function, code gets to an if-else check for (!n->vhost_started),
  which is false (i.e. vhost_started is true) and enters the else code
  block, which calls vhost_net_stop() and then sets n->vhost_started to
  false.
  
  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:
  
  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status
  
  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows the
  same path that thread A followed, and enters vhost_net_stop().
  
  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.
  
  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its call
  to qmp_set_link() returns:
  
  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup
  
  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup
  
  and the duplicate vhost_dev_cleanup fails assertions since things were
- already cleaned up.
+ already cleaned up.  Additionally, if thread B's call to
+ vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
+ then that will call vhost_dev_stop() and vhost_disable_notifiers() which
+ both try to access things that have been freed/cleared/disabled by
+ vhost_dev_cleanup().

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Trusty:
  In Progress
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Cosmic:
  In Progress
Status in qemu source package in Disco:
  In Progress

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).  However, this appears to still apply upstream, and I am
  sending a patch to the qemu list to patch upstream as well.

  The specific race condition for this (in the qemu 2.5 code version)
  is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (3 preceding siblings ...)
  2019-04-15 19:26 ` Dan Streetman
@ 2019-04-23  9:12 ` Dan Streetman
  2019-04-23  9:50 ` Dan Streetman
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-23  9:12 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu (Ubuntu Disco)
       Status: In Progress => Fix Released

** Changed in: qemu (Ubuntu)
       Status: In Progress => Fix Released

** Changed in: qemu
       Status: In Progress => Fix Released

** Changed in: qemu (Ubuntu Cosmic)
       Status: In Progress => Fix Released

** Changed in: qemu (Ubuntu Bionic)
       Status: In Progress => Fix Released

** Changed in: qemu (Ubuntu Trusty)
       Status: In Progress => Won't Fix

** Changed in: qemu (Ubuntu Disco)
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Changed in: qemu (Ubuntu Cosmic)
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Changed in: qemu (Ubuntu Bionic)
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Changed in: qemu
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Changed in: qemu (Ubuntu)
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Changed in: qemu (Ubuntu Trusty)
     Assignee: Dan Streetman (ddstreet) => (unassigned)

** Description changed:

  [impact]
  
  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):
  
  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683
  
  [test case]
  
  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall configuration
  of the guest as well as how exactly it's shut down - specifically, its
  vhost user net must be closed from the host side at a specific time
  during qemu shutdown.
  
  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me to
  reproduce so I have relied on their reproduction and testing to debug
  and craft the patch for this.
  
  [regression potential]
  
  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.
  
  [other info]
  
  this was originally seen in the 2.5 version of qemu - specifically, the
  UCA version in trusty-mitaka (which uses the xenial qemu codebase).
- However, this appears to still apply upstream, and I am sending a patch
- to the qemu list to patch upstream as well.
+ 
+ After discussion upstream, it appears this was fixed upstream by commit
+ e7c83a885f8, which is included starting in version 2.9.  However, this
+ commit depends on at least commit 5345fdb4467, and likely more other
+ previous commits, which make widespread code changes and are unsuitable
+ to backport.  Therefore this seems like it should be specifically worked
+ around in the Xenial qemu codebase.
+ 
  
  The specific race condition for this (in the qemu 2.5 code version) is:
  
  as shown in above bt, thread A starts shutting down qemu, e.g.:
  
  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status
  
  in this function, code gets to an if-else check for (!n->vhost_started),
  which is false (i.e. vhost_started is true) and enters the else code
  block, which calls vhost_net_stop() and then sets n->vhost_started to
  false.
  
  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:
  
  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status
  
  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows the
  same path that thread A followed, and enters vhost_net_stop().
  
  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.
  
  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its call
  to qmp_set_link() returns:
  
  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup
  
  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup
  
  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers() which
  both try to access things that have been freed/cleared/disabled by
  vhost_dev_cleanup().

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (4 preceding siblings ...)
  2019-04-23  9:12 ` Dan Streetman
@ 2019-04-23  9:50 ` Dan Streetman
  2019-04-23 10:21 ` Launchpad Bug Tracker
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-23  9:50 UTC (permalink / raw)
  To: qemu-devel

Note as mentioned in the description, it appears this was fixed upstream
by commit e7c83a885f8, which is included starting in version 2.9.
However, this commit depends on at least commit 5345fdb4467, and likely
more other previous commits, which make widespread code changes and are
unsuitable to backport. Therefore this seems like it should be
specifically worked around in the Xenial qemu codebase.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (5 preceding siblings ...)
  2019-04-23  9:50 ` Dan Streetman
@ 2019-04-23 10:21 ` Launchpad Bug Tracker
  2019-04-24 13:59 ` Robie Basak
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Launchpad Bug Tracker @ 2019-04-23 10:21 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~ddstreet/ubuntu/+source/qemu/+git/qemu/+merge/366392

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  In Progress
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (6 preceding siblings ...)
  2019-04-23 10:21 ` Launchpad Bug Tracker
@ 2019-04-24 13:59 ` Robie Basak
  2019-04-24 14:00 ` [Qemu-devel] [Bug 1823458] Please test proposed package Robie Basak
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Robie Basak @ 2019-04-24 13:59 UTC (permalink / raw)
  To: qemu-devel

SRU note: please see Christian's MP review (linked from this bug) for
some advice on additional care during SRU verification.

** Changed in: qemu (Ubuntu Xenial)
       Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-xenial

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Please test proposed package
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (7 preceding siblings ...)
  2019-04-24 13:59 ` Robie Basak
@ 2019-04-24 14:00 ` Robie Basak
  2019-04-24 15:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Robie Basak @ 2019-04-24 14:00 UTC (permalink / raw)
  To: qemu-devel

Hello Dan, or anyone else affected,

Accepted qemu into xenial-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-
5ubuntu10.37 in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested and change the tag from
verification-needed-xenial to verification-done-xenial. If it does not
fix the bug for you, please add a comment stating that, and change the
tag to verification-failed-xenial. In either case, details of your
testing will help us make a better decision.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (8 preceding siblings ...)
  2019-04-24 14:00 ` [Qemu-devel] [Bug 1823458] Please test proposed package Robie Basak
@ 2019-04-24 15:40 ` Corey Bryant
  2019-04-24 16:03 ` Corey Bryant
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-04-24 15:40 UTC (permalink / raw)
  To: qemu-devel

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/mitaka
   Importance: Undecided
       Status: New

** Changed in: cloud-archive/mitaka
   Importance: Undecided => Medium

** Changed in: cloud-archive/mitaka
       Status: New => Triaged

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (9 preceding siblings ...)
  2019-04-24 15:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
@ 2019-04-24 16:03 ` Corey Bryant
  2019-04-24 16:39 ` Dan Streetman
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-04-24 16:03 UTC (permalink / raw)
  To: qemu-devel

** Also affects: cloud-archive/ocata
   Importance: Undecided
       Status: New

** Changed in: cloud-archive/ocata
   Importance: Undecided => Medium

** Changed in: cloud-archive/ocata
       Status: New => Triaged

** Changed in: cloud-archive
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (10 preceding siblings ...)
  2019-04-24 16:03 ` Corey Bryant
@ 2019-04-24 16:39 ` Dan Streetman
  2019-04-24 16:40 ` Dan Streetman
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-24 16:39 UTC (permalink / raw)
  To: qemu-devel

** Patch added: "lp1823458-ocata.debdiff"
   https://bugs.launchpad.net/cloud-archive/+bug/1823458/+attachment/5258683/+files/lp1823458-ocata.debdiff

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (11 preceding siblings ...)
  2019-04-24 16:39 ` Dan Streetman
@ 2019-04-24 16:40 ` Dan Streetman
  2019-04-24 20:53 ` [Qemu-devel] [Bug 1823458] Please test proposed package Corey Bryant
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-24 16:40 UTC (permalink / raw)
  To: qemu-devel

uca: workaround patches are needed in mitaka and ocata.  Mitaka can pull
from Xenial build as usual, and debdiff for Ocata attached.  Other UCA
releases are later than 2.9 and so are fixed with the upstream fix
mentioned in the description.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Please test proposed package
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (12 preceding siblings ...)
  2019-04-24 16:40 ` Dan Streetman
@ 2019-04-24 20:53 ` Corey Bryant
  2019-04-24 20:54 ` Corey Bryant
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-04-24 20:53 UTC (permalink / raw)
  To: qemu-devel

Hello Dan, or anyone else affected,

Accepted qemu into mitaka-proposed. The package will build now and be
available in the Ubuntu Cloud Archive in a few hours, and then in the
-proposed repository.

Please help us by testing this new package. To enable the -proposed
repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, and change the tag
from verification-mitaka-needed to verification-mitaka-done. If it does
not fix the bug for you, please add a comment stating that, and change
the tag to verification-mitaka-failed. In either case, details of your
testing will help us make a better decision.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance!

** Changed in: cloud-archive/mitaka
       Status: Triaged => Fix Committed

** Tags added: verification-mitaka-needed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Please test proposed package
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (13 preceding siblings ...)
  2019-04-24 20:53 ` [Qemu-devel] [Bug 1823458] Please test proposed package Corey Bryant
@ 2019-04-24 20:54 ` Corey Bryant
  2019-04-30 10:06 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-04-24 20:54 UTC (permalink / raw)
  To: qemu-devel

Hello Dan, or anyone else affected,

Accepted qemu into ocata-proposed. The package will build now and be
available in the Ubuntu Cloud Archive in a few hours, and then in the
-proposed repository.

Please help us by testing this new package. To enable the -proposed
repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, and change the tag
from verification-ocata-needed to verification-ocata-done. If it does
not fix the bug for you, please add a comment stating that, and change
the tag to verification-ocata-failed. In either case, details of your
testing will help us make a better decision.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance!

** Changed in: cloud-archive/ocata
       Status: Triaged => Fix Committed

** Tags added: verification-ocata-needed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (14 preceding siblings ...)
  2019-04-24 20:54 ` Corey Bryant
@ 2019-04-30 10:06 ` Dan Streetman
  2019-04-30 10:07 ` Dan Streetman
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-30 10:06 UTC (permalink / raw)
  To: qemu-devel

This has been verified by the original reporter to fix the problem of
qemu crashing.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (15 preceding siblings ...)
  2019-04-30 10:06 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
@ 2019-04-30 10:07 ` Dan Streetman
  2019-05-06  9:01 ` Łukasz Zemczak
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-04-30 10:07 UTC (permalink / raw)
  To: qemu-devel

clarification: by "original reporter" I mean the customer of Canonical,
reporting the problem to us.

** Tags removed: verification-mitaka-needed verification-needed verification-needed-xenial verification-ocata-needed
** Tags added: verification-done verification-done-xenial verification-mitaka-done verification-ocata-done

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (16 preceding siblings ...)
  2019-04-30 10:07 ` Dan Streetman
@ 2019-05-06  9:01 ` Łukasz Zemczak
  2019-05-07 19:11 ` Brian Murray
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Łukasz Zemczak @ 2019-05-06  9:01 UTC (permalink / raw)
  To: qemu-devel

Since this SRU is rather hard to verify + Christian's suggestion to let
the SRU age a bit longer, I would still wait a few days before
releasing.

Dan (or anyone else involved) - could you maybe perform some safety
checks using this package against the use-cases mentioned in the
regression potential? I guess that would make me feel much safer
releasing this when knowing it was tested more than by just one person
too.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (17 preceding siblings ...)
  2019-05-06  9:01 ` Łukasz Zemczak
@ 2019-05-07 19:11 ` Brian Murray
  2019-05-07 19:26 ` Dan Streetman
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Brian Murray @ 2019-05-07 19:11 UTC (permalink / raw)
  To: qemu-devel

I'm setting this to Incomplete per sil2100's last comment.

** Changed in: qemu (Ubuntu Xenial)
       Status: Fix Committed => Incomplete

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Incomplete
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (18 preceding siblings ...)
  2019-05-07 19:11 ` Brian Murray
@ 2019-05-07 19:26 ` Dan Streetman
  2019-05-08  6:49   ` Christian Ehrhardt 
  2019-05-10 14:40 ` Dan Streetman
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 30+ messages in thread
From: Dan Streetman @ 2019-05-07 19:26 UTC (permalink / raw)
  To: qemu-devel

@sil2100 yes I agree, let's wait longer before releasing.  We have the
Canonical customer performing testing with the package, and we can run
some additional sanity checks as well.  The config coming from the
customer is an openstack setup using OVS, so that's what we will setup
and perform sanity testing on.

@cpaelzer, if you have any suggestions for specific tests/configurations
that might be good to test the specific code changed here, please let me
know.

@bdmurray, sure, let's leave it set to incomplete while we're regression
testing ;-)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Incomplete
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
  2019-05-07 19:26 ` Dan Streetman
@ 2019-05-08  6:49   ` Christian Ehrhardt 
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Ehrhardt  @ 2019-05-08  6:49 UTC (permalink / raw)
  To: qemu-devel

> @cpaelzer, if you have any suggestions for specific tests/configurations
> that might be good to test the specific code changed here, please let me
> know.

I have ran the few test that would cover that area in the past on PPAs already.
Unfortunately this is a very specific path and I don't have much more
tests for it.

If anything comes to my mind it would be loops of attaching/detaching
extra interfaces to guests and try some traffic on them.
And every now and then in between supend/resume the or shutdown/start
the guest again.
Like:
repeat forever
   start or resume
        repeat ~20 times
          add network device
          check network device to work
   shutdown or suspend
This should cover a lot of paths that your change might have affected.
/me hopes that indents will be retained by LP

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Incomplete
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (19 preceding siblings ...)
  2019-05-07 19:26 ` Dan Streetman
@ 2019-05-10 14:40 ` Dan Streetman
  2019-05-13 10:54 ` Łukasz Zemczak
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-05-10 14:40 UTC (permalink / raw)
  To: qemu-devel

On a Xenial DPDK setup with the proposed qemu version (1:2.5+dfsg-
5ubuntu10.37), I created a VM and attached a vhost-user interface to it
using this xml:

$ cat vm3-iface2.xml 
  <interface type='vhostuser'>
    <mac address='52:54:00:c3:37:7e'/>
    <source type='unix' path='/run/openvswitch/vhu2' mode='client'/>
    <model type='virtio'/>
    <driver name='vhost'/>
    <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
  </interface>

the OVS interface was created with:
# ovs-vsctl add-port br1 vhu2 -- set Interface vhu2 type=dpdkvhostuser

The interface was added to the vm with:
$ virsh attach-device vm3 vm3-iface2.xml --live

and detached with:
$ virsh detach-device vm3 vm3-iface2.xml --live

Inside the guest, the vhost-user interface was configured with DHCP, and
a ping started to the DHCP server, 10.0.2.2:

ubuntu@vm3:~$ ping 10.0.2.2
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.
64 bytes from 10.0.2.2: icmp_seq=1 ttl=255 time=0.122 ms
64 bytes from 10.0.2.2: icmp_seq=2 ttl=255 time=0.107 ms
64 bytes from 10.0.2.2: icmp_seq=3 ttl=255 time=0.112 ms
64 bytes from 10.0.2.2: icmp_seq=4 ttl=255 time=0.110 ms
>From 10.198.200.1 icmp_seq=5 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=6 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=7 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=8 Destination Port Unreachable
64 bytes from 10.0.2.2: icmp_seq=9 ttl=255 time=0.255 ms
>From 10.198.200.1 icmp_seq=10 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=11 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=12 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=13 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=14 Destination Port Unreachable
>From 10.198.200.1 icmp_seq=15 Destination Port Unreachable
64 bytes from 10.0.2.2: icmp_seq=16 ttl=255 time=0.277 ms
64 bytes from 10.0.2.2: icmp_seq=17 ttl=255 time=0.127 ms
64 bytes from 10.0.2.2: icmp_seq=18 ttl=255 time=0.104 ms


each change in ping (working, not working) corresponded to attaching or detaching the vhost-user interface; repeated attaches/detaches were made with no problems, and ping correctly working or not working based on if the interface was attached or not.

The guest was suspended, I waited for 5 seconds, and then resumed the
guest, with no problems.

The guest was shutdown with the interface attached with no problem or
crash; the guest was also shutdown with the interface detached (after
attaching and detaching several times) with no problem.

All this was repeated several times with no problems seen.

I believe this covers regression testing for the area of code the patch
touches, so marking this as fix committed again; this should be ready
for release.

** Changed in: qemu (Ubuntu Xenial)
       Status: Incomplete => Fix Committed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (20 preceding siblings ...)
  2019-05-10 14:40 ` Dan Streetman
@ 2019-05-13 10:54 ` Łukasz Zemczak
  2019-05-13 10:54 ` [Qemu-devel] [Bug 1823458] Update Released Łukasz Zemczak
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Łukasz Zemczak @ 2019-05-13 10:54 UTC (permalink / raw)
  To: qemu-devel

This is even more than what I wanted, thanks!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Update Released
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (21 preceding siblings ...)
  2019-05-13 10:54 ` Łukasz Zemczak
@ 2019-05-13 10:54 ` Łukasz Zemczak
  2019-05-13 11:04 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Launchpad Bug Tracker
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Łukasz Zemczak @ 2019-05-13 10:54 UTC (permalink / raw)
  To: qemu-devel

The verification of the Stable Release Update for qemu has completed
successfully and the package has now been released to -updates.
Subsequently, the Ubuntu Stable Release Updates Team is being
unsubscribed and will not receive messages about this bug report.  In
the event that you encounter a regression using the package from
-updates please report a new bug using ubuntu-bug and tag the bug report
regression-update so we can easily find any regressions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Committed
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (22 preceding siblings ...)
  2019-05-13 10:54 ` [Qemu-devel] [Bug 1823458] Update Released Łukasz Zemczak
@ 2019-05-13 11:04 ` Launchpad Bug Tracker
  2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Launchpad Bug Tracker @ 2019-05-13 11:04 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.37

---------------
qemu (1:2.5+dfsg-5ubuntu10.37) xenial; urgency=medium

  * d/p/lp1823458/add-VirtIONet-vhost_stopped-flag-to-prevent-multiple.patch,
    d/p/lp1823458/do-not-call-vhost_net_cleanup-on-running-net-from-ch.patch:
    - Prevent crash due to race condition on shutdown;
      this is fixed differently upstream (starting in Bionic), but
      the change is too large to backport into Xenial.  These two very
      small patches work around the problem in an unintrusive way.
      (LP: #1823458)

 -- Dan Streetman <ddstreet@canonical.com>  Tue, 23 Apr 2019 05:19:55
-0400

** Changed in: qemu (Ubuntu Xenial)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Update Released
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (23 preceding siblings ...)
  2019-05-13 11:04 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Launchpad Bug Tracker
@ 2019-05-13 18:37 ` Corey Bryant
  2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-05-13 18:37 UTC (permalink / raw)
  To: qemu-devel

The verification of the Stable Release Update for qemu has completed
successfully and the package has now been released to -updates. In the
event that you encounter a regression using the package from -updates
please report a new bug using ubuntu-bug and tag the bug report
regression-update so we can easily find any regressions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (24 preceding siblings ...)
  2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
@ 2019-05-13 18:37 ` Corey Bryant
  2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-05-13 18:37 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:2.8+dfsg-3ubuntu2.9~cloud5.1
---------------

 qemu (1:2.8+dfsg-3ubuntu2.9~cloud5.1) xenial-ocata; urgency=medium
 .
   * d/p/lp1823458/add-VirtIONet-vhost_stopped-flag-to-prevent-multiple.patch,
     d/p/lp1823458/do-not-call-vhost_net_cleanup-on-running-net-from-ch.patch:
     - Prevent crash due to race condition on shutdown;
       this is fixed differently upstream (starting in Bionic), but
       the change is too large to backport into Xenial.  These two very
       small patches work around the problem in an unintrusive way.
       (LP: #1823458)


** Changed in: cloud-archive/ocata
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Update Released
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (25 preceding siblings ...)
  2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
@ 2019-05-13 18:40 ` Corey Bryant
  2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
  2019-05-16 12:34 ` Dan Streetman
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-05-13 18:40 UTC (permalink / raw)
  To: qemu-devel

The verification of the Stable Release Update for qemu has completed
successfully and the package has now been released to -updates. In the
event that you encounter a regression using the package from -updates
please report a new bug using ubuntu-bug and tag the bug report
regression-update so we can easily find any regressions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (26 preceding siblings ...)
  2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
@ 2019-05-13 18:40 ` Corey Bryant
  2019-05-16 12:34 ` Dan Streetman
  28 siblings, 0 replies; 30+ messages in thread
From: Corey Bryant @ 2019-05-13 18:40 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.37~cloud0
---------------

 qemu (1:2.5+dfsg-5ubuntu10.37~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 qemu (1:2.5+dfsg-5ubuntu10.37) xenial; urgency=medium
 .
   * d/p/lp1823458/add-VirtIONet-vhost_stopped-flag-to-prevent-multiple.patch,
     d/p/lp1823458/do-not-call-vhost_net_cleanup-on-running-net-from-ch.patch:
     - Prevent crash due to race condition on shutdown;
       this is fixed differently upstream (starting in Bionic), but
       the change is too large to backport into Xenial.  These two very
       small patches work around the problem in an unintrusive way.
       (LP: #1823458)


** Changed in: cloud-archive/mitaka
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
       [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
                   ` (27 preceding siblings ...)
  2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
@ 2019-05-16 12:34 ` Dan Streetman
  28 siblings, 0 replies; 30+ messages in thread
From: Dan Streetman @ 2019-05-16 12:34 UTC (permalink / raw)
  To: qemu-devel

see bug 1829245 for regression introduced by this patch; these patches
will be reverted from xenial, and then re-uploaded along with a patch
for the regression in bug 1829380

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458

Title:
  race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
  crashes qemu

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Trusty:
  Won't Fix
Status in qemu source package in Xenial:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Cosmic:
  Fix Released
Status in qemu source package in Disco:
  Fix Released

Bug description:
  [impact]

  on shutdown of a guest, there is a race condition that results in qemu
  crashing instead of normally shutting down.  The bt looks similar to
  this (depending on the specific version of qemu, of course; this is
  taken from 2.5 version of qemu):

  (gdb) bt
  #0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
  #1  0x00005636c0bc4389 in qemu_mutex_lock (mutex=mutex@entry=0x0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
  #2  0x00005636c0988130 in qemu_chr_fe_write_all (s=s@entry=0x0, buf=buf@entry=0x7ffe65c086a0 "\v", len=len@entry=20) at /build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
  #3  0x00005636c08f3483 in vhost_user_write (msg=msg@entry=0x7ffe65c086a0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
      at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
  #4  0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70, ring=0x7ffe65c087e0) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
  #5  0x00005636c08efff0 in vhost_virtqueue_stop (dev=dev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338, vq=0x5636c1bf6d00, idx=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
  #6  0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@entry=0x5636c1bf6b70, vdev=vdev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
  #7  0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70, dev=dev@entry=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
  #8  0x00005636c08dbe5b in vhost_net_stop (dev=dev@entry=0x5636c2853338, ncs=0x5636c209d110, total_queues=total_queues@entry=1) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
  #9  0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a', n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
  #10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
  #11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338, val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
  #12 0x00005636c098fed2 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
  #13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
  #14 vm_stop (state=RUN_STATE_SHUTDOWN) at /build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
  #15 0x00005636c085d240 in main_loop_should_exit () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
  #16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
  #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683

  [test case]

  unfortunately since this is a race condition, it's very hard to
  arbitrarily reproduce; it depends very much on the overall
  configuration of the guest as well as how exactly it's shut down -
  specifically, its vhost user net must be closed from the host side at
  a specific time during qemu shutdown.

  I have someone with such a setup who has reported to me their setup is
  able to reproduce this reliably, but the config is too complex for me
  to reproduce so I have relied on their reproduction and testing to
  debug and craft the patch for this.

  [regression potential]

  the change adds a flag to prevent repeated calls to vhost_net_stop().
  This also prevents any calls to vhost_net_cleanup() from
  net_vhost_user_event().  Any regression would be seen when stopping
  and/or cleaning up a vhost net.  Regressions might include failure to
  hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
  leak), or crashes during cleanup or stopping a vhost net.

  [other info]

  this was originally seen in the 2.5 version of qemu - specifically,
  the UCA version in trusty-mitaka (which uses the xenial qemu
  codebase).

  After discussion upstream, it appears this was fixed upstream by
  commit e7c83a885f8, which is included starting in version 2.9.
  However, this commit depends on at least commit 5345fdb4467, and
  likely more other previous commits, which make widespread code changes
  and are unsuitable to backport.  Therefore this seems like it should
  be specifically worked around in the Xenial qemu codebase.

  
  The specific race condition for this (in the qemu 2.5 code version) is:

  as shown in above bt, thread A starts shutting down qemu, e.g.:

  vm_stop->do_vm_stop->vm_state_notify
    virtio_set_status
      virtio_net_set_status
        virtio_net_vhost_status

  in this function, code gets to an if-else check for
  (!n->vhost_started), which is false (i.e. vhost_started is true) and
  enters the else code block, which calls vhost_net_stop() and then sets
  n->vhost_started to false.

  While thread A is inside vhost_net_stop(), thread B is triggered by
  the vhost net chr handler with a user event and calls:

  net_vhost_user_event
    qmp_set_link (from case CHR_EVENT_CLOSED)
      virtio_net_set_link_status (via ->link_status_changed)
        virtio_net_set_status
          virtio_net_vhost_status

  notice thread B has now reached the same function that thread A is in;
  since the checks in the function have not changed, thread B follows
  the same path that thread A followed, and enters vhost_net_stop().

  Since thread A has already shut down and cleaned up some of the
  internals, once thread B starts trying to also clean up things, it
  segfaults as the shown in the bt.

  Avoiding only this duplicate call to vhost_net_stop() is required, but
  not enough - let's continue to look at what thread B does after its
  call to qmp_set_link() returns:

  net_vhost_user_event
    vhost_user_stop
      vhost_net_cleanup
        vhost_dev_cleanup

  However, in main() qemu registers atexit(net_cleanup()), which does:
  net_cleanup
    qemu_del_nic (or qemu_del_net_client, depending on ->type)
      qemu_cleanup_net_client
        vhost_user_cleanup (via ->cleanup)
          vhost_net_cleanup
            vhost_dev_cleanup

  and the duplicate vhost_dev_cleanup fails assertions since things were
  already cleaned up.  Additionally, if thread B's call to
  vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
  then that will call vhost_dev_stop() and vhost_disable_notifiers()
  which both try to access things that have been freed/cleared/disabled
  by vhost_dev_cleanup().

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1823458/+subscriptions


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-05-16 13:07 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <155455149397.14414.11595397789908732027.malonedeb@gac.canonical.com>
2019-04-06 15:16 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
2019-04-11 20:54 ` Dan Streetman
2019-04-11 21:28 ` Dan Streetman
2019-04-15 19:26 ` Dan Streetman
2019-04-23  9:12 ` Dan Streetman
2019-04-23  9:50 ` Dan Streetman
2019-04-23 10:21 ` Launchpad Bug Tracker
2019-04-24 13:59 ` Robie Basak
2019-04-24 14:00 ` [Qemu-devel] [Bug 1823458] Please test proposed package Robie Basak
2019-04-24 15:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
2019-04-24 16:03 ` Corey Bryant
2019-04-24 16:39 ` Dan Streetman
2019-04-24 16:40 ` Dan Streetman
2019-04-24 20:53 ` [Qemu-devel] [Bug 1823458] Please test proposed package Corey Bryant
2019-04-24 20:54 ` Corey Bryant
2019-04-30 10:06 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Dan Streetman
2019-04-30 10:07 ` Dan Streetman
2019-05-06  9:01 ` Łukasz Zemczak
2019-05-07 19:11 ` Brian Murray
2019-05-07 19:26 ` Dan Streetman
2019-05-08  6:49   ` Christian Ehrhardt 
2019-05-10 14:40 ` Dan Streetman
2019-05-13 10:54 ` Łukasz Zemczak
2019-05-13 10:54 ` [Qemu-devel] [Bug 1823458] Update Released Łukasz Zemczak
2019-05-13 11:04 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Launchpad Bug Tracker
2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
2019-05-13 18:37 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Update Released Corey Bryant
2019-05-13 18:40 ` [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu Corey Bryant
2019-05-16 12:34 ` Dan Streetman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.