All of lore.kernel.org
 help / color / mirror / Atom feed
* 9pfs: Twalk crash
@ 2021-08-30 15:55 Christian Schoenebeck
  2021-08-31 10:57 ` Greg Kurz
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Schoenebeck @ 2021-08-30 15:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

Apparently commit 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 '9pfs: reduce 
latency of Twalk' has introduced occasional crashes.

My first impression after looking at the backtrace: looks like the patch 
itself is probably not causing this, but rather unmasked this issue (i.e. 
increased the chance to be triggered).

The crash is because of 'elem' is NULL in virtio_pdu_vunmarshal() (frame 0).

bt taken with HEAD being 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, fmt=0x55a9352766d1 
"ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, 
fmt, ap);
[Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
(gdb) bt full
#0  0x000055a934dfb9a7 in virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, 
fmt=0x55a9352766d1 "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:
146
        s = 0x55a93717b4b8
        v = 0x55a93717aee0
        elem = 0x0
        ret = <optimized out>
#1  0x000055a934bf35e8 in pdu_unmarshal (pdu=pdu@entry=0x55a93717cde8, 
offset=offset@entry=7, fmt=fmt@entry=0x55a9352766d1 "ddw") at ../hw/9pfs/9p.c:
71
        ret = <optimized out>
        ap = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 
0x7f38a9ad9db0, reg_save_area = 0x7f38a9ad9cf0}}
#2  0x000055a934bf68db in v9fs_walk (opaque=0x55a93717cde8) at ../hw/9pfs/
9p.c:1720
        name_idx = <optimized out>
        qids = 0x0
        i = <optimized out>
        err = 0
        dpath = {size = 0, data = 0x0}
        path = {size = 0, data = 0x0}
        pathes = 0x0
        nwnames = 1
        stbuf = 
            {st_dev = 2050, st_ino = 1199848, st_nlink = 1, st_mode = 41471, 
st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 13, st_blksize = 
4096, st_blocks = 16, s}
        fidst = 
            {st_dev = 2050, st_ino = 1198183, st_nlink = 3, st_mode = 16877, 
st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 12288, st_blksize = 
4096, st_blocks = 32}
        stbufs = 0x0
        offset = 7
        fid = 299
        newfid = 687
        wnames = 0x0
        fidp = <optimized out>
        newfidp = 0x0
        pdu = 0x55a93717cde8
        s = 0x55a93717b4b8
        qid = {type = 2 '\002', version = 1556732739, path = 2399697}
#3  0x000055a93505760b in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at ../util/coroutine-ucontext.c:173






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-08-30 15:55 9pfs: Twalk crash Christian Schoenebeck
@ 2021-08-31 10:57 ` Greg Kurz
  2021-08-31 15:00   ` Christian Schoenebeck
  0 siblings, 1 reply; 9+ messages in thread
From: Greg Kurz @ 2021-08-31 10:57 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Mon, 30 Aug 2021 17:55:04 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> Apparently commit 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 '9pfs: reduce 
> latency of Twalk' has introduced occasional crashes.
> 
> My first impression after looking at the backtrace: looks like the patch 
> itself is probably not causing this, but rather unmasked this issue (i.e. 
> increased the chance to be triggered).
> 
> The crash is because of 'elem' is NULL in virtio_pdu_vunmarshal() (frame 0).
> 

Ouch... this certainly isn't expected to happen :-\

elem is popped out the vq in handle_9p_output():

        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
        if (!elem) {
            goto out_free_pdu;
        }
[...]
        v->elems[pdu->idx] = elem;

and cleared on PDU completion in virtio_9p_push_and_notify() :

    v->elems[pdu->idx] = NULL;


I can't think of a way where push_and_notify() could collide
with handle_output()... both are supposed to be run sequentially
by the main event loop. Maybe active some traces ?

> bt taken with HEAD being 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, fmt=0x55a9352766d1 
> "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
> 146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, 
> fmt, ap);
> [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> (gdb) bt full
> #0  0x000055a934dfb9a7 in virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, 
> fmt=0x55a9352766d1 "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:
> 146
>         s = 0x55a93717b4b8
>         v = 0x55a93717aee0
>         elem = 0x0

So this is v->elems[pdu->idx]... what's the value of pdu->idx ?

>         ret = <optimized out>
> #1  0x000055a934bf35e8 in pdu_unmarshal (pdu=pdu@entry=0x55a93717cde8, 
> offset=offset@entry=7, fmt=fmt@entry=0x55a9352766d1 "ddw") at ../hw/9pfs/9p.c:
> 71
>         ret = <optimized out>
>         ap = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 
> 0x7f38a9ad9db0, reg_save_area = 0x7f38a9ad9cf0}}
> #2  0x000055a934bf68db in v9fs_walk (opaque=0x55a93717cde8) at ../hw/9pfs/
> 9p.c:1720
>         name_idx = <optimized out>
>         qids = 0x0
>         i = <optimized out>
>         err = 0
>         dpath = {size = 0, data = 0x0}
>         path = {size = 0, data = 0x0}
>         pathes = 0x0
>         nwnames = 1
>         stbuf = 
>             {st_dev = 2050, st_ino = 1199848, st_nlink = 1, st_mode = 41471, 
> st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 13, st_blksize = 
> 4096, st_blocks = 16, s}
>         fidst = 
>             {st_dev = 2050, st_ino = 1198183, st_nlink = 3, st_mode = 16877, 
> st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 12288, st_blksize = 
> 4096, st_blocks = 32}
>         stbufs = 0x0
>         offset = 7
>         fid = 299
>         newfid = 687
>         wnames = 0x0
>         fidp = <optimized out>
>         newfidp = 0x0
>         pdu = 0x55a93717cde8
>         s = 0x55a93717b4b8
>         qid = {type = 2 '\002', version = 1556732739, path = 2399697}
> #3  0x000055a93505760b in coroutine_trampoline (i0=<optimized out>, 
> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-08-31 10:57 ` Greg Kurz
@ 2021-08-31 15:00   ` Christian Schoenebeck
  2021-08-31 17:04     ` Greg Kurz
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Schoenebeck @ 2021-08-31 15:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Dienstag, 31. August 2021 12:57:49 CEST Greg Kurz wrote:
> On Mon, 30 Aug 2021 17:55:04 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > Apparently commit 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 '9pfs: reduce
> > latency of Twalk' has introduced occasional crashes.
> > 
> > My first impression after looking at the backtrace: looks like the patch
> > itself is probably not causing this, but rather unmasked this issue (i.e.
> > increased the chance to be triggered).
> > 
> > The crash is because of 'elem' is NULL in virtio_pdu_vunmarshal() (frame
> > 0).
> Ouch... this certainly isn't expected to happen :-\
> 
> elem is popped out the vq in handle_9p_output():
> 
>         elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>         if (!elem) {
>             goto out_free_pdu;
>         }
> [...]
>         v->elems[pdu->idx] = elem;
> 
> and cleared on PDU completion in virtio_9p_push_and_notify() :
> 
>     v->elems[pdu->idx] = NULL;
> 
> 
> I can't think of a way where push_and_notify() could collide
> with handle_output()... both are supposed to be run sequentially
> by the main event loop.

Ok, I made a quick "is same thread" assertion check:

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b71f..bb6ebd16aa 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -28,12 +28,25 @@
 #include "qemu/module.h"
 #include "sysemu/qtest.h"
 
+static void assert_thread(void) {
+    static QemuThread thread;
+    static bool calledBefore;
+    if (!calledBefore) {
+        calledBefore = 1;
+        qemu_thread_get_self(&thread);
+        return;
+    }
+    assert(qemu_thread_is_self(&thread));
+}
+
 static void virtio_9p_push_and_notify(V9fsPDU *pdu)
 {
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
     VirtQueueElement *elem = v->elems[pdu->idx];
 
+    assert_thread();
+
     /* push onto queue and notify */
     virtqueue_push(v->vq, elem, pdu->size);
     g_free(elem);
@@ -51,6 +64,8 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue 
*vq)
     ssize_t len;
     VirtQueueElement *elem;
 
+    assert_thread();
+
     while ((pdu = pdu_alloc(s))) {
         P9MsgHeader out;
 
@@ -125,6 +140,8 @@ static ssize_t virtio_pdu_vmarshal(V9fsPDU *pdu, size_t 
offset,
     VirtQueueElement *elem = v->elems[pdu->idx];
     ssize_t ret;
 
+    assert_thread();
+
     ret = v9fs_iov_vmarshal(elem->in_sg, elem->in_num, offset, 1, fmt, ap);
     if (ret < 0) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
@@ -143,6 +160,8 @@ static ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t 
offset,
     VirtQueueElement *elem = v->elems[pdu->idx];
     ssize_t ret;
 
+    assert_thread();
+
     ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, 
ap);
     if (ret < 0) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
@@ -160,6 +179,8 @@ static void virtio_init_in_iov_from_pdu(V9fsPDU *pdu, 
struct iovec **piov,
     VirtQueueElement *elem = v->elems[pdu->idx];
     size_t buf_size = iov_size(elem->in_sg, elem->in_num);
 
+    assert_thread();
+
     if (buf_size < size) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
 
@@ -180,6 +201,8 @@ static void virtio_init_out_iov_from_pdu(V9fsPDU *pdu, 
struct iovec **piov,
     VirtQueueElement *elem = v->elems[pdu->idx];
     size_t buf_size = iov_size(elem->out_sg, elem->out_num);
 
+    assert_thread();
+
     if (buf_size < size) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
 
And it triggered, however I am not sure if some of those functions I asserted 
above are indeed allowed to be executed on a different thread than main 
thread:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
(gdb) bt
#0  0x00007fd3fe6db7bb in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/
sysv/linux/raise.c:50
#1  0x00007fd3fe6c6535 in __GI_abort () at abort.c:79
#2  0x00007fd3fe6c640f in __assert_fail_base
    (fmt=0x7fd3fe828ee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=0x563569935704 "qemu_thread_is_self(&thread)", file=0x5635699356e6 2
#3  0x00007fd3fe6d4102 in __GI___assert_fail
    (assertion=assertion@entry=0x563569935704 "qemu_thread_is_self(&thread)", 
file=file@entry=0x5635699356e6 "../hw/9pfs/virtio-9p-device.c", line1
#4  0x00005635695c98aa in assert_thread () at ../hw/9pfs/virtio-9p-device.c:39
#5  0x00005635695c98aa in assert_thread () at ../hw/9pfs/virtio-9p-device.c:31
#6  0x00005635695c9a4b in virtio_9p_push_and_notify (pdu=0x56356bc38f38) at 
../hw/9pfs/virtio-9p-device.c:48
#7  0x00005635693c2f28 in pdu_complete (pdu=pdu@entry=0x56356bc38f38, 
len=<optimized out>, len@entry=-4) at ../hw/9pfs/9p.c:1059
#8  0x00005635693c49e0 in v9fs_walk (opaque=0x56356bc38f38) at ../hw/9pfs/
9p.c:1868
#9  0x00005635698256bb in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#10 0x00007fd3fe6f0b50 in __correctly_grouped_prefixwc
    (begin=0x2 <error: Cannot access memory at address 0x2>, end=0x0, 
thousands=0 L'\000', grouping=0x7fd3fe6db7bb <__GI_raise+267> "H\213\214$\b\1
#11 0x0000000000000000 in  ()
(gdb)

Also there is no official qemu_main_thread() function it seems yet. That might 
be useful in general.

> Maybe active some traces ?

I need to read up on how to use traces, then I'll check that as well. But 
probably ensuring thread sanity in a way proposed above would probably make 
sense first.

I also tried a little retry hack as a test, just in case this was some sort of 
"not ready yet" issue, because I saw Philippe was working on some virtio queue 
not ready issue lately:
https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04460.html
https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04471.html

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b71f..ba332c27b6 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -141,6 +141,19 @@ static ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t 
offset,
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
     VirtQueueElement *elem = v->elems[pdu->idx];
+    for (int i = 0; !elem; ++i) {
+        bool in_coroutine = qemu_in_coroutine();
+        printf("TRIGGERED %d (in_coroutine=%d)!\n", i, in_coroutine);
+        fflush(stdout);
+        if (in_coroutine) {
+            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
+        } else {
+            g_usleep(100);
+        }
+        s = pdu->s;
+        v = container_of(s, V9fsVirtioState, state);
+        elem = v->elems[pdu->idx];
+    }
     ssize_t ret;
 
     ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, 
ap);

But that did not bring much:

TRIGGERED 0 (in_coroutine=1)!
TRIGGERED 1 (in_coroutine=1)!
TRIGGERED 2 (in_coroutine=1)!
...
TRIGGERED 43 (in_coroutine=1)!
TRIGGERED 44 (in_coroutine=1)!
qemu_aio_coroutine_enter: Co-routine was already scheduled in 
'qemu_co_sleep_ns'
Aborted (core dumped)

> > bt taken with HEAD being 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4:
> > 
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7,
> > fmt=0x55a9352766d1
> > "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
> > 146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset,
> > 1, fmt, ap);
> > [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> > (gdb) bt full
> > #0  0x000055a934dfb9a7 in virtio_pdu_vunmarshal (pdu=0x55a93717cde8,
> > offset=7, fmt=0x55a9352766d1 "ddw", ap=0x7f38a9ad9cd0) at
> > ../hw/9pfs/virtio-9p-device.c: 146
> > 
> >         s = 0x55a93717b4b8
> >         v = 0x55a93717aee0
> >         elem = 0x0
> 
> So this is v->elems[pdu->idx]... what's the value of pdu->idx ?

In that originally posted core dump it was 113:

#0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, fmt=0x55a9352766d1 
"ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, 
fmt, ap);
[Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
(gdb) p pdu->idx
$1 = 113

> >         ret = <optimized out>
> > 
> > #1  0x000055a934bf35e8 in pdu_unmarshal (pdu=pdu@entry=0x55a93717cde8,
> > offset=offset@entry=7, fmt=fmt@entry=0x55a9352766d1 "ddw") at
> > ../hw/9pfs/9p.c: 71
> > 
> >         ret = <optimized out>
> >         ap = {{gp_offset = 24, fp_offset = 48, overflow_arg_area =
> > 
> > 0x7f38a9ad9db0, reg_save_area = 0x7f38a9ad9cf0}}
> > #2  0x000055a934bf68db in v9fs_walk (opaque=0x55a93717cde8) at ../hw/9pfs/
> > 9p.c:1720
> > 
> >         name_idx = <optimized out>
> >         qids = 0x0
> >         i = <optimized out>
> >         err = 0
> >         dpath = {size = 0, data = 0x0}
> >         path = {size = 0, data = 0x0}
> >         pathes = 0x0
> >         nwnames = 1
> >         stbuf =
> >         
> >             {st_dev = 2050, st_ino = 1199848, st_nlink = 1, st_mode =
> >             41471,
> > 
> > st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 13, st_blksize
> > =
> > 4096, st_blocks = 16, s}
> > 
> >         fidst =
> >         
> >             {st_dev = 2050, st_ino = 1198183, st_nlink = 3, st_mode =
> >             16877,
> > 
> > st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 12288,
> > st_blksize = 4096, st_blocks = 32}
> > 
> >         stbufs = 0x0
> >         offset = 7
> >         fid = 299
> >         newfid = 687
> >         wnames = 0x0
> >         fidp = <optimized out>
> >         newfidp = 0x0
> >         pdu = 0x55a93717cde8
> >         s = 0x55a93717b4b8
> >         qid = {type = 2 '\002', version = 1556732739, path = 2399697}
> > 
> > #3  0x000055a93505760b in coroutine_trampoline (i0=<optimized out>,
> > i1=<optimized out>) at ../util/coroutine-ucontext.c:173




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-08-31 15:00   ` Christian Schoenebeck
@ 2021-08-31 17:04     ` Greg Kurz
  2021-09-01 12:49       ` Christian Schoenebeck
  0 siblings, 1 reply; 9+ messages in thread
From: Greg Kurz @ 2021-08-31 17:04 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 31 Aug 2021 17:00:32 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Dienstag, 31. August 2021 12:57:49 CEST Greg Kurz wrote:
> > On Mon, 30 Aug 2021 17:55:04 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > Apparently commit 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 '9pfs: reduce
> > > latency of Twalk' has introduced occasional crashes.
> > > 
> > > My first impression after looking at the backtrace: looks like the patch
> > > itself is probably not causing this, but rather unmasked this issue (i.e.
> > > increased the chance to be triggered).
> > > 
> > > The crash is because of 'elem' is NULL in virtio_pdu_vunmarshal() (frame
> > > 0).
> > Ouch... this certainly isn't expected to happen :-\
> > 
> > elem is popped out the vq in handle_9p_output():
> > 
> >         elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> >         if (!elem) {
> >             goto out_free_pdu;
> >         }
> > [...]
> >         v->elems[pdu->idx] = elem;
> > 
> > and cleared on PDU completion in virtio_9p_push_and_notify() :
> > 
> >     v->elems[pdu->idx] = NULL;
> > 
> > 
> > I can't think of a way where push_and_notify() could collide
> > with handle_output()... both are supposed to be run sequentially
> > by the main event loop.
> 
> Ok, I made a quick "is same thread" assertion check:
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..bb6ebd16aa 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -28,12 +28,25 @@
>  #include "qemu/module.h"
>  #include "sysemu/qtest.h"
>  
> +static void assert_thread(void) {
> +    static QemuThread thread;
> +    static bool calledBefore;
> +    if (!calledBefore) {
> +        calledBefore = 1;
> +        qemu_thread_get_self(&thread);
> +        return;
> +    }
> +    assert(qemu_thread_is_self(&thread));
> +}
> +
>  static void virtio_9p_push_and_notify(V9fsPDU *pdu)
>  {
>      V9fsState *s = pdu->s;
>      V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
>      VirtQueueElement *elem = v->elems[pdu->idx];
>  
> +    assert_thread();
> +
>      /* push onto queue and notify */
>      virtqueue_push(v->vq, elem, pdu->size);
>      g_free(elem);
> @@ -51,6 +64,8 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue 
> *vq)
>      ssize_t len;
>      VirtQueueElement *elem;
>  
> +    assert_thread();
> +
>      while ((pdu = pdu_alloc(s))) {
>          P9MsgHeader out;
>  
> @@ -125,6 +140,8 @@ static ssize_t virtio_pdu_vmarshal(V9fsPDU *pdu, size_t 
> offset,
>      VirtQueueElement *elem = v->elems[pdu->idx];
>      ssize_t ret;
>  
> +    assert_thread();
> +
>      ret = v9fs_iov_vmarshal(elem->in_sg, elem->in_num, offset, 1, fmt, ap);
>      if (ret < 0) {
>          VirtIODevice *vdev = VIRTIO_DEVICE(v);
> @@ -143,6 +160,8 @@ static ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t 
> offset,
>      VirtQueueElement *elem = v->elems[pdu->idx];
>      ssize_t ret;
>  
> +    assert_thread();
> +
>      ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, 
> ap);
>      if (ret < 0) {
>          VirtIODevice *vdev = VIRTIO_DEVICE(v);
> @@ -160,6 +179,8 @@ static void virtio_init_in_iov_from_pdu(V9fsPDU *pdu, 
> struct iovec **piov,
>      VirtQueueElement *elem = v->elems[pdu->idx];
>      size_t buf_size = iov_size(elem->in_sg, elem->in_num);
>  
> +    assert_thread();
> +
>      if (buf_size < size) {
>          VirtIODevice *vdev = VIRTIO_DEVICE(v);
>  
> @@ -180,6 +201,8 @@ static void virtio_init_out_iov_from_pdu(V9fsPDU *pdu, 
> struct iovec **piov,
>      VirtQueueElement *elem = v->elems[pdu->idx];
>      size_t buf_size = iov_size(elem->out_sg, elem->out_num);
>  
> +    assert_thread();
> +
>      if (buf_size < size) {
>          VirtIODevice *vdev = VIRTIO_DEVICE(v);
>  
> And it triggered, however I am not sure if some of those functions I asserted 
> above are indeed allowed to be executed on a different thread than main 
> thread:
> 
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]

Based in the thread number, it seems that the signal was raised by
the main event thread...

> (gdb) bt
> #0  0x00007fd3fe6db7bb in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/
> sysv/linux/raise.c:50
> #1  0x00007fd3fe6c6535 in __GI_abort () at abort.c:79
> #2  0x00007fd3fe6c640f in __assert_fail_base
>     (fmt=0x7fd3fe828ee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
> assertion=0x563569935704 "qemu_thread_is_self(&thread)", file=0x5635699356e6 2
> #3  0x00007fd3fe6d4102 in __GI___assert_fail
>     (assertion=assertion@entry=0x563569935704 "qemu_thread_is_self(&thread)", 
> file=file@entry=0x5635699356e6 "../hw/9pfs/virtio-9p-device.c", line1

What is the content of thread then if it isn't pthread_self() ?

> #4  0x00005635695c98aa in assert_thread () at ../hw/9pfs/virtio-9p-device.c:39
> #5  0x00005635695c98aa in assert_thread () at ../hw/9pfs/virtio-9p-device.c:31
> #6  0x00005635695c9a4b in virtio_9p_push_and_notify (pdu=0x56356bc38f38) at 
> ../hw/9pfs/virtio-9p-device.c:48
> #7  0x00005635693c2f28 in pdu_complete (pdu=pdu@entry=0x56356bc38f38, 
> len=<optimized out>, len@entry=-4) at ../hw/9pfs/9p.c:1059
> #8  0x00005635693c49e0 in v9fs_walk (opaque=0x56356bc38f38) at ../hw/9pfs/
> 9p.c:1868
> #9  0x00005635698256bb in coroutine_trampoline (i0=<optimized out>, 
> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> #10 0x00007fd3fe6f0b50 in __correctly_grouped_prefixwc
>     (begin=0x2 <error: Cannot access memory at address 0x2>, end=0x0, 
> thousands=0 L'\000', grouping=0x7fd3fe6db7bb <__GI_raise+267> "H\213\214$\b\1
> #11 0x0000000000000000 in  ()
> (gdb)
> 
> Also there is no official qemu_main_thread() function it seems yet. That might 
> be useful in general.
> 
> > Maybe active some traces ?
> 
> I need to read up on how to use traces, then I'll check that as well. But 
> probably ensuring thread sanity in a way proposed above would probably make 
> sense first.
> 

Yeah and we don't have traces at the virtio-9p device level... 

Anyway, have a look at hw/9pfs/trace-events to know existing traces
and then just add "-trace v9fs_blabla" to the QEMU command line. You
can add several -trace and I think you can put wildcards, i.e. 
"-trace v9fs_*"

> I also tried a little retry hack as a test, just in case this was some sort of 
> "not ready yet" issue, because I saw Philippe was working on some virtio queue 
> not ready issue lately:
> https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04460.html
> https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04471.html
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..ba332c27b6 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -141,6 +141,19 @@ static ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t 
> offset,
>      V9fsState *s = pdu->s;
>      V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
>      VirtQueueElement *elem = v->elems[pdu->idx];
> +    for (int i = 0; !elem; ++i) {
> +        bool in_coroutine = qemu_in_coroutine();
> +        printf("TRIGGERED %d (in_coroutine=%d)!\n", i, in_coroutine);
> +        fflush(stdout);
> +        if (in_coroutine) {
> +            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> +        } else {
> +            g_usleep(100);
> +        }
> +        s = pdu->s;
> +        v = container_of(s, V9fsVirtioState, state);
> +        elem = v->elems[pdu->idx];
> +    }
>      ssize_t ret;
>  
>      ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, 
> ap);
> 
> But that did not bring much:
> 
> TRIGGERED 0 (in_coroutine=1)!

Well, it proves that virtio_pdu_vunmarshal() is indeed passed
a PDU without an associated queue element which is really a
no go...

> TRIGGERED 1 (in_coroutine=1)!
> TRIGGERED 2 (in_coroutine=1)!
> ...
> TRIGGERED 43 (in_coroutine=1)!
> TRIGGERED 44 (in_coroutine=1)!
> qemu_aio_coroutine_enter: Co-routine was already scheduled in 
> 'qemu_co_sleep_ns'
> Aborted (core dumped)
> 

Strange...

> > > bt taken with HEAD being 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4:
> > > 
> > > Program terminated with signal SIGSEGV, Segmentation fault.
> > > #0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7,
> > > fmt=0x55a9352766d1
> > > "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
> > > 146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset,
> > > 1, fmt, ap);
> > > [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> > > (gdb) bt full
> > > #0  0x000055a934dfb9a7 in virtio_pdu_vunmarshal (pdu=0x55a93717cde8,
> > > offset=7, fmt=0x55a9352766d1 "ddw", ap=0x7f38a9ad9cd0) at
> > > ../hw/9pfs/virtio-9p-device.c: 146
> > > 
> > >         s = 0x55a93717b4b8
> > >         v = 0x55a93717aee0
> > >         elem = 0x0
> > 
> > So this is v->elems[pdu->idx]... what's the value of pdu->idx ?
> 
> In that originally posted core dump it was 113:
> 
> #0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7, fmt=0x55a9352766d1 
> "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
> 146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, 
> fmt, ap);
> [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> (gdb) p pdu->idx
> $1 = 113
> 

Ok, so it is a valid index (not over MAX_REQ). So it would mean
that someone cleared the slot in our back ?

> > >         ret = <optimized out>
> > > 
> > > #1  0x000055a934bf35e8 in pdu_unmarshal (pdu=pdu@entry=0x55a93717cde8,
> > > offset=offset@entry=7, fmt=fmt@entry=0x55a9352766d1 "ddw") at
> > > ../hw/9pfs/9p.c: 71
> > > 
> > >         ret = <optimized out>
> > >         ap = {{gp_offset = 24, fp_offset = 48, overflow_arg_area =
> > > 
> > > 0x7f38a9ad9db0, reg_save_area = 0x7f38a9ad9cf0}}
> > > #2  0x000055a934bf68db in v9fs_walk (opaque=0x55a93717cde8) at ../hw/9pfs/
> > > 9p.c:1720
> > > 
> > >         name_idx = <optimized out>
> > >         qids = 0x0
> > >         i = <optimized out>
> > >         err = 0
> > >         dpath = {size = 0, data = 0x0}
> > >         path = {size = 0, data = 0x0}
> > >         pathes = 0x0
> > >         nwnames = 1
> > >         stbuf =
> > >         
> > >             {st_dev = 2050, st_ino = 1199848, st_nlink = 1, st_mode =
> > >             41471,
> > > 
> > > st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 13, st_blksize
> > > =
> > > 4096, st_blocks = 16, s}
> > > 
> > >         fidst =
> > >         
> > >             {st_dev = 2050, st_ino = 1198183, st_nlink = 3, st_mode =
> > >             16877,
> > > 
> > > st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 12288,
> > > st_blksize = 4096, st_blocks = 32}
> > > 
> > >         stbufs = 0x0
> > >         offset = 7
> > >         fid = 299
> > >         newfid = 687
> > >         wnames = 0x0
> > >         fidp = <optimized out>
> > >         newfidp = 0x0
> > >         pdu = 0x55a93717cde8
> > >         s = 0x55a93717b4b8
> > >         qid = {type = 2 '\002', version = 1556732739, path = 2399697}
> > > 
> > > #3  0x000055a93505760b in coroutine_trampoline (i0=<optimized out>,
> > > i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-08-31 17:04     ` Greg Kurz
@ 2021-09-01 12:49       ` Christian Schoenebeck
  2021-09-01 14:21         ` Christian Schoenebeck
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Schoenebeck @ 2021-09-01 12:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Dienstag, 31. August 2021 19:04:54 CEST Greg Kurz wrote:
> > Ok, I made a quick "is same thread" assertion check:
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index 54ee93b71f..bb6ebd16aa 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -28,12 +28,25 @@
> > 
> >  #include "qemu/module.h"
> >  #include "sysemu/qtest.h"
> > 
> > +static void assert_thread(void) {
> > +    static QemuThread thread;
> > +    static bool calledBefore;
> > +    if (!calledBefore) {
> > +        calledBefore = 1;
> > +        qemu_thread_get_self(&thread);
> > +        return;
> > +    }
> > +    assert(qemu_thread_is_self(&thread));
> > +}
> > +
[...]
> > And it triggered, however I am not sure if some of those functions I
> > asserted above are indeed allowed to be executed on a different thread
> > than main thread:
> > 
> > Program terminated with signal SIGABRT, Aborted.
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> > [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
> 
> Based in the thread number, it seems that the signal was raised by
> the main event thread...

No, it was not main thread actually, gdb's "current thread is 1" output is 
misleading.

Following the thread id trace, I extended the thread assertion checks over to 
v9fs_walk() as well, like this:

static void coroutine_fn v9fs_walk(void *opaque)
{
    ...
    assert_thread();
    v9fs_co_run_in_worker({
        ...
    });
    assert_thread();
    ...
}

and made sure the reference thread id to be compared is really the main 
thread.

And what happens here is before v9fs_co_run_in_worker() is entered, 
v9fs_walk() runs on main thread, but after returning from 
v9fs_co_run_in_worker() it runs on a different thread for some reason, not on 
main thread as it would be expected at that point.

This is my test patch:

----

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 2815257f42..fa706c10e6 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -1717,6 +1717,8 @@ static void coroutine_fn v9fs_walk(void *opaque)
     V9fsState *s = pdu->s;
     V9fsQID qid;
 
+    assert_thread();
+
     err = pdu_unmarshal(pdu, offset, "ddw", &fid, &newfid, &nwnames);
     if (err < 0) {
         pdu_complete(pdu, err);
@@ -1724,10 +1726,15 @@ static void coroutine_fn v9fs_walk(void *opaque)
     }
     offset += err;
 
+    assert_thread();
+
     trace_v9fs_walk(pdu->tag, pdu->id, fid, newfid, nwnames);
 
+    assert_thread();
+
     if (nwnames > P9_MAXWELEM) {
         err = -EINVAL;
+        assert_thread();
         goto out_nofid;
     }
     if (nwnames) {
@@ -1738,10 +1745,12 @@ static void coroutine_fn v9fs_walk(void *opaque)
         for (i = 0; i < nwnames; i++) {
             err = pdu_unmarshal(pdu, offset, "s", &wnames[i]);
             if (err < 0) {
+                assert_thread();
                 goto out_nofid;
             }
             if (name_is_illegal(wnames[i].data)) {
                 err = -ENOENT;
+                assert_thread();
                 goto out_nofid;
             }
             offset += err;
@@ -1750,6 +1759,7 @@ static void coroutine_fn v9fs_walk(void *opaque)
     fidp = get_fid(pdu, fid);
     if (fidp == NULL) {
         err = -ENOENT;
+        assert_thread();
         goto out_nofid;
     }
 
@@ -1767,6 +1777,7 @@ static void coroutine_fn v9fs_walk(void *opaque)
      * Twalk client request) as small as possible, run all the required fs
      * driver code altogether inside the following block.
      */
+    assert_thread();
     v9fs_co_run_in_worker({
         if (v9fs_request_cancelled(pdu)) {
             err = -EINTR;
@@ -1807,6 +1818,7 @@ static void coroutine_fn v9fs_walk(void *opaque)
             }
         }
     });
+    assert_thread();
     /*
      * Handle all the rest of this Twalk request on main thread ...
      */
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 54ee93b71f..abf8a33b3b 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -27,6 +27,21 @@
 #include "qemu/iov.h"
 #include "qemu/module.h"
 #include "sysemu/qtest.h"
+#include <sys/types.h>
+#include <sys/syscall.h>
+
+static pid_t main_thread_id = -1;
+
+void assert_thread(void) {
+    assert(main_thread_id != -1);
+    pid_t tid = syscall(__NR_gettid);
+    if (tid != main_thread_id) {
+        printf("Unexpected thread id %d (should be %d) - process id is %d\n",
+               tid, main_thread_id, getpid());
+        fflush(stdout);
+    }
+    assert(tid == main_thread_id);
+}
 
 static void virtio_9p_push_and_notify(V9fsPDU *pdu)
 {
@@ -34,6 +49,8 @@ static void virtio_9p_push_and_notify(V9fsPDU *pdu)
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
     VirtQueueElement *elem = v->elems[pdu->idx];
 
+    assert_thread();
+
     /* push onto queue and notify */
     virtqueue_push(v->vq, elem, pdu->size);
     g_free(elem);
@@ -51,6 +68,8 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue 
*vq)
     ssize_t len;
     VirtQueueElement *elem;
 
+    assert_thread();
+
     while ((pdu = pdu_alloc(s))) {
         P9MsgHeader out;
 
@@ -125,6 +144,8 @@ static ssize_t virtio_pdu_vmarshal(V9fsPDU *pdu, size_t 
offset,
     VirtQueueElement *elem = v->elems[pdu->idx];
     ssize_t ret;
 
+    assert_thread();
+
     ret = v9fs_iov_vmarshal(elem->in_sg, elem->in_num, offset, 1, fmt, ap);
     if (ret < 0) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
@@ -143,6 +164,8 @@ static ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t 
offset,
     VirtQueueElement *elem = v->elems[pdu->idx];
     ssize_t ret;
 
+    assert_thread();
+
     ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, 
ap);
     if (ret < 0) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
@@ -160,6 +183,8 @@ static void virtio_init_in_iov_from_pdu(V9fsPDU *pdu, 
struct iovec **piov,
     VirtQueueElement *elem = v->elems[pdu->idx];
     size_t buf_size = iov_size(elem->in_sg, elem->in_num);
 
+    assert_thread();
+
     if (buf_size < size) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
 
@@ -180,6 +205,8 @@ static void virtio_init_out_iov_from_pdu(V9fsPDU *pdu, 
struct iovec **piov,
     VirtQueueElement *elem = v->elems[pdu->idx];
     size_t buf_size = iov_size(elem->out_sg, elem->out_num);
 
+    assert_thread();
+
     if (buf_size < size) {
         VirtIODevice *vdev = VIRTIO_DEVICE(v);
 
@@ -251,6 +278,10 @@ static Property virtio_9p_properties[] = {
 
 static void virtio_9p_class_init(ObjectClass *klass, void *data)
 {
+    main_thread_id = syscall(__NR_gettid);
+    printf("Main thread ID is %d\n", main_thread_id);
+    fflush(stdout);
+
     DeviceClass *dc = DEVICE_CLASS(klass);
     VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
 
diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
index 20fa118f3a..eb7d1293cb 100644
--- a/hw/9pfs/virtio-9p.h
+++ b/hw/9pfs/virtio-9p.h
@@ -17,4 +17,6 @@ struct V9fsVirtioState {
 #define TYPE_VIRTIO_9P "virtio-9p-device"
 OBJECT_DECLARE_SIMPLE_TYPE(V9fsVirtioState, VIRTIO_9P)
 
+void assert_thread(void);
+
 #endif


----

Output:

Unexpected thread id 10780 (should be 10754) - process id is 10754
qemu-system-x86_64: ../hw/9pfs/virtio-9p-device.c:43: assert_thread: Assertion 
`tid == main_thread_id' failed.
Aborted (core dumped)

So the initially captured thread ID 10754 at process startup matches the 
process ID, hence 10754 was the main thread, and hence the assertion was 
triggered on a thread that was not main thread.

Backtrace:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fb4637fe700 (LWP 10780))]
(gdb) bt
#0  0x00007fb7ac6d87bb in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/
sysv/linux/raise.c:50
#1  0x00007fb7ac6c3535 in __GI_abort () at abort.c:79
#2  0x00007fb7ac6c340f in __assert_fail_base
    (fmt=0x7fb7ac825ee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=0x562e6126072f "tid == main_thread_id", file=0x562e612606fc "../hw/2
#3  0x00007fb7ac6d1102 in __GI___assert_fail
    (assertion=assertion@entry=0x562e6126072f "tid == main_thread_id", 
file=file@entry=0x562e612606fc "../hw/9pfs/virtio-9p-device.c", line=line@e1
#4  0x0000562e60ef3965 in assert_thread () at ../hw/9pfs/virtio-9p-device.c:43
#5  0x0000562e60ef3965 in assert_thread () at ../hw/9pfs/virtio-9p-device.c:35
#6  0x0000562e60cef125 in v9fs_walk (opaque=0x562e6405c288) at ../hw/9pfs/
9p.c:1821
#7  0x0000562e6114f79b in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#8  0x00007fb7ac6edb50 in __correctly_grouped_prefixwc
    (begin=0x2 <error: Cannot access memory at address 0x2>, end=0x0, 
thousands=0 L'\000', grouping=0x7fb7ac6d87bb <__GI_raise+267> "H\213\214$\b\1
#9  0x0000000000000000 in  ()

And 9p.c:1821 is exactly:

    assert_thread();
    v9fs_co_run_in_worker({
        if (v9fs_request_cancelled(pdu)) {
            err = -EINTR;
            break;
        }
        err = s->ops->lstat(&s->ctx, &dpath, &fidst);
        if (err < 0) {
            err = -errno;
            break;
        }
        stbuf = fidst;
        for (name_idx = 0; name_idx < nwnames; name_idx++) {
            if (v9fs_request_cancelled(pdu)) {
                err = -EINTR;
                break;
            }
            if (!same_stat_id(&pdu->s->root_st, &stbuf) ||
                strcmp("..", wnames[name_idx].data))
            {
                err = s->ops->name_to_path(&s->ctx, &dpath,
                                        wnames[name_idx].data, &path);
                if (err < 0) {
                    err = -errno;
                    break;
                }
                if (v9fs_request_cancelled(pdu)) {
                    err = -EINTR;
                    break;
                }
                err = s->ops->lstat(&s->ctx, &path, &stbuf);
                if (err < 0) {
                    err = -errno;
                    break;
                }
                stbufs[name_idx] = stbuf;
                v9fs_path_copy(&dpath, &path);
                v9fs_path_copy(&pathes[name_idx], &path);
            }
        }
    });
    assert_thread(); //<--- HERE

> > > > [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> > > > (gdb) bt full
> > > > #0  0x000055a934dfb9a7 in virtio_pdu_vunmarshal (pdu=0x55a93717cde8,
> > > > offset=7, fmt=0x55a9352766d1 "ddw", ap=0x7f38a9ad9cd0) at
> > > > ../hw/9pfs/virtio-9p-device.c: 146
> > > > 
> > > >         s = 0x55a93717b4b8
> > > >         v = 0x55a93717aee0
> > > >         elem = 0x0
> > > 
> > > So this is v->elems[pdu->idx]... what's the value of pdu->idx ?
> > 
> > In that originally posted core dump it was 113:
> > 
> > #0  virtio_pdu_vunmarshal (pdu=0x55a93717cde8, offset=7,
> > fmt=0x55a9352766d1
> > "ddw", ap=0x7f38a9ad9cd0) at ../hw/9pfs/virtio-9p-device.c:146
> > 146         ret = v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset,
> > 1, fmt, ap);
> > [Current thread is 1 (Thread 0x7f3bddd2ac40 (LWP 7811))]
> > (gdb) p pdu->idx
> > $1 = 113
> 
> Ok, so it is a valid index (not over MAX_REQ). So it would mean
> that someone cleared the slot in our back ?

Yes, the slot is within valid boundaries. Assertion checks for that in 
virtio-9p-device.c in future wouldn't hurt though.

Best regards,
Christian Schoenebeck




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-09-01 12:49       ` Christian Schoenebeck
@ 2021-09-01 14:21         ` Christian Schoenebeck
  2021-09-01 15:41           ` Greg Kurz
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Schoenebeck @ 2021-09-01 14:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Mittwoch, 1. September 2021 14:49:37 CEST Christian Schoenebeck wrote:
> > > And it triggered, however I am not sure if some of those functions I
> > > asserted above are indeed allowed to be executed on a different thread
> > > than main thread:
> > > 
> > > Program terminated with signal SIGABRT, Aborted.
> > > #0  __GI_raise (sig=sig@entry=6) at
> > > ../sysdeps/unix/sysv/linux/raise.c:50
> > > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> > > [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
> > 
> > Based in the thread number, it seems that the signal was raised by
> > the main event thread...
> 
> No, it was not main thread actually, gdb's "current thread is 1" output is
> misleading.
> 
> Following the thread id trace, I extended the thread assertion checks over
> to v9fs_walk() as well, like this:
> 
> static void coroutine_fn v9fs_walk(void *opaque)
> {
>     ...
>     assert_thread();
>     v9fs_co_run_in_worker({
>         ...
>     });
>     assert_thread();
>     ...
> }
> 
> and made sure the reference thread id to be compared is really the main
> thread.
> 
> And what happens here is before v9fs_co_run_in_worker() is entered,
> v9fs_walk() runs on main thread, but after returning from
> v9fs_co_run_in_worker() it runs on a different thread for some reason, not
> on main thread as it would be expected at that point.

Ok, I think I found the root cause: the block is break;-ing out too far. The 
following patch should fix it:

diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
index c51289903d..f83c7dda7b 100644
--- a/hw/9pfs/coth.h
+++ b/hw/9pfs/coth.h
@@ -51,7 +51,9 @@
          */                                                             \
         qemu_coroutine_yield();                                         \
         qemu_bh_delete(co_bh);                                          \
-        code_block;                                                     \
+        do {                                                            \
+            code_block;                                                 \
+        } while (0);                                                    \
         /* re-enter back to qemu thread */                              \
         qemu_coroutine_yield();                                         \
     } while (0)

I haven't triggered a crash with that patch, but due to the occasional nature 
of this issue I'll give it some more spins before officially proclaiming it my 
bug. :)

Best regards,
Christian Schoenebeck




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-09-01 14:21         ` Christian Schoenebeck
@ 2021-09-01 15:41           ` Greg Kurz
  2021-09-01 16:07             ` Christian Schoenebeck
  0 siblings, 1 reply; 9+ messages in thread
From: Greg Kurz @ 2021-09-01 15:41 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Wed, 01 Sep 2021 16:21:06 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 1. September 2021 14:49:37 CEST Christian Schoenebeck wrote:
> > > > And it triggered, however I am not sure if some of those functions I
> > > > asserted above are indeed allowed to be executed on a different thread
> > > > than main thread:
> > > > 
> > > > Program terminated with signal SIGABRT, Aborted.
> > > > #0  __GI_raise (sig=sig@entry=6) at
> > > > ../sysdeps/unix/sysv/linux/raise.c:50
> > > > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> > > > [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
> > > 
> > > Based in the thread number, it seems that the signal was raised by
> > > the main event thread...
> > 
> > No, it was not main thread actually, gdb's "current thread is 1" output is
> > misleading.
> > 
> > Following the thread id trace, I extended the thread assertion checks over
> > to v9fs_walk() as well, like this:
> > 
> > static void coroutine_fn v9fs_walk(void *opaque)
> > {
> >     ...
> >     assert_thread();
> >     v9fs_co_run_in_worker({
> >         ...
> >     });
> >     assert_thread();
> >     ...
> > }
> > 
> > and made sure the reference thread id to be compared is really the main
> > thread.
> > 
> > And what happens here is before v9fs_co_run_in_worker() is entered,
> > v9fs_walk() runs on main thread, but after returning from
> > v9fs_co_run_in_worker() it runs on a different thread for some reason, not
> > on main thread as it would be expected at that point.
> 
> Ok, I think I found the root cause: the block is break;-ing out too far. The 

That could explain the breakage indeed since the block you've added
to v9fs_walk() embeds a bunch of break statements. AFAICT this block
breaks on errors... do you know which one ?

> following patch should fix it:
> 
> diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> index c51289903d..f83c7dda7b 100644
> --- a/hw/9pfs/coth.h
> +++ b/hw/9pfs/coth.h
> @@ -51,7 +51,9 @@
>           */                                                             \
>          qemu_coroutine_yield();                                         \
>          qemu_bh_delete(co_bh);                                          \
> -        code_block;                                                     \
> +        do {                                                            \
> +            code_block;                                                 \
> +        } while (0);                                                    \

Good.

>          /* re-enter back to qemu thread */                              \
>          qemu_coroutine_yield();                                         \
>      } while (0)
> 
> I haven't triggered a crash with that patch, but due to the occasional nature 
> of this issue I'll give it some more spins before officially proclaiming it my 
> bug. :)

Well, this is a pre-existing limitation with v9fs_co_run_in_worker().
This wasn't documented as such and not really obvious to detect when
you optimized TWALK. We've never hit it before because the other
v9fs_co_run_in_worker() users don't have break statements.

But, indeed, this caused a regression in 6.1 so this will need a Fixes:
tag and Cc: qemu-stable.

> 
> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-09-01 15:41           ` Greg Kurz
@ 2021-09-01 16:07             ` Christian Schoenebeck
  2021-09-01 16:31               ` Greg Kurz
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Schoenebeck @ 2021-09-01 16:07 UTC (permalink / raw)
  To: Greg Kurz; +Cc: qemu-devel

On Mittwoch, 1. September 2021 17:41:02 CEST Greg Kurz wrote:
> On Wed, 01 Sep 2021 16:21:06 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 1. September 2021 14:49:37 CEST Christian Schoenebeck wrote:
> > > > > And it triggered, however I am not sure if some of those functions I
> > > > > asserted above are indeed allowed to be executed on a different
> > > > > thread
> > > > > than main thread:
> > > > > 
> > > > > Program terminated with signal SIGABRT, Aborted.
> > > > > #0  __GI_raise (sig=sig@entry=6) at
> > > > > ../sysdeps/unix/sysv/linux/raise.c:50
> > > > > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or
> > > > > directory.
> > > > > [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
> > > > 
> > > > Based in the thread number, it seems that the signal was raised by
> > > > the main event thread...
> > > 
> > > No, it was not main thread actually, gdb's "current thread is 1" output
> > > is
> > > misleading.
> > > 
> > > Following the thread id trace, I extended the thread assertion checks
> > > over
> > > to v9fs_walk() as well, like this:
> > > 
> > > static void coroutine_fn v9fs_walk(void *opaque)
> > > {
> > > 
> > >     ...
> > >     assert_thread();
> > >     v9fs_co_run_in_worker({
> > >     
> > >         ...
> > >     
> > >     });
> > >     assert_thread();
> > >     ...
> > > 
> > > }
> > > 
> > > and made sure the reference thread id to be compared is really the main
> > > thread.
> > > 
> > > And what happens here is before v9fs_co_run_in_worker() is entered,
> > > v9fs_walk() runs on main thread, but after returning from
> > > v9fs_co_run_in_worker() it runs on a different thread for some reason,
> > > not
> > > on main thread as it would be expected at that point.
> > 
> > Ok, I think I found the root cause: the block is break;-ing out too far.
> > The
> That could explain the breakage indeed since the block you've added
> to v9fs_walk() embeds a bunch of break statements. AFAICT this block
> breaks on errors... do you know which one ?

Yes, I've verified that. In my case an interrupt of Twalk triggered this bug. 
so it was this path exactly:

    v9fs_co_run_in_worker({
        if (v9fs_request_cancelled(pdu)) {
            ...
            break;
        }
        ...
    });

so it was really this break;-ing too far being the root cause of the crash.

> > following patch should fix it:
> > 
> > diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> > index c51289903d..f83c7dda7b 100644
> > --- a/hw/9pfs/coth.h
> > +++ b/hw/9pfs/coth.h
> > @@ -51,7 +51,9 @@
> > 
> >           */                                                             \
> >          
> >          qemu_coroutine_yield();                                         \
> >          qemu_bh_delete(co_bh);                                          \
> > 
> > -        code_block;                                                     \
> > +        do {                                                            \
> > +            code_block;                                                 \
> > +        } while (0);                                                    \
> 
> Good.
> 
> >          /* re-enter back to qemu thread */                              \
> >          qemu_coroutine_yield();                                         \
> >      
> >      } while (0)
> > 
> > I haven't triggered a crash with that patch, but due to the occasional
> > nature of this issue I'll give it some more spins before officially
> > proclaiming it my bug. :)
> 
> Well, this is a pre-existing limitation with v9fs_co_run_in_worker().
> This wasn't documented as such and not really obvious to detect when
> you optimized TWALK. We've never hit it before because the other
> v9fs_co_run_in_worker() users don't have break statements.

Yes, I know, this was my bad.

> But, indeed, this caused a regression in 6.1 so this will need a Fixes:
> tag and Cc: qemu-stable.

Yep, I'm preparing a patch now.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 9pfs: Twalk crash
  2021-09-01 16:07             ` Christian Schoenebeck
@ 2021-09-01 16:31               ` Greg Kurz
  0 siblings, 0 replies; 9+ messages in thread
From: Greg Kurz @ 2021-09-01 16:31 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Wed, 01 Sep 2021 18:07:39 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 1. September 2021 17:41:02 CEST Greg Kurz wrote:
> > On Wed, 01 Sep 2021 16:21:06 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Mittwoch, 1. September 2021 14:49:37 CEST Christian Schoenebeck wrote:
> > > > > > And it triggered, however I am not sure if some of those functions I
> > > > > > asserted above are indeed allowed to be executed on a different
> > > > > > thread
> > > > > > than main thread:
> > > > > > 
> > > > > > Program terminated with signal SIGABRT, Aborted.
> > > > > > #0  __GI_raise (sig=sig@entry=6) at
> > > > > > ../sysdeps/unix/sysv/linux/raise.c:50
> > > > > > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or
> > > > > > directory.
> > > > > > [Current thread is 1 (Thread 0x7fd0bcef1700 (LWP 6470))]
> > > > > 
> > > > > Based in the thread number, it seems that the signal was raised by
> > > > > the main event thread...
> > > > 
> > > > No, it was not main thread actually, gdb's "current thread is 1" output
> > > > is
> > > > misleading.
> > > > 
> > > > Following the thread id trace, I extended the thread assertion checks
> > > > over
> > > > to v9fs_walk() as well, like this:
> > > > 
> > > > static void coroutine_fn v9fs_walk(void *opaque)
> > > > {
> > > > 
> > > >     ...
> > > >     assert_thread();
> > > >     v9fs_co_run_in_worker({
> > > >     
> > > >         ...
> > > >     
> > > >     });
> > > >     assert_thread();
> > > >     ...
> > > > 
> > > > }
> > > > 
> > > > and made sure the reference thread id to be compared is really the main
> > > > thread.
> > > > 
> > > > And what happens here is before v9fs_co_run_in_worker() is entered,
> > > > v9fs_walk() runs on main thread, but after returning from
> > > > v9fs_co_run_in_worker() it runs on a different thread for some reason,
> > > > not
> > > > on main thread as it would be expected at that point.
> > > 
> > > Ok, I think I found the root cause: the block is break;-ing out too far.
> > > The
> > That could explain the breakage indeed since the block you've added
> > to v9fs_walk() embeds a bunch of break statements. AFAICT this block
> > breaks on errors... do you know which one ?
> 
> Yes, I've verified that. In my case an interrupt of Twalk triggered this bug. 
> so it was this path exactly:
> 
>     v9fs_co_run_in_worker({
>         if (v9fs_request_cancelled(pdu)) {
>             ...
>             break;
>         }
>         ...
>     });
> 
> so it was really this break;-ing too far being the root cause of the crash.
> 
> > > following patch should fix it:
> > > 
> > > diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> > > index c51289903d..f83c7dda7b 100644
> > > --- a/hw/9pfs/coth.h
> > > +++ b/hw/9pfs/coth.h
> > > @@ -51,7 +51,9 @@
> > > 
> > >           */                                                             \
> > >          
> > >          qemu_coroutine_yield();                                         \
> > >          qemu_bh_delete(co_bh);                                          \
> > > 
> > > -        code_block;                                                     \
> > > +        do {                                                            \
> > > +            code_block;                                                 \
> > > +        } while (0);                                                    \
> > 
> > Good.
> > 
> > >          /* re-enter back to qemu thread */                              \
> > >          qemu_coroutine_yield();                                         \
> > >      
> > >      } while (0)
> > > 
> > > I haven't triggered a crash with that patch, but due to the occasional
> > > nature of this issue I'll give it some more spins before officially
> > > proclaiming it my bug. :)
> > 
> > Well, this is a pre-existing limitation with v9fs_co_run_in_worker().
> > This wasn't documented as such and not really obvious to detect when
> > you optimized TWALK. We've never hit it before because the other
> > v9fs_co_run_in_worker() users don't have break statements.
> 
> Yes, I know, this was my bad.
> 

No, I mean the opposite actually. You shouldn't feel sorry to have
detected that this macro we're using everywhere is badly broken from
the beginning... even at the cost of a regression we'll fix shortly :)

> > But, indeed, this caused a regression in 6.1 so this will need a Fixes:
> > tag and Cc: qemu-stable.
> 
> Yep, I'm preparing a patch now.
> 
> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-09-01 16:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 15:55 9pfs: Twalk crash Christian Schoenebeck
2021-08-31 10:57 ` Greg Kurz
2021-08-31 15:00   ` Christian Schoenebeck
2021-08-31 17:04     ` Greg Kurz
2021-09-01 12:49       ` Christian Schoenebeck
2021-09-01 14:21         ` Christian Schoenebeck
2021-09-01 15:41           ` Greg Kurz
2021-09-01 16:07             ` Christian Schoenebeck
2021-09-01 16:31               ` Greg Kurz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.