qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Rafael David Tinoco <rafaeldtinoco@kernelpath.com>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
Date: Tue, 10 Sep 2019 14:16:48 -0000	[thread overview]
Message-ID: <156812500849.592.15082482903371865706.malone@wampee.canonical.com> (raw)
In-Reply-To: 154327283728.15443.11625169757714443608.malonedeb@soybean.canonical.com

Alright, here is what is happening:

Whenever program is stuck, thread #2 backtrace is this:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaaaabd41b0 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2  qemu_event_wait (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3  0x0000aaaaaabee03c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I
don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'

And there should be no 0x7fff value for "ev->value".

qemu_event_init() is the one initializing the global:

    static QemuEvent rcu_call_ready_event;

and it is called by "rcu_init_complete()" which is called by
"rcu_init()":

    static void __attribute__((__constructor__)) rcu_init(void)

a constructor function.

So, "fixing" this issue by:

    (gdb) print rcu_call_ready_event
    $8 = {value = 4294967295, initialized = true}
    
    (gdb) watch rcu_call_ready_event
    Hardware watchpoint 1: rcu_call_ready_event
    
    (gdb) set rcu_call_ready_event.initialized = 1
    
    (gdb) set rcu_call_ready_event.value = 0

and note that I added a watchpoint to rcu_call_ready_event global:

<HANG>

Thread 1 "qemu-img" received signal SIGINT, Interrupt.
(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33625))]

(gdb) bt
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>)
#1  0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) print rcu_call_ready_event
$9 = {value = 0, initialized = true}

You can see I advanced in the qemu_event_{reset,set,wait} logic.

    (gdb) disassemble /m 0x0000aaaaaabd4110
    Dump of assembler code for function qemu_event_reset:
    408     in ./util/qemu-thread-posix.c

    409     in ./util/qemu-thread-posix.c

    410     in ./util/qemu-thread-posix.c
    411     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f0 <+0>:     ldrb    w1, [x0, #4]
       0x0000aaaaaabd40f4 <+4>:     cbz     w1, 0xaaaaaabd411c <qemu_event_reset+44>
       0x0000aaaaaabd411c <+44>:    stp     x29, x30, [sp, #-16]!
       0x0000aaaaaabd4120 <+48>:    adrp    x3, 0xaaaaaac20000
       0x0000aaaaaabd4124 <+52>:    add     x3, x3, #0x908
       0x0000aaaaaabd4128 <+56>:    mov     x29, sp
       0x0000aaaaaabd412c <+60>:    adrp    x1, 0xaaaaaac20000
       0x0000aaaaaabd4130 <+64>:    adrp    x0, 0xaaaaaac20000
       0x0000aaaaaabd4134 <+68>:    add     x3, x3, #0x290
       0x0000aaaaaabd4138 <+72>:    add     x1, x1, #0xc00
       0x0000aaaaaabd413c <+76>:    add     x0, x0, #0xd40
       0x0000aaaaaabd4140 <+80>:    mov     w2, #0x19b    // #411
       0x0000aaaaaabd4144 <+84>:    bl      0xaaaaaaaff190 <__assert_fail@plt>

    412     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f8 <+8>:     ldr     w1, [x0]

    413     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40fc <+12>:    dmb     ishld

    414     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4118 <+40>:    ret

And I'm currently inside the STLXR and LDAXR logic. To make sure my program counter is advancing, I added a breakpoint at 0x0000aaaaaabd4108, so CBNZ instruction would branch indefinitely into LDXAR instruction again, until the
LDAXR<->STLXR logic is satisfied (inside qemu_event_wait()).

(gdb) break *(0x0000aaaaaabd4108)
Breakpoint 2 at 0xaaaaaabd4108: file ./util/qemu-thread-posix.c, line 414.

which is basically this:

    if (value == EV_SET) {                        EV_SET == 0
        atomic_or(&ev->value, EV_FREE);           EV_FREE = 1
    }
    
and we can see that this logic being called one time after another:

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
    
(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

EVEN when rcu_call_ready_event->value is already EV_SET (0):

(gdb) print rcu_call_ready_event
$11 = {value = 0, initialized = true}

(gdb) info break
Num     Type           Disp Enb Address            What
1       hw watchpoint  keep y                      rcu_call_ready_event
3       breakpoint     keep n   0x0000aaaaaabd4108 qemu-thread-posix.c:414
        breakpoint already hit 23 times
4       breakpoint     keep y   0x0000aaaaaabd4148 qemu-thread-posix.c:424

IF I enable only rcu_call_ready_event HW watchpoint, nothing is
triggered.

(gdb) watch *(rcu_call_ready_event->value)
Hardware watchpoint 6: *(rcu_call_ready_event->value)

not if I set it directly to QemuEvent->value...

    assert(ev->initialized);
    value = atomic_read(&ev->value);
    smp_mb_acquire();
    if (value == EV_SET) {
        atomic_or(&ev->value, EV_FREE);
    }

meaning that "value" and "ev->value" might have a diff value... is that
so ?

(gdb) print value
$14 = <optimized out>

can't say.. checking registers AND stack:

       0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 
       

x0             0xaaaaaac87ce8      187649986428136
x1             0x1                 1
x2             0x1                 1
x3             0x0                 0
x4             0xffffbec61e98      281473882398360
x5             0xffffbec61c90      281473882397840
x6             0xffffbec61c90      281473882397840
x7             0x1                 1
x8             0x65                101
x9             0x0                 0
x10            0x0                 0
x11            0x0                 0
x12            0xffffbec61d90      281473882398096
x13            0x0                 0
x14            0x0                 0
x15            0x2                 2
x16            0xffffbf67ccf0      281473892994288
x17            0xffffbf274938      281473888766264
x18            0x23f               575
x19            0x0                 0
x20            0xaaaaaac87ce8      187649986428136
x21            0x0                 0
x22            0xfffffffff5bf      281474976708031
x23            0xaaaaaac87ce0      187649986428128
x24            0xaaaaaac29000      187649986039808
x25            0xfffffffff658      281474976708184
x26            0x1000              4096
x27            0xffffbf28c000      281473888862208
x28            0xffffbec61d90      281473882398096
x29            0xffffbec61420      281473882395680
x30            0xaaaaaabedff8      187649985798136
sp             0xffffbec61420      0xffffbec61420
pc             0xaaaaaabd4110      0xaaaaaabd4110 <qemu_event_reset+32>
cpsr           0x0                 [ EL=0 ]
fpsr           0x0                 0
fpcr           0x0                 0

AND the ORR instruction is ALWAYS being executed against 0x1 (not 0x0,
which is what I just changed by changing .value):

(gdb) print value
$14 = <optimized out>

       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
       
#0x1 is being used instead of contents of "value" local variable (volatile).

I'll recompile QEMU flagging all those local "unsigned value" variables
as being volatile and check if optimization changes. Or even try to
disable optimizations.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


  parent reply	other threads:[~2019-09-10 14:26 UTC|newest]

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
2018-11-26 23:54 ` dann frazier
2018-12-05 11:20 ` Alex Bennée
2019-04-15 12:59 ` 贞贵李
2019-04-15 12:59   ` 贞贵李
2019-04-15 14:37 ` 贞贵李
2019-04-15 14:37   ` 贞贵李
2019-04-15 22:25 ` dann frazier
2019-04-15 22:25   ` dann frazier
2019-04-15 23:37 ` dann frazier
2019-04-15 23:37   ` dann frazier
2019-04-16  8:16 ` 贞贵李
2019-04-16  8:16   ` 贞贵李
2019-04-16 13:32 ` 贞贵李
2019-04-16 13:32   ` 贞贵李
2019-04-23  1:29 ` 贞贵李
2019-04-23  1:29   ` 贞贵李
2019-06-05 16:16 ` dann frazier
2019-09-05 15:03 ` Rafael David Tinoco
2019-09-06 15:12 ` Rafael David Tinoco
2019-09-06 15:16 ` Rafael David Tinoco
2019-09-06 21:22 ` Rafael David Tinoco
2019-09-09 16:47 ` Rafael David Tinoco
2019-09-10  2:04 ` Rafael David Tinoco
2019-09-10 14:16 ` Rafael David Tinoco [this message]
2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-10 22:56 ` Rafael David Tinoco
2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
2019-09-11  2:15   ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11  7:05   ` [Qemu-devel] " Rafael David Tinoco
2019-09-11  7:05     ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
2019-09-11 14:48     ` Rafael David Tinoco
2019-09-11 19:09     ` Rafael David Tinoco
2019-09-11 19:09       ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-24 20:25       ` [Qemu-devel] " dann frazier
2019-09-24 20:25         ` [Bug 1805256] " dann frazier
2019-10-02  9:23         ` Jan Glauber
2019-10-02  9:23           ` Jan Glauber
2019-10-02  9:45           ` Paolo Bonzini
2019-10-02 11:05             ` Jan Glauber
2019-10-02 11:05               ` [Bug 1805256] " Jan Glauber
2019-10-02 13:20               ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
2019-10-02 14:58                 ` Torvald Riegel
2019-10-02 16:30                   ` Paolo Bonzini
2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
2019-10-07 14:36             ` Jan Glauber
2019-10-07 14:36               ` [Bug 1805256] " Jan Glauber
2019-10-07 14:44             ` dann frazier
2019-10-07 14:44               ` [Bug 1805256] " dann frazier
2019-10-07 14:58               ` Paolo Bonzini
2019-10-09  8:02                 ` Jan Glauber
2019-10-09  8:02                   ` [Bug 1805256] " Jan Glauber
2019-10-09  9:15                   ` Paolo Bonzini
2019-10-11  6:05                     ` Jan Glauber
2019-10-11  6:05                       ` [Bug 1805256] " Jan Glauber
2019-10-11  8:18                       ` Paolo Bonzini
2019-10-11  8:30                         ` Jan Glauber
2019-10-11  8:30                           ` [Bug 1805256] " Jan Glauber
2019-10-11 17:55                           ` dann frazier
2019-10-11 17:55                             ` dann frazier
2019-10-12  0:24                             ` [Bug 1805256] " no-reply
2019-10-12  0:49                             ` no-reply
2019-10-11 17:50                       ` dann frazier
2019-10-11 17:50                         ` [Bug 1805256] " dann frazier
2019-09-11  2:17 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-11 11:19 ` Rafael David Tinoco
2019-09-11 19:23 ` Rafael David Tinoco
2019-10-02 11:02 ` Jan Glauber
2019-10-03 12:28 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 21:35 ` dann frazier
2019-12-13 14:24 ` dann frazier
2019-12-17  1:34 ` Fred Kimmy
2019-12-17 19:17 ` dann frazier
2019-12-18  2:40 ` Rafael David Tinoco
2019-12-18  9:52 ` iveskim
2019-12-18 14:52 ` dann frazier
2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
2020-02-13  8:41 ` Ike Panhc
2020-02-13  8:42 ` Andrew Cloke
2020-02-13  9:20 ` Fred Kimmy
2020-04-15  2:47 ` Rafael David Tinoco
2020-05-04  7:24 ` Ike Panhc
2020-05-05  0:54 ` Ike Panhc
2020-05-05  1:22 ` Ying Fang
2020-05-05  6:15 ` Ike Panhc
2020-05-05 15:01 ` Ike Panhc
2020-05-05 18:48 ` Rafael David Tinoco
2020-05-05 23:55 ` dann frazier
2020-05-06 13:08 ` Rafael David Tinoco
2020-05-06 13:23 ` Rafael David Tinoco
2020-05-06 15:45 ` Ike Panhc
2020-05-06 16:42 ` dann frazier
2020-05-06 19:04 ` Launchpad Bug Tracker
2020-05-06 19:09 ` Philippe Mathieu-Daudé
2020-05-06 19:57   ` dann frazier
2020-05-06 20:11 ` Rafael David Tinoco
2020-05-06 21:10 ` Launchpad Bug Tracker
2020-05-06 21:44 ` Launchpad Bug Tracker
2020-05-07  3:37 ` Launchpad Bug Tracker
2020-05-07  7:00 ` Ike Panhc
2020-05-07 22:27 ` dann frazier
2020-05-14  8:05 ` Andrew Cloke
2020-05-27  4:55 ` Christian Ehrhardt 
2020-05-28 14:58 ` Christian Ehrhardt 
2020-05-29  7:55 ` Launchpad Bug Tracker
2020-05-29  8:01 ` Christian Ehrhardt 
2020-06-02 22:45 ` Brian Murray
2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray
2020-06-02 22:54 ` Brian Murray
2020-06-03  4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot
2020-06-03  6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc
2020-06-03  8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot
2020-06-05  3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt 
2020-06-11  8:04 ` Andrew Cloke
2020-06-17  5:16 ` Christian Ehrhardt 
2020-06-18  9:23 ` Launchpad Bug Tracker
2020-06-18  9:23 ` [Bug 1805256] Update Released Łukasz Zemczak
2020-06-18  9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker
2020-06-18  9:39 ` Launchpad Bug Tracker
2020-06-18 10:27 ` Andrew Cloke
2020-06-30  6:54 ` Christian Ehrhardt 
2020-07-01  7:01 ` Ike Panhc
2020-07-12 13:16 ` Rafael David Tinoco
2020-07-13  3:59 ` Launchpad Bug Tracker
2020-07-13  4:12 ` Rafael David Tinoco
2020-07-15 15:31 ` dann frazier
2020-07-20 12:22 ` Rafael David Tinoco
2020-07-21 20:02 ` Rafael David Tinoco
2020-07-21 20:03 ` Rafael David Tinoco
2020-07-31 18:51 ` Rafael David Tinoco
2020-07-31 21:42 ` Rafael David Tinoco
2020-08-07  9:53 ` Timo Aaltonen
2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot
2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier
2020-08-14 19:49 ` dann frazier
2020-08-19 16:36 ` Launchpad Bug Tracker
2020-08-19 17:16 ` Andrew Cloke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=156812500849.592.15082482903371865706.malone@wampee.canonical.com \
    --to=rafaeldtinoco@kernelpath.com \
    --cc=1805256@bugs.launchpad.net \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).