All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fam Zheng <fam@euphon.net>
To: Zhenyu Ye <yezhenyu2@huawei.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	xiexiangyou@huawei.com, armbru@redhat.com,
	Stefan Hajnoczi <stefanha@redhat.com>,
	pbonzini@redhat.com, mreitz@redhat.com
Subject: Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions
Date: Thu, 17 Sep 2020 10:10:56 +0000	[thread overview]
Message-ID: <20200917101056.GA2423524@dev> (raw)
In-Reply-To: <7ad220bd-7ee5-1f66-b2e5-7dc57d72eb2e@huawei.com>

On 2020-09-17 15:36, Zhenyu Ye wrote:
> Hi Stefan,
> 
> On 2020/9/14 21:27, Stefan Hajnoczi wrote:
> >>
> >> Theoretically, everything running in an iothread is asynchronous. However,
> >> some 'asynchronous' actions are not non-blocking entirely, such as
> >> io_submit().  This will block while the iodepth is too big and I/O pressure
> >> is too high.  If we do some qmp actions, such as 'info block', at this time,
> >> may cause vm soft lockup.  This series can make these qmp actions safer.
> >>
> >> I constructed the scene as follow:
> >> 1. create a vm with 4 disks, using iothread.
> >> 2. add press to the CPU on the host.  In my scene, the CPU usage exceeds 95%.
> >> 3. add press to the 4 disks in the vm at the same time.  I used the fio and
> >> some parameters are:
> >>
> >> 	 fio -rw=randrw -bs=1M -size=1G -iodepth=512 -ioengine=libaio -numjobs=4
> >>
> >> 4. do block query actions, for example, by virsh:
> >>
> >> 	virsh qemu-monitor-command [vm name] --hmp info block
> >>
> >> Then the vm will soft lockup, the calltrace is:
> >>
> >> [  192.311393] watchdog: BUG: soft lockup - CPU#1 stuck for 42s! [kworker/1:1:33]
> > 
> > Hi,
> > Sorry I haven't had time to investigate this myself yet.
> > 
> > Do you also have a QEMU backtrace when the hang occurs?
> > 
> > Let's find out if QEMU is stuck in the io_submit(2) syscall or whether
> > there's an issue in QEMU itself that causes the softlockup (for example,
> > aio_poll() with the global mutex held).
> 
> Sorry for the delay.
> 
> I traced the io_submit() within the guest. The maximum execution time exceeds 16s:
> 
> 	guest# perf trace -p `pidof -s fio` --duration 1
> 	    14.808 (3843.082 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	  3861.336 (11470.608 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 15341.998 (479.283 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 15831.380 (3704.997 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 19547.869 (3412.839 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 22966.953 (1080.854 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 24062.689 (6939.813 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 31019.219 (532.147 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 31556.610 (3469.920 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 35038.885 (9007.175 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 44053.578 (16006.405 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 60068.944 (3068.748 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 63138.474 (13012.499 ms): io_getevents(ctx_id: 281472924459008, min_nr: 511, nr: 511, events: 0x19a83150) = 511
> 	 76191.073 (2872.657 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	...
> 
> And in the host, the information of sys_enter_io_submit() is:
> 
> 	Samples: 3K of event 'syscalls:sys_enter_io_submit', Event count (approx.): 3150
> 	Children      Self  Trace output
> 	-   66.70%    66.70%  ctx_id: 0xffff9c044000, nr: 0x00000001, iocbpp: 0xffff9f7fad28
> 	0xffffae7f871c
> 	0xffffae8a27c4
> 	qemu_thread_start
> 	iothread_run
> 	aio_poll
> 	aio_dispatch_ready_handlers
> 	aio_dispatch_handler
> 	virtio_queue_host_notifier_aio_read
> 	virtio_queue_notify_aio_vq
> 	virtio_blk_data_plane_handle_output
> 	virtio_blk_handle_vq
> 	blk_io_unplug
> 	bdrv_io_unplug
> 	bdrv_io_unplug
> 	raw_aio_unplug
> 	laio_io_unplug
> 	syscall
> 
> 
> When the hang occurs, the QEMU is blocked at:
> 
> 	#0  0x0000ffff95762b64 in ?? () from target:/usr/lib64/libpthread.so.0
> 	#1  0x0000ffff9575bd88 in pthread_mutex_lock () from target:/usr/lib64/libpthread.so.0
> 	#2  0x0000aaaabb1f5948 in qemu_mutex_lock_impl (mutex=0xaaaacc8e1860,
> 	    file=0xaaaabb4e1bd0 "/Images/eillon/CODE/5-opensource/qemu/util/async.c", line=605)
> 	#3  0x0000aaaabb20acd4 in aio_context_acquire (ctx=0xaaaacc8e1800)
> 	#4  0x0000aaaabb105e90 in bdrv_query_image_info (bs=0xaaaacc934620,
> 	    p_info=0xaaaaccc41e18, errp=0xffffca669118)
> 	#5  0x0000aaaabb105968 in bdrv_block_device_info (blk=0xaaaacdca19f0, bs=0xaaaacc934620,
> 	    flat=false, errp=0xffffca6692b8)
> 	#6  0x0000aaaabb1063dc in bdrv_query_info (blk=0xaaaacdca19f0, p_info=0xaaaacd29c9a8,
> 	    errp=0xffffca6692b8)
> 	#7  0x0000aaaabb106c14 in qmp_query_block (errp=0x0)
> 	#8  0x0000aaaabacb8e6c in hmp_info_block (mon=0xffffca6693d0, qdict=0xaaaacd089790)
> 	#9  0x0000aaaabb0068f0 in handle_hmp_command (mon=0xffffca6693d0,
> 	    cmdline=0xaaaacc9013ca "")
> 	#10 0x0000aaaabaab5034 in qmp_human_monitor_command (
> 	    command_line=0xaaaacc9013c0 "info block", has_cpu_index=false, cpu_index=0,
> 	    errp=0xffffca6694d8)
> 	#11 0x0000aaaabb03c720 in qmp_marshal_human_monitor_command (args=0xffff70008070,
> 	    ret=0xffffca669570, errp=0xffffca669568) at qapi/qapi-commands-misc.c:653
> 	#12 0x0000aaaabb18fdbc in qmp_dispatch (cmds=0xaaaabbb197c0 <qmp_commands>,
> 	    request=0xffff70005cc0, allow_oob=false)
> 	#13 0x0000aaaabb003074 in monitor_qmp_dispatch (mon=0xaaaacc962320, req=0xffff70005cc0)
> 	#14 0x0000aaaabb003488 in monitor_qmp_bh_dispatcher (data=0x0)
> 	#15 0x0000aaaabb209d44 in aio_bh_call (bh=0xaaaacc5f0810)
> 	#16 0x0000aaaabb209e50 in aio_bh_poll (ctx=0xaaaacc5ef320)
> 	#17 0x0000aaaabb1f1494 in aio_dispatch (ctx=0xaaaacc5ef320)
> 	#18 0x0000aaaabb20a330 in aio_ctx_dispatch (source=0xaaaacc5ef320, callback=0x0,
> 	    user_data=0x0)
> 	#19 0x0000ffff95f00a7c in g_main_context_dispatch ()
> 	   from target:/usr/lib64/libglib-2.0.so.0
> 	#20 0x0000aaaabb2128e8 in glib_pollfds_poll ()
> 	#21 0x0000aaaabb212970 in os_host_main_loop_wait (timeout=64805420)
> 	#22 0x0000aaaabb212a90 in main_loop_wait (nonblocking=0)
> 	#23 0x0000aaaabaad63d0 in qemu_main_loop ()
> 	#24 0x0000aaaabb188b54 in main (argc=117, argv=0xffffca669a68, envp=0xffffca669e18)
> 
> And the QEMU process backtrace in kernel is:
> 	[<0>] __switch_to+0xdc/0x140
> 	[<0>] futex_wait_queue_me+0xd4/0x158
> 	[<0>] futex_wait+0xf4/0x230
> 	[<0>] do_futex+0x46c/0x608
> 	[<0>] __arm64_sys_futex+0x13c/0x188
> 	[<0>] el0_svc_common+0x80/0x1b8
> 	[<0>] el0_svc_handler+0x38/0x78
> 	[<0>] el0_svc+0x10/0x14
> 	[<0>] 0xffffffffffffffff

Hi Zhenyu,

Did you by any chance capture the kernel stack where io_submit is stuck?

Thanks,
Fam


  reply	other threads:[~2020-09-17 10:15 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 14:52 [PATCH v1 0/2] Add timeout mechanism to qmp actions Zhenyu Ye
2020-08-10 14:52 ` [PATCH v1 1/2] util: introduce aio_context_acquire_timeout Zhenyu Ye
2020-08-10 14:52 ` [PATCH v1 2/2] qmp: use aio_context_acquire_timeout replace aio_context_acquire Zhenyu Ye
2020-08-10 15:38 ` [PATCH v1 0/2] Add timeout mechanism to qmp actions Kevin Wolf
2020-08-11 13:54   ` Zhenyu Ye
2020-08-21 12:52     ` Stefan Hajnoczi
2020-09-14 13:27     ` Stefan Hajnoczi
2020-09-17  7:36       ` Zhenyu Ye
2020-09-17 10:10         ` Fam Zheng [this message]
2020-09-17 15:44         ` Stefan Hajnoczi
2020-09-17 16:01           ` Fam Zheng
2020-09-18 11:23             ` Zhenyu Ye
2020-09-18 14:06               ` Fam Zheng
2020-09-19  2:22                 ` Zhenyu Ye
2020-09-21 11:14                   ` Fam Zheng
2020-10-13 10:00                     ` Stefan Hajnoczi
2020-10-19 12:40                       ` Zhenyu Ye
2020-10-19 13:25                         ` Paolo Bonzini
2020-10-20  1:34                           ` Zhenyu Ye
2020-10-22 16:29                             ` Fam Zheng
2020-12-08 13:10                               ` Stefan Hajnoczi
2020-12-08 13:47                                 ` Glauber Costa
2020-12-14 16:33                                   ` Stefan Hajnoczi
2020-12-21 11:30                                     ` Zhenyu Ye
2020-09-14 14:42     ` Daniel P. Berrangé
2020-09-17  8:12       ` Zhenyu Ye
2020-08-12 13:51 ` Stefan Hajnoczi
2020-08-13  1:51   ` Zhenyu Ye

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200917101056.GA2423524@dev \
    --to=fam@euphon.net \
    --cc=armbru@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=xiexiangyou@huawei.com \
    --cc=yezhenyu2@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.