All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Juan Quintela <quintela@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] postcopy migration hangs while loading virtio state
Date: Mon, 24 Apr 2017 15:35:17 +0100	[thread overview]
Message-ID: <20170424143516.GD2075@work-vm> (raw)
In-Reply-To: <75a4c5ff-385e-31ac-5f86-883b082cd94e@de.ibm.com>

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >> David, Juan,
> >> 
> >> I can trigger a hang of postcopy migration (when I do it early) so
> >> that both sides are in paused state. Looks like one thread is still
> >> loading vmstates for virtio and this load accesses guest memory and
> >> triggers a userfault.
> > 
> > It's perfectly legal for the destination to cause a userfault as it
> > loads the virtio queue - the virtio queue should be being loaded by
> > the main thread from the buffer while the 'listening' thread is
> > waiting for the incoming page data.
> > 
> > Can you turn on the following tracing please: destination: 
> > postcopy_ram_fault_thread_request postcopy_place_page 
> > postcopy_place_page_zero
> > 
> > source: migrate_handle_rp_req_pages ram_save_queue_pages
> > 
> > You should see: virtio does the access userfault generates a fault 
> > postcopy_ram_fault_thread_request sends the request to the source
> > 
> > the source sees migrate_handle_rp_req_pages queues it with
> > ram_save_queue_pages
> > 
> > the destination sees the page arrive and postcopy_place_page or
> > postcopy_place_page_zero
> > 
> > some of that might disappear if the page was already on it's way.
> 
> 
> the last event on the source are
> [..]
> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000

How do you see those on the source???

> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
> 

Is that a typo? I'm expecting those two 'len' fields to be the same?

> On the target a see lots of
> 
> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
> 
> So we have about 2 seconds of traffic going on after that request,
> I assume its precopy related.
> 
> Looking back on the target history there was
> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> 
> In fact it seems to be the first and only request:
> 
> # cat /tmp/test0.trace | grep -v postcopy_place_page
> 
> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000

OK, does the HVA there correspond to the address that your virtio device is blocking on?
(or at least the start of the page)
Do you see a postcopy_place_page with a host= the same HVA ?

From the source backtrace, I think the source thinks it's sent everything
and is waiting for the return-path-thread to close.

Dave

> > 
> > Dave
> > 
> >> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
> >> in lduw_he_p (ptr=0x3ff498d9002) at
> >> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
> >> kernel this threads hangs in handle_userfault. #1
> >> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
> >> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
> >> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
> >> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
> >> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
> >> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
> >> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
> >> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
> >> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
> >> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
> >> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
> >> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
> >> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
> >> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
> >> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
> >> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
> >> field=0x16adf38 <__compound_literal.0>) at
> >> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
> >> vmstate_load_state (f=0x28353600, vmsd=0x14db480
> >> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
> >> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
> >> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
> >> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
> >> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
> >> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
> >> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
> >> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
> >> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
> >> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
> >> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
> >> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
> >> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
> >> qemu_loadvm_state (f=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
> >> process_incoming_migration_co (opaque=0x282c7d20) at
> >> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
> >> coroutine_trampoline (i0=0, i1=674533168) at
> >> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
> >> __makecontext_ret () from target:/lib64/libc.so.6
> >> 
> >> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
> >> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
> >> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
> >> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
> >> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
> >> postcopy_ram_listen_thread (opaque=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #5
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
> >> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
> >> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
> >> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #3
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
> >> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
> >> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >> do_data_decompress (opaque=0x283023d8) at
> >> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #4
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
> >> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
> >> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >> do_data_decompress (opaque=0x28302360) at
> >> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #4
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
> >> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
> >> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
> >> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
> >> pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
> >> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
> >> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
> >> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
> >> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
> >> #6  0x000003ff8e007bc4 in start_thread () from
> >> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
> >> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
> >> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
> >> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
> >> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
> >> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
> >> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
> >> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
> >> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
> >> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
> >> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
> >> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
> >> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
> >> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
> >> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> 
> >> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
> >> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
> >> value='-device'/> <qemu:arg
> >> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
> >>
> >>
> >> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
> >> be a virtio issue after all?
> >> 
> >> Any idea?
> >> 
> > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

  reply	other threads:[~2017-04-24 14:35 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 10:31 [Qemu-devel] postcopy migration hangs while loading virtio state Christian Borntraeger
2017-04-24 10:53 ` Dr. David Alan Gilbert
2017-04-24 13:02   ` Christian Borntraeger
2017-04-24 14:35     ` Dr. David Alan Gilbert [this message]
2017-04-24 18:17       ` Christian Borntraeger
2017-04-24 19:12         ` Dr. David Alan Gilbert
2017-04-25 13:27           ` Christian Borntraeger
2017-04-26  9:59           ` Christian Borntraeger
2017-04-26 10:13             ` Christian Borntraeger
2017-04-26 11:01             ` Dr. David Alan Gilbert
2017-04-26 11:37               ` Christian Borntraeger
2017-04-26 11:45               ` Christian Borntraeger
2017-04-26 12:07                 ` Christian Borntraeger
2017-06-30 16:31                   ` Dr. David Alan Gilbert
2017-07-03 14:28                     ` Christian Borntraeger
2017-07-03 16:25                     ` Michael S. Tsirkin
2017-07-03 19:07                       ` Dr. David Alan Gilbert
2017-07-04  7:48                         ` Christian Borntraeger
2017-07-04  8:16                           ` Martin Schwidefsky
2017-07-04  8:27                             ` Christian Borntraeger
2017-07-10  8:05                             ` Christian Borntraeger
2017-04-24 13:11   ` Christian Borntraeger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170424143516.GD2075@work-vm \
    --to=dgilbert@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.