All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] postcopy migration hangs while loading virtio state
@ 2017-04-24 10:31 Christian Borntraeger
  2017-04-24 10:53 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-24 10:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Juan Quintela, Michael S. Tsirkin; +Cc: qemu-devel

David, Juan,

I can trigger a hang of postcopy migration (when I do it early) so that
both sides are in paused state. Looks like one thread is still loading
vmstates for virtio and this load accesses guest memory and triggers
a userfault. 

Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)):
#0  0x0000000001017130 in lduw_he_p (ptr=0x3ff498d9002) at /root/qemu/include/qemu/bswap.h:317
		<---- according to the host kernel this threads hangs in handle_userfault.
#1  0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at /root/qemu/include/qemu/bswap.h:359
#2  0x0000000001025840 in address_space_lduw_internal_cached (cache=0x283491d0, addr=2, attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at /root/qemu/memory_ldst.inc.c:284
#3  0x00000000010259a6 in address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=..., result=0x0) at /root/qemu/memory_ldst.inc.c:315
#4  0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0, addr=2) at /root/qemu/memory_ldst.inc.c:334
#5  0x0000000001116c10 in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0, pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166
#6  0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at /root/qemu/hw/virtio/virtio.c:263
#7  0x000000000111daea in virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at /root/qemu/hw/virtio/virtio.c:2168
#8  0x000000000111d0cc in virtio_device_get (f=0x28353600, opaque=0x28275090, size=0, field=0x16adf38 <__compound_literal.0>) at /root/qemu/hw/virtio/virtio.c:1975
#9  0x00000000012a7f50 in vmstate_load_state (f=0x28353600, vmsd=0x14db480 <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at /root/qemu/migration/vmstate.c:128
#10 0x00000000010cbc08 in vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at /root/qemu/migration/savevm.c:732
#11 0x00000000010ced6e in qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1890
#12 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947
#13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1707
#14 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at /root/qemu/migration/savevm.c:1770
#15 0x00000000010cf056 in qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1960
#16 0x00000000010cf2b6 in qemu_loadvm_state (f=0x282c7d20) at /root/qemu/migration/savevm.c:2020
#17 0x000000000129d234 in process_incoming_migration_co (opaque=0x282c7d20) at /root/qemu/migration/migration.c:404
#18 0x00000000014b6ca2 in coroutine_trampoline (i0=0, i1=674533168) at /root/qemu/util/coroutine-ucontext.c:79
#19 0x000003ff8de530ea in __makecontext_ret () from target:/lib64/libc.so.6

Thread 8 (Thread 0x3ff40b38910 (LWP 21165)):
#0  0x000003ff8df04336 in syscall () from target:/lib64/libc.so.6
#1  0x00000000014933ce in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at /root/qemu/include/qemu/futex.h:26
#2  0x00000000014935d0 in qemu_event_wait (ev=0x17169b0 <mis_current+8>) at /root/qemu/util/qemu-thread-posix.c:399
#3  0x00000000010ce094 in postcopy_ram_listen_thread (opaque=0x282c7d20) at /root/qemu/migration/savevm.c:1532
#4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 7 (Thread 0x3ff41338910 (LWP 21164)):
#0  0x000003ff8defd72c in ?? () from target:/lib64/libc.so.6
#1  0x00000000012af920 in postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at /root/qemu/migration/postcopy-ram.c:437
#2  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#3  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 6 (Thread 0x3ff41b38910 (LWP 21163)):
#0  0x000003ff8e00d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000000001492d66 in qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at /root/qemu/util/qemu-thread-posix.c:133
#2  0x00000000010c6080 in do_data_decompress (opaque=0x283023d8) at /root/qemu/migration/ram.c:2277
#3  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#4  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 5 (Thread 0x3ff42338910 (LWP 21162)):
#0  0x000003ff8e00d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000000001492d66 in qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at /root/qemu/util/qemu-thread-posix.c:133
#2  0x00000000010c6080 in do_data_decompress (opaque=0x28302360) at /root/qemu/migration/ram.c:2277
#3  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#4  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 4 (Thread 0x3ff89287910 (LWP 21148)):
#0  0x000003ff8e010abc in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from target:/lib64/libpthread.so.0
#2  0x000003ff8e00d4ee in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#3  0x0000000001492d66 in qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688 <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133
#4  0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at /root/qemu/cpus.c:1085
#5  0x00000000010773aa in qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123
#6  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#7  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)):
#0  0x000003ff8defd832 in ppoll () from target:/lib64/libc.so.6
#1  0x000000000148b092 in qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at /root/qemu/util/qemu-timer.c:322
#2  0x000000000148f46a in aio_poll (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622
#3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at /root/qemu/iothread.c:59
#4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 2 (Thread 0x3ff8afff910 (LWP 21142)):
#0  0x000003ff8df04336 in syscall () from target:/lib64/libc.so.6
#1  0x00000000014933ce in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>, val=4294967295) at /root/qemu/include/qemu/futex.h:26
#2  0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4 <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399
#3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at /root/qemu/util/rcu.c:249
#4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6


Looks like that this
    <qemu:arg value='-drive'/>
    <qemu:arg value='driver=null-aio,id=null1,if=none,size=100G'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>

is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
be a virtio issue after all?

Any idea?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 10:31 [Qemu-devel] postcopy migration hangs while loading virtio state Christian Borntraeger
@ 2017-04-24 10:53 ` Dr. David Alan Gilbert
  2017-04-24 13:02   ` Christian Borntraeger
  2017-04-24 13:11   ` Christian Borntraeger
  0 siblings, 2 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-04-24 10:53 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> David, Juan,
> 
> I can trigger a hang of postcopy migration (when I do it early) so that
> both sides are in paused state. Looks like one thread is still loading
> vmstates for virtio and this load accesses guest memory and triggers
> a userfault. 

It's perfectly legal for the destination to cause a userfault as it loads the
virtio queue - the virtio queue should be being loaded by the main thread
from the buffer while the 'listening' thread is waiting for the incoming
page data.

Can you turn on the following tracing please:
  destination:
    postcopy_ram_fault_thread_request
    postcopy_place_page
    postcopy_place_page_zero

  source:
    migrate_handle_rp_req_pages
    ram_save_queue_pages

You should see:
    virtio does the access
    userfault generates a fault
    postcopy_ram_fault_thread_request sends the request to the source

    the source sees migrate_handle_rp_req_pages
    queues it with ram_save_queue_pages

    the destination sees the page arrive and
    postcopy_place_page or postcopy_place_page_zero

some of that might disappear if the page was already on it's way.

Dave

> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)):
> #0  0x0000000001017130 in lduw_he_p (ptr=0x3ff498d9002) at /root/qemu/include/qemu/bswap.h:317
> 		<---- according to the host kernel this threads hangs in handle_userfault.
> #1  0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at /root/qemu/include/qemu/bswap.h:359
> #2  0x0000000001025840 in address_space_lduw_internal_cached (cache=0x283491d0, addr=2, attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at /root/qemu/memory_ldst.inc.c:284
> #3  0x00000000010259a6 in address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=..., result=0x0) at /root/qemu/memory_ldst.inc.c:315
> #4  0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0, addr=2) at /root/qemu/memory_ldst.inc.c:334
> #5  0x0000000001116c10 in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0, pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166
> #6  0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at /root/qemu/hw/virtio/virtio.c:263
> #7  0x000000000111daea in virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at /root/qemu/hw/virtio/virtio.c:2168
> #8  0x000000000111d0cc in virtio_device_get (f=0x28353600, opaque=0x28275090, size=0, field=0x16adf38 <__compound_literal.0>) at /root/qemu/hw/virtio/virtio.c:1975
> #9  0x00000000012a7f50 in vmstate_load_state (f=0x28353600, vmsd=0x14db480 <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at /root/qemu/migration/vmstate.c:128
> #10 0x00000000010cbc08 in vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at /root/qemu/migration/savevm.c:732
> #11 0x00000000010ced6e in qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1890
> #12 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947
> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1707
> #14 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at /root/qemu/migration/savevm.c:1770
> #15 0x00000000010cf056 in qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1960
> #16 0x00000000010cf2b6 in qemu_loadvm_state (f=0x282c7d20) at /root/qemu/migration/savevm.c:2020
> #17 0x000000000129d234 in process_incoming_migration_co (opaque=0x282c7d20) at /root/qemu/migration/migration.c:404
> #18 0x00000000014b6ca2 in coroutine_trampoline (i0=0, i1=674533168) at /root/qemu/util/coroutine-ucontext.c:79
> #19 0x000003ff8de530ea in __makecontext_ret () from target:/lib64/libc.so.6
> 
> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)):
> #0  0x000003ff8df04336 in syscall () from target:/lib64/libc.so.6
> #1  0x00000000014933ce in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at /root/qemu/include/qemu/futex.h:26
> #2  0x00000000014935d0 in qemu_event_wait (ev=0x17169b0 <mis_current+8>) at /root/qemu/util/qemu-thread-posix.c:399
> #3  0x00000000010ce094 in postcopy_ram_listen_thread (opaque=0x282c7d20) at /root/qemu/migration/savevm.c:1532
> #4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 7 (Thread 0x3ff41338910 (LWP 21164)):
> #0  0x000003ff8defd72c in ?? () from target:/lib64/libc.so.6
> #1  0x00000000012af920 in postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at /root/qemu/migration/postcopy-ram.c:437
> #2  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #3  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)):
> #0  0x000003ff8e00d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
> #1  0x0000000001492d66 in qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at /root/qemu/util/qemu-thread-posix.c:133
> #2  0x00000000010c6080 in do_data_decompress (opaque=0x283023d8) at /root/qemu/migration/ram.c:2277
> #3  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #4  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 5 (Thread 0x3ff42338910 (LWP 21162)):
> #0  0x000003ff8e00d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
> #1  0x0000000001492d66 in qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at /root/qemu/util/qemu-thread-posix.c:133
> #2  0x00000000010c6080 in do_data_decompress (opaque=0x28302360) at /root/qemu/migration/ram.c:2277
> #3  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #4  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 4 (Thread 0x3ff89287910 (LWP 21148)):
> #0  0x000003ff8e010abc in __lll_lock_wait () from target:/lib64/libpthread.so.0
> #1  0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from target:/lib64/libpthread.so.0
> #2  0x000003ff8e00d4ee in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
> #3  0x0000000001492d66 in qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688 <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133
> #4  0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at /root/qemu/cpus.c:1085
> #5  0x00000000010773aa in qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123
> #6  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #7  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)):
> #0  0x000003ff8defd832 in ppoll () from target:/lib64/libc.so.6
> #1  0x000000000148b092 in qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at /root/qemu/util/qemu-timer.c:322
> #2  0x000000000148f46a in aio_poll (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622
> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at /root/qemu/iothread.c:59
> #4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)):
> #0  0x000003ff8df04336 in syscall () from target:/lib64/libc.so.6
> #1  0x00000000014933ce in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>, val=4294967295) at /root/qemu/include/qemu/futex.h:26
> #2  0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4 <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399
> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at /root/qemu/util/rcu.c:249
> #4  0x000003ff8e007bc4 in start_thread () from target:/lib64/libpthread.so.0
> #5  0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> 
> 
> Looks like that this
>     <qemu:arg value='-drive'/>
>     <qemu:arg value='driver=null-aio,id=null1,if=none,size=100G'/>
>     <qemu:arg value='-device'/>
>     <qemu:arg value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
> 
> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
> be a virtio issue after all?
> 
> Any idea?
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 10:53 ` Dr. David Alan Gilbert
@ 2017-04-24 13:02   ` Christian Borntraeger
  2017-04-24 14:35     ` Dr. David Alan Gilbert
  2017-04-24 13:11   ` Christian Borntraeger
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-24 13:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> David, Juan,
>> 
>> I can trigger a hang of postcopy migration (when I do it early) so
>> that both sides are in paused state. Looks like one thread is still
>> loading vmstates for virtio and this load accesses guest memory and
>> triggers a userfault.
> 
> It's perfectly legal for the destination to cause a userfault as it
> loads the virtio queue - the virtio queue should be being loaded by
> the main thread from the buffer while the 'listening' thread is
> waiting for the incoming page data.
> 
> Can you turn on the following tracing please: destination: 
> postcopy_ram_fault_thread_request postcopy_place_page 
> postcopy_place_page_zero
> 
> source: migrate_handle_rp_req_pages ram_save_queue_pages
> 
> You should see: virtio does the access userfault generates a fault 
> postcopy_ram_fault_thread_request sends the request to the source
> 
> the source sees migrate_handle_rp_req_pages queues it with
> ram_save_queue_pages
> 
> the destination sees the page arrive and postcopy_place_page or
> postcopy_place_page_zero
> 
> some of that might disappear if the page was already on it's way.


the last event on the source are
[..]
58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100


On the target a see lots of

39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000

So we have about 2 seconds of traffic going on after that request,
I assume its precopy related.

Looking back on the target history there was
39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000

In fact it seems to be the first and only request:

# cat /tmp/test0.trace | grep -v postcopy_place_page

39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000

> 
> Dave
> 
>> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
>> in lduw_he_p (ptr=0x3ff498d9002) at
>> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
>> kernel this threads hangs in handle_userfault. #1
>> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
>> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
>> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
>> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
>> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
>> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
>> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
>> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
>> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
>> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
>> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
>> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
>> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
>> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
>> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
>> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
>> field=0x16adf38 <__compound_literal.0>) at
>> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
>> vmstate_load_state (f=0x28353600, vmsd=0x14db480
>> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
>> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
>> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
>> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
>> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
>> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
>> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
>> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
>> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
>> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
>> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
>> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
>> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
>> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
>> qemu_loadvm_state (f=0x282c7d20) at
>> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
>> process_incoming_migration_co (opaque=0x282c7d20) at
>> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
>> coroutine_trampoline (i0=0, i1=674533168) at
>> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
>> __makecontext_ret () from target:/lib64/libc.so.6
>> 
>> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
>> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
>> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
>> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
>> postcopy_ram_listen_thread (opaque=0x282c7d20) at
>> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
>> start_thread () from target:/lib64/libpthread.so.0 #5
>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
>> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
>> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
>> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
>> start_thread () from target:/lib64/libpthread.so.0 #3
>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>> do_data_decompress (opaque=0x283023d8) at
>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>> start_thread () from target:/lib64/libpthread.so.0 #4
>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>> do_data_decompress (opaque=0x28302360) at
>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>> start_thread () from target:/lib64/libpthread.so.0 #4
>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
>> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
>> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
>> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
>> pthread_cond_wait@@GLIBC_2.3.2 () from
>> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
>> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
>> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
>> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
>> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
>> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
>> #6  0x000003ff8e007bc4 in start_thread () from
>> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
>> thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
>> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
>> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
>> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
>> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
>> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
>> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>> thread_start () from target:/lib64/libc.so.6
>> 
>> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
>> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
>> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
>> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
>> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
>> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>> thread_start () from target:/lib64/libc.so.6
>> 
>> 
>> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
>> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
>> value='-device'/> <qemu:arg
>> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
>>
>>
>> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
>> be a virtio issue after all?
>> 
>> Any idea?
>> 
> -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 10:53 ` Dr. David Alan Gilbert
  2017-04-24 13:02   ` Christian Borntraeger
@ 2017-04-24 13:11   ` Christian Borntraeger
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-24 13:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:

FWIW, the sending side calltrace looks like:

> 
Thread 8 (Thread 0x3ff509aa910 (LWP 58594)):
#0  0x000003ff9d588ca4 in pthread_join () from target:/lib64/libpthread.so.0
#1  0x0000000001493d4a in qemu_thread_join (thread=0x16c7e78 <current_migration+216>) at /root/qemu/util/qemu-thread-posix.c:504
#2  0x00000000012a0d0e in await_return_path_close_on_source (ms=0x16c7da0 <current_migration>) at /root/qemu/migration/migration.c:1641
#3  0x00000000012a1312 in migration_completion (s=0x16c7da0 <current_migration>, current_active_state=5, old_vm_running=0x3ff509a7789, start_time=0x3ff509a7790)
    at /root/qemu/migration/migration.c:1860
#4  0x00000000012a17b2 in migration_thread (opaque=0x16c7da0 <current_migration>) at /root/qemu/migration/migration.c:1987
#5  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#6  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 7 (Thread 0x3ff4bfff910 (LWP 58593)):
#0  0x000003ff9d5915b6 in recvmsg () from target:/lib64/libpthread.so.0
#1  0x000000000141771a in qio_channel_socket_readv (ioc=0xc9986b0, iov=0x3ff4bffc170, niov=1, fds=0x0, nfds=0x0, errp=0x0) at /root/qemu/io/channel-socket.c:476
#2  0x0000000001411668 in qio_channel_readv_full (ioc=0xc9986b0, iov=0x3ff4bffc170, niov=1, fds=0x0, nfds=0x0, errp=0x0) at /root/qemu/io/channel.c:64
#3  0x000000000141190a in qio_channel_read (ioc=0xc9986b0, buf=0xca23888 "", buflen=32768, errp=0x0) at /root/qemu/io/channel.c:112
#4  0x00000000012ad384 in channel_get_buffer (opaque=0xc9986b0, buf=0xca23888 "", pos=49, size=32768) at /root/qemu/migration/qemu-file-channel.c:78
#5  0x00000000012abb64 in qemu_fill_buffer (f=0xca23850) at /root/qemu/migration/qemu-file.c:295
#6  0x00000000012ac7b2 in qemu_peek_byte (f=0xca23850, offset=0) at /root/qemu/migration/qemu-file.c:555
#7  0x00000000012ac82e in qemu_get_byte (f=0xca23850) at /root/qemu/migration/qemu-file.c:568
#8  0x00000000012acbfa in qemu_get_be16 (f=0xca23850) at /root/qemu/migration/qemu-file.c:640
#9  0x00000000012a0756 in source_return_path_thread (opaque=0x16c7da0 <current_migration>) at /root/qemu/migration/migration.c:1512
#10 0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#11 0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 6 (Thread 0x3ff511aa910 (LWP 58454)):
#0  0x000003ff9d58d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000000001492fce in qemu_cond_wait (cond=0xc9ca238, mutex=0xc9ca210) at /root/qemu/util/qemu-thread-posix.c:133
#2  0x00000000010c6080 in do_data_decompress (opaque=0xc9ca208) at /root/qemu/migration/ram.c:2277
#3  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#4  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 5 (Thread 0x3ff519aa910 (LWP 58453)):
#0  0x000003ff9d58d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000000001492fce in qemu_cond_wait (cond=0xc9ca1c0, mutex=0xc9ca198) at /root/qemu/util/qemu-thread-posix.c:133
#2  0x00000000010c6080 in do_data_decompress (opaque=0xc9ca190) at /root/qemu/migration/ram.c:2277
#3  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#4  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 4 (Thread 0x3ff98807910 (LWP 58440)):
#0  0x000003ff9d58d450 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000000001492fce in qemu_cond_wait (cond=0xc9e6930, mutex=0x17116b0 <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133
#2  0x000000000107720e in qemu_kvm_wait_io_event (cpu=0xc9a87d0) at /root/qemu/cpus.c:1085
#3  0x00000000010773aa in qemu_kvm_cpu_thread_fn (arg=0xc9a87d0) at /root/qemu/cpus.c:1123
#4  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 3 (Thread 0x3ff99d7f910 (LWP 58437)):
#0  0x000003ff9d47d832 in ppoll () from target:/lib64/libc.so.6
#1  0x000000000148b2fa in qemu_poll_ns (fds=0x3ff940008c0, nfds=1, timeout=-1) at /root/qemu/util/qemu-timer.c:322
#2  0x000000000148f6d2 in aio_poll (ctx=0xc9473c0, blocking=true) at /root/qemu/util/aio-posix.c:622
#3  0x00000000011a7b5e in iothread_run (opaque=0xc947230) at /root/qemu/iothread.c:59
#4  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 2 (Thread 0x3ff9a57f910 (LWP 58434)):
#0  0x000003ff9d484336 in syscall () from target:/lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x0000000001493636 in qemu_futex_wait (f=0x1b2b9dc <rcu_call_ready_event>, val=4294967295) at /root/qemu/include/qemu/futex.h:26
#2  0x0000000001493838 in qemu_event_wait (ev=0x1b2b9dc <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399
#3  0x00000000014b40a8 in call_rcu_thread (opaque=0x0) at /root/qemu/util/rcu.c:249
#4  0x000003ff9d587bc4 in start_thread () from target:/lib64/libpthread.so.0
#5  0x000003ff9d48a9f2 in thread_start () from target:/lib64/libc.so.6

Thread 1 (Thread 0x3ffb24c5f00 (LWP 58412)):
#0  0x000003ff9d47d832 in ppoll () from target:/lib64/libc.so.6
#1  0x000000000148b414 in qemu_poll_ns (fds=0xc98f6e0, nfds=6, timeout=999000000) at /root/qemu/util/qemu-timer.c:334
#2  0x000000000148ccca in os_host_main_loop_wait (timeout=999000000) at /root/qemu/util/main-loop.c:255
#3  0x000000000148ce0a in main_loop_wait (nonblocking=0) at /root/qemu/util/main-loop.c:517
#4  0x00000000011b3cfa in main_loop () at /root/qemu/vl.c:1898
#5  0x00000000011bc3fc in main (argc=91, argv=0x3ffdc87eed8, envp=0x3ffdc87f1b8) at /root/qemu/vl.c:4709

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 13:02   ` Christian Borntraeger
@ 2017-04-24 14:35     ` Dr. David Alan Gilbert
  2017-04-24 18:17       ` Christian Borntraeger
  0 siblings, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-04-24 14:35 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >> David, Juan,
> >> 
> >> I can trigger a hang of postcopy migration (when I do it early) so
> >> that both sides are in paused state. Looks like one thread is still
> >> loading vmstates for virtio and this load accesses guest memory and
> >> triggers a userfault.
> > 
> > It's perfectly legal for the destination to cause a userfault as it
> > loads the virtio queue - the virtio queue should be being loaded by
> > the main thread from the buffer while the 'listening' thread is
> > waiting for the incoming page data.
> > 
> > Can you turn on the following tracing please: destination: 
> > postcopy_ram_fault_thread_request postcopy_place_page 
> > postcopy_place_page_zero
> > 
> > source: migrate_handle_rp_req_pages ram_save_queue_pages
> > 
> > You should see: virtio does the access userfault generates a fault 
> > postcopy_ram_fault_thread_request sends the request to the source
> > 
> > the source sees migrate_handle_rp_req_pages queues it with
> > ram_save_queue_pages
> > 
> > the destination sees the page arrive and postcopy_place_page or
> > postcopy_place_page_zero
> > 
> > some of that might disappear if the page was already on it's way.
> 
> 
> the last event on the source are
> [..]
> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000

How do you see those on the source???

> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
> 

Is that a typo? I'm expecting those two 'len' fields to be the same?

> On the target a see lots of
> 
> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
> 
> So we have about 2 seconds of traffic going on after that request,
> I assume its precopy related.
> 
> Looking back on the target history there was
> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> 
> In fact it seems to be the first and only request:
> 
> # cat /tmp/test0.trace | grep -v postcopy_place_page
> 
> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000

OK, does the HVA there correspond to the address that your virtio device is blocking on?
(or at least the start of the page)
Do you see a postcopy_place_page with a host= the same HVA ?

From the source backtrace, I think the source thinks it's sent everything
and is waiting for the return-path-thread to close.

Dave

> > 
> > Dave
> > 
> >> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
> >> in lduw_he_p (ptr=0x3ff498d9002) at
> >> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
> >> kernel this threads hangs in handle_userfault. #1
> >> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
> >> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
> >> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
> >> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
> >> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
> >> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
> >> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
> >> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
> >> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
> >> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
> >> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
> >> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
> >> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
> >> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
> >> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
> >> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
> >> field=0x16adf38 <__compound_literal.0>) at
> >> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
> >> vmstate_load_state (f=0x28353600, vmsd=0x14db480
> >> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
> >> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
> >> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
> >> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
> >> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
> >> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
> >> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
> >> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
> >> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
> >> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
> >> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
> >> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
> >> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
> >> qemu_loadvm_state (f=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
> >> process_incoming_migration_co (opaque=0x282c7d20) at
> >> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
> >> coroutine_trampoline (i0=0, i1=674533168) at
> >> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
> >> __makecontext_ret () from target:/lib64/libc.so.6
> >> 
> >> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
> >> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
> >> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
> >> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
> >> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
> >> postcopy_ram_listen_thread (opaque=0x282c7d20) at
> >> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #5
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
> >> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
> >> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
> >> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #3
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
> >> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
> >> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >> do_data_decompress (opaque=0x283023d8) at
> >> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #4
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
> >> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
> >> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >> do_data_decompress (opaque=0x28302360) at
> >> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >> start_thread () from target:/lib64/libpthread.so.0 #4
> >> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
> >> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
> >> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
> >> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
> >> pthread_cond_wait@@GLIBC_2.3.2 () from
> >> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
> >> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
> >> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
> >> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
> >> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
> >> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
> >> #6  0x000003ff8e007bc4 in start_thread () from
> >> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
> >> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
> >> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
> >> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
> >> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
> >> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
> >> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
> >> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
> >> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
> >> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
> >> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
> >> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
> >> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
> >> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
> >> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >> thread_start () from target:/lib64/libc.so.6
> >> 
> >> 
> >> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
> >> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
> >> value='-device'/> <qemu:arg
> >> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
> >>
> >>
> >> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
> >> be a virtio issue after all?
> >> 
> >> Any idea?
> >> 
> > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 14:35     ` Dr. David Alan Gilbert
@ 2017-04-24 18:17       ` Christian Borntraeger
  2017-04-24 19:12         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-24 18:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> David, Juan,
>>>>
>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>> that both sides are in paused state. Looks like one thread is still
>>>> loading vmstates for virtio and this load accesses guest memory and
>>>> triggers a userfault.
>>>
>>> It's perfectly legal for the destination to cause a userfault as it
>>> loads the virtio queue - the virtio queue should be being loaded by
>>> the main thread from the buffer while the 'listening' thread is
>>> waiting for the incoming page data.
>>>
>>> Can you turn on the following tracing please: destination: 
>>> postcopy_ram_fault_thread_request postcopy_place_page 
>>> postcopy_place_page_zero
>>>
>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>
>>> You should see: virtio does the access userfault generates a fault 
>>> postcopy_ram_fault_thread_request sends the request to the source
>>>
>>> the source sees migrate_handle_rp_req_pages queues it with
>>> ram_save_queue_pages
>>>
>>> the destination sees the page arrive and postcopy_place_page or
>>> postcopy_place_page_zero
>>>
>>> some of that might disappear if the page was already on it's way.
>>
>>
>> the last event on the source are
>> [..]
>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
> 
> How do you see those on the source???

It was the previous migrate (I did it in a loop forth and back)
The problem happens on the migrate back.
> 
>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>
> 
> Is that a typo? I'm expecting those two 'len' fields to be the same?

Yes, its a cut'n' paste "miss the last byte"


> 
>> On the target a see lots of
>>
>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
>>
>> So we have about 2 seconds of traffic going on after that request,
>> I assume its precopy related.
>>
>> Looking back on the target history there was
>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>
>> In fact it seems to be the first and only request:
>>
>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>
>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> 
> OK, does the HVA there correspond to the address that your virtio device is blocking on?

yes it is the same page.


> (or at least the start of the page)
> Do you see a postcopy_place_page with a host= the same HVA ?

no

> 
> From the source backtrace, I think the source thinks it's sent everything
> and is waiting for the return-path-thread to close.
> 
> Dave
> 
>>>
>>> Dave
>>>
>>>> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
>>>> in lduw_he_p (ptr=0x3ff498d9002) at
>>>> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
>>>> kernel this threads hangs in handle_userfault. #1
>>>> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
>>>> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
>>>> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
>>>> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
>>>> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
>>>> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
>>>> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
>>>> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
>>>> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
>>>> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
>>>> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
>>>> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
>>>> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
>>>> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
>>>> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
>>>> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
>>>> field=0x16adf38 <__compound_literal.0>) at
>>>> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
>>>> vmstate_load_state (f=0x28353600, vmsd=0x14db480
>>>> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
>>>> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
>>>> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
>>>> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
>>>> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
>>>> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
>>>> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
>>>> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
>>>> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
>>>> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
>>>> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
>>>> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
>>>> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
>>>> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
>>>> qemu_loadvm_state (f=0x282c7d20) at
>>>> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
>>>> process_incoming_migration_co (opaque=0x282c7d20) at
>>>> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
>>>> coroutine_trampoline (i0=0, i1=674533168) at
>>>> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
>>>> __makecontext_ret () from target:/lib64/libc.so.6
>>>>
>>>> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
>>>> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
>>>> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
>>>> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
>>>> postcopy_ram_listen_thread (opaque=0x282c7d20) at
>>>> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
>>>> start_thread () from target:/lib64/libpthread.so.0 #5
>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
>>>> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
>>>> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
>>>> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
>>>> start_thread () from target:/lib64/libpthread.so.0 #3
>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>> do_data_decompress (opaque=0x283023d8) at
>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>> do_data_decompress (opaque=0x28302360) at
>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
>>>> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
>>>> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
>>>> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
>>>> pthread_cond_wait@@GLIBC_2.3.2 () from
>>>> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
>>>> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
>>>> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
>>>> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
>>>> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
>>>> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
>>>> #6  0x000003ff8e007bc4 in start_thread () from
>>>> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
>>>> thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
>>>> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
>>>> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
>>>> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
>>>> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
>>>> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
>>>> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>> thread_start () from target:/lib64/libc.so.6
>>>>
>>>> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
>>>> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
>>>> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
>>>> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
>>>> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
>>>> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>> thread_start () from target:/lib64/libc.so.6
>>>>
>>>>
>>>> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
>>>> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
>>>> value='-device'/> <qemu:arg
>>>> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
>>>>
>>>>
>>>> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
>>>> be a virtio issue after all?
>>>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 18:17       ` Christian Borntraeger
@ 2017-04-24 19:12         ` Dr. David Alan Gilbert
  2017-04-25 13:27           ` Christian Borntraeger
  2017-04-26  9:59           ` Christian Borntraeger
  0 siblings, 2 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-04-24 19:12 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> >>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >>>> David, Juan,
> >>>>
> >>>> I can trigger a hang of postcopy migration (when I do it early) so
> >>>> that both sides are in paused state. Looks like one thread is still
> >>>> loading vmstates for virtio and this load accesses guest memory and
> >>>> triggers a userfault.
> >>>
> >>> It's perfectly legal for the destination to cause a userfault as it
> >>> loads the virtio queue - the virtio queue should be being loaded by
> >>> the main thread from the buffer while the 'listening' thread is
> >>> waiting for the incoming page data.
> >>>
> >>> Can you turn on the following tracing please: destination: 
> >>> postcopy_ram_fault_thread_request postcopy_place_page 
> >>> postcopy_place_page_zero
> >>>
> >>> source: migrate_handle_rp_req_pages ram_save_queue_pages
> >>>
> >>> You should see: virtio does the access userfault generates a fault 
> >>> postcopy_ram_fault_thread_request sends the request to the source
> >>>
> >>> the source sees migrate_handle_rp_req_pages queues it with
> >>> ram_save_queue_pages
> >>>
> >>> the destination sees the page arrive and postcopy_place_page or
> >>> postcopy_place_page_zero
> >>>
> >>> some of that might disappear if the page was already on it's way.
> >>
> >>
> >> the last event on the source are
> >> [..]
> >> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
> >> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
> > 
> > How do you see those on the source???
> 
> It was the previous migrate (I did it in a loop forth and back)
> The problem happens on the migrate back.

OK, good.

> > 
> >> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
> >> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
> >>
> > 
> > Is that a typo? I'm expecting those two 'len' fields to be the same?
> 
> Yes, its a cut'n' paste "miss the last byte"

Good.

Ok, before digging further, is this a new bug or does it happen on older
QEMU?  Have you got a rune to reproduce it on x86?

> 
> > 
> >> On the target a see lots of
> >>
> >> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
> >> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
> >> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
> >> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
> >> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
> >> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
> >>
> >> So we have about 2 seconds of traffic going on after that request,
> >> I assume its precopy related.
> >>
> >> Looking back on the target history there was
> >> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> >>
> >> In fact it seems to be the first and only request:
> >>
> >> # cat /tmp/test0.trace | grep -v postcopy_place_page
> >>
> >> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> > 
> > OK, does the HVA there correspond to the address that your virtio device is blocking on?
> 
> yes it is the same page.
> 
> 
> > (or at least the start of the page)
> > Do you see a postcopy_place_page with a host= the same HVA ?
> 
> no

Hmm, that's bad.
The flow is:
    Precopy
       (a) Send pages
    Switch to postcopy
       (b) Send discards for pages that have been changed after
           (a)
    Postcopy
       (c) Keep sending pages until we run out
       (d) But send pages we're asked for first

 (d) might be ignored if the source thinks the page was already sent or not dirty.

So we need to figure out:
  1) If the source sent the pages during (a)
  2) If the source discarded it during (b)
  3) If it sent it again any time in c/d
  4) If it ignored the request from d


So please turn on the traces:
    get_queued_page_not_dirty                should help with (4)
    get_queued_page                          should help with (4)
    ram_discard_range                        should help with (2)
    loadvm_postcopy_ram_handle_discard       should help with (2)
    qemu_savevm_send_postcopy_ram_discard    should help with (2)

add near the top of ram_save_page in ram.c:
  fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);

   should help with 1, 3

So lets see if your page ever gets sent in ram_save_page, and does it get
discarded prior to it hitting the point where it hangs but after the page
arrived?
Another (slimmer) possibility is the number of dirties paged is wrong
so the source thinks it has finished too soon - but we'll only look at
that if all the above doesn't help.

Dave

> > 
> > From the source backtrace, I think the source thinks it's sent everything
> > and is waiting for the return-path-thread to close.
> > 
> > Dave
> > 
> >>>
> >>> Dave
> >>>
> >>>> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
> >>>> in lduw_he_p (ptr=0x3ff498d9002) at
> >>>> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
> >>>> kernel this threads hangs in handle_userfault. #1
> >>>> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
> >>>> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
> >>>> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
> >>>> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
> >>>> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
> >>>> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
> >>>> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
> >>>> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
> >>>> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
> >>>> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
> >>>> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
> >>>> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
> >>>> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
> >>>> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
> >>>> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
> >>>> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
> >>>> field=0x16adf38 <__compound_literal.0>) at
> >>>> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
> >>>> vmstate_load_state (f=0x28353600, vmsd=0x14db480
> >>>> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
> >>>> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
> >>>> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
> >>>> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
> >>>> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
> >>>> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
> >>>> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
> >>>> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
> >>>> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
> >>>> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
> >>>> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
> >>>> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
> >>>> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
> >>>> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
> >>>> qemu_loadvm_state (f=0x282c7d20) at
> >>>> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
> >>>> process_incoming_migration_co (opaque=0x282c7d20) at
> >>>> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
> >>>> coroutine_trampoline (i0=0, i1=674533168) at
> >>>> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
> >>>> __makecontext_ret () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
> >>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >>>> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
> >>>> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
> >>>> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
> >>>> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
> >>>> postcopy_ram_listen_thread (opaque=0x282c7d20) at
> >>>> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
> >>>> start_thread () from target:/lib64/libpthread.so.0 #5
> >>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
> >>>> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
> >>>> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
> >>>> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
> >>>> start_thread () from target:/lib64/libpthread.so.0 #3
> >>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
> >>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >>>> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
> >>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >>>> do_data_decompress (opaque=0x283023d8) at
> >>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >>>> start_thread () from target:/lib64/libpthread.so.0 #4
> >>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
> >>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
> >>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
> >>>> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
> >>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
> >>>> do_data_decompress (opaque=0x28302360) at
> >>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
> >>>> start_thread () from target:/lib64/libpthread.so.0 #4
> >>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
> >>>> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
> >>>> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
> >>>> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
> >>>> pthread_cond_wait@@GLIBC_2.3.2 () from
> >>>> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
> >>>> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
> >>>> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
> >>>> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
> >>>> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
> >>>> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
> >>>> #6  0x000003ff8e007bc4 in start_thread () from
> >>>> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
> >>>> thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
> >>>> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
> >>>> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
> >>>> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
> >>>> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
> >>>> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
> >>>> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
> >>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >>>> thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
> >>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
> >>>> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
> >>>> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
> >>>> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
> >>>> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
> >>>> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
> >>>> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
> >>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
> >>>> thread_start () from target:/lib64/libc.so.6
> >>>>
> >>>>
> >>>> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
> >>>> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
> >>>> value='-device'/> <qemu:arg
> >>>> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
> >>>>
> >>>>
> >>>> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
> >>>> be a virtio issue after all?
> >>>>
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 19:12         ` Dr. David Alan Gilbert
@ 2017-04-25 13:27           ` Christian Borntraeger
  2017-04-26  9:59           ` Christian Borntraeger
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-25 13:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>> David, Juan,
>>>>>>
>>>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>>>> that both sides are in paused state. Looks like one thread is still
>>>>>> loading vmstates for virtio and this load accesses guest memory and
>>>>>> triggers a userfault.
>>>>>
>>>>> It's perfectly legal for the destination to cause a userfault as it
>>>>> loads the virtio queue - the virtio queue should be being loaded by
>>>>> the main thread from the buffer while the 'listening' thread is
>>>>> waiting for the incoming page data.
>>>>>
>>>>> Can you turn on the following tracing please: destination: 
>>>>> postcopy_ram_fault_thread_request postcopy_place_page 
>>>>> postcopy_place_page_zero
>>>>>
>>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>>>
>>>>> You should see: virtio does the access userfault generates a fault 
>>>>> postcopy_ram_fault_thread_request sends the request to the source
>>>>>
>>>>> the source sees migrate_handle_rp_req_pages queues it with
>>>>> ram_save_queue_pages
>>>>>
>>>>> the destination sees the page arrive and postcopy_place_page or
>>>>> postcopy_place_page_zero
>>>>>
>>>>> some of that might disappear if the page was already on it's way.
>>>>
>>>>
>>>> the last event on the source are
>>>> [..]
>>>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
>>>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
>>>
>>> How do you see those on the source???
>>
>> It was the previous migrate (I did it in a loop forth and back)
>> The problem happens on the migrate back.
> 
> OK, good.
> 
>>>
>>>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
>>>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>>>
>>>
>>> Is that a typo? I'm expecting those two 'len' fields to be the same?
>>
>> Yes, its a cut'n' paste "miss the last byte"
> 
> Good.
> 
> Ok, before digging further, is this a new bug or does it happen on older
> QEMU?  Have you got a rune to reproduce it on x86?

No x86 yet. It also seems to happen with 2.8 but only between two specific
machines (and one of those is down for maintenance right now).
Will do the experiments as soon as it  is back up.



>>
>>>
>>>> On the target a see lots of
>>>>
>>>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
>>>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
>>>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
>>>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
>>>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
>>>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
>>>>
>>>> So we have about 2 seconds of traffic going on after that request,
>>>> I assume its precopy related.
>>>>
>>>> Looking back on the target history there was
>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>
>>>> In fact it seems to be the first and only request:
>>>>
>>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>>>
>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>
>>> OK, does the HVA there correspond to the address that your virtio device is blocking on?
>>
>> yes it is the same page.
>>
>>
>>> (or at least the start of the page)
>>> Do you see a postcopy_place_page with a host= the same HVA ?
>>
>> no
> 
> Hmm, that's bad.
> The flow is:
>     Precopy
>        (a) Send pages
>     Switch to postcopy
>        (b) Send discards for pages that have been changed after
>            (a)
>     Postcopy
>        (c) Keep sending pages until we run out
>        (d) But send pages we're asked for first
> 
>  (d) might be ignored if the source thinks the page was already sent or not dirty.
> 
> So we need to figure out:
>   1) If the source sent the pages during (a)
>   2) If the source discarded it during (b)
>   3) If it sent it again any time in c/d
>   4) If it ignored the request from d
> 
> 
> So please turn on the traces:
>     get_queued_page_not_dirty                should help with (4)
>     get_queued_page                          should help with (4)
>     ram_discard_range                        should help with (2)
>     loadvm_postcopy_ram_handle_discard       should help with (2)
>     qemu_savevm_send_postcopy_ram_discard    should help with (2)
> 
> add near the top of ram_save_page in ram.c:
>   fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
> 
>    should help with 1, 3
> 
> So lets see if your page ever gets sent in ram_save_page, and does it get
> discarded prior to it hitting the point where it hangs but after the page
> arrived?
> Another (slimmer) possibility is the number of dirties paged is wrong
> so the source thinks it has finished too soon - but we'll only look at
> that if all the above doesn't help.
> 
> Dave
> 
>>>
>>> From the source backtrace, I think the source thinks it's sent everything
>>> and is waiting for the return-path-thread to close.
>>>
>>> Dave
>>>
>>>>>
>>>>> Dave
>>>>>
>>>>>> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
>>>>>> in lduw_he_p (ptr=0x3ff498d9002) at
>>>>>> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
>>>>>> kernel this threads hangs in handle_userfault. #1
>>>>>> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
>>>>>> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
>>>>>> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
>>>>>> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
>>>>>> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
>>>>>> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
>>>>>> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
>>>>>> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
>>>>>> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
>>>>>> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
>>>>>> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
>>>>>> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
>>>>>> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
>>>>>> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
>>>>>> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
>>>>>> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
>>>>>> field=0x16adf38 <__compound_literal.0>) at
>>>>>> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
>>>>>> vmstate_load_state (f=0x28353600, vmsd=0x14db480
>>>>>> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
>>>>>> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
>>>>>> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
>>>>>> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
>>>>>> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
>>>>>> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
>>>>>> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
>>>>>> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
>>>>>> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
>>>>>> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
>>>>>> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
>>>>>> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
>>>>>> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
>>>>>> qemu_loadvm_state (f=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
>>>>>> process_incoming_migration_co (opaque=0x282c7d20) at
>>>>>> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
>>>>>> coroutine_trampoline (i0=0, i1=674533168) at
>>>>>> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
>>>>>> __makecontext_ret () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
>>>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>>>> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
>>>>>> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
>>>>>> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
>>>>>> postcopy_ram_listen_thread (opaque=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #5
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
>>>>>> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
>>>>>> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
>>>>>> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #3
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
>>>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>>>> do_data_decompress (opaque=0x283023d8) at
>>>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
>>>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>>>> do_data_decompress (opaque=0x28302360) at
>>>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
>>>>>> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
>>>>>> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
>>>>>> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
>>>>>> pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
>>>>>> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
>>>>>> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
>>>>>> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
>>>>>> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
>>>>>> #6  0x000003ff8e007bc4 in start_thread () from
>>>>>> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
>>>>>> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
>>>>>> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
>>>>>> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
>>>>>> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
>>>>>> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
>>>>>> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
>>>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
>>>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>>>> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
>>>>>> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
>>>>>> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
>>>>>> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
>>>>>> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
>>>>>> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
>>>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>>
>>>>>> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
>>>>>> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
>>>>>> value='-device'/> <qemu:arg
>>>>>> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
>>>>>>
>>>>>>
>>>>>> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
>>>>>> be a virtio issue after all?
>>>>>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-24 19:12         ` Dr. David Alan Gilbert
  2017-04-25 13:27           ` Christian Borntraeger
@ 2017-04-26  9:59           ` Christian Borntraeger
  2017-04-26 10:13             ` Christian Borntraeger
  2017-04-26 11:01             ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-26  9:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>> David, Juan,
>>>>>>
>>>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>>>> that both sides are in paused state. Looks like one thread is still
>>>>>> loading vmstates for virtio and this load accesses guest memory and
>>>>>> triggers a userfault.
>>>>>
>>>>> It's perfectly legal for the destination to cause a userfault as it
>>>>> loads the virtio queue - the virtio queue should be being loaded by
>>>>> the main thread from the buffer while the 'listening' thread is
>>>>> waiting for the incoming page data.
>>>>>
>>>>> Can you turn on the following tracing please: destination: 
>>>>> postcopy_ram_fault_thread_request postcopy_place_page 
>>>>> postcopy_place_page_zero
>>>>>
>>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>>>
>>>>> You should see: virtio does the access userfault generates a fault 
>>>>> postcopy_ram_fault_thread_request sends the request to the source
>>>>>
>>>>> the source sees migrate_handle_rp_req_pages queues it with
>>>>> ram_save_queue_pages
>>>>>
>>>>> the destination sees the page arrive and postcopy_place_page or
>>>>> postcopy_place_page_zero
>>>>>
>>>>> some of that might disappear if the page was already on it's way.
>>>>
>>>>
>>>> the last event on the source are
>>>> [..]
>>>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
>>>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
>>>
>>> How do you see those on the source???
>>
>> It was the previous migrate (I did it in a loop forth and back)
>> The problem happens on the migrate back.
> 
> OK, good.
> 
>>>
>>>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
>>>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>>>
>>>
>>> Is that a typo? I'm expecting those two 'len' fields to be the same?
>>
>> Yes, its a cut'n' paste "miss the last byte"
> 
> Good.
> 
> Ok, before digging further, is this a new bug or does it happen on older
> QEMU?  Have you got a rune to reproduce it on x86?
> 
>>
>>>
>>>> On the target a see lots of
>>>>
>>>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
>>>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
>>>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
>>>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
>>>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
>>>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
>>>>
>>>> So we have about 2 seconds of traffic going on after that request,
>>>> I assume its precopy related.
>>>>
>>>> Looking back on the target history there was
>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>
>>>> In fact it seems to be the first and only request:
>>>>
>>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>>>
>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>
>>> OK, does the HVA there correspond to the address that your virtio device is blocking on?
>>
>> yes it is the same page.
>>
>>
>>> (or at least the start of the page)
>>> Do you see a postcopy_place_page with a host= the same HVA ?
>>
>> no
> 
> Hmm, that's bad.
> The flow is:
>     Precopy
>        (a) Send pages
>     Switch to postcopy
>        (b) Send discards for pages that have been changed after
>            (a)
>     Postcopy
>        (c) Keep sending pages until we run out
>        (d) But send pages we're asked for first
> 
>  (d) might be ignored if the source thinks the page was already sent or not dirty.
> 
> So we need to figure out:
>   1) If the source sent the pages during (a)
>   2) If the source discarded it during (b)
>   3) If it sent it again any time in c/d
>   4) If it ignored the request from d
> 
> 
> So please turn on the traces:
>     get_queued_page_not_dirty                should help with (4)
>     get_queued_page                          should help with (4)
>     ram_discard_range                        should help with (2)
>     loadvm_postcopy_ram_handle_discard       should help with (2)
>     qemu_savevm_send_postcopy_ram_discard    should help with (2)
> 
> add near the top of ram_save_page in ram.c:
>   fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
> 
>    should help with 1, 3


source:
28863@1493200554.418978:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419047:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419056:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419064:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419071:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419077:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419083:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419088:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419094:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419100:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419106:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419112:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419117:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419123:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419130:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419138:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419145:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419151:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419156:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419163:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419169:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419174:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419179:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419184:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419189:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
28863@1493200554.419208:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
28863@1493200554.818195:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
28863@1493200554.818208:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
28863@1493200554.818469:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000 (sent=0)

I also get 
ram_save_page: s390.ram:3c6b000


target (without postcopy_place)
10929@1493200553.657966:ram_discard_range s390.ram: start: 0 5dc00000
10929@1493200554.428226:loadvm_postcopy_ram_handle_discard 
10929@1493200554.428235:ram_discard_range s390.ram: start: 1000 a000
10929@1493200554.428247:ram_discard_range s390.ram: start: c000 4000
10929@1493200554.428252:ram_discard_range s390.ram: start: 12000 ee000
10929@1493200554.428441:ram_discard_range s390.ram: start: b72000 2000
10929@1493200554.428444:ram_discard_range s390.ram: start: b75000 2000
10929@1493200554.428447:ram_discard_range s390.ram: start: b79000 1000
10929@1493200554.428450:ram_discard_range s390.ram: start: b7e000 1000
10929@1493200554.428452:ram_discard_range s390.ram: start: b85000 1000
10929@1493200554.428455:ram_discard_range s390.ram: start: ba0000 1000
10929@1493200554.428457:ram_discard_range s390.ram: start: bad000 1000
10929@1493200554.428459:ram_discard_range s390.ram: start: bb7000 1000
10929@1493200554.428462:ram_discard_range s390.ram: start: bbe000 1000
10929@1493200554.428464:loadvm_postcopy_ram_handle_discard 
10929@1493200554.428465:ram_discard_range s390.ram: start: bc8000 1000
10929@1493200554.428468:ram_discard_range s390.ram: start: c16000 2000
10929@1493200554.428471:ram_discard_range s390.ram: start: ca3000 1000
10929@1493200554.428474:ram_discard_range s390.ram: start: cc3000 1000
10929@1493200554.428476:ram_discard_range s390.ram: start: cc9000 1000
10929@1493200554.428478:ram_discard_range s390.ram: start: ce9000 1000
10929@1493200554.428480:ram_discard_range s390.ram: start: cff000 1000
10929@1493200554.428483:ram_discard_range s390.ram: start: d05000 1000
10929@1493200554.428485:ram_discard_range s390.ram: start: d09000 6000
10929@1493200554.428491:ram_discard_range s390.ram: start: d11000 1000
10929@1493200554.428493:ram_discard_range s390.ram: start: d16000 1000
10929@1493200554.428496:ram_discard_range s390.ram: start: d1a000 1e000
10929@1493200554.428521:loadvm_postcopy_ram_handle_discard 
10929@1493200554.428522:ram_discard_range s390.ram: start: d39000 42000
10929@1493200554.428582:ram_discard_range s390.ram: start: d80000 3000
10929@1493200554.428587:ram_discard_range s390.ram: start: d84000 3000
10929@1493200554.428590:ram_discard_range s390.ram: start: d88000 2000
10929@1493200554.428593:ram_discard_range s390.ram: start: d8f000 1000
10929@1493200554.428596:ram_discard_range s390.ram: start: d91000 1000
10929@1493200554.428598:ram_discard_range s390.ram: start: d94000 1000
10929@1493200554.428600:ram_discard_range s390.ram: start: dbb000 1000
10929@1493200554.428603:ram_discard_range s390.ram: start: dbf000 7000
10929@1493200554.428610:ram_discard_range s390.ram: start: dc8000 1000
10929@1493200554.428612:ram_discard_range s390.ram: start: dca000 1000
10929@1493200554.428614:ram_discard_range s390.ram: start: dcd000 1000
10929@1493200554.428616:loadvm_postcopy_ram_handle_discard 
10929@1493200554.428617:ram_discard_range s390.ram: start: dd0000 5000
10929@1493200554.428623:ram_discard_range s390.ram: start: dd6000 7000
10929@1493200554.428630:ram_discard_range s390.ram: start: de0000 3000
10929@1493200554.428634:ram_discard_range s390.ram: start: de8000 3000
10929@1493200554.428637:ram_discard_range s390.ram: start: dec000 1000
10929@1493200554.428640:ram_discard_range s390.ram: start: dee000 12000
10929@1493200554.428655:ram_discard_range s390.ram: start: 1100000 2300000
10929@1493200554.437553:ram_discard_range s390.ram: start: 3423000 3000
10929@1493200554.437558:ram_discard_range s390.ram: start: 3475000 8b000
10929@1493200554.437697:ram_discard_range s390.ram: start: 35e3000 3000
10929@1493200554.437701:ram_discard_range s390.ram: start: 3600000 1000
10929@1493200554.437704:ram_discard_range s390.ram: start: 3606000 1000
10929@1493200554.437706:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437707:ram_discard_range s390.ram: start: 360d000 1000
10929@1493200554.437709:ram_discard_range s390.ram: start: 3610000 1000
10929@1493200554.437711:ram_discard_range s390.ram: start: 3616000 1000
10929@1493200554.437713:ram_discard_range s390.ram: start: 361d000 1000
10929@1493200554.437716:ram_discard_range s390.ram: start: 3620000 1000
10929@1493200554.437726:ram_discard_range s390.ram: start: 3626000 1000
10929@1493200554.437729:ram_discard_range s390.ram: start: 362d000 1000
10929@1493200554.437731:ram_discard_range s390.ram: start: 3630000 1000
10929@1493200554.437734:ram_discard_range s390.ram: start: 3636000 1000
10929@1493200554.437736:ram_discard_range s390.ram: start: 363d000 1000
10929@1493200554.437738:ram_discard_range s390.ram: start: 3640000 1000
10929@1493200554.437740:ram_discard_range s390.ram: start: 3646000 1000
10929@1493200554.437742:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437743:ram_discard_range s390.ram: start: 364d000 1000
10929@1493200554.437746:ram_discard_range s390.ram: start: 3650000 1000
10929@1493200554.437748:ram_discard_range s390.ram: start: 3656000 1000
10929@1493200554.437750:ram_discard_range s390.ram: start: 365d000 1000
10929@1493200554.437752:ram_discard_range s390.ram: start: 3660000 1000
10929@1493200554.437754:ram_discard_range s390.ram: start: 3666000 1000
10929@1493200554.437757:ram_discard_range s390.ram: start: 366d000 1000
10929@1493200554.437759:ram_discard_range s390.ram: start: 3670000 1000
10929@1493200554.437761:ram_discard_range s390.ram: start: 3676000 1000
10929@1493200554.437763:ram_discard_range s390.ram: start: 367d000 4000
10929@1493200554.437768:ram_discard_range s390.ram: start: 3686000 1000
10929@1493200554.437770:ram_discard_range s390.ram: start: 368d000 1000
10929@1493200554.437772:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437773:ram_discard_range s390.ram: start: 3690000 1000
10929@1493200554.437775:ram_discard_range s390.ram: start: 3696000 1000
10929@1493200554.437778:ram_discard_range s390.ram: start: 369d000 1000
10929@1493200554.437780:ram_discard_range s390.ram: start: 36a0000 1000
10929@1493200554.437782:ram_discard_range s390.ram: start: 36a6000 1000
10929@1493200554.437784:ram_discard_range s390.ram: start: 36ad000 1000
10929@1493200554.437786:ram_discard_range s390.ram: start: 36b0000 1000
10929@1493200554.437789:ram_discard_range s390.ram: start: 36b6000 1000
10929@1493200554.437791:ram_discard_range s390.ram: start: 36bd000 1000
10929@1493200554.437793:ram_discard_range s390.ram: start: 36c0000 1000
10929@1493200554.437795:ram_discard_range s390.ram: start: 36c6000 1000
10929@1493200554.437798:ram_discard_range s390.ram: start: 36cd000 1000
10929@1493200554.437800:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437801:ram_discard_range s390.ram: start: 36d0000 1000
10929@1493200554.437803:ram_discard_range s390.ram: start: 36d6000 1000
10929@1493200554.437805:ram_discard_range s390.ram: start: 36dd000 1000
10929@1493200554.437807:ram_discard_range s390.ram: start: 36e0000 1000
10929@1493200554.437810:ram_discard_range s390.ram: start: 36e6000 1000
10929@1493200554.437812:ram_discard_range s390.ram: start: 36ed000 1000
10929@1493200554.437814:ram_discard_range s390.ram: start: 36f0000 1000
10929@1493200554.437816:ram_discard_range s390.ram: start: 36f6000 1000
10929@1493200554.437819:ram_discard_range s390.ram: start: 36fd000 1000
10929@1493200554.437821:ram_discard_range s390.ram: start: 37e0000 1000
10929@1493200554.437824:ram_discard_range s390.ram: start: 37e6000 1000
10929@1493200554.437826:ram_discard_range s390.ram: start: 37ed000 1000
10929@1493200554.437828:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437829:ram_discard_range s390.ram: start: 37f0000 1000
10929@1493200554.437831:ram_discard_range s390.ram: start: 37f6000 1000
10929@1493200554.437833:ram_discard_range s390.ram: start: 37fd000 1000
10929@1493200554.437836:ram_discard_range s390.ram: start: 3800000 1000
10929@1493200554.437838:ram_discard_range s390.ram: start: 3806000 1000
10929@1493200554.437840:ram_discard_range s390.ram: start: 380d000 1000
10929@1493200554.437842:ram_discard_range s390.ram: start: 3810000 1000
10929@1493200554.437845:ram_discard_range s390.ram: start: 3816000 1000
10929@1493200554.437847:ram_discard_range s390.ram: start: 381d000 3000
10929@1493200554.437850:ram_discard_range s390.ram: start: 3850000 1000
10929@1493200554.437854:ram_discard_range s390.ram: start: 3856000 1000
10929@1493200554.437858:ram_discard_range s390.ram: start: 385d000 1000
10929@1493200554.437860:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437861:ram_discard_range s390.ram: start: 3860000 1000
10929@1493200554.437863:ram_discard_range s390.ram: start: 3866000 1000
10929@1493200554.437865:ram_discard_range s390.ram: start: 386d000 1000
10929@1493200554.437868:ram_discard_range s390.ram: start: 3870000 1000
10929@1493200554.437870:ram_discard_range s390.ram: start: 3876000 1000
10929@1493200554.437872:ram_discard_range s390.ram: start: 387d000 1000
10929@1493200554.437874:ram_discard_range s390.ram: start: 3880000 1000
10929@1493200554.437877:ram_discard_range s390.ram: start: 3886000 1000
10929@1493200554.437879:ram_discard_range s390.ram: start: 388d000 1000
10929@1493200554.437881:ram_discard_range s390.ram: start: 3890000 1000
10929@1493200554.437883:ram_discard_range s390.ram: start: 3896000 1000
10929@1493200554.437885:ram_discard_range s390.ram: start: 389d000 1000
10929@1493200554.437887:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437888:ram_discard_range s390.ram: start: 38a0000 1000
10929@1493200554.437891:ram_discard_range s390.ram: start: 38a6000 1000
10929@1493200554.437893:ram_discard_range s390.ram: start: 38ad000 1000
10929@1493200554.437895:ram_discard_range s390.ram: start: 38b0000 1000
10929@1493200554.437897:ram_discard_range s390.ram: start: 38b6000 1000
10929@1493200554.437899:ram_discard_range s390.ram: start: 38bd000 1000
10929@1493200554.437902:ram_discard_range s390.ram: start: 38c0000 1000
10929@1493200554.437904:ram_discard_range s390.ram: start: 38c6000 1000
10929@1493200554.437906:ram_discard_range s390.ram: start: 38cd000 1000
10929@1493200554.437908:ram_discard_range s390.ram: start: 38d0000 1000
10929@1493200554.437911:ram_discard_range s390.ram: start: 38d6000 1000
10929@1493200554.437913:ram_discard_range s390.ram: start: 38dd000 1000
10929@1493200554.437915:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437916:ram_discard_range s390.ram: start: 38e0000 1000
10929@1493200554.437919:ram_discard_range s390.ram: start: 38e6000 1000
10929@1493200554.437921:ram_discard_range s390.ram: start: 38ed000 3000
10929@1493200554.437925:ram_discard_range s390.ram: start: 3900000 1000
10929@1493200554.437927:ram_discard_range s390.ram: start: 3906000 1000
10929@1493200554.437930:ram_discard_range s390.ram: start: 390d000 1000
10929@1493200554.437932:ram_discard_range s390.ram: start: 3920000 1000
10929@1493200554.437934:ram_discard_range s390.ram: start: 3922000 1000
10929@1493200554.437936:ram_discard_range s390.ram: start: 3924000 1000
10929@1493200554.437938:ram_discard_range s390.ram: start: 3926000 1000
10929@1493200554.437941:ram_discard_range s390.ram: start: 392b000 1000
10929@1493200554.437943:ram_discard_range s390.ram: start: 3930000 1000
10929@1493200554.437945:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437946:ram_discard_range s390.ram: start: 3936000 1000
10929@1493200554.437948:ram_discard_range s390.ram: start: 393d000 1000
10929@1493200554.437950:ram_discard_range s390.ram: start: 3940000 1000
10929@1493200554.437953:ram_discard_range s390.ram: start: 3946000 1000
10929@1493200554.437955:ram_discard_range s390.ram: start: 394d000 1000
10929@1493200554.437957:ram_discard_range s390.ram: start: 3950000 1000
10929@1493200554.437960:ram_discard_range s390.ram: start: 3956000 1000
10929@1493200554.437962:ram_discard_range s390.ram: start: 395d000 1000
10929@1493200554.437964:ram_discard_range s390.ram: start: 3960000 1000
10929@1493200554.437966:ram_discard_range s390.ram: start: 3966000 1000
10929@1493200554.437969:ram_discard_range s390.ram: start: 396d000 1000
10929@1493200554.437971:ram_discard_range s390.ram: start: 3970000 1000
10929@1493200554.437973:loadvm_postcopy_ram_handle_discard 
10929@1493200554.437974:ram_discard_range s390.ram: start: 3976000 1000
10929@1493200554.437976:ram_discard_range s390.ram: start: 397d000 1000
10929@1493200554.437978:ram_discard_range s390.ram: start: 3980000 1000
10929@1493200554.437981:ram_discard_range s390.ram: start: 3986000 1000
10929@1493200554.437984:ram_discard_range s390.ram: start: 398d000 1000
10929@1493200554.437986:ram_discard_range s390.ram: start: 3990000 1000
10929@1493200554.437988:ram_discard_range s390.ram: start: 3996000 1000
10929@1493200554.437990:ram_discard_range s390.ram: start: 399d000 3000
10929@1493200554.437994:ram_discard_range s390.ram: start: 39ad000 8000
10929@1493200554.438002:ram_discard_range s390.ram: start: 39c8000 1000
10929@1493200554.438004:ram_discard_range s390.ram: start: 39ca000 1000
10929@1493200554.438006:ram_discard_range s390.ram: start: 39cc000 1000
10929@1493200554.438008:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438009:ram_discard_range s390.ram: start: 39ce000 1000
10929@1493200554.438011:ram_discard_range s390.ram: start: 39e0000 1000
10929@1493200554.438014:ram_discard_range s390.ram: start: 39e6000 1000
10929@1493200554.438016:ram_discard_range s390.ram: start: 39ed000 1000
10929@1493200554.438018:ram_discard_range s390.ram: start: 39f0000 1000
10929@1493200554.438021:ram_discard_range s390.ram: start: 39f6000 1000
10929@1493200554.438023:ram_discard_range s390.ram: start: 39fd000 1000
10929@1493200554.438025:ram_discard_range s390.ram: start: 3a00000 1000
10929@1493200554.438027:ram_discard_range s390.ram: start: 3a06000 1000
10929@1493200554.438029:ram_discard_range s390.ram: start: 3a0d000 1000
10929@1493200554.438031:ram_discard_range s390.ram: start: 3a10000 1000
10929@1493200554.438034:ram_discard_range s390.ram: start: 3a16000 1000
10929@1493200554.438036:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438037:ram_discard_range s390.ram: start: 3a1d000 1000
10929@1493200554.438039:ram_discard_range s390.ram: start: 3a20000 1000
10929@1493200554.438041:ram_discard_range s390.ram: start: 3a26000 1000
10929@1493200554.438043:ram_discard_range s390.ram: start: 3a2d000 1000
10929@1493200554.438045:ram_discard_range s390.ram: start: 3a30000 1000
10929@1493200554.438048:ram_discard_range s390.ram: start: 3a36000 1000
10929@1493200554.438050:ram_discard_range s390.ram: start: 3a3d000 1000
10929@1493200554.438052:ram_discard_range s390.ram: start: 3a40000 1000
10929@1493200554.438054:ram_discard_range s390.ram: start: 3a46000 1000
10929@1493200554.438056:ram_discard_range s390.ram: start: 3a4d000 1000
10929@1493200554.438059:ram_discard_range s390.ram: start: 3a50000 1000
10929@1493200554.438061:ram_discard_range s390.ram: start: 3a56000 1000
10929@1493200554.438063:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438067:ram_discard_range s390.ram: start: 3a5d000 1000
10929@1493200554.438069:ram_discard_range s390.ram: start: 3a60000 1000
10929@1493200554.438071:ram_discard_range s390.ram: start: 3a66000 1000
10929@1493200554.438073:ram_discard_range s390.ram: start: 3a6d000 1000
10929@1493200554.438075:ram_discard_range s390.ram: start: 3a70000 1000
10929@1493200554.438078:ram_discard_range s390.ram: start: 3a76000 1000
10929@1493200554.438080:ram_discard_range s390.ram: start: 3a7d000 4000
10929@1493200554.438084:ram_discard_range s390.ram: start: 3a86000 1000
10929@1493200554.438087:ram_discard_range s390.ram: start: 3a8d000 1000
10929@1493200554.438089:ram_discard_range s390.ram: start: 3a90000 1000
10929@1493200554.438091:ram_discard_range s390.ram: start: 3a96000 1000
10929@1493200554.438093:ram_discard_range s390.ram: start: 3a9d000 1000
10929@1493200554.438095:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438096:ram_discard_range s390.ram: start: 3aa0000 1000
10929@1493200554.438098:ram_discard_range s390.ram: start: 3aa6000 1000
10929@1493200554.438100:ram_discard_range s390.ram: start: 3aad000 1000
10929@1493200554.438103:ram_discard_range s390.ram: start: 3ab0000 1000
10929@1493200554.438105:ram_discard_range s390.ram: start: 3ab6000 1000
10929@1493200554.438107:ram_discard_range s390.ram: start: 3abd000 1000
10929@1493200554.438109:ram_discard_range s390.ram: start: 3ac0000 1000
10929@1493200554.438111:ram_discard_range s390.ram: start: 3ac6000 1000
10929@1493200554.438114:ram_discard_range s390.ram: start: 3acd000 1000
10929@1493200554.438116:ram_discard_range s390.ram: start: 3ad0000 1000
10929@1493200554.438119:ram_discard_range s390.ram: start: 3ad6000 1000
10929@1493200554.438121:ram_discard_range s390.ram: start: 3add000 1000
10929@1493200554.438123:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438124:ram_discard_range s390.ram: start: 3ae0000 1000
10929@1493200554.438126:ram_discard_range s390.ram: start: 3ae6000 1000
10929@1493200554.438128:ram_discard_range s390.ram: start: 3aed000 1000
10929@1493200554.438131:ram_discard_range s390.ram: start: 3af0000 1000
10929@1493200554.438133:ram_discard_range s390.ram: start: 3af6000 1000
10929@1493200554.438135:ram_discard_range s390.ram: start: 3afd000 1000
10929@1493200554.438137:ram_discard_range s390.ram: start: 3b00000 1000
10929@1493200554.438140:ram_discard_range s390.ram: start: 3b06000 1000
10929@1493200554.438142:ram_discard_range s390.ram: start: 3b0d000 1000
10929@1493200554.438144:ram_discard_range s390.ram: start: 3b10000 1000
10929@1493200554.438146:ram_discard_range s390.ram: start: 3b16000 1000
10929@1493200554.438149:ram_discard_range s390.ram: start: 3b1d000 4000
10929@1493200554.438154:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438155:ram_discard_range s390.ram: start: 3b26000 1000
10929@1493200554.438157:ram_discard_range s390.ram: start: 3b2d000 1000
10929@1493200554.438160:ram_discard_range s390.ram: start: 3b30000 1000
10929@1493200554.438162:ram_discard_range s390.ram: start: 3b36000 1000
10929@1493200554.438164:ram_discard_range s390.ram: start: 3b3d000 1000
10929@1493200554.438166:ram_discard_range s390.ram: start: 3b40000 1000
10929@1493200554.438168:ram_discard_range s390.ram: start: 3b46000 1000
10929@1493200554.438170:ram_discard_range s390.ram: start: 3b4d000 1000
10929@1493200554.438173:ram_discard_range s390.ram: start: 3b50000 1000
10929@1493200554.438175:ram_discard_range s390.ram: start: 3b56000 1000
10929@1493200554.438177:ram_discard_range s390.ram: start: 3b5d000 1000
10929@1493200554.438179:ram_discard_range s390.ram: start: 3b60000 1000
10929@1493200554.438181:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438182:ram_discard_range s390.ram: start: 3b66000 1000
10929@1493200554.438184:ram_discard_range s390.ram: start: 3b6d000 1000
10929@1493200554.438187:ram_discard_range s390.ram: start: 3b70000 1000
10929@1493200554.438189:ram_discard_range s390.ram: start: 3b76000 1000
10929@1493200554.438191:ram_discard_range s390.ram: start: 3b7d000 1000
10929@1493200554.438193:ram_discard_range s390.ram: start: 3b80000 1000
10929@1493200554.438196:ram_discard_range s390.ram: start: 3b86000 1000
10929@1493200554.438198:ram_discard_range s390.ram: start: 3b8d000 1000
10929@1493200554.438200:ram_discard_range s390.ram: start: 3b90000 1000
10929@1493200554.438202:ram_discard_range s390.ram: start: 3b96000 1000
10929@1493200554.438204:ram_discard_range s390.ram: start: 3b9d000 1000
10929@1493200554.438206:ram_discard_range s390.ram: start: 3ba0000 1000
10929@1493200554.438209:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438210:ram_discard_range s390.ram: start: 3ba6000 1000
10929@1493200554.438212:ram_discard_range s390.ram: start: 3bad000 3000
10929@1493200554.438215:ram_discard_range s390.ram: start: 3bd0000 1000
10929@1493200554.438218:ram_discard_range s390.ram: start: 3bd6000 1000
10929@1493200554.438220:ram_discard_range s390.ram: start: 3bdd000 1000
10929@1493200554.438222:ram_discard_range s390.ram: start: 3be0000 1000
10929@1493200554.438224:ram_discard_range s390.ram: start: 3be6000 1000
10929@1493200554.438226:ram_discard_range s390.ram: start: 3bed000 1000
10929@1493200554.438229:ram_discard_range s390.ram: start: 3bf0000 1000
10929@1493200554.438231:ram_discard_range s390.ram: start: 3bf6000 1000
10929@1493200554.438233:ram_discard_range s390.ram: start: 3bfd000 1000
10929@1493200554.438235:ram_discard_range s390.ram: start: 3c06000 4000
10929@1493200554.438240:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438241:ram_discard_range s390.ram: start: 3c79000 3000
10929@1493200554.438246:ram_discard_range s390.ram: start: 3c80000 1000
10929@1493200554.438248:ram_discard_range s390.ram: start: 3c86000 1000
10929@1493200554.438251:ram_discard_range s390.ram: start: 3c8d000 1000
10929@1493200554.438253:ram_discard_range s390.ram: start: 3c90000 1000
10929@1493200554.438255:ram_discard_range s390.ram: start: 3c96000 1000
10929@1493200554.438257:ram_discard_range s390.ram: start: 3c9d000 1000
10929@1493200554.438259:ram_discard_range s390.ram: start: 3ca0000 1000
10929@1493200554.438261:ram_discard_range s390.ram: start: 3ca6000 1000
10929@1493200554.438264:ram_discard_range s390.ram: start: 3cad000 1000
10929@1493200554.438266:ram_discard_range s390.ram: start: 3cb0000 1000
10929@1493200554.438268:ram_discard_range s390.ram: start: 3cb6000 1000
10929@1493200554.438270:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438271:ram_discard_range s390.ram: start: 3cbd000 1000
10929@1493200554.438274:ram_discard_range s390.ram: start: 3cc0000 1000
10929@1493200554.438276:ram_discard_range s390.ram: start: 3cc6000 1000
10929@1493200554.438278:ram_discard_range s390.ram: start: 3ccd000 1000
10929@1493200554.438281:ram_discard_range s390.ram: start: 3cd0000 1000
10929@1493200554.438283:ram_discard_range s390.ram: start: 3cd6000 1000
10929@1493200554.438285:ram_discard_range s390.ram: start: 3cdd000 1000
10929@1493200554.438287:ram_discard_range s390.ram: start: 3ce0000 1000
10929@1493200554.438289:ram_discard_range s390.ram: start: 3ce6000 1000
10929@1493200554.438292:ram_discard_range s390.ram: start: 3ced000 1000
10929@1493200554.438294:ram_discard_range s390.ram: start: 3cf0000 1000
10929@1493200554.438296:ram_discard_range s390.ram: start: 3cf6000 1000
10929@1493200554.438298:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438299:ram_discard_range s390.ram: start: 3cfd000 1000
10929@1493200554.438301:ram_discard_range s390.ram: start: 3d55000 2000
10929@1493200554.438304:ram_discard_range s390.ram: start: 3da3000 5000
10929@1493200554.438310:ram_discard_range s390.ram: start: 3dc0000 13000
10929@1493200554.438326:ram_discard_range s390.ram: start: 3dd4000 f000
10929@1493200554.438339:ram_discard_range s390.ram: start: 3de4000 3000
10929@1493200554.438343:ram_discard_range s390.ram: start: 3de8000 1000
10929@1493200554.438345:ram_discard_range s390.ram: start: 3dea000 1000
10929@1493200554.438347:ram_discard_range s390.ram: start: 3f4a000 1e000
10929@1493200554.438372:ram_discard_range s390.ram: start: 3fa8000 2000
10929@1493200554.438375:ram_discard_range s390.ram: start: 4000000 1000
10929@1493200554.438377:ram_discard_range s390.ram: start: 4006000 1000
10929@1493200554.438379:loadvm_postcopy_ram_handle_discard 
10929@1493200554.438380:ram_discard_range s390.ram: start: 400d000 1000
10929@1493200554.438383:ram_discard_range s390.ram: start: 4010000 1000
10929@1493200554.438385:ram_discard_range s390.ram: start: 4016000 1000
10929@1493200554.438387:ram_discard_range s390.ram: start: 401d000 3000
10929@1493200554.438391:ram_discard_range s390.ram: start: 4040000 1000
10929@1493200554.438393:ram_discard_range s390.ram: start: 4042000 1000
10929@1493200554.438395:ram_discard_range s390.ram: start: 40e3000 1000
10929@1493200554.438397:ram_discard_range s390.ram: start: 417e000 2215000
10929@1493200554.445269:ram_discard_range s390.ram: start: 2258d000 3b673000
10929@1493200554.817596:postcopy_ram_fault_thread_request Request for HVA=3ff3df6b000 rb=s390.ram offset=3c6b000


> 
> So lets see if your page ever gets sent in ram_save_page, and does it get
> discarded prior to it hitting the point where it hangs but after the page
> arrived?
> Another (slimmer) possibility is the number of dirties paged is wrong
> so the source thinks it has finished too soon - but we'll only look at
> that if all the above doesn't help.
> 
> Dave
> 
>>>
>>> From the source backtrace, I think the source thinks it's sent everything
>>> and is waiting for the return-path-thread to close.
>>>
>>> Dave
>>>
>>>>>
>>>>> Dave
>>>>>
>>>>>> Thread 1 (Thread 0x3ffa2f45f00 (LWP 21122)): #0  0x0000000001017130
>>>>>> in lduw_he_p (ptr=0x3ff498d9002) at
>>>>>> /root/qemu/include/qemu/bswap.h:317 <---- according to the host
>>>>>> kernel this threads hangs in handle_userfault. #1
>>>>>> 0x0000000001017342 in lduw_le_p (ptr=0x3ff498d9002) at
>>>>>> /root/qemu/include/qemu/bswap.h:359 #2  0x0000000001025840 in
>>>>>> address_space_lduw_internal_cached (cache=0x283491d0, addr=2,
>>>>>> attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at
>>>>>> /root/qemu/memory_ldst.inc.c:284 #3  0x00000000010259a6 in
>>>>>> address_space_lduw_le_cached (cache=0x283491d0, addr=2, attrs=...,
>>>>>> result=0x0) at /root/qemu/memory_ldst.inc.c:315 #4
>>>>>> 0x0000000001025ad6 in lduw_le_phys_cached (cache=0x283491d0,
>>>>>> addr=2) at /root/qemu/memory_ldst.inc.c:334 #5  0x0000000001116c10
>>>>>> in virtio_lduw_phys_cached (vdev=0x28275090, cache=0x283491d0,
>>>>>> pa=2) at /root/qemu/include/hw/virtio/virtio-access.h:166 #6
>>>>>> 0x0000000001117940 in vring_used_idx (vq=0x3ffa2e1c090) at
>>>>>> /root/qemu/hw/virtio/virtio.c:263 #7  0x000000000111daea in
>>>>>> virtio_load (vdev=0x28275090, f=0x28353600, version_id=2) at
>>>>>> /root/qemu/hw/virtio/virtio.c:2168 #8  0x000000000111d0cc in
>>>>>> virtio_device_get (f=0x28353600, opaque=0x28275090, size=0,
>>>>>> field=0x16adf38 <__compound_literal.0>) at
>>>>>> /root/qemu/hw/virtio/virtio.c:1975 #9  0x00000000012a7f50 in
>>>>>> vmstate_load_state (f=0x28353600, vmsd=0x14db480
>>>>>> <vmstate_virtio_blk>, opaque=0x28275090, version_id=2) at
>>>>>> /root/qemu/migration/vmstate.c:128 #10 0x00000000010cbc08 in
>>>>>> vmstate_load (f=0x28353600, se=0x28279e60, version_id=2) at
>>>>>> /root/qemu/migration/savevm.c:732 #11 0x00000000010ced6e in
>>>>>> qemu_loadvm_section_start_full (f=0x28353600, mis=0x17169a8
>>>>>> <mis_current>) at /root/qemu/migration/savevm.c:1890 #12
>>>>>> 0x00000000010cf00c in qemu_loadvm_state_main (f=0x28353600,
>>>>>> mis=0x17169a8 <mis_current>) at /root/qemu/migration/savevm.c:1947 
>>>>>> #13 0x00000000010ce5bc in loadvm_handle_cmd_packaged (mis=0x17169a8
>>>>>> <mis_current>) at /root/qemu/migration/savevm.c:1707 #14
>>>>>> 0x00000000010ce87e in loadvm_process_command (f=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:1770 #15 0x00000000010cf056 in
>>>>>> qemu_loadvm_state_main (f=0x282c7d20, mis=0x17169a8 <mis_current>)
>>>>>> at /root/qemu/migration/savevm.c:1960 #16 0x00000000010cf2b6 in
>>>>>> qemu_loadvm_state (f=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:2020 #17 0x000000000129d234 in
>>>>>> process_incoming_migration_co (opaque=0x282c7d20) at
>>>>>> /root/qemu/migration/migration.c:404 #18 0x00000000014b6ca2 in
>>>>>> coroutine_trampoline (i0=0, i1=674533168) at
>>>>>> /root/qemu/util/coroutine-ucontext.c:79 #19 0x000003ff8de530ea in
>>>>>> __makecontext_ret () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 8 (Thread 0x3ff40b38910 (LWP 21165)): #0  0x000003ff8df04336
>>>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>>>> in qemu_futex_wait (f=0x17169b0 <mis_current+8>, val=4294967295) at
>>>>>> /root/qemu/include/qemu/futex.h:26 #2  0x00000000014935d0 in
>>>>>> qemu_event_wait (ev=0x17169b0 <mis_current+8>) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:399 #3  0x00000000010ce094 in
>>>>>> postcopy_ram_listen_thread (opaque=0x282c7d20) at
>>>>>> /root/qemu/migration/savevm.c:1532 #4  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #5
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 7 (Thread 0x3ff41338910 (LWP 21164)): #0  0x000003ff8defd72c
>>>>>> in ?? () from target:/lib64/libc.so.6 #1  0x00000000012af920 in
>>>>>> postcopy_ram_fault_thread (opaque=0x17169a8 <mis_current>) at
>>>>>> /root/qemu/migration/postcopy-ram.c:437 #2  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #3
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 6 (Thread 0x3ff41b38910 (LWP 21163)): #0  0x000003ff8e00d450
>>>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x28302408, mutex=0x283023e0) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>>>> do_data_decompress (opaque=0x283023d8) at
>>>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 5 (Thread 0x3ff42338910 (LWP 21162)): #0  0x000003ff8e00d450
>>>>>> in pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #1  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x28302390, mutex=0x28302368) at
>>>>>> /root/qemu/util/qemu-thread-posix.c:133 #2  0x00000000010c6080 in
>>>>>> do_data_decompress (opaque=0x28302360) at
>>>>>> /root/qemu/migration/ram.c:2277 #3  0x000003ff8e007bc4 in
>>>>>> start_thread () from target:/lib64/libpthread.so.0 #4
>>>>>> 0x000003ff8df0a9f2 in thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 4 (Thread 0x3ff89287910 (LWP 21148)): #0  0x000003ff8e010abc
>>>>>> in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1
>>>>>> 0x000003ff8e00bcb6 in __pthread_mutex_cond_lock () from
>>>>>> target:/lib64/libpthread.so.0 #2  0x000003ff8e00d4ee in
>>>>>> pthread_cond_wait@@GLIBC_2.3.2 () from
>>>>>> target:/lib64/libpthread.so.0 #3  0x0000000001492d66 in
>>>>>> qemu_cond_wait (cond=0x2831eb00, mutex=0x1711688
>>>>>> <qemu_global_mutex>) at /root/qemu/util/qemu-thread-posix.c:133 #4
>>>>>> 0x000000000107720e in qemu_kvm_wait_io_event (cpu=0x282e09b0) at
>>>>>> /root/qemu/cpus.c:1085 #5  0x00000000010773aa in
>>>>>> qemu_kvm_cpu_thread_fn (arg=0x282e09b0) at /root/qemu/cpus.c:1123 
>>>>>> #6  0x000003ff8e007bc4 in start_thread () from
>>>>>> target:/lib64/libpthread.so.0 #7  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 3 (Thread 0x3ff8a7ff910 (LWP 21145)): #0  0x000003ff8defd832
>>>>>> in ppoll () from target:/lib64/libc.so.6 #1  0x000000000148b092 in
>>>>>> qemu_poll_ns (fds=0x3ff840008c0, nfds=1, timeout=-1) at
>>>>>> /root/qemu/util/qemu-timer.c:322 #2  0x000000000148f46a in aio_poll
>>>>>> (ctx=0x2827e7b0, blocking=true) at /root/qemu/util/aio-posix.c:622 
>>>>>> #3  0x00000000011a78f6 in iothread_run (opaque=0x2827d740) at
>>>>>> /root/qemu/iothread.c:59 #4  0x000003ff8e007bc4 in start_thread ()
>>>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>> Thread 2 (Thread 0x3ff8afff910 (LWP 21142)): #0  0x000003ff8df04336
>>>>>> in syscall () from target:/lib64/libc.so.6 #1  0x00000000014933ce
>>>>>> in qemu_futex_wait (f=0x1b2b9b4 <rcu_call_ready_event>,
>>>>>> val=4294967295) at /root/qemu/include/qemu/futex.h:26 #2
>>>>>> 0x00000000014935d0 in qemu_event_wait (ev=0x1b2b9b4
>>>>>> <rcu_call_ready_event>) at /root/qemu/util/qemu-thread-posix.c:399 
>>>>>> #3  0x00000000014b3e40 in call_rcu_thread (opaque=0x0) at
>>>>>> /root/qemu/util/rcu.c:249 #4  0x000003ff8e007bc4 in start_thread ()
>>>>>> from target:/lib64/libpthread.so.0 #5  0x000003ff8df0a9f2 in
>>>>>> thread_start () from target:/lib64/libc.so.6
>>>>>>
>>>>>>
>>>>>> Looks like that this <qemu:arg value='-drive'/> <qemu:arg
>>>>>> value='driver=null-aio,id=null1,if=none,size=100G'/> <qemu:arg
>>>>>> value='-device'/> <qemu:arg
>>>>>> value='virtio-blk-ccw,scsi=off,drive=null1,serial=null1,num-queues=4'/>
>>>>>>
>>>>>>
>>>>>> is necessary to trigger the hang, with nun-qeues=1 everything seems to work fine, so it might
>>>>>> be a virtio issue after all?
>>>>>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26  9:59           ` Christian Borntraeger
@ 2017-04-26 10:13             ` Christian Borntraeger
  2017-04-26 11:01             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-26 10:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/26/2017 11:59 AM, Christian Borntraeger wrote:

> source:
> 28863@1493200554.418978:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419047:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419056:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419064:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419071:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419077:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419083:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419088:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419094:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419100:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419106:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419112:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419117:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419123:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419130:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419138:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419145:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419151:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419156:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419163:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419169:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419174:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419179:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419184:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419189:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419208:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
> 28863@1493200554.818195:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
> 28863@1493200554.818208:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
> 28863@1493200554.818469:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000 (sent=0)
> 
> I also get 
> ram_save_page: s390.ram:3c6b000

looks like this one happens before the discards.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26  9:59           ` Christian Borntraeger
  2017-04-26 10:13             ` Christian Borntraeger
@ 2017-04-26 11:01             ` Dr. David Alan Gilbert
  2017-04-26 11:37               ` Christian Borntraeger
  2017-04-26 11:45               ` Christian Borntraeger
  1 sibling, 2 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-04-26 11:01 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
> >>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
> >>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >>>>>> David, Juan,
> >>>>>>
> >>>>>> I can trigger a hang of postcopy migration (when I do it early) so
> >>>>>> that both sides are in paused state. Looks like one thread is still
> >>>>>> loading vmstates for virtio and this load accesses guest memory and
> >>>>>> triggers a userfault.
> >>>>>
> >>>>> It's perfectly legal for the destination to cause a userfault as it
> >>>>> loads the virtio queue - the virtio queue should be being loaded by
> >>>>> the main thread from the buffer while the 'listening' thread is
> >>>>> waiting for the incoming page data.
> >>>>>
> >>>>> Can you turn on the following tracing please: destination: 
> >>>>> postcopy_ram_fault_thread_request postcopy_place_page 
> >>>>> postcopy_place_page_zero
> >>>>>
> >>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
> >>>>>
> >>>>> You should see: virtio does the access userfault generates a fault 
> >>>>> postcopy_ram_fault_thread_request sends the request to the source
> >>>>>
> >>>>> the source sees migrate_handle_rp_req_pages queues it with
> >>>>> ram_save_queue_pages
> >>>>>
> >>>>> the destination sees the page arrive and postcopy_place_page or
> >>>>> postcopy_place_page_zero
> >>>>>
> >>>>> some of that might disappear if the page was already on it's way.
> >>>>
> >>>>
> >>>> the last event on the source are
> >>>> [..]
> >>>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
> >>>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
> >>>
> >>> How do you see those on the source???
> >>
> >> It was the previous migrate (I did it in a loop forth and back)
> >> The problem happens on the migrate back.
> > 
> > OK, good.
> > 
> >>>
> >>>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
> >>>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
> >>>>
> >>>
> >>> Is that a typo? I'm expecting those two 'len' fields to be the same?
> >>
> >> Yes, its a cut'n' paste "miss the last byte"
> > 
> > Good.
> > 
> > Ok, before digging further, is this a new bug or does it happen on older
> > QEMU?  Have you got a rune to reproduce it on x86?
> > 
> >>
> >>>
> >>>> On the target a see lots of
> >>>>
> >>>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
> >>>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
> >>>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
> >>>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
> >>>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
> >>>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
> >>>>
> >>>> So we have about 2 seconds of traffic going on after that request,
> >>>> I assume its precopy related.
> >>>>
> >>>> Looking back on the target history there was
> >>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> >>>>
> >>>> In fact it seems to be the first and only request:
> >>>>
> >>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
> >>>>
> >>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
> >>>
> >>> OK, does the HVA there correspond to the address that your virtio device is blocking on?
> >>
> >> yes it is the same page.
> >>
> >>
> >>> (or at least the start of the page)
> >>> Do you see a postcopy_place_page with a host= the same HVA ?
> >>
> >> no
> > 
> > Hmm, that's bad.
> > The flow is:
> >     Precopy
> >        (a) Send pages
> >     Switch to postcopy
> >        (b) Send discards for pages that have been changed after
> >            (a)
> >     Postcopy
> >        (c) Keep sending pages until we run out
> >        (d) But send pages we're asked for first
> > 
> >  (d) might be ignored if the source thinks the page was already sent or not dirty.
> > 
> > So we need to figure out:
> >   1) If the source sent the pages during (a)
> >   2) If the source discarded it during (b)
> >   3) If it sent it again any time in c/d
> >   4) If it ignored the request from d
> > 
> > 
> > So please turn on the traces:
> >     get_queued_page_not_dirty                should help with (4)
> >     get_queued_page                          should help with (4)
> >     ram_discard_range                        should help with (2)
> >     loadvm_postcopy_ram_handle_discard       should help with (2)
> >     qemu_savevm_send_postcopy_ram_discard    should help with (2)
> > 
> > add near the top of ram_save_page in ram.c:
> >   fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
> > 
> >    should help with 1, 3
> 
> 
> source:
> 28863@1493200554.418978:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419047:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419056:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419064:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419071:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419077:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419083:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419088:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419094:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419100:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419106:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419112:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419117:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419123:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419130:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419138:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419145:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419151:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419156:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419163:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419169:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419174:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419179:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419184:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419189:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
> 28863@1493200554.419208:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
> 28863@1493200554.818195:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
> 28863@1493200554.818208:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
> 28863@1493200554.818469:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000 (sent=0)

OK, the source is ignoring the request because it thinks the page has
already been sent and hasn't been redirtied.

> I also get 
> ram_save_page: s390.ram:3c6b000

and you said in your other message that happens before the discards, which
means it's precopy, and that's why you don't see the 'place_page'.

> target (without postcopy_place)
> 10929@1493200553.657966:ram_discard_range s390.ram: start: 0 5dc00000

That's start of day I think, so we can ignore that (I think!)

> 10929@1493200554.428226:loadvm_postcopy_ram_handle_discard 

....

> 10929@1493200554.438231:ram_discard_range s390.ram: start: 3bf6000 1000
> 10929@1493200554.438233:ram_discard_range s390.ram: start: 3bfd000 1000
> 10929@1493200554.438235:ram_discard_range s390.ram: start: 3c06000 4000
> 10929@1493200554.438240:loadvm_postcopy_ram_handle_discard 
> 10929@1493200554.438241:ram_discard_range s390.ram: start: 3c79000 3000

Well, it's not discarded the 3c6b000 page, so the page we received
during precopy should still be there, and it shouldn't fault.

In migration/ram.c in ram_load, before the line:
           switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {

can you add:
    fprintf(stderr, "%s: addr=%zx flags=%zx host=%p\n", __func__, addr, flags, host);

that should show the incoming blocks during precopy and we should be able
to see the page coming in.
Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
then try changing ram_handle_compressed to always do the memset.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26 11:01             ` Dr. David Alan Gilbert
@ 2017-04-26 11:37               ` Christian Borntraeger
  2017-04-26 11:45               ` Christian Borntraeger
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-26 11:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/26/2017 01:01 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>>>> David, Juan,
>>>>>>>>
>>>>>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>>>>>> that both sides are in paused state. Looks like one thread is still
>>>>>>>> loading vmstates for virtio and this load accesses guest memory and
>>>>>>>> triggers a userfault.
>>>>>>>
>>>>>>> It's perfectly legal for the destination to cause a userfault as it
>>>>>>> loads the virtio queue - the virtio queue should be being loaded by
>>>>>>> the main thread from the buffer while the 'listening' thread is
>>>>>>> waiting for the incoming page data.
>>>>>>>
>>>>>>> Can you turn on the following tracing please: destination: 
>>>>>>> postcopy_ram_fault_thread_request postcopy_place_page 
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>>>>>
>>>>>>> You should see: virtio does the access userfault generates a fault 
>>>>>>> postcopy_ram_fault_thread_request sends the request to the source
>>>>>>>
>>>>>>> the source sees migrate_handle_rp_req_pages queues it with
>>>>>>> ram_save_queue_pages
>>>>>>>
>>>>>>> the destination sees the page arrive and postcopy_place_page or
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> some of that might disappear if the page was already on it's way.
>>>>>>
>>>>>>
>>>>>> the last event on the source are
>>>>>> [..]
>>>>>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
>>>>>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
>>>>>
>>>>> How do you see those on the source???
>>>>
>>>> It was the previous migrate (I did it in a loop forth and back)
>>>> The problem happens on the migrate back.
>>>
>>> OK, good.
>>>
>>>>>
>>>>>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
>>>>>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>>>>>
>>>>>
>>>>> Is that a typo? I'm expecting those two 'len' fields to be the same?
>>>>
>>>> Yes, its a cut'n' paste "miss the last byte"
>>>
>>> Good.
>>>
>>> Ok, before digging further, is this a new bug or does it happen on older
>>> QEMU?  Have you got a rune to reproduce it on x86?
>>>
>>>>
>>>>>
>>>>>> On the target a see lots of
>>>>>>
>>>>>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
>>>>>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
>>>>>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
>>>>>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
>>>>>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
>>>>>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
>>>>>>
>>>>>> So we have about 2 seconds of traffic going on after that request,
>>>>>> I assume its precopy related.
>>>>>>
>>>>>> Looking back on the target history there was
>>>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>>
>>>>>> In fact it seems to be the first and only request:
>>>>>>
>>>>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>>>>>
>>>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>
>>>>> OK, does the HVA there correspond to the address that your virtio device is blocking on?
>>>>
>>>> yes it is the same page.
>>>>
>>>>
>>>>> (or at least the start of the page)
>>>>> Do you see a postcopy_place_page with a host= the same HVA ?
>>>>
>>>> no
>>>
>>> Hmm, that's bad.
>>> The flow is:
>>>     Precopy
>>>        (a) Send pages
>>>     Switch to postcopy
>>>        (b) Send discards for pages that have been changed after
>>>            (a)
>>>     Postcopy
>>>        (c) Keep sending pages until we run out
>>>        (d) But send pages we're asked for first
>>>
>>>  (d) might be ignored if the source thinks the page was already sent or not dirty.
>>>
>>> So we need to figure out:
>>>   1) If the source sent the pages during (a)
>>>   2) If the source discarded it during (b)
>>>   3) If it sent it again any time in c/d
>>>   4) If it ignored the request from d
>>>
>>>
>>> So please turn on the traces:
>>>     get_queued_page_not_dirty                should help with (4)
>>>     get_queued_page                          should help with (4)
>>>     ram_discard_range                        should help with (2)
>>>     loadvm_postcopy_ram_handle_discard       should help with (2)
>>>     qemu_savevm_send_postcopy_ram_discard    should help with (2)
>>>
>>> add near the top of ram_save_page in ram.c:
>>>   fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
>>>
>>>    should help with 1, 3
>>
>>
>> source:
>> 28863@1493200554.418978:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419047:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419056:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419064:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419071:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419077:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419083:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419088:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419094:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419100:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419106:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419112:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419117:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419123:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419130:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419138:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419145:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419151:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419156:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419163:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419169:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419174:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419179:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419184:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419189:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419208:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
>> 28863@1493200554.818195:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
>> 28863@1493200554.818208:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
>> 28863@1493200554.818469:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000 (sent=0)
> 
> OK, the source is ignoring the request because it thinks the page has
> already been sent and hasn't been redirtied.
> 
>> I also get 
>> ram_save_page: s390.ram:3c6b000
> 
> and you said in your other message that happens before the discards, which
> means it's precopy, and that's why you don't see the 'place_page'.
> 
>> target (without postcopy_place)
>> 10929@1493200553.657966:ram_discard_range s390.ram: start: 0 5dc00000
> 
> That's start of day I think, so we can ignore that (I think!)
> 
>> 10929@1493200554.428226:loadvm_postcopy_ram_handle_discard 
> 
> ....
> 
>> 10929@1493200554.438231:ram_discard_range s390.ram: start: 3bf6000 1000
>> 10929@1493200554.438233:ram_discard_range s390.ram: start: 3bfd000 1000
>> 10929@1493200554.438235:ram_discard_range s390.ram: start: 3c06000 4000
>> 10929@1493200554.438240:loadvm_postcopy_ram_handle_discard 
>> 10929@1493200554.438241:ram_discard_range s390.ram: start: 3c79000 3000
> 
> Well, it's not discarded the 3c6b000 page, so the page we received
> during precopy should still be there, and it shouldn't fault.
> 
> In migration/ram.c in ram_load, before the line:
>            switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> 
> can you add:
>     fprintf(stderr, "%s: addr=%zx flags=%zx host=%p\n", __func__, addr, flags, host);
> 
> that should show the incoming blocks during precopy and we should be able
> to see the page coming in.
> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> then try changing ram_handle_compressed to always do the memset.

Lots of output with the following flags (some have bit 1 set, other not)

flags=10
flags=22
flags=28
flags=4
flags=8

The interesting area is:

ram_load: addr=3c6a000 flags=28 host=0x3ff21f6a000
ram_load: addr=3c6b000 flags=28 host=0x3ff21f6b000
ram_load: addr=3c6c000 flags=28 host=0x3ff21f6c000
ram_load: addr=3c6d000 flags=22 host=0x3ff21f6d000
ram_load: addr=3c6e000 flags=28 host=0x3ff21f6e000
ram_load: addr=3c6f000 flags=22 host=0x3ff21f6f000

and this time it hung on address 3c6d000 (which has
a different flags than the others)

I can send you the full file if you like but your theory
seems to fully match the log.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26 11:01             ` Dr. David Alan Gilbert
  2017-04-26 11:37               ` Christian Borntraeger
@ 2017-04-26 11:45               ` Christian Borntraeger
  2017-04-26 12:07                 ` Christian Borntraeger
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-26 11:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/26/2017 01:01 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>>>>>> David, Juan,
>>>>>>>>
>>>>>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>>>>>> that both sides are in paused state. Looks like one thread is still
>>>>>>>> loading vmstates for virtio and this load accesses guest memory and
>>>>>>>> triggers a userfault.
>>>>>>>
>>>>>>> It's perfectly legal for the destination to cause a userfault as it
>>>>>>> loads the virtio queue - the virtio queue should be being loaded by
>>>>>>> the main thread from the buffer while the 'listening' thread is
>>>>>>> waiting for the incoming page data.
>>>>>>>
>>>>>>> Can you turn on the following tracing please: destination: 
>>>>>>> postcopy_ram_fault_thread_request postcopy_place_page 
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>>>>>
>>>>>>> You should see: virtio does the access userfault generates a fault 
>>>>>>> postcopy_ram_fault_thread_request sends the request to the source
>>>>>>>
>>>>>>> the source sees migrate_handle_rp_req_pages queues it with
>>>>>>> ram_save_queue_pages
>>>>>>>
>>>>>>> the destination sees the page arrive and postcopy_place_page or
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> some of that might disappear if the page was already on it's way.
>>>>>>
>>>>>>
>>>>>> the last event on the source are
>>>>>> [..]
>>>>>> 58412@1493037953.747988:postcopy_place_page host=0x3ff92246000
>>>>>> 58412@1493037953.747992:postcopy_place_page host=0x3ff92247000
>>>>>
>>>>> How do you see those on the source???
>>>>
>>>> It was the previous migrate (I did it in a loop forth and back)
>>>> The problem happens on the migrate back.
>>>
>>> OK, good.
>>>
>>>>>
>>>>>> 58412@1493037956.804210:migrate_handle_rp_req_pages in s390.ram at 41d9000 len 1000
>>>>>> 58412@1493037956.804216:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>>>>>
>>>>>
>>>>> Is that a typo? I'm expecting those two 'len' fields to be the same?
>>>>
>>>> Yes, its a cut'n' paste "miss the last byte"
>>>
>>> Good.
>>>
>>> Ok, before digging further, is this a new bug or does it happen on older
>>> QEMU?  Have you got a rune to reproduce it on x86?
>>>
>>>>
>>>>>
>>>>>> On the target a see lots of
>>>>>>
>>>>>> 39741@1493037958.833710:postcopy_place_page_zero host=0x3ff9befa000
>>>>>> 39741@1493037958.833716:postcopy_place_page host=0x3ff9befb000
>>>>>> 39741@1493037958.833759:postcopy_place_page host=0x3ff9befc000
>>>>>> 39741@1493037958.833818:postcopy_place_page host=0x3ff9befd000
>>>>>> 39741@1493037958.833819:postcopy_place_page_zero host=0x3ff9befe000
>>>>>> 39741@1493037958.833822:postcopy_place_page host=0x3ff9beff000
>>>>>>
>>>>>> So we have about 2 seconds of traffic going on after that request,
>>>>>> I assume its precopy related.
>>>>>>
>>>>>> Looking back on the target history there was
>>>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>>
>>>>>> In fact it seems to be the first and only request:
>>>>>>
>>>>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>>>>>
>>>>>> 39741@1493037956.804337:postcopy_ram_fault_thread_request Request for HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>
>>>>> OK, does the HVA there correspond to the address that your virtio device is blocking on?
>>>>
>>>> yes it is the same page.
>>>>
>>>>
>>>>> (or at least the start of the page)
>>>>> Do you see a postcopy_place_page with a host= the same HVA ?
>>>>
>>>> no
>>>
>>> Hmm, that's bad.
>>> The flow is:
>>>     Precopy
>>>        (a) Send pages
>>>     Switch to postcopy
>>>        (b) Send discards for pages that have been changed after
>>>            (a)
>>>     Postcopy
>>>        (c) Keep sending pages until we run out
>>>        (d) But send pages we're asked for first
>>>
>>>  (d) might be ignored if the source thinks the page was already sent or not dirty.
>>>
>>> So we need to figure out:
>>>   1) If the source sent the pages during (a)
>>>   2) If the source discarded it during (b)
>>>   3) If it sent it again any time in c/d
>>>   4) If it ignored the request from d
>>>
>>>
>>> So please turn on the traces:
>>>     get_queued_page_not_dirty                should help with (4)
>>>     get_queued_page                          should help with (4)
>>>     ram_discard_range                        should help with (2)
>>>     loadvm_postcopy_ram_handle_discard       should help with (2)
>>>     qemu_savevm_send_postcopy_ram_discard    should help with (2)
>>>
>>> add near the top of ram_save_page in ram.c:
>>>   fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
>>>
>>>    should help with 1, 3
>>
>>
>> source:
>> 28863@1493200554.418978:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419047:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419056:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419064:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419071:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419077:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419083:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419088:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419094:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419100:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419106:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419112:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419117:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419123:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419130:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419138:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419145:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419151:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419156:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419163:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419169:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419174:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419179:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419184:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419189:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> 28863@1493200554.419208:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
>> 28863@1493200554.818195:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
>> 28863@1493200554.818208:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
>> 28863@1493200554.818469:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000 (sent=0)
> 
> OK, the source is ignoring the request because it thinks the page has
> already been sent and hasn't been redirtied.
> 
>> I also get 
>> ram_save_page: s390.ram:3c6b000
> 
> and you said in your other message that happens before the discards, which
> means it's precopy, and that's why you don't see the 'place_page'.
> 
>> target (without postcopy_place)
>> 10929@1493200553.657966:ram_discard_range s390.ram: start: 0 5dc00000
> 
> That's start of day I think, so we can ignore that (I think!)
> 
>> 10929@1493200554.428226:loadvm_postcopy_ram_handle_discard 
> 
> ....
> 
>> 10929@1493200554.438231:ram_discard_range s390.ram: start: 3bf6000 1000
>> 10929@1493200554.438233:ram_discard_range s390.ram: start: 3bfd000 1000
>> 10929@1493200554.438235:ram_discard_range s390.ram: start: 3c06000 4000
>> 10929@1493200554.438240:loadvm_postcopy_ram_handle_discard 
>> 10929@1493200554.438241:ram_discard_range s390.ram: start: 3c79000 3000
> 
> Well, it's not discarded the 3c6b000 page, so the page we received
> during precopy should still be there, and it shouldn't fault.
> 
> In migration/ram.c in ram_load, before the line:
>            switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
> 
> can you add:
>     fprintf(stderr, "%s: addr=%zx flags=%zx host=%p\n", __func__, addr, flags, host);
> 
> that should show the incoming blocks during precopy and we should be able
> to see the page coming in.
> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> then try changing ram_handle_compressed to always do the memset.

FWIW, changing ram_handle_compressed to always memset makes the problem go away.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26 11:45               ` Christian Borntraeger
@ 2017-04-26 12:07                 ` Christian Borntraeger
  2017-06-30 16:31                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-04-26 12:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel

On 04/26/2017 01:45 PM, Christian Borntraeger wrote:

>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
>> then try changing ram_handle_compressed to always do the memset.
> 
> FWIW, changing ram_handle_compressed to always memset makes the problem go away.

It is still running fine now with the "always memset change"

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-04-26 12:07                 ` Christian Borntraeger
@ 2017-06-30 16:31                   ` Dr. David Alan Gilbert
  2017-07-03 14:28                     ` Christian Borntraeger
  2017-07-03 16:25                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-06-30 16:31 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel, thuth

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
> 
> >> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> >> then try changing ram_handle_compressed to always do the memset.
> > 
> > FWIW, changing ram_handle_compressed to always memset makes the problem go away.
> 
> It is still running fine now with the "always memset change"

Did we ever nail down a fix for this; as I remember Andrea said
we shouldn't need to do that memset, but we came to the conclusion
it was something specific to how s390 protection keys worked.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-06-30 16:31                   ` Dr. David Alan Gilbert
@ 2017-07-03 14:28                     ` Christian Borntraeger
  2017-07-03 16:25                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-07-03 14:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, Michael S. Tsirkin, qemu-devel, thuth,
	Andrea Arcangeli, Martin Schwidefsky

On 06/30/2017 06:31 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
>>
>>>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
>>>> then try changing ram_handle_compressed to always do the memset.
>>>
>>> FWIW, changing ram_handle_compressed to always memset makes the problem go away.
>>
>> It is still running fine now with the "always memset change"
> 
> Did we ever nail down a fix for this; as I remember Andrea said
> we shouldn't need to do that memset, but we came to the conclusion
> it was something specific to how s390 protection keys worked.

It was specific to s390. Newer Linuxes do not use the storage keys,
so we enable them lazily. If a guest goes from keyless to keyed, the
kernel will mark the VM to no longer use the zero page and the  walk all
pages of that guests and zaps the empty_zero page from the page table.
All normal code will then fault in a cow copy of the zero page on the
next access. 
The postcopy logic now at least reads all zero pages to prevent faults
on these addresses. If a running guest switches to key-mode afterwards
(but during postcopy) then we might get a fault for a zero page.

Still not sure about the best solution. (see  s390_enable_skey in arch/s390/mm/gmap.c)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-06-30 16:31                   ` Dr. David Alan Gilbert
  2017-07-03 14:28                     ` Christian Borntraeger
@ 2017-07-03 16:25                     ` Michael S. Tsirkin
  2017-07-03 19:07                       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-07-03 16:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Christian Borntraeger, Juan Quintela, qemu-devel, thuth

On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> > On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
> > 
> > >> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> > >> then try changing ram_handle_compressed to always do the memset.
> > > 
> > > FWIW, changing ram_handle_compressed to always memset makes the problem go away.
> > 
> > It is still running fine now with the "always memset change"
> 
> Did we ever nail down a fix for this; as I remember Andrea said
> we shouldn't need to do that memset, but we came to the conclusion
> it was something specific to how s390 protection keys worked.
> 
> Dave

No we didn't. Let's merge that for now, with a comment that
we don't really understand what's going on?

> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-07-03 16:25                     ` Michael S. Tsirkin
@ 2017-07-03 19:07                       ` Dr. David Alan Gilbert
  2017-07-04  7:48                         ` Christian Borntraeger
  0 siblings, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2017-07-03 19:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Juan Quintela, qemu-devel, thuth

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> > > On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
> > > 
> > > >> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> > > >> then try changing ram_handle_compressed to always do the memset.
> > > > 
> > > > FWIW, changing ram_handle_compressed to always memset makes the problem go away.
> > > 
> > > It is still running fine now with the "always memset change"
> > 
> > Did we ever nail down a fix for this; as I remember Andrea said
> > we shouldn't need to do that memset, but we came to the conclusion
> > it was something specific to how s390 protection keys worked.
> > 
> > Dave
> 
> No we didn't. Let's merge that for now, with a comment that
> we don't really understand what's going on?

Hmm no, I don't really want to change the !s390 behaviour, especially
since it causes allocation that we otherwise avoid and Andrea's
reply tothe original post pointed out it's not necessary.

Dave

> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-07-03 19:07                       ` Dr. David Alan Gilbert
@ 2017-07-04  7:48                         ` Christian Borntraeger
  2017-07-04  8:16                           ` Martin Schwidefsky
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2017-07-04  7:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Michael S. Tsirkin
  Cc: Juan Quintela, qemu-devel, thuth, Martin Schwidefsky, Andrea Arcangeli

On 07/03/2017 09:07 PM, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
>> On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
>>>>
>>>>>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
>>>>>> then try changing ram_handle_compressed to always do the memset.
>>>>>
>>>>> FWIW, changing ram_handle_compressed to always memset makes the problem go away.
>>>>
>>>> It is still running fine now with the "always memset change"
>>>
>>> Did we ever nail down a fix for this; as I remember Andrea said
>>> we shouldn't need to do that memset, but we came to the conclusion
>>> it was something specific to how s390 protection keys worked.
>>>
>>> Dave
>>
>> No we didn't. Let's merge that for now, with a comment that
>> we don't really understand what's going on?
> 
> Hmm no, I don't really want to change the !s390 behaviour, especially
> since it causes allocation that we otherwise avoid and Andrea's
> reply tothe original post pointed out it's not necessary.


Since storage keys are per physical page we must not have shared pages.
Therefore in s390_enable_skey we already do a break_ksm (via ksm_madvise),
in other words we already allocate pages on the keyless->keyed switch.

So why not do the same for zero pages - instead of invalidating the page
table entry, lets just do a proper COW.

Something like this maybe (Andrea, Martin any better way to do that?)


diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 4fb3d3c..11475c7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2149,13 +2149,18 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
 static int __s390_enable_skey(pte_t *pte, unsigned long addr,
                              unsigned long next, struct mm_walk *walk)
 {
+       struct page *page;
        /*
-        * Remove all zero page mappings,
+        * Remove all zero page mappings with a COW
         * after establishing a policy to forbid zero page mappings
         * following faults for that page will get fresh anonymous pages
         */
-       if (is_zero_pfn(pte_pfn(*pte)))
-               ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
+       if (is_zero_pfn(pte_pfn(*pte))) {
+               if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1)
+                       put_page(page);
+               else
+                       return -ENOMEM;
+       }
        /* Clear storage key */
        ptep_zap_key(walk->mm, addr, pte);
        return 0;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-07-04  7:48                         ` Christian Borntraeger
@ 2017-07-04  8:16                           ` Martin Schwidefsky
  2017-07-04  8:27                             ` Christian Borntraeger
  2017-07-10  8:05                             ` Christian Borntraeger
  0 siblings, 2 replies; 22+ messages in thread
From: Martin Schwidefsky @ 2017-07-04  8:16 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Dr. David Alan Gilbert, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, thuth, Andrea Arcangeli

On Tue, 4 Jul 2017 09:48:11 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 07/03/2017 09:07 PM, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@redhat.com) wrote:  
> >> On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:  
> >>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
> >>>> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
> >>>>  
> >>>>>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
> >>>>>> then try changing ram_handle_compressed to always do the memset.  
> >>>>>
> >>>>> FWIW, changing ram_handle_compressed to always memset makes the problem go away.  
> >>>>
> >>>> It is still running fine now with the "always memset change"  
> >>>
> >>> Did we ever nail down a fix for this; as I remember Andrea said
> >>> we shouldn't need to do that memset, but we came to the conclusion
> >>> it was something specific to how s390 protection keys worked.
> >>>
> >>> Dave  
> >>
> >> No we didn't. Let's merge that for now, with a comment that
> >> we don't really understand what's going on?  
> > 
> > Hmm no, I don't really want to change the !s390 behaviour, especially
> > since it causes allocation that we otherwise avoid and Andrea's
> > reply tothe original post pointed out it's not necessary.  
> 
> 
> Since storage keys are per physical page we must not have shared pages.
> Therefore in s390_enable_skey we already do a break_ksm (via ksm_madvise),
> in other words we already allocate pages on the keyless->keyed switch.
> 
> So why not do the same for zero pages - instead of invalidating the page
> table entry, lets just do a proper COW.
> 
> Something like this maybe (Andrea, Martin any better way to do that?)
> 
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 4fb3d3c..11475c7 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2149,13 +2149,18 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
>  static int __s390_enable_skey(pte_t *pte, unsigned long addr,
>                               unsigned long next, struct mm_walk *walk)
>  {
> +       struct page *page;
>         /*
> -        * Remove all zero page mappings,
> +        * Remove all zero page mappings with a COW
>          * after establishing a policy to forbid zero page mappings
>          * following faults for that page will get fresh anonymous pages
>          */
> -       if (is_zero_pfn(pte_pfn(*pte)))
> -               ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
> +       if (is_zero_pfn(pte_pfn(*pte))) {
> +               if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1)
> +                       put_page(page);
> +               else
> +                       return -ENOMEM;
> +       }
>         /* Clear storage key */
>         ptep_zap_key(walk->mm, addr, pte);
>         return 0;

I do not quite get the problem here. The zero-page mappings are always
marked as _PAGE_SPECIAL. These should be safe to replace with an empty
pte. We do mark all VMAs as unmergeable prior to the page table walk
that replaces all zero-page mappings. What is the get_user_pages() call
supposed to do?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-07-04  8:16                           ` Martin Schwidefsky
@ 2017-07-04  8:27                             ` Christian Borntraeger
  2017-07-10  8:05                             ` Christian Borntraeger
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-07-04  8:27 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Dr. David Alan Gilbert, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, thuth, Andrea Arcangeli

On 07/04/2017 10:16 AM, Martin Schwidefsky wrote:
> On Tue, 4 Jul 2017 09:48:11 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 07/03/2017 09:07 PM, Dr. David Alan Gilbert wrote:
>>> * Michael S. Tsirkin (mst@redhat.com) wrote:  
>>>> On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:  
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
>>>>>> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
>>>>>>  
>>>>>>>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
>>>>>>>> then try changing ram_handle_compressed to always do the memset.  
>>>>>>>
>>>>>>> FWIW, changing ram_handle_compressed to always memset makes the problem go away.  
>>>>>>
>>>>>> It is still running fine now with the "always memset change"  
>>>>>
>>>>> Did we ever nail down a fix for this; as I remember Andrea said
>>>>> we shouldn't need to do that memset, but we came to the conclusion
>>>>> it was something specific to how s390 protection keys worked.
>>>>>
>>>>> Dave  
>>>>
>>>> No we didn't. Let's merge that for now, with a comment that
>>>> we don't really understand what's going on?  
>>>
>>> Hmm no, I don't really want to change the !s390 behaviour, especially
>>> since it causes allocation that we otherwise avoid and Andrea's
>>> reply tothe original post pointed out it's not necessary.  
>>
>>
>> Since storage keys are per physical page we must not have shared pages.
>> Therefore in s390_enable_skey we already do a break_ksm (via ksm_madvise),
>> in other words we already allocate pages on the keyless->keyed switch.
>>
>> So why not do the same for zero pages - instead of invalidating the page
>> table entry, lets just do a proper COW.
>>
>> Something like this maybe (Andrea, Martin any better way to do that?)
>>
>>
>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
>> index 4fb3d3c..11475c7 100644
>> --- a/arch/s390/mm/gmap.c
>> +++ b/arch/s390/mm/gmap.c
>> @@ -2149,13 +2149,18 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
>>  static int __s390_enable_skey(pte_t *pte, unsigned long addr,
>>                               unsigned long next, struct mm_walk *walk)yy
>>  {
>> +       struct page *page;
>>         /*
>> -        * Remove all zero page mappings,
>> +        * Remove all zero page mappings with a COW
>>          * after establishing a policy to forbid zero page mappings
>>          * following faults for that page will get fresh anonymous pages
>>          */
>> -       if (is_zero_pfn(pte_pfn(*pte)))
>> -               ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
>> +       if (is_zero_pfn(pte_pfn(*pte))) {
>> +               if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1)
>> +                       put_page(page);
>> +               else
>> +                       return -ENOMEM;
>> +       }
>>         /* Clear storage key */
>>         ptep_zap_key(walk->mm, addr, pte);
>>         return 0;
> 
> I do not quite get the problem here. The zero-page mappings are always
> marked as _PAGE_SPECIAL. These should be safe to replace with an empty
> pte. We do mark all VMAs as unmergeable prior to the page table walk
> that replaces all zero-page mappings. What is the get_user_pages() call
> supposed to do?

the problem is that postcopy migration will do a read on the target system
for zero pages (to fault in zero pages).
The logic in postcopy then relies on this page never ever delivering a 
userfault again.

If we now enable keys on the target host, we will break the zero pages with
an empty PTE (lazy faulting).
With an empty pte we will get a userfault for this if this page is accesses
again on the target. Now postcopy "knows" that this page was already transferred
and bascially ignores this fault. Since postcopy will never resolve this
userfault the guest will hang.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] postcopy migration hangs while loading virtio state
  2017-07-04  8:16                           ` Martin Schwidefsky
  2017-07-04  8:27                             ` Christian Borntraeger
@ 2017-07-10  8:05                             ` Christian Borntraeger
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-07-10  8:05 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Dr. David Alan Gilbert, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, thuth, Andrea Arcangeli

On 07/04/2017 10:16 AM, Martin Schwidefsky wrote:
> On Tue, 4 Jul 2017 09:48:11 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 07/03/2017 09:07 PM, Dr. David Alan Gilbert wrote:
>>> * Michael S. Tsirkin (mst@redhat.com) wrote:  
>>>> On Fri, Jun 30, 2017 at 05:31:39PM +0100, Dr. David Alan Gilbert wrote:  
>>>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:  
>>>>>> On 04/26/2017 01:45 PM, Christian Borntraeger wrote:
>>>>>>  
>>>>>>>> Hmm, I have a theory, if the flags field has bit 1 set, i.e. RAM_SAVE_FLAG_COMPRESS
>>>>>>>> then try changing ram_handle_compressed to always do the memset.  
>>>>>>>
>>>>>>> FWIW, changing ram_handle_compressed to always memset makes the problem go away.  
>>>>>>
>>>>>> It is still running fine now with the "always memset change"  
>>>>>
>>>>> Did we ever nail down a fix for this; as I remember Andrea said
>>>>> we shouldn't need to do that memset, but we came to the conclusion
>>>>> it was something specific to how s390 protection keys worked.
>>>>>
>>>>> Dave  
>>>>
>>>> No we didn't. Let's merge that for now, with a comment that
>>>> we don't really understand what's going on?  
>>>
>>> Hmm no, I don't really want to change the !s390 behaviour, especially
>>> since it causes allocation that we otherwise avoid and Andrea's
>>> reply tothe original post pointed out it's not necessary.  
>>
>>
>> Since storage keys are per physical page we must not have shared pages.
>> Therefore in s390_enable_skey we already do a break_ksm (via ksm_madvise),
>> in other words we already allocate pages on the keyless->keyed switch.
>>
>> So why not do the same for zero pages - instead of invalidating the page
>> table entry, lets just do a proper COW.
>>
>> Something like this maybe (Andrea, Martin any better way to do that?)
>>
>>
>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
>> index 4fb3d3c..11475c7 100644
>> --- a/arch/s390/mm/gmap.c
>> +++ b/arch/s390/mm/gmap.c
>> @@ -2149,13 +2149,18 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
>>  static int __s390_enable_skey(pte_t *pte, unsigned long addr,
>>                               unsigned long next, struct mm_walk *walk)
>>  {
>> +       struct page *page;
>>         /*
>> -        * Remove all zero page mappings,
>> +        * Remove all zero page mappings with a COW
>>          * after establishing a policy to forbid zero page mappings
>>          * following faults for that page will get fresh anonymous pages
>>          */
>> -       if (is_zero_pfn(pte_pfn(*pte)))
>> -               ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
>> +       if (is_zero_pfn(pte_pfn(*pte))) {
>> +               if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1)
>> +                       put_page(page);
>> +               else
>> +                       return -ENOMEM;
>> +       }
>>         /* Clear storage key */
>>         ptep_zap_key(walk->mm, addr, pte);
>>         return 0;
> 
> I do not quite get the problem here. The zero-page mappings are always
> marked as _PAGE_SPECIAL. These should be safe to replace with an empty
> pte. We do mark all VMAs as unmergeable prior to the page table walk
> that replaces all zero-page mappings. What is the get_user_pages() call
> supposed to do?
> 


After talking to Martin, we decided that it might be a good trial to simply
not use the empty zero page at all for KVM guests. Something like this


diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 57057fb..65ab11d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -505,7 +505,7 @@ static inline int mm_alloc_pgste(struct mm_struct *mm)
  * In the case that a guest uses storage keys
  * faults should no longer be backed by zero pages
  */
-#define mm_forbids_zeropage mm_use_skey
+#define mm_forbids_zeropage mm_has_pgste
 static inline int mm_use_skey(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 4fb3d3c..88f502a 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2149,13 +2149,6 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
 static int __s390_enable_skey(pte_t *pte, unsigned long addr,
                              unsigned long next, struct mm_walk *walk)
 {
-       /*
-        * Remove all zero page mappings,
-        * after establishing a policy to forbid zero page mappings
-        * following faults for that page will get fresh anonymous pages
-        */
-       if (is_zero_pfn(pte_pfn(*pte)))
-               ptep_xchg_direct(walk->mm, addr, pte, __pte(_PAGE_INVALID));
        /* Clear storage key */
        ptep_zap_key(walk->mm, addr, pte);
        return 0;



seems to do the trick. Will have a look what that means for the memory usage for
the usual cases.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-07-10  8:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 10:31 [Qemu-devel] postcopy migration hangs while loading virtio state Christian Borntraeger
2017-04-24 10:53 ` Dr. David Alan Gilbert
2017-04-24 13:02   ` Christian Borntraeger
2017-04-24 14:35     ` Dr. David Alan Gilbert
2017-04-24 18:17       ` Christian Borntraeger
2017-04-24 19:12         ` Dr. David Alan Gilbert
2017-04-25 13:27           ` Christian Borntraeger
2017-04-26  9:59           ` Christian Borntraeger
2017-04-26 10:13             ` Christian Borntraeger
2017-04-26 11:01             ` Dr. David Alan Gilbert
2017-04-26 11:37               ` Christian Borntraeger
2017-04-26 11:45               ` Christian Borntraeger
2017-04-26 12:07                 ` Christian Borntraeger
2017-06-30 16:31                   ` Dr. David Alan Gilbert
2017-07-03 14:28                     ` Christian Borntraeger
2017-07-03 16:25                     ` Michael S. Tsirkin
2017-07-03 19:07                       ` Dr. David Alan Gilbert
2017-07-04  7:48                         ` Christian Borntraeger
2017-07-04  8:16                           ` Martin Schwidefsky
2017-07-04  8:27                             ` Christian Borntraeger
2017-07-10  8:05                             ` Christian Borntraeger
2017-04-24 13:11   ` Christian Borntraeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.