All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lukas Straub <lukasstraub2@web.de>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>,
	Zhanghailiang <zhang.zhanghailiang@huawei.com>,
	qemu-devel <qemu-devel@nongnu.org>
Subject: Re: [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd
Date: Wed, 20 May 2020 22:44:50 +0200	[thread overview]
Message-ID: <20200520224450.5a0bf201@luklap> (raw)
In-Reply-To: <20200519145020.GG2798@work-vm>

[-- Attachment #1: Type: text/plain, Size: 5893 bytes --]

On Tue, 19 May 2020 15:50:20 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Lukas Straub (lukasstraub2@web.de) wrote:
> > On Mon, 18 May 2020 12:55:34 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:  
> > > > > This causes the migration thread to hang if we failover during checkpoint. A
> > > > > shutdown fd won't cause network traffic anyway.
> > > > >     
> > > > 
> > > > I'm not quite sure if this modification can take side effect on normal migration process or not,
> > > > There are several places calling it.
> > > > 
> > > > Maybe Juan and Dave can help ;)
> > > >     
> > > > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > > > ---
> > > > >  migration/qemu-file.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > > > > 1c3a358a14..0748b5810f 100644
> > > > > --- a/migration/qemu-file.c
> > > > > +++ b/migration/qemu-file.c
> > > > > @@ -660,7 +660,7 @@ int64_t qemu_ftell(QEMUFile *f)  int
> > > > > qemu_file_rate_limit(QEMUFile *f)  {
> > > > >      if (f->shutdown) {
> > > > > -        return 1;
> > > > > +        return 0;
> > > > >      }    
> > > 
> > > This looks wrong to me; I'd be curious to understand how it's hanging
> > > for you.
> > > '1' means 'stop what you're doing', 0 means carry on; carrying on with a
> > > shutdown fd sounds wrong.
> > > 
> > > If we look at ram.c we have:
> > > 
> > >         while ((ret = qemu_file_rate_limit(f)) == 0 ||
> > >                 !QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
> > >             int pages;
> > >         ....
> > > 
> > > so if it returns '1', as it does at the moment it should cause it to
> > > exit the ram_save_iterate loop - which is what we want if it's failing.
> > > Thus I think you need to find the actual place it's stuck in this case -
> > > I suspect it's repeatedly calling ram_save_iterate and then exiting it,
> > > but if that's happening perhaps we're missing a qemu_file_get_error
> > > check somewhere.  
> > 
> > Hi,
> > the problem is in ram_save_host_page and migration_rate_limit, here is a backtrace:  
> 
> Ah...
> 
> > #0  0x00007f7b502921a8 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f7ada7fb3f0, clockid=0, expected=0, futex_word=0x55bc358b9908) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
> > #1  do_futex_wait (sem=sem@entry=0x55bc358b9908, abstime=abstime@entry=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:112
> > #2  0x00007f7b502922d3 in __new_sem_wait_slow (sem=0x55bc358b9908, abstime=0x7f7ada7fb3f0, clockid=0) at sem_waitcommon.c:184
> > #3  0x000055bc3382b6c1 in qemu_sem_timedwait (sem=0x55bc358b9908, ms=100) at util/qemu-thread-posix.c:306
> > #4  0x000055bc3363950b in migration_rate_limit () at migration/migration.c:3365  
> 
> OK, so how about:
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b6b662e016..4e885385a8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3356,6 +3356,10 @@ bool migration_rate_limit(void)
>      bool urgent = false;
>      migration_update_counters(s, now);
>      if (qemu_file_rate_limit(s->to_dst_file)) {
> +
> +        if (qemu_file_get_error(mis->from_src_file)) {
> +            return false;
> +        }
>          /*
>           * Wait for a delay to do rate limiting OR
>           * something urgent to post the semaphore.
> 
> Does that work?

Yes, this works well using s->to_dst_file instead of mis->from_src_file.

Regards,
Lukas Straub

> I wonder if we also need to kick the rate_limit_sem when we yank the
> socket.
> 
> Dave
> 
> > #5  0x000055bc332b70d3 in ram_save_host_page (rs=0x7f7acc001a70, pss=0x7f7ada7fb4b0, last_stage=true) at /home/lukas/qemu/migration/ram.c:1696
> > #6  0x000055bc332b71fa in ram_find_and_save_block (rs=0x7f7acc001a70, last_stage=true) at /home/lukas/qemu/migration/ram.c:1750
> > #7  0x000055bc332b8bbd in ram_save_complete (f=0x55bc36661330, opaque=0x55bc33fbc678 <ram_state>) at /home/lukas/qemu/migration/ram.c:2606
> > #8  0x000055bc3364112c in qemu_savevm_state_complete_precopy_iterable (f=0x55bc36661330, in_postcopy=false) at migration/savevm.c:1344
> > #9  0x000055bc33641556 in qemu_savevm_state_complete_precopy (f=0x55bc36661330, iterable_only=true, inactivate_disks=false) at migration/savevm.c:1442
> > #10 0x000055bc33641982 in qemu_savevm_live_state (f=0x55bc36661330) at migration/savevm.c:1569
> > #11 0x000055bc33645407 in colo_do_checkpoint_transaction (s=0x55bc358b9840, bioc=0x7f7acc059990, fb=0x7f7acc4627b0) at migration/colo.c:464
> > #12 0x000055bc336457ca in colo_process_checkpoint (s=0x55bc358b9840) at migration/colo.c:589
> > #13 0x000055bc336459e4 in migrate_start_colo_process (s=0x55bc358b9840) at migration/colo.c:666
> > #14 0x000055bc336393d7 in migration_iteration_finish (s=0x55bc358b9840) at migration/migration.c:3312
> > #15 0x000055bc33639753 in migration_thread (opaque=0x55bc358b9840) at migration/migration.c:3477
> > #16 0x000055bc3382bbb5 in qemu_thread_start (args=0x55bc357c27c0) at util/qemu-thread-posix.c:519
> > #17 0x00007f7b50288f27 in start_thread (arg=<optimized out>) at pthread_create.c:479
> > #18 0x00007f7b501ba31f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> > 
> > It hangs in ram_save_host_page for at least 10 Minutes.
> > 
> > Regards,
> > Lukas Straub
> >   
> > > Dave
> > >   
> > > > >      if (qemu_file_get_error(f)) {
> > > > >          return 1;
> > > > > --
> > > > > 2.20.1    
> > > >     
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > >   
> >   
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2020-05-20 20:48 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-11 11:10 [PATCH 0/6] colo: migration related bugfixes Lukas Straub
2020-05-11 11:10 ` [PATCH 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
2020-05-13 11:31   ` 答复: " Zhanghailiang
2020-05-11 11:10 ` [PATCH 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
2020-05-13  9:47   ` Dr. David Alan Gilbert
2020-05-13 19:15     ` Lukas Straub
2020-05-11 11:10 ` [PATCH 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
2020-05-14 12:45   ` 答复: " Zhanghailiang
2020-05-11 11:10 ` [PATCH 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
2020-05-15  6:24   ` Zhanghailiang
2020-05-11 11:10 ` [PATCH 5/6] migration/qemu-file.c: Don't ratelimit a shutdown fd Lukas Straub
2020-05-14 13:05   ` 答复: " Zhanghailiang
2020-05-18 11:55     ` Dr. David Alan Gilbert
2020-05-19 13:08       ` Lukas Straub
2020-05-19 14:50         ` Dr. David Alan Gilbert
2020-05-20 20:44           ` Lukas Straub [this message]
2020-05-11 11:11 ` [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
2020-05-14 13:27   ` 答复: " Zhanghailiang
2020-05-14 14:31     ` Lukas Straub
2020-05-15  1:45       ` Zhanghailiang
2020-05-15  1:53   ` Zhanghailiang
2020-06-01 16:50 ` [PATCH 0/6] colo: migration related bugfixes Dr. David Alan Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200520224450.5a0bf201@luklap \
    --to=lukasstraub2@web.de \
    --cc=dgilbert@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.