All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O
@ 2011-08-26 10:59 Daniel P. Berrange
  2011-08-26 11:25 ` Daniel P. Berrange
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel P. Berrange @ 2011-08-26 10:59 UTC (permalink / raw)
  To: qemu-devel

From: "Daniel P. Berrange" <berrange@redhat.com>

There are two common cases where migrate_cancel is intended to be
used

  1. When migration is not converging due to an overactive
     guest and insufficient network bandwidth
  2. When migration is stuck due a network outage, waiting
     for the TCP transmit timeout to occurr & return an I/O
     error for send()

In the second case, if you attempt to use 'migrate_cancel' it
will also get stuck. This can be seen by attempting to migrate
to a QEMU which has been SIGSTOP'd

  $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \
       -monitor stdio -vnc :2 -incoming tcp:localhost:9000
   QEMU 0.14.1 monitor - type 'help' for more information
   (qemu)
   <Ctrl-Z>
   [1]+  Stopped

And in another shell

  $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \
        -monitor stdio -vnc :1
   QEMU 0.14.1 monitor - type 'help' for more information
   (qemu) migrate -d tcp:localhost:9000
   (qemu) info migrate
   Migration status: active
   transferred ram: 416 kbytes
   remaining ram: 621624 kbytes
   total ram: 623040 kbytes
   (qemu) migrate_cancel

This last command will never return, until the first QEMU is
resumed. Looking at the stack trace in GDB you see

 #0  0x0000003a8320e4c2 in __libc_send (fd=10, buf=0x1bc7c70, n=19777, flags=0)
    at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
 #1  0x000000000048fb1e in socket_write (s=<optimized out>, buf=<optimized out>, size=<optimized out>)
    at migration-tcp.c:39
 #2  0x000000000048eba4 in migrate_fd_put_buffer (opaque=0x1b76ad0, data=0x1bc7c70, size=19777)
    at migration.c:324
 #3  0x000000000048e442 in buffered_flush (s=0x1b76b90) at buffered_file.c:87
 #4  0x000000000048e4cf in buffered_close (opaque=0x1b76b90) at buffered_file.c:177
 #5  0x0000000000496d57 in qemu_fclose (f=0x1bbfc10) at savevm.c:479
 #6  0x000000000048f4ca in migrate_fd_cleanup (s=0x1b76ad0) at migration.c:291
 #7  0x000000000048f035 in do_migrate_cancel (mon=<optimized out>, qdict=<optimized out>,
    ret_data=<optimized out>) at migration.c:136[snip]
 [snip]

The migration_fd_cleanup method is where the problem really starts.
Specifically it does

    if (s->file) {
        DPRINTF("closing file\n");
        if (qemu_fclose(s->file) != 0) {
            ret = -1;
        }
        s->file = NULL;
    }

    if (s->fd != -1)
        close(s->fd);

And gets stuck in the qemu_fclose() bit because that method (rightly) tries
to flush all outstanding buffers before closing. Unfortunately while this is
desirable when migration ends successfully, it is undesirable when we are
failing/cancelling migration.

It is hard to tell qemu_fclose() that it shouldn't flush buffers directly,
so the alternative is to ensure that this method fails quickly when it
attempts I/O. This is easily achieved, simply by closing 's->fd' before
calling qemu_fclose.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
---
 migration.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/migration.c b/migration.c
index f5959b4..a432c3b 100644
--- a/migration.c
+++ b/migration.c
@@ -286,6 +286,13 @@ int migrate_fd_cleanup(FdMigrationState *s)
 
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
 
+    if ((s->state == MIG_STATE_ERROR ||
+         s->state == MIG_STATE_CANCELLED) &&
+        s->fd != -1) {
+        close(s->fd);
+        s->fd = -1;
+    }
+
     if (s->file) {
         DPRINTF("closing file\n");
         if (qemu_fclose(s->file) != 0) {
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O
  2011-08-26 10:59 [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O Daniel P. Berrange
@ 2011-08-26 11:25 ` Daniel P. Berrange
  2011-08-26 13:48   ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel P. Berrange @ 2011-08-26 11:25 UTC (permalink / raw)
  To: qemu-devel

On Fri, Aug 26, 2011 at 11:59:28AM +0100, Daniel P. Berrange wrote:
> From: "Daniel P. Berrange" <berrange@redhat.com>
> 
> There are two common cases where migrate_cancel is intended to be
> used
> 
>   1. When migration is not converging due to an overactive
>      guest and insufficient network bandwidth
>   2. When migration is stuck due a network outage, waiting
>      for the TCP transmit timeout to occurr & return an I/O
>      error for send()
> 
> In the second case, if you attempt to use 'migrate_cancel' it
> will also get stuck. This can be seen by attempting to migrate
> to a QEMU which has been SIGSTOP'd
> 
>   $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \
>        -monitor stdio -vnc :2 -incoming tcp:localhost:9000
>    QEMU 0.14.1 monitor - type 'help' for more information
>    (qemu)
>    <Ctrl-Z>
>    [1]+  Stopped
> 
> And in another shell
> 
>   $ ./x86_64-softmmu/qemu-system-x86_64 -cdrom ~/boot.iso -m 600 \
>         -monitor stdio -vnc :1
>    QEMU 0.14.1 monitor - type 'help' for more information
>    (qemu) migrate -d tcp:localhost:9000
>    (qemu) info migrate
>    Migration status: active
>    transferred ram: 416 kbytes
>    remaining ram: 621624 kbytes
>    total ram: 623040 kbytes
>    (qemu) migrate_cancel
> 
> This last command will never return, until the first QEMU is
> resumed. Looking at the stack trace in GDB you see
> 
>  #0  0x0000003a8320e4c2 in __libc_send (fd=10, buf=0x1bc7c70, n=19777, flags=0)
>     at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
>  #1  0x000000000048fb1e in socket_write (s=<optimized out>, buf=<optimized out>, size=<optimized out>)
>     at migration-tcp.c:39
>  #2  0x000000000048eba4 in migrate_fd_put_buffer (opaque=0x1b76ad0, data=0x1bc7c70, size=19777)
>     at migration.c:324
>  #3  0x000000000048e442 in buffered_flush (s=0x1b76b90) at buffered_file.c:87
>  #4  0x000000000048e4cf in buffered_close (opaque=0x1b76b90) at buffered_file.c:177
>  #5  0x0000000000496d57 in qemu_fclose (f=0x1bbfc10) at savevm.c:479
>  #6  0x000000000048f4ca in migrate_fd_cleanup (s=0x1b76ad0) at migration.c:291
>  #7  0x000000000048f035 in do_migrate_cancel (mon=<optimized out>, qdict=<optimized out>,
>     ret_data=<optimized out>) at migration.c:136[snip]
>  [snip]
> 
> The migration_fd_cleanup method is where the problem really starts.
> Specifically it does
> 
>     if (s->file) {
>         DPRINTF("closing file\n");
>         if (qemu_fclose(s->file) != 0) {
>             ret = -1;
>         }
>         s->file = NULL;
>     }
> 
>     if (s->fd != -1)
>         close(s->fd);
> 
> And gets stuck in the qemu_fclose() bit because that method (rightly) tries
> to flush all outstanding buffers before closing. Unfortunately while this is
> desirable when migration ends successfully, it is undesirable when we are
> failing/cancelling migration.
> 
> It is hard to tell qemu_fclose() that it shouldn't flush buffers directly,
> so the alternative is to ensure that this method fails quickly when it
> attempts I/O. This is easily achieved, simply by closing 's->fd' before
> calling qemu_fclose.
> 
> Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> ---
>  migration.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/migration.c b/migration.c
> index f5959b4..a432c3b 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -286,6 +286,13 @@ int migrate_fd_cleanup(FdMigrationState *s)
>  
>      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>  
> +    if ((s->state == MIG_STATE_ERROR ||
> +         s->state == MIG_STATE_CANCELLED) &&
> +        s->fd != -1) {
> +        close(s->fd);
> +        s->fd = -1;
> +    }
> +
>      if (s->file) {
>          DPRINTF("closing file\n");
>          if (qemu_fclose(s->file) != 0) {

This approach results in 'socket_write' doing a send(-1) which
gets back a EBADF errno, causing the flush to abort.  An alternative
approach is to make the migrate_fd_put_buffer short-circuit the
send() call by checking the migration state thus:

diff --git a/migration.c b/migration.c
index f5959b4..6448d0b 100644
--- a/migration.c
+++ b/migration.c
@@ -319,6 +319,11 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     FdMigrationState *s = opaque;
     ssize_t ret;
 
+    if (s->state == MIG_STATE_ERROR ||
+        s->state == MIG_STATE_CANCELLED) {
+        return -EIO;
+    }
+
     do {
         ret = s->write(s, data, size);
     } while (ret == -1 && ((s->get_error(s)) == EINTR));


I think I slightly prefer this second option, since it avoids the EBADF
scenario. Other opinions ?

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O
  2011-08-26 11:25 ` Daniel P. Berrange
@ 2011-08-26 13:48   ` Paolo Bonzini
  2012-06-01  5:04     ` Amos Kong
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2011-08-26 13:48 UTC (permalink / raw)
  To: qemu-devel

On 08/26/2011 01:25 PM, Daniel P. Berrange wrote:
> diff --git a/migration.c b/migration.c
> index f5959b4..6448d0b 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -319,6 +319,11 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
>       FdMigrationState *s = opaque;
>       ssize_t ret;
>
> +    if (s->state == MIG_STATE_ERROR ||
> +        s->state == MIG_STATE_CANCELLED) {
> +        return -EIO;
> +    }
> +
>       do {
>           ret = s->write(s, data, size);
>       } while (ret == -1&&  ((s->get_error(s)) == EINTR));
>
>
> I think I slightly prefer this second option, since it avoids the EBADF
> scenario. Other opinions ?

I agree.

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O
  2011-08-26 13:48   ` Paolo Bonzini
@ 2012-06-01  5:04     ` Amos Kong
  0 siblings, 0 replies; 4+ messages in thread
From: Amos Kong @ 2012-06-01  5:04 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gren, qemu-devel

On Fri, Aug 26, 2011 at 9:48 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 08/26/2011 01:25 PM, Daniel P. Berrange wrote:
>>
>> diff --git a/migration.c b/migration.c
>> index f5959b4..6448d0b 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -319,6 +319,11 @@ ssize_t migrate_fd_put_buffer(void *opaque, const
>> void *data, size_t size)
>>      FdMigrationState *s = opaque;
>>      ssize_t ret;
>>
>> +    if (s->state == MIG_STATE_ERROR ||
>> +        s->state == MIG_STATE_CANCELLED) {
>> +        return -EIO;
>> +    }
>> +
>>      do {
>>          ret = s->write(s, data, size);
>>      } while (ret == -1&&  ((s->get_error(s)) == EINTR));
>>
>>
>>
>> I think I slightly prefer this second option, since it avoids the EBADF
>> scenario. Other opinions ?
>
>
> I agree.

Hello Paolo, berrange,

What's the status of this patch? Gren also touched this problem
(failed to migrate_cancel).

Old thread: http://lists.nongnu.org/archive/html/qemu-devel/2011-08/msg03245.html


Amos.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-01  5:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-26 10:59 [Qemu-devel] [PATCH] Ensure migrate_cancel does not block doing I/O Daniel P. Berrange
2011-08-26 11:25 ` Daniel P. Berrange
2011-08-26 13:48   ` Paolo Bonzini
2012-06-01  5:04     ` Amos Kong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.