All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] migration: Terminate multifd threads on yank
@ 2021-07-30  7:40 Leonardo Bras
  2021-08-02 15:35 ` Dr. David Alan Gilbert
  2021-08-03  6:41 ` Lukas Straub
  0 siblings, 2 replies; 6+ messages in thread
From: Leonardo Bras @ 2021-07-30  7:40 UTC (permalink / raw)
  To: Juan Quintela, Dr. David Alan Gilbert, Lukas Straub
  Cc: Li Xiaohui, Leonardo Bras, qemu-devel

From source host viewpoint, losing a connection during migration will
cause the sockets to get stuck in sendmsg() syscall, waiting for
the receiving side to reply.

In migration, yank works by shutting-down the migration QIOChannel fd.
This causes a failure in the next sendmsg() for that fd, and the whole
migration gets cancelled.

In multifd, due to having multiple sockets in multiple threads,
on a connection loss there will be extra sockets stuck in sendmsg(),
and because they will be holding their own mutex, there is good chance
the main migration thread can get stuck in multifd_send_pages()
waiting for one of those mutexes.

While it's waiting, the main migration thread can't run sendmsg() on
it's fd, and therefore can't cause the migration to be cancelled, thus
causing yank not to work.

Fixes this by shutting down all migration fds (including multifd ones),
so no thread get's stuck in sendmsg() while holding a lock, and thus
allowing the main migration thread to properly cancel migration when
yank is used.

There is no need to do the same procedure to yank to work in the
receiving host since ops->recv_pages() is kept outside the mutex protected
code in multifd_recv_thread().

Buglink:https://bugzilla.redhat.com/show_bug.cgi?id=1970337
Reported-by: Li Xiaohui <xiaohli@redhat.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c        | 11 +++++++++++
 migration/multifd.h        |  1 +
 migration/yank_functions.c |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 377da78f5b..744a180dfe 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1040,6 +1040,17 @@ void multifd_recv_sync_main(void)
     trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
 }
 
+void multifd_shutdown(void)
+{
+    if (!migrate_use_multifd()) {
+        return;
+    }
+
+    if (multifd_send_state) {
+        multifd_send_terminate_threads(NULL);
+    }
+}
+
 static void *multifd_recv_thread(void *opaque)
 {
     MultiFDRecvParams *p = opaque;
diff --git a/migration/multifd.h b/migration/multifd.h
index 8d6751f5ed..0517213bdf 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,6 +22,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 void multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
+void multifd_shutdown(void);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index 8c08aef14a..9335a64f00 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -15,12 +15,14 @@
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
 #include "qemu-file.h"
+#include "multifd.h"
 
 void migration_yank_iochannel(void *opaque)
 {
     QIOChannel *ioc = QIO_CHANNEL(opaque);
 
     qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    multifd_shutdown();
 }
 
 /* Return whether yank is supported on this ioc */
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] migration: Terminate multifd threads on yank
  2021-07-30  7:40 [PATCH 1/1] migration: Terminate multifd threads on yank Leonardo Bras
@ 2021-08-02 15:35 ` Dr. David Alan Gilbert
  2021-08-03  7:02   ` Leonardo Bras Soares Passos
  2021-08-03  6:41 ` Lukas Straub
  1 sibling, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-02 15:35 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Li Xiaohui, Lukas Straub, qemu-devel, Juan Quintela

* Leonardo Bras (leobras@redhat.com) wrote:
> From source host viewpoint, losing a connection during migration will
> cause the sockets to get stuck in sendmsg() syscall, waiting for
> the receiving side to reply.
> 
> In migration, yank works by shutting-down the migration QIOChannel fd.
> This causes a failure in the next sendmsg() for that fd, and the whole
> migration gets cancelled.
> 
> In multifd, due to having multiple sockets in multiple threads,
> on a connection loss there will be extra sockets stuck in sendmsg(),
> and because they will be holding their own mutex, there is good chance
> the main migration thread can get stuck in multifd_send_pages()
> waiting for one of those mutexes.
> 
> While it's waiting, the main migration thread can't run sendmsg() on
> it's fd, and therefore can't cause the migration to be cancelled, thus
> causing yank not to work.
> 
> Fixes this by shutting down all migration fds (including multifd ones),
> so no thread get's stuck in sendmsg() while holding a lock, and thus
> allowing the main migration thread to properly cancel migration when
> yank is used.
> 
> There is no need to do the same procedure to yank to work in the
> receiving host since ops->recv_pages() is kept outside the mutex protected
> code in multifd_recv_thread().
> 
> Buglink:https://bugzilla.redhat.com/show_bug.cgi?id=1970337
> Reported-by: Li Xiaohui <xiaohli@redhat.com>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd.c        | 11 +++++++++++
>  migration/multifd.h        |  1 +
>  migration/yank_functions.c |  2 ++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 377da78f5b..744a180dfe 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1040,6 +1040,17 @@ void multifd_recv_sync_main(void)
>      trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
>  }
>  
> +void multifd_shutdown(void)
> +{
> +    if (!migrate_use_multifd()) {
> +        return;
> +    }
> +
> +    if (multifd_send_state) {
> +        multifd_send_terminate_threads(NULL);
> +    }

That calls :
    for (i = 0; i < migrate_multifd_channels(); i++) {
        MultiFDSendParams *p = &multifd_send_state->params[i];

        qemu_mutex_lock(&p->mutex);
        p->quit = true;
        qemu_sem_post(&p->sem);
        qemu_mutex_unlock(&p->mutex);
    }

so why doesn't this also get stuck in the same mutex you're trying to
fix?

Does the qio_channel_shutdown actually cause a shutdown on all fd's
for the multifd?

(I've just seen the multifd/cancel test fail stuck in multifd_send_sync_main
waiting on one of the locks).

Dave

> +}
> +
>  static void *multifd_recv_thread(void *opaque)
>  {
>      MultiFDRecvParams *p = opaque;
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 8d6751f5ed..0517213bdf 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -22,6 +22,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>  void multifd_recv_sync_main(void);
>  void multifd_send_sync_main(QEMUFile *f);
>  int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
> +void multifd_shutdown(void);
>  
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
> diff --git a/migration/yank_functions.c b/migration/yank_functions.c
> index 8c08aef14a..9335a64f00 100644
> --- a/migration/yank_functions.c
> +++ b/migration/yank_functions.c
> @@ -15,12 +15,14 @@
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
>  #include "qemu-file.h"
> +#include "multifd.h"
>  
>  void migration_yank_iochannel(void *opaque)
>  {
>      QIOChannel *ioc = QIO_CHANNEL(opaque);
>  
>      qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +    multifd_shutdown();
>  }
>  
>  /* Return whether yank is supported on this ioc */
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] migration: Terminate multifd threads on yank
  2021-07-30  7:40 [PATCH 1/1] migration: Terminate multifd threads on yank Leonardo Bras
  2021-08-02 15:35 ` Dr. David Alan Gilbert
@ 2021-08-03  6:41 ` Lukas Straub
  2021-08-03  7:18   ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 6+ messages in thread
From: Lukas Straub @ 2021-08-03  6:41 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: qemu-devel, Li Xiaohui, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2265 bytes --]

On Fri, 30 Jul 2021 04:40:45 -0300
Leonardo Bras <leobras@redhat.com> wrote:

> From source host viewpoint, losing a connection during migration will
> cause the sockets to get stuck in sendmsg() syscall, waiting for
> the receiving side to reply.
> 
> In migration, yank works by shutting-down the migration QIOChannel fd.
> This causes a failure in the next sendmsg() for that fd, and the whole
> migration gets cancelled.
> 
> In multifd, due to having multiple sockets in multiple threads,
> on a connection loss there will be extra sockets stuck in sendmsg(),
> and because they will be holding their own mutex, there is good chance
> the main migration thread can get stuck in multifd_send_pages()
> waiting for one of those mutexes.
> 
> While it's waiting, the main migration thread can't run sendmsg() on
> it's fd, and therefore can't cause the migration to be cancelled, thus
> causing yank not to work.
> 
> Fixes this by shutting down all migration fds (including multifd ones),
> so no thread get's stuck in sendmsg() while holding a lock, and thus
> allowing the main migration thread to properly cancel migration when
> yank is used.
> 
> There is no need to do the same procedure to yank to work in the
> receiving host since ops->recv_pages() is kept outside the mutex protected
> code in multifd_recv_thread().
> 
> Buglink:https://bugzilla.redhat.com/show_bug.cgi?id=1970337
> Reported-by: Li Xiaohui <xiaohli@redhat.com>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---

Hi,
There is an easier explanation: I forgot the send side of multifd
altogether (I thought it was covered by migration_channel_connect()).
So yank won't actually shutdown() the multifd sockets on the send side.

In the bugreport you wrote
> (As a test, I called qio_channel_shutdown() in every multifd iochannel and yank worked just fine, but I could not retry migration, because it was still 'ongoing')
That sounds like a bug in the error handling for multifd. But quickly
looking at the code, it should properly fail the migration.

BTW: You can shutdown outgoing sockets from outside of qemu with the
'ss' utility, like this: 'sudo ss -K dst <destination ip> dport = <destination port>'

Regards,
Lukas Straub

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] migration: Terminate multifd threads on yank
  2021-08-02 15:35 ` Dr. David Alan Gilbert
@ 2021-08-03  7:02   ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 6+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-08-03  7:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Li Xiaohui, Lukas Straub, qemu-devel, Juan Quintela

Hello Dave,

> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 377da78f5b..744a180dfe 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -1040,6 +1040,17 @@ void multifd_recv_sync_main(void)
> >      trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
> >  }
> >
> > +void multifd_shutdown(void)
> > +{
> > +    if (!migrate_use_multifd()) {
> > +        return;
> > +    }
> > +
> > +    if (multifd_send_state) {
> > +        multifd_send_terminate_threads(NULL);
> > +    }
>
> That calls :
>     for (i = 0; i < migrate_multifd_channels(); i++) {
>         MultiFDSendParams *p = &multifd_send_state->params[i];
>
>         qemu_mutex_lock(&p->mutex);
>         p->quit = true;
>         qemu_sem_post(&p->sem);
>         qemu_mutex_unlock(&p->mutex);
>     }
>
> so why doesn't this also get stuck in the same mutex you're trying to
> fix?

You are right, I got confused over the locks.
I need to get a better look at the code, and truly understand why this
patch fixes (?) the issue.

>
> Does the qio_channel_shutdown actually cause a shutdown on all fd's
> for the multifd?

As far as I tested, it does shutdown a single fd, but whenever this fd
fails in it's first sendmsg it causes migration to fail and all the
other fds get shutdown as well.

>
> (I've just seen the multifd/cancel test fail stuck in multifd_send_sync_main
> waiting on one of the locks).
>
> Dave
>

> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

I will do a little more reading / debugging in this code.
Thanks Dave!



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] migration: Terminate multifd threads on yank
  2021-08-03  6:41 ` Lukas Straub
@ 2021-08-03  7:18   ` Leonardo Bras Soares Passos
  2021-08-03  8:25     ` Lukas Straub
  0 siblings, 1 reply; 6+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-08-03  7:18 UTC (permalink / raw)
  To: Lukas Straub
  Cc: qemu-devel, Li Xiaohui, Dr. David Alan Gilbert, Juan Quintela

Hello Lukas,

On Tue, Aug 3, 2021 at 3:42 AM Lukas Straub <lukasstraub2@web.de> wrote:
> Hi,
> There is an easier explanation: I forgot the send side of multifd
> altogether (I thought it was covered by migration_channel_connect()).
> So yank won't actually shutdown() the multifd sockets on the send side.

If I could get that correctly, it seems to abort migration (and
therefore close all fds) if the ft that ends up qio_channel_shutdown()
get to sendmsg(), which can take a while.
But it really does not close thew fds before that.

>
> In the bugreport you wrote
> > (As a test, I called qio_channel_shutdown() in every multifd iochannel and yank worked just fine, but I could not retry migration, because it was still 'ongoing')
> That sounds like a bug in the error handling for multifd. But quickly
> looking at the code, it should properly fail the migration.

In the end, just asking each thread to just exit ended up getting me a
smoother migration abort.
>
> BTW: You can shutdown outgoing sockets from outside of qemu with the
> 'ss' utility, like this: 'sudo ss -K dst <destination ip> dport = <destination port>'

Very nice tool, thanks for sharing!

>
> Regards,
> Lukas Straub

Best regards,
Leonardo Bras



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] migration: Terminate multifd threads on yank
  2021-08-03  7:18   ` Leonardo Bras Soares Passos
@ 2021-08-03  8:25     ` Lukas Straub
  0 siblings, 0 replies; 6+ messages in thread
From: Lukas Straub @ 2021-08-03  8:25 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: qemu-devel, Li Xiaohui, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

On Tue, 3 Aug 2021 04:18:42 -0300
Leonardo Bras Soares Passos <leobras@redhat.com> wrote:

> Hello Lukas,
> 
> On Tue, Aug 3, 2021 at 3:42 AM Lukas Straub <lukasstraub2@web.de> wrote:
> > Hi,
> > There is an easier explanation: I forgot the send side of multifd
> > altogether (I thought it was covered by migration_channel_connect()).
> > So yank won't actually shutdown() the multifd sockets on the send side.  
> 
> If I could get that correctly, it seems to abort migration (and
> therefore close all fds) if the ft that ends up qio_channel_shutdown()
> get to sendmsg(), which can take a while.

How long is "can take a while"? Until some TCP connection times out?
That would mean that it is hanging somewhere else.

I mean in precopy migration the multifd send threads should be fully
utilized and always sending something until the migration finishes. In
that case it is likely that all the treads become stuck in
qio_channel_write_all() if the connection breaks silently (i.e.
discards packets or the destination is powered off, No connection
reset) since there are no TCP ACK's ariving from the destination side
-> kernel tcp buffer becomes full -> qio_channel_write_all() blocks.
Thus, shutdown() on the sockets should be enough to get the treads
unstuck and notice that the connection broke.

If something else hangs, the question is where...

> But it really does not close thew fds before that.

Note: shutdown() is not close().

> >
> > In the bugreport you wrote  
> > > (As a test, I called qio_channel_shutdown() in every multifd iochannel and yank worked just fine, but I could not retry migration, because it was still 'ongoing')  
> > That sounds like a bug in the error handling for multifd. But quickly
> > looking at the code, it should properly fail the migration.  
> 
> In the end, just asking each thread to just exit ended up getting me a
> smoother migration abort.
> >
> > BTW: You can shutdown outgoing sockets from outside of qemu with the
> > 'ss' utility, like this: 'sudo ss -K dst <destination ip> dport = <destination port>'  
> 
> Very nice tool, thanks for sharing!
> 
> >
> > Regards,
> > Lukas Straub  
> 
> Best regards,
> Leonardo Bras
> 



-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-03  8:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30  7:40 [PATCH 1/1] migration: Terminate multifd threads on yank Leonardo Bras
2021-08-02 15:35 ` Dr. David Alan Gilbert
2021-08-03  7:02   ` Leonardo Bras Soares Passos
2021-08-03  6:41 ` Lukas Straub
2021-08-03  7:18   ` Leonardo Bras Soares Passos
2021-08-03  8:25     ` Lukas Straub

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.