[PATCH v1 0/2] Zero copy improvements (QIOChannel + multifd)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/2] Zero copy improvements (QIOChannel + multifd)
@ 2022-06-28  1:09 Leonardo Bras
  2022-06-28  1:09 ` [PATCH v1 1/2] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent Leonardo Bras
  2022-06-28  1:09 ` [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working Leonardo Bras
  0 siblings, 2 replies; 11+ messages in thread
From: Leonardo Bras @ 2022-06-28  1:09 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

The first patch avoid spuriously returning 1 [*] when zero-copy flush is
attempted before any buffer was sent using MSG_ZEROCOPY.

[*] zero-copy not being used, even though it's enabled and supported
by kernel

The second patch should be applied on top of Juan Quintela's patchset that
reduces the frequency of multifd syncs, or else it could potentially send 
20 warnings per second.
									...
Leonardo Bras (2):
  QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing
    sent
  migration/multifd: Warn user when zerocopy not working

 io/channel-socket.c | 8 +++++++-
 migration/multifd.c | 3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)

-- 
2.36.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v1 1/2] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent
  2022-06-28  1:09 [PATCH v1 0/2] Zero copy improvements (QIOChannel + multifd) Leonardo Bras
@ 2022-06-28  1:09 ` Leonardo Bras
  2022-06-28  1:09 ` [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working Leonardo Bras
  1 sibling, 0 replies; 11+ messages in thread
From: Leonardo Bras @ 2022-06-28  1:09 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

If flush is called when no buffer was sent with MSG_ZEROCOPY, it currently
returns 1. This return code should be used only when Linux fails to use
MSG_ZEROCOPY on a lot of sendmsg().

Fix this by returning early from flush if no sendmsg(...,MSG_ZEROCOPY)
was attempted.

Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX")
Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 io/channel-socket.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index 4466bb1cd4..698c086b70 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -716,12 +716,18 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
     struct cmsghdr *cm;
     char control[CMSG_SPACE(sizeof(*serr))];
     int received;
-    int ret = 1;
+    int ret;
+
+    if (!sioc->zero_copy_queued) {
+        return 0;
+    }
 
     msg.msg_control = control;
     msg.msg_controllen = sizeof(control);
     memset(control, 0, sizeof(control));
 
+    ret = 1;
+
     while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
         received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
         if (received < 0) {
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28  1:09 [PATCH v1 0/2] Zero copy improvements (QIOChannel + multifd) Leonardo Bras
  2022-06-28  1:09 ` [PATCH v1 1/2] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent Leonardo Bras
@ 2022-06-28  1:09 ` Leonardo Bras
  2022-06-28  7:53   ` Daniel P. Berrangé
  1 sibling, 1 reply; 11+ messages in thread
From: Leonardo Bras @ 2022-06-28  1:09 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

Some errors, like the lack of Scatter-Gather support by the network
interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
zero-copy, which causes it to fall back to the default copying mechanism.

After each full dirty-bitmap scan there should be a zero-copy flush
happening, which checks for errors each of the previous calls to
sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
warn the user about it.

Since it happens once each full dirty-bitmap scan, even in worst case
scenario it should not print a lot of warnings, and will allow tracking
how many dirty-bitmap iterations were not able to use zero-copy send.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 684c014c86..9c62aec84e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
             if (ret < 0) {
                 error_report_err(err);
                 return -1;
+            } else if (ret == 1) {
+                warn_report("The network device is not able to use "
+                            "zero-copy-send: copying is being used");
             }
         }
     }
-- 
2.36.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28  1:09 ` [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working Leonardo Bras
@ 2022-06-28  7:53   ` Daniel P. Berrangé
  2022-06-28 12:32     ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28  7:53 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> Some errors, like the lack of Scatter-Gather support by the network
> interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> zero-copy, which causes it to fall back to the default copying mechanism.

How common is this lack of SG support ? What NICs did you have that
were affected ?

> After each full dirty-bitmap scan there should be a zero-copy flush
> happening, which checks for errors each of the previous calls to
> sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> warn the user about it.
> 
> Since it happens once each full dirty-bitmap scan, even in worst case
> scenario it should not print a lot of warnings, and will allow tracking
> how many dirty-bitmap iterations were not able to use zero-copy send.

For long running migrations which are not converging, or converging
very slowly there could be 100's of passes.



> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 684c014c86..9c62aec84e 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
>              if (ret < 0) {
>                  error_report_err(err);
>                  return -1;
> +            } else if (ret == 1) {
> +                warn_report("The network device is not able to use "
> +                            "zero-copy-send: copying is being used");
>              }
>          }
>      }
> -- 
> 2.36.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28  7:53   ` Daniel P. Berrangé
@ 2022-06-28 12:32     ` Leonardo Bras Soares Passos
  2022-06-28 13:02       ` Daniel P. Berrangé
  0 siblings, 1 reply; 11+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-06-28 12:32 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > Some errors, like the lack of Scatter-Gather support by the network
> > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > zero-copy, which causes it to fall back to the default copying mechanism.
>
> How common is this lack of SG support ? What NICs did you have that
> were affected ?

I am not aware of any NIC without SG available for testing, nor have
any idea on how common they are.
But since we can detect sendmsg() falling back to copying we should
warn the user if this ever happens.

There is also a case in IPv6 related to fragmentation that may cause
MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
covered.

>
> > After each full dirty-bitmap scan there should be a zero-copy flush
> > happening, which checks for errors each of the previous calls to
> > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > warn the user about it.
> >
> > Since it happens once each full dirty-bitmap scan, even in worst case
> > scenario it should not print a lot of warnings, and will allow tracking
> > how many dirty-bitmap iterations were not able to use zero-copy send.
>
> For long running migrations which are not converging, or converging
> very slowly there could be 100's of passes.
>

I could change it so it only warns once, if that is too much output.

Best regards,
Leo

>
>
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 684c014c86..9c62aec84e 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
> >              if (ret < 0) {
> >                  error_report_err(err);
> >                  return -1;
> > +            } else if (ret == 1) {
> > +                warn_report("The network device is not able to use "
> > +                            "zero-copy-send: copying is being used");
> >              }
> >          }
> >      }
> > --
> > 2.36.1
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28 12:32     ` Leonardo Bras Soares Passos
@ 2022-06-28 13:02       ` Daniel P. Berrangé
  2022-06-28 13:52         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 13:02 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > Some errors, like the lack of Scatter-Gather support by the network
> > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > zero-copy, which causes it to fall back to the default copying mechanism.
> >
> > How common is this lack of SG support ? What NICs did you have that
> > were affected ?
> 
> I am not aware of any NIC without SG available for testing, nor have
> any idea on how common they are.
> But since we can detect sendmsg() falling back to copying we should
> warn the user if this ever happens.
> 
> There is also a case in IPv6 related to fragmentation that may cause
> MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> covered.
> 
> >
> > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > happening, which checks for errors each of the previous calls to
> > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > warn the user about it.
> > >
> > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > scenario it should not print a lot of warnings, and will allow tracking
> > > how many dirty-bitmap iterations were not able to use zero-copy send.
> >
> > For long running migrations which are not converging, or converging
> > very slowly there could be 100's of passes.
> >
> 
> I could change it so it only warns once, if that is too much output.

Well I'm mostly wondering what we're expecting the user todo with this
information. Generally a log file containing warnings ends up turning
into a bug report. If we think it is important for users and/or mgmt
apps to be aware of this info, then it might be better to actually
put a field in the query-migrate stats to report if zero-copy is
being honoured or not, and just have a trace point in this location
instead.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28 13:02       ` Daniel P. Berrangé
@ 2022-06-28 13:52         ` Dr. David Alan Gilbert
  2022-06-28 16:53           ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28 13:52 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras Soares Passos, Juan Quintela, Peter Xu, qemu-devel

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > >
> > > How common is this lack of SG support ? What NICs did you have that
> > > were affected ?
> > 
> > I am not aware of any NIC without SG available for testing, nor have
> > any idea on how common they are.
> > But since we can detect sendmsg() falling back to copying we should
> > warn the user if this ever happens.
> > 
> > There is also a case in IPv6 related to fragmentation that may cause
> > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > covered.
> > 
> > >
> > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > happening, which checks for errors each of the previous calls to
> > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > warn the user about it.
> > > >
> > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > >
> > > For long running migrations which are not converging, or converging
> > > very slowly there could be 100's of passes.
> > >
> > 
> > I could change it so it only warns once, if that is too much output.
> 
> Well I'm mostly wondering what we're expecting the user todo with this
> information. Generally a log file containing warnings ends up turning
> into a bug report. If we think it is important for users and/or mgmt
> apps to be aware of this info, then it might be better to actually
> put a field in the query-migrate stats to report if zero-copy is
> being honoured or not,

Yeh just a counter would work there I think.

> and just have a trace point in this location
> instead.

Yeh.

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28 13:52         ` Dr. David Alan Gilbert
@ 2022-06-28 16:53           ` Leonardo Bras Soares Passos
  2022-06-28 16:56             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-06-28 16:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé, Juan Quintela, Peter Xu, qemu-devel

On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > > >
> > > > How common is this lack of SG support ? What NICs did you have that
> > > > were affected ?
> > >
> > > I am not aware of any NIC without SG available for testing, nor have
> > > any idea on how common they are.
> > > But since we can detect sendmsg() falling back to copying we should
> > > warn the user if this ever happens.
> > >
> > > There is also a case in IPv6 related to fragmentation that may cause
> > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > covered.
> > >
> > > >
> > > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > > happening, which checks for errors each of the previous calls to
> > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > > warn the user about it.
> > > > >
> > > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > > >
> > > > For long running migrations which are not converging, or converging
> > > > very slowly there could be 100's of passes.
> > > >
> > >
> > > I could change it so it only warns once, if that is too much output.
> >
> > Well I'm mostly wondering what we're expecting the user todo with this
> > information.


My rationale on that:
- zero-copy-send is a feature that is supposed to improve send
throughput by reducing cpu usage.
- there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
- if this happens, there will be a potential throughput decrease on sendmsg()
- the user (or management app) need to know when zero-copy-send is
degrading throughput, so it can be disabled
- this is also important for performance testing, given it can be
confusing having zero-copy-send improving throughput in some cases,
and degrading in others, without any apparent reason why.

> > Generally a log file containing warnings ends up turning
> > into a bug report. If we think it is important for users and/or mgmt
> > apps to be aware of this info, then it might be better to actually
> > put a field in the query-migrate stats to report if zero-copy is
> > being honoured or not,
>
> Yeh just a counter would work there I think.

The warning idea was totally due to my inexperience on this mgmt app
interface, since I had no other idea on how to deal with that.

I think having it in query-migrate is a much better idea than a
warning, since it should be much easier to parse and disable
zero-copy-send if desired.
Even in my current qemu test script, it's much better having it in
query-migrate.

>
> > and just have a trace point in this location
> > instead.
>
> Yeh.
>

Yeap, the counter idea seems great!
Will it be always printed there, or only when zero-copy-send is enabled?

Best regards,
Leo

> Dave
>
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28 16:53           ` Leonardo Bras Soares Passos
@ 2022-06-28 16:56             ` Dr. David Alan Gilbert
  2022-07-01  6:18               ` Leonardo Brás
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28 16:56 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Daniel P. Berrangé, Juan Quintela, Peter Xu, qemu-devel

* Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > > > >
> > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > were affected ?
> > > >
> > > > I am not aware of any NIC without SG available for testing, nor have
> > > > any idea on how common they are.
> > > > But since we can detect sendmsg() falling back to copying we should
> > > > warn the user if this ever happens.
> > > >
> > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > covered.
> > > >
> > > > >
> > > > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > > > happening, which checks for errors each of the previous calls to
> > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > > > warn the user about it.
> > > > > >
> > > > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > > > >
> > > > > For long running migrations which are not converging, or converging
> > > > > very slowly there could be 100's of passes.
> > > > >
> > > >
> > > > I could change it so it only warns once, if that is too much output.
> > >
> > > Well I'm mostly wondering what we're expecting the user todo with this
> > > information.
> 
> 
> My rationale on that:
> - zero-copy-send is a feature that is supposed to improve send
> throughput by reducing cpu usage.
> - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> - if this happens, there will be a potential throughput decrease on sendmsg()
> - the user (or management app) need to know when zero-copy-send is
> degrading throughput, so it can be disabled
> - this is also important for performance testing, given it can be
> confusing having zero-copy-send improving throughput in some cases,
> and degrading in others, without any apparent reason why.
> 
> > > Generally a log file containing warnings ends up turning
> > > into a bug report. If we think it is important for users and/or mgmt
> > > apps to be aware of this info, then it might be better to actually
> > > put a field in the query-migrate stats to report if zero-copy is
> > > being honoured or not,
> >
> > Yeh just a counter would work there I think.
> 
> The warning idea was totally due to my inexperience on this mgmt app
> interface, since I had no other idea on how to deal with that.

Yeh it's not too silly an idea!
The way some of these warning or stats get to us can be a bit random,
but sometimes can confuse things.

> I think having it in query-migrate is a much better idea than a
> warning, since it should be much easier to parse and disable
> zero-copy-send if desired.
> Even in my current qemu test script, it's much better having it in
> query-migrate.
> 
> >
> > > and just have a trace point in this location
> > > instead.
> >
> > Yeh.
> >
> 
> Yeap, the counter idea seems great!
> Will it be always printed there, or only when zero-copy-send is enabled?

You could make it either if it's enabled or if it's none zero.
(I guess you want it to reset to 0 at the start of a new migration).

Dave

> 
> Best regards,
> Leo
> 
> > Dave
> >
> > > With regards,
> > > Daniel
> > > --
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-06-28 16:56             ` Dr. David Alan Gilbert
@ 2022-07-01  6:18               ` Leonardo Brás
  2022-07-04  9:19                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Leonardo Brás @ 2022-07-01  6:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé, Juan Quintela, Peter Xu, qemu-devel

On Tue, 2022-06-28 at 17:56 +0100, Dr. David Alan Gilbert wrote:
> * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > 
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos
> > > > wrote:
> > > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé
> > > > > <berrange@redhat.com> wrote:
> > > > > > 
> > > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > > Some errors, like the lack of Scatter-Gather support by the
> > > > > > > network
> > > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail
> > > > > > > on using
> > > > > > > zero-copy, which causes it to fall back to the default copying
> > > > > > > mechanism.
> > > > > > 
> > > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > > were affected ?
> > > > > 
> > > > > I am not aware of any NIC without SG available for testing, nor have
> > > > > any idea on how common they are.
> > > > > But since we can detect sendmsg() falling back to copying we should
> > > > > warn the user if this ever happens.
> > > > > 
> > > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > > covered.
> > > > > 
> > > > > > 
> > > > > > > After each full dirty-bitmap scan there should be a zero-copy
> > > > > > > flush
> > > > > > > happening, which checks for errors each of the previous calls to
> > > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy,
> > > > > > > then
> > > > > > > warn the user about it.
> > > > > > > 
> > > > > > > Since it happens once each full dirty-bitmap scan, even in worst
> > > > > > > case
> > > > > > > scenario it should not print a lot of warnings, and will allow
> > > > > > > tracking
> > > > > > > how many dirty-bitmap iterations were not able to use zero-copy
> > > > > > > send.
> > > > > > 
> > > > > > For long running migrations which are not converging, or converging
> > > > > > very slowly there could be 100's of passes.
> > > > > > 
> > > > > 
> > > > > I could change it so it only warns once, if that is too much output.
> > > > 
> > > > Well I'm mostly wondering what we're expecting the user todo with this
> > > > information.
> > 
> > 
> > My rationale on that:
> > - zero-copy-send is a feature that is supposed to improve send
> > throughput by reducing cpu usage.
> > - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> > - if this happens, there will be a potential throughput decrease on
> > sendmsg()
> > - the user (or management app) need to know when zero-copy-send is
> > degrading throughput, so it can be disabled
> > - this is also important for performance testing, given it can be
> > confusing having zero-copy-send improving throughput in some cases,
> > and degrading in others, without any apparent reason why.
> > 
> > > > Generally a log file containing warnings ends up turning
> > > > into a bug report. If we think it is important for users and/or mgmt
> > > > apps to be aware of this info, then it might be better to actually
> > > > put a field in the query-migrate stats to report if zero-copy is
> > > > being honoured or not,
> > > 
> > > Yeh just a counter would work there I think.
> > 
> > The warning idea was totally due to my inexperience on this mgmt app
> > interface, since I had no other idea on how to deal with that.
> 
> Yeh it's not too silly an idea!
> The way some of these warning or stats get to us can be a bit random,
> but sometimes can confuse things.
> 
> > I think having it in query-migrate is a much better idea than a
> > warning, since it should be much easier to parse and disable
> > zero-copy-send if desired.
> > Even in my current qemu test script, it's much better having it in
> > query-migrate.
> > 
> > > 
> > > > and just have a trace point in this location
> > > > instead.
> > > 
> > > Yeh.
> > > 
> > 
> > Yeap, the counter idea seems great!
> > Will it be always printed there, or only when zero-copy-send is enabled?
> 
> You could make it either if it's enabled or if it's none zero.
> (I guess you want it to reset to 0 at the start of a new migration).
> 
> Dave

Thanks for this feedback!

I have everything already working, but I am struggling with a good property
name. 

I am currently using zero_copy_copied (or zero-copy-copied in json), but it does
not look like a good Migration stat name. 

Do you have any suggestion?

Best regards,
Leo


> 
> > 
> > Best regards,
> > Leo
> > 
> > > Dave
> > > 
> > > > With regards,
> > > > Daniel
> > > > --
> > > > > : https://berrange.com      -o-   
> > > > > https://www.flickr.com/photos/dberrange :|
> > > > > : https://libvirt.org         -o-           
> > > > > https://fstop138.berrange.com :|
> > > > > : https://entangle-photo.org    -o-   
> > > > > https://www.instagram.com/dberrange :|
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> > 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working
  2022-07-01  6:18               ` Leonardo Brás
@ 2022-07-04  9:19                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2022-07-04  9:19 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Daniel P. Berrangé, Juan Quintela, Peter Xu, qemu-devel

* Leonardo Brás (leobras@redhat.com) wrote:
> On Tue, 2022-06-28 at 17:56 +0100, Dr. David Alan Gilbert wrote:
> > * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > > On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > > 
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos
> > > > > wrote:
> > > > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé
> > > > > > <berrange@redhat.com> wrote:
> > > > > > > 
> > > > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > > > Some errors, like the lack of Scatter-Gather support by the
> > > > > > > > network
> > > > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail
> > > > > > > > on using
> > > > > > > > zero-copy, which causes it to fall back to the default copying
> > > > > > > > mechanism.
> > > > > > > 
> > > > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > > > were affected ?
> > > > > > 
> > > > > > I am not aware of any NIC without SG available for testing, nor have
> > > > > > any idea on how common they are.
> > > > > > But since we can detect sendmsg() falling back to copying we should
> > > > > > warn the user if this ever happens.
> > > > > > 
> > > > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > > > covered.
> > > > > > 
> > > > > > > 
> > > > > > > > After each full dirty-bitmap scan there should be a zero-copy
> > > > > > > > flush
> > > > > > > > happening, which checks for errors each of the previous calls to
> > > > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy,
> > > > > > > > then
> > > > > > > > warn the user about it.
> > > > > > > > 
> > > > > > > > Since it happens once each full dirty-bitmap scan, even in worst
> > > > > > > > case
> > > > > > > > scenario it should not print a lot of warnings, and will allow
> > > > > > > > tracking
> > > > > > > > how many dirty-bitmap iterations were not able to use zero-copy
> > > > > > > > send.
> > > > > > > 
> > > > > > > For long running migrations which are not converging, or converging
> > > > > > > very slowly there could be 100's of passes.
> > > > > > > 
> > > > > > 
> > > > > > I could change it so it only warns once, if that is too much output.
> > > > > 
> > > > > Well I'm mostly wondering what we're expecting the user todo with this
> > > > > information.
> > > 
> > > 
> > > My rationale on that:
> > > - zero-copy-send is a feature that is supposed to improve send
> > > throughput by reducing cpu usage.
> > > - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> > > - if this happens, there will be a potential throughput decrease on
> > > sendmsg()
> > > - the user (or management app) need to know when zero-copy-send is
> > > degrading throughput, so it can be disabled
> > > - this is also important for performance testing, given it can be
> > > confusing having zero-copy-send improving throughput in some cases,
> > > and degrading in others, without any apparent reason why.
> > > 
> > > > > Generally a log file containing warnings ends up turning
> > > > > into a bug report. If we think it is important for users and/or mgmt
> > > > > apps to be aware of this info, then it might be better to actually
> > > > > put a field in the query-migrate stats to report if zero-copy is
> > > > > being honoured or not,
> > > > 
> > > > Yeh just a counter would work there I think.
> > > 
> > > The warning idea was totally due to my inexperience on this mgmt app
> > > interface, since I had no other idea on how to deal with that.
> > 
> > Yeh it's not too silly an idea!
> > The way some of these warning or stats get to us can be a bit random,
> > but sometimes can confuse things.
> > 
> > > I think having it in query-migrate is a much better idea than a
> > > warning, since it should be much easier to parse and disable
> > > zero-copy-send if desired.
> > > Even in my current qemu test script, it's much better having it in
> > > query-migrate.
> > > 
> > > > 
> > > > > and just have a trace point in this location
> > > > > instead.
> > > > 
> > > > Yeh.
> > > > 
> > > 
> > > Yeap, the counter idea seems great!
> > > Will it be always printed there, or only when zero-copy-send is enabled?
> > 
> > You could make it either if it's enabled or if it's none zero.
> > (I guess you want it to reset to 0 at the start of a new migration).
> > 
> > Dave
> 
> Thanks for this feedback!
> 
> I have everything already working, but I am struggling with a good property
> name. 
> 
> I am currently using zero_copy_copied (or zero-copy-copied in json), but it does
> not look like a good Migration stat name. 
> 
> Do you have any suggestion?

Shrug; I'm not going to fuss over the name too much as long as it's
reasonable. 'zero-copied' might be OK.

Dave

> Best regards,
> Leo
> 
> 
> > 
> > > 
> > > Best regards,
> > > Leo
> > > 
> > > > Dave
> > > > 
> > > > > With regards,
> > > > > Daniel
> > > > > --
> > > > > > : https://berrange.com      -o-   
> > > > > > https://www.flickr.com/photos/dberrange :|
> > > > > > : https://libvirt.org         -o-           
> > > > > > https://fstop138.berrange.com :|
> > > > > > : https://entangle-photo.org    -o-   
> > > > > > https://www.instagram.com/dberrange :|
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > 
> > > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-07-04  9:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28  1:09 [PATCH v1 0/2] Zero copy improvements (QIOChannel + multifd) Leonardo Bras
2022-06-28  1:09 ` [PATCH v1 1/2] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent Leonardo Bras
2022-06-28  1:09 ` [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working Leonardo Bras
2022-06-28  7:53   ` Daniel P. Berrangé
2022-06-28 12:32     ` Leonardo Bras Soares Passos
2022-06-28 13:02       ` Daniel P. Berrangé
2022-06-28 13:52         ` Dr. David Alan Gilbert
2022-06-28 16:53           ` Leonardo Bras Soares Passos
2022-06-28 16:56             ` Dr. David Alan Gilbert
2022-07-01  6:18               ` Leonardo Brás
2022-07-04  9:19                 ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.