All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leonardo Bras Soares Passos <leobras@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"Jagannathan Raman" <jag.raman@oracle.com>,
	qemu-block@nongnu.org, "Juan Quintela" <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Fam Zheng" <fam@euphon.net>
Subject: Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Date: Thu, 2 Sep 2021 06:34:01 -0300	[thread overview]
Message-ID: <CAJ6HWG6dd+timQM27-NTumvwDM2bFawRsnmrZumdzGZ8hCR3dQ@mail.gmail.com> (raw)
In-Reply-To: <YTCPngNB+l70sl06@redhat.com>

On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thank you for the feedback!
> >
> > Comments inline.
> >
> > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> >
> > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> > a somehow synchronous way, but returning errp (and sometimes closing the
> > channel because of it) was a poor implementation.
> >
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer
> > >    after system call return without possibly modifying the data in
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > >
> >
> > By the above piece of documentation, I get there is no problem in
> > overwriting the buffer, but a corrupt, or newer version of the memory may
> > be sent instead of the original one. I am pointing this out because there
> > are workloads like page migration that would not be impacted, given
> > once the buffer is changed, it will dirty the page and it will be re-sent.
>
> The idea of having the buffer overwritten is still seriously worrying
> me. I get the idea that in theory the "worst" that should happen is
> that we unexpectedly transmit newer guest memory contents. There are
> so many edge cases in migration code though that I'm worried there
> might be scenarios in which that is still going to cause problems.

I agree we should keep a eye on that, maybe have David Gilbert's opinion
on that.

>
> The current migration code has a synchronization point at the end of
> each iteration of the migration loop. At the end of each iteration,
> the source has a guarantee that all pages from the dirty bitmap have
> now been sent. With the ZEROCOPY feature we have entirely removed all
> synchronization points from the source host wrt memory sending. At
> best we know that the pages will have been sent if we get to close()
> without seeing errors, unless we're going to explicitly track the
> completion messages. The TCP re-transmit semantics are especially
> worrying to me, because the re-transmit is liable to send different
> data than was in the original lost TCP packet.
>
> Maybe everything is still safe because TCP is a reliable ordered
> stream, and thus eventually the dst will get the newest guest
> page contents. I'm still very worried that we have code in the
> source that implicitly relies on there being some synchronization
> points that we've effectively removed.
>
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > That is a good point.
> > Proposing a new optional method like io_async_writev() + an error
> > checking mechanism could do the job.
> > io_async_writev() could fall-back to io_writev() in cases where it's not
> > implemented.
> >
> > I am not sure about the error checking yet.
> > Options I can see are:
> > 1 - A callback, as you suggested, which IIUC would be provided by
> > code using the QIOChannel, and would only fix the reported errors,
> > leaving the responsibility of checking for errors to the IOChannel code.
>
> A callback is fairly simple, but potentially limits performance. Reading
> the kernel docs it seems they intentionally merge notifications to enable
> a whole contiguous set to be processed in one go. A callback would
> effectively be unmerging them requiring processing one a time. Not
> sure how much this matters to our usage.
>
> > 2 - A new method, maybe io_async_errck(), which would be called
> > whenever the using code wants to deal with pending errors. It could
> > return an array/list of IOVs that need to be re-sent, for example,
> > and code using QIOChannel could deal with it however it wants.
>
> Considering that we're using TCP, we get a reliable, ordered data
> stream. So there should never be a failure that requires use to
> know specific IOVs to re-sent - the kernel still handles TCP
> re-transmit - we just have to still have the buffer available.
> As noted above though, the re-transmit is liable to send different
> data than the original transmit, which worries me.
>
> So the only errors we should get from the completion notifications
> would be errors that are completely fatal for the TCP stream. So
> for errors, we only need to know whether an error has ocurred or
> not. The second reason for the completion notifications is for
> providing a synchronization, to know when the buffer can be released
> or overwritten. That is the only scenario we need to know list of
> IOVs that completed.
>
>
>
> > > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> > > >                           "Unable to write to socket");
> > > >          return -1;
> > > >      }
> > > > +
> > > > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > > > +        sioc->errq_pending += niov;
> > > > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > > > +            qio_channel_socket_errq_proc(sioc, errp);
> > > > +        }
> > > > +    }
> > >
> > > This silently looses any errors set from upto the final
> > > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.
> >
> > You are right.
> >
> > >
> > > Also means if you have a series of writes without
> > > MSG_ZEROCOPY, it'll delay checking any pending
> > > errors.
> >
> > That's expected... if there are only happening sends without MSG_ZEROCOPY,
> > it means the ones sent with zerocopy can wait. The problem would be
> > the above case.
>
> Well it depends whether there are synchonization requirements in the
> caller. For example, current migration code knows that at the end of
> each iteration sending pages, that the data has all actally been
> sent. With this new logic, we might have done several more iterations
> of the migration loop before the data for the original iteration has
> been sent. The writes() for the original iteration may well be sending
> the data from the later iterations. Possibly this is ok, but it is
> a clear difference in the data that will actually go out on the wire
> with this code.
>
> > > I would suggest checkig in close(), but as mentioned
> > > earlier, I think the design is flawed because the caller
> > > fundamentally needs to know about completion for every
> > > single write they make in order to know when the buffer
> > > can be released / reused.
> >
> > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > activated with a
> > parameter flag, or on a different method if callback is preferred):
> > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > syscall after each packet sent, and this means the fd gets a signal after each
> > sendmsg() happens, with error or not.
> >
> > We could harness this with a poll() and a relatively high timeout:
> > - We stop sending packets, and then call poll().
> > - Whenever poll() returns 0, it means a timeout happened, and so it
> > took too long
> > without sendmsg() happening, meaning all the packets are sent.
> > - If it returns anything else, we go back to fixing the errors found (re-send)
> >
> > The problem may be defining the value of this timeout, but it could be
> > called only
> > when zerocopy is active.
>
> Maybe we need to check completions at the end of each iteration of the
> migration dirtypage loop ?

Sorry, I am really new to this, and I still couldn't understand why would we
need to check at the end of each iteration, instead of doing a full check at the
end.

Is that a case of a fatal error on TCP stream, in which a lot of packets
were reported as 'sent' but we may lose track of which were in fact sent?

Would the poll() approach above be enough for a 'flush()' ?

>
>
> > > > +static void
> > > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > > > +                                bool enabled)
> > > > +{
> > > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > > +    int v = enabled ? 1 : 0;
> > > > +    int ret;
> > > > +
> > > > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > > > +    if (ret >= 0) {
> > > > +        sioc->zerocopy_enabled = true;
> > > > +    }
> > > > +}
> > >
> > > Surely we need to tell the caller wether this succeeed, otherwise
> > > the later sendmsg() is going to fail with EINVAL on older kernels
> > > where MSG_ZEROCOPY is not supported.
> >
> > Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
> > should have been
> > something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
> > way it would
> > reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
> > otherwise.
> >
> > or something like this, if we want it to stick with zerocopy if
> > setting it off fails.
> > if (ret >= 0) {
> >     sioc->zerocopy_enabled = enabled;
> > }
>
> Yes, that is a bug fix we need, but actually I was refering
> to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> from sendmsg.

Agree, something like io_writev(,, sioc->zerocopy_enabled ?
MSG_ZEROCOPY : 0, errp)'
should do, right?
(or an io_async_writev(), that will fall_back to io_writev() if
zerocopy is disabled)

>
> > > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > > index 4ce890a538..bf44b0f7b0 100644
> > > > --- a/io/channel-tls.c
> > > > +++ b/io/channel-tls.c
> > > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> > > >      qio_channel_set_delay(tioc->master, enabled);
> > > >  }
> > > >
> > > > +
> > > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > > > +                                         bool enabled)
> > > > +{
> > > > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > > > +
> > > > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > > > +}
> > >
> > > This is going to be unsafe because gnutls will discard/reuse the
> > > buffer for the ciphertext after every write(). I can't see a
> > > way to safely enable MSG_ZEROCOPY when TLS is used.
> >
> > Yeah, that makes sense.
> > It would make more sense to implement KTLS, as IIRC it kind of does
> > 'zerocopy', since it saves the encrypted data directly in kernel buffer.
>
> I would hope it is zerocopy, in so much as the kernel can directly
> use the userspace pages as the plaintext, and then the ciphertext
> in kernel space can be sent directly without further copies.
>
> What i'm not sure is whether this means it also becomes an effectively
> asynchronous API for transmitting data. ie does it have the same
> constraints about needing to lock pages, and not modify data in the
> pages? I've not investigated it in any detail, but I don't recall
> that being mentioned, and if it was the case, it would be impossible
> to enable that transparently in gnutls as its a major semantic changed.

I think it somehow waits for the user buffer to be encrypted to the kernel
buffer, and then returns, behaving like a normal sendmsg().
(except if it's using kernel async cripto API,  which I don't think is the case)

>
> > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > flag to async_enabled. If KTLS is not available, it would fall back to
> > using gnuTLS on io_writev, just like it would happen in zerocopy.
>
> If gnutls is going to support KTLS in a way we can use, I dont want to
> have any of that duplicated in QEMU code. I want to be able delegate
> as much as possible to gnutls, at least so that QEMU isn't in the loop
> for any crypto sensitive parts, as that complicates certification
> of crypto.

Yeah, that's a very good argument :)
I think it's probably the case to move from the current callback implementation
to the implementation in which we give gnuTLS the file descriptor.

( I was thinking on implementing an simpler direct gnuTLS version only
for the 'zerocopy' QIOChannel_TLS, but I think it would not make much sense.)



  reply	other threads:[~2021-09-02  9:34 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
2021-09-01 20:54   ` Eric Blake
2021-09-02  8:26     ` Leonardo Bras Soares Passos
2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
2021-08-31 12:57   ` Daniel P. Berrangé
2021-08-31 20:27     ` Peter Xu
2021-09-01  8:50       ` Daniel P. Berrangé
2021-09-01 15:52         ` Peter Xu
2021-09-01 15:59           ` Daniel P. Berrangé
2021-09-02  7:07         ` Leonardo Bras Soares Passos
2021-09-02  6:59       ` Leonardo Bras Soares Passos
2021-09-07 16:44         ` Peter Xu
2021-09-08 20:13           ` Leonardo Bras Soares Passos
2021-09-08 21:04             ` Peter Xu
2021-09-02  6:38     ` Leonardo Bras Soares Passos
2021-09-02  8:47       ` Daniel P. Berrangé
2021-09-02  9:34         ` Leonardo Bras Soares Passos [this message]
2021-09-02  9:49           ` Daniel P. Berrangé
2021-09-02 10:19             ` Leonardo Bras Soares Passos
2021-09-02 10:28               ` Daniel P. Berrangé
2021-09-07 11:06                 ` Dr. David Alan Gilbert
2021-09-07 18:09                   ` Peter Xu
2021-09-08  8:30                     ` Dr. David Alan Gilbert
2021-09-08 15:24                       ` Peter Xu
2021-09-09  8:49                         ` Dr. David Alan Gilbert
2021-09-08 20:25                   ` Leonardo Bras Soares Passos
2021-09-08 21:09                     ` Peter Xu
2021-09-08 21:57                       ` Daniel P. Berrangé
2021-09-09  2:05                         ` Peter Xu
2021-09-09  4:58                           ` Leonardo Bras Soares Passos
2021-09-09 16:40                             ` Peter Xu
2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
2021-08-31 13:16   ` Daniel P. Berrangé
2021-08-31 20:29     ` Peter Xu
2021-09-01  8:53       ` Daniel P. Berrangé
2021-09-01 15:35         ` Peter Xu
2021-09-01 15:44           ` Daniel P. Berrangé
2021-09-01 16:01             ` Peter Xu
2021-09-02  7:57             ` Leonardo Bras Soares Passos
2021-09-07 11:13             ` Dr. David Alan Gilbert
2021-09-08 15:26               ` Daniel P. Berrangé
2021-09-02  7:23           ` Jason Wang
2021-09-02  8:08             ` Leonardo Bras Soares Passos
2021-09-02  7:27       ` Leonardo Bras Soares Passos
2021-09-02  7:22     ` Leonardo Bras Soares Passos
2021-09-02  8:20       ` Daniel P. Berrangé
2021-09-02  8:52         ` Leonardo Bras Soares Passos
2021-09-02  9:20           ` Daniel P. Berrangé
2021-09-02  9:49             ` Leonardo Bras Soares Passos
2021-09-02  9:59               ` Daniel P. Berrangé
2021-09-02 10:25                 ` Leonardo Bras Soares Passos
2021-09-07 11:17             ` Dr. David Alan Gilbert
2021-09-07 18:32       ` Peter Xu
2021-09-08  2:59         ` Jason Wang
2021-09-08  3:24           ` Peter Xu
2021-09-08  3:26             ` Jason Wang
2021-09-08  8:19           ` Dr. David Alan Gilbert
2021-09-08 15:19             ` Peter Xu
2021-09-09  1:10               ` Jason Wang
2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
2021-09-01 19:21   ` Leonardo Bras Soares Passos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ6HWG6dd+timQM27-NTumvwDM2bFawRsnmrZumdzGZ8hCR3dQ@mail.gmail.com \
    --to=leobras@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=fam@euphon.net \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.