All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"Jagannathan Raman" <jag.raman@oracle.com>,
	qemu-block@nongnu.org, "Juan Quintela" <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	qemu-devel@nongnu.org, "Leonardo Bras" <leobras@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Fam Zheng" <fam@euphon.net>
Subject: Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Date: Wed, 1 Sep 2021 16:59:20 +0100	[thread overview]
Message-ID: <YS+jWOxYq0qzohY6@redhat.com> (raw)
In-Reply-To: <YS+hrV7Rz5iiwRlH@t490s>

On Wed, Sep 01, 2021 at 11:52:13AM -0400, Peter Xu wrote:
> On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > > > 
> > > > > To make it work, three steps are needed:
> > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > > 3 - Process the socket's error queue, dealing with any error
> > > > 
> > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > > 
> > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > > from a synchronous call to an asynchronous call.
> > > > 
> > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > > until an asynchronous completion notification has been received from
> > > > the socket error queue. These notifications are not required to
> > > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > > to the buffer if a re-transmit is needed.
> > > > 
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > 
> > > >   "Page pinning also changes system call semantics. It temporarily 
> > > >    shares the buffer between process and network stack. Unlike with
> > > >    copying, the process cannot immediately overwrite the buffer 
> > > >    after system call return without possibly modifying the data in 
> > > >    flight. Kernel integrity is not affected, but a buggy program
> > > >    can possibly corrupt its own data stream."
> > > > 
> > > > AFAICT, the design added in this patch does not provide any way
> > > > to honour these requirements around buffer lifetime.
> > > > 
> > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > > way. The buffer lifetime requirements imply need for an API
> > > > design that is fundamentally different for asynchronous usage,
> > > > with a callback to notify when the write has finished/failed.
> > > 
> > > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > > being available and it's not obvious at all.  Just to mention that the initial
> > > user of this work will make sure all zero copy buffers will be guest pages only
> > > (as it's only used in multi-fd), so they should always be there during the
> > > process.
> > 
> > That is not the case when migration is using TLS, because the buffers
> > transmitted are the ciphertext buffer, not the plaintext guest page.
> 
> Agreed.
> 
> > 
> > > In short, we may just want to at least having a way to make sure all zero
> > > copied buffers are finished using and they're sent after some function returns
> > > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > > error queue and keep those information somewhere too.
> > > 
> > > Some other side notes that reached my mind..
> > > 
> > > The qio_channel_writev_full() may not be suitable for async operations, as the
> > > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > > zero copy on the channel?
> > 
> > All the APIs that exist today are fundamentally only suitable for sync
> > operations. Supporting async correctly will definitely a brand new APIs
> > separate from what exists today.
> 
> Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
> untouched, but just add a new interface at qio_channel_writev_full() level.
> 
> IOW, we could comment on io_writev() that it can be either sync or async now,
> just like sendmsg() has that implication too with different flag passed in.
> When calling io_writev() with different upper helpers, QIO channel could
> identify what flag to pass over to io_writev().

I don't think we should overload any of the existing methods with extra
parameters for async. Introduce a completely new set of methods exclusively
for this alternative interaction model.  I can kinda understand why they
took the approach they did with sendmsg, but I wouldn't use it as design
motivation for QEMU (except as example of what not to do).

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



  reply	other threads:[~2021-09-01 16:06 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
2021-09-01 20:54   ` Eric Blake
2021-09-02  8:26     ` Leonardo Bras Soares Passos
2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
2021-08-31 12:57   ` Daniel P. Berrangé
2021-08-31 20:27     ` Peter Xu
2021-09-01  8:50       ` Daniel P. Berrangé
2021-09-01 15:52         ` Peter Xu
2021-09-01 15:59           ` Daniel P. Berrangé [this message]
2021-09-02  7:07         ` Leonardo Bras Soares Passos
2021-09-02  6:59       ` Leonardo Bras Soares Passos
2021-09-07 16:44         ` Peter Xu
2021-09-08 20:13           ` Leonardo Bras Soares Passos
2021-09-08 21:04             ` Peter Xu
2021-09-02  6:38     ` Leonardo Bras Soares Passos
2021-09-02  8:47       ` Daniel P. Berrangé
2021-09-02  9:34         ` Leonardo Bras Soares Passos
2021-09-02  9:49           ` Daniel P. Berrangé
2021-09-02 10:19             ` Leonardo Bras Soares Passos
2021-09-02 10:28               ` Daniel P. Berrangé
2021-09-07 11:06                 ` Dr. David Alan Gilbert
2021-09-07 18:09                   ` Peter Xu
2021-09-08  8:30                     ` Dr. David Alan Gilbert
2021-09-08 15:24                       ` Peter Xu
2021-09-09  8:49                         ` Dr. David Alan Gilbert
2021-09-08 20:25                   ` Leonardo Bras Soares Passos
2021-09-08 21:09                     ` Peter Xu
2021-09-08 21:57                       ` Daniel P. Berrangé
2021-09-09  2:05                         ` Peter Xu
2021-09-09  4:58                           ` Leonardo Bras Soares Passos
2021-09-09 16:40                             ` Peter Xu
2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
2021-08-31 13:16   ` Daniel P. Berrangé
2021-08-31 20:29     ` Peter Xu
2021-09-01  8:53       ` Daniel P. Berrangé
2021-09-01 15:35         ` Peter Xu
2021-09-01 15:44           ` Daniel P. Berrangé
2021-09-01 16:01             ` Peter Xu
2021-09-02  7:57             ` Leonardo Bras Soares Passos
2021-09-07 11:13             ` Dr. David Alan Gilbert
2021-09-08 15:26               ` Daniel P. Berrangé
2021-09-02  7:23           ` Jason Wang
2021-09-02  8:08             ` Leonardo Bras Soares Passos
2021-09-02  7:27       ` Leonardo Bras Soares Passos
2021-09-02  7:22     ` Leonardo Bras Soares Passos
2021-09-02  8:20       ` Daniel P. Berrangé
2021-09-02  8:52         ` Leonardo Bras Soares Passos
2021-09-02  9:20           ` Daniel P. Berrangé
2021-09-02  9:49             ` Leonardo Bras Soares Passos
2021-09-02  9:59               ` Daniel P. Berrangé
2021-09-02 10:25                 ` Leonardo Bras Soares Passos
2021-09-07 11:17             ` Dr. David Alan Gilbert
2021-09-07 18:32       ` Peter Xu
2021-09-08  2:59         ` Jason Wang
2021-09-08  3:24           ` Peter Xu
2021-09-08  3:26             ` Jason Wang
2021-09-08  8:19           ` Dr. David Alan Gilbert
2021-09-08 15:19             ` Peter Xu
2021-09-09  1:10               ` Jason Wang
2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
2021-09-01 19:21   ` Leonardo Bras Soares Passos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YS+jWOxYq0qzohY6@redhat.com \
    --to=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=fam@euphon.net \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=leobras@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.