qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Leonardo Bras Soares Passos <leobras@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"Jagannathan Raman" <jag.raman@oracle.com>,
	qemu-block@nongnu.org, "Juan Quintela" <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Fam Zheng" <fam@euphon.net>
Subject: Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
Date: Thu, 2 Sep 2021 10:20:10 +0100	[thread overview]
Message-ID: <YTCXSoETM7UfeMQE@redhat.com> (raw)
In-Reply-To: <CAJ6HWG45Xb2uc3OdJctL7SFxmrH0ZhDwQMBDAmS1TuDgiCrCMA@mail.gmail.com>

On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > Hello Daniel, thanks for the feedback !
> > >
> > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > >
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > >
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > other data being sent at the default copying approach.
> > > > >
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > >  migration/multifd.c      | 9 ++++++---
> > > > >  migration/multifd.h      | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > >
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >              }
> > > > >
> > > > >              if (used) {
> > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > +                                                          &local_err);
> > > >
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> >
> > Yes, by default limit QEMU sees will be something very small.
> >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > Maybe an option would be trying to mlock all guest memory before setting
> > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > I told you in the previous mail).
> >
> > Currently ability to lock memory is something that has to be configured
> > when QEMU starts, and it requires libvirt to grant suitable permissions
> > to QEMU. Memory locking is generally undesirable because it prevents
> > memory overcommit. Or rather if you are allowing memory overcommit, then
> > allowing memory locking is a way to kill your entire host.
> 
> You mean it's gonna consume too much memory, or something else?

Essentially yes. 

> > I don't think we can unconditionally grant ability to lock arbitrary
> > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > migration later. Granting it at runtime feels questionable as you now
> > need to track and predict how much locked memory you've allowed, and
> > also have possible problems with revokation.
> 
> (I am really new to this, so please forgive me if I am asking dumb or
> overly basic questions)
> 
> What does revokation means in this context?
> You give the process hability to lock n MB of memory, and then you take it?
> Why would that happen? Is Locked memory a limited resource?

Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
total.

So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
which exceeds physical RAM. Most of the time this may well be fine
as the VMs don't concurrently need their full RAM allocation, and
worst case they'll get pushed to swap as the kernel re-shares
memory in respose to load. So perhaps each VM only needs 20 GB
resident at any time, but over time one VM can burst upto 32 GB
and then 12 GB of it get swapped out later when inactive.

But now consider if we allowed 2 of the VMs to lock memory for
purposes of migration. Those 2 VMs can now pin 64 GB of memory
in the worst case, leaving no free memory for the 3rd VM or
for the OS. This will likely take down the entire host, regardless
of swap availability.

IOW, if you are overcomitting RAM you have to be extremely
careful about allowing any VM to lock memory. If you do decide
to allow memory locking, you need to make sure that the worst
case locked memory amount still leaves enough unlocked memory
for the OS to be able to effectively manage the overcommit
load via swap.  We definitely can't grant memory locking to
VMs at startup in this scenario, and if we grant it at runtime,
we need to be able to revoke it again later.

These overcommit numbers are a bit more extreme that you'd 
usually do in real world, but it illustrates the genreal
problem. Also bear in mind that QEMU has memory overhead
beyond the guest RAM block, which varies over time, making
accounting quite hard. We have to also assume that QEMU
could have been compromised by a guest breakout, so we
can't assume that migration will play nice - we have to
assume the worst case possible, given the process ulimits.


> > Overall the memory locking needs look like a significant constraint that
> > will affect ability to use this feature.
> >
> 
> I Agree, there is a lot to take in account.
> In any way, those constraints could be checked at the same function as
> the setsockopt() right?

QEMU could possibly check its ulimits to see if it is possible, but it
would be safer to require a migration capability to be explicitly set
by the mgmt app to opt-in to zerocopy.

> (Setting up configs to improve the chance of zerocopy would probably only
> happen at/before qemu starting, right?)

Usually, but you can change ulimits on the fly for a process. I'm just
not sure of semantics if you reduce limits and existing usage exceeds
the reduced value.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



  reply	other threads:[~2021-09-02  9:21 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
2021-09-01 20:54   ` Eric Blake
2021-09-02  8:26     ` Leonardo Bras Soares Passos
2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
2021-08-31 12:57   ` Daniel P. Berrangé
2021-08-31 20:27     ` Peter Xu
2021-09-01  8:50       ` Daniel P. Berrangé
2021-09-01 15:52         ` Peter Xu
2021-09-01 15:59           ` Daniel P. Berrangé
2021-09-02  7:07         ` Leonardo Bras Soares Passos
2021-09-02  6:59       ` Leonardo Bras Soares Passos
2021-09-07 16:44         ` Peter Xu
2021-09-08 20:13           ` Leonardo Bras Soares Passos
2021-09-08 21:04             ` Peter Xu
2021-09-02  6:38     ` Leonardo Bras Soares Passos
2021-09-02  8:47       ` Daniel P. Berrangé
2021-09-02  9:34         ` Leonardo Bras Soares Passos
2021-09-02  9:49           ` Daniel P. Berrangé
2021-09-02 10:19             ` Leonardo Bras Soares Passos
2021-09-02 10:28               ` Daniel P. Berrangé
2021-09-07 11:06                 ` Dr. David Alan Gilbert
2021-09-07 18:09                   ` Peter Xu
2021-09-08  8:30                     ` Dr. David Alan Gilbert
2021-09-08 15:24                       ` Peter Xu
2021-09-09  8:49                         ` Dr. David Alan Gilbert
2021-09-08 20:25                   ` Leonardo Bras Soares Passos
2021-09-08 21:09                     ` Peter Xu
2021-09-08 21:57                       ` Daniel P. Berrangé
2021-09-09  2:05                         ` Peter Xu
2021-09-09  4:58                           ` Leonardo Bras Soares Passos
2021-09-09 16:40                             ` Peter Xu
2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
2021-08-31 13:16   ` Daniel P. Berrangé
2021-08-31 20:29     ` Peter Xu
2021-09-01  8:53       ` Daniel P. Berrangé
2021-09-01 15:35         ` Peter Xu
2021-09-01 15:44           ` Daniel P. Berrangé
2021-09-01 16:01             ` Peter Xu
2021-09-02  7:57             ` Leonardo Bras Soares Passos
2021-09-07 11:13             ` Dr. David Alan Gilbert
2021-09-08 15:26               ` Daniel P. Berrangé
2021-09-02  7:23           ` Jason Wang
2021-09-02  8:08             ` Leonardo Bras Soares Passos
2021-09-02  7:27       ` Leonardo Bras Soares Passos
2021-09-02  7:22     ` Leonardo Bras Soares Passos
2021-09-02  8:20       ` Daniel P. Berrangé
2021-09-02  8:52         ` Leonardo Bras Soares Passos
2021-09-02  9:20           ` Daniel P. Berrangé [this message]
2021-09-02  9:49             ` Leonardo Bras Soares Passos
2021-09-02  9:59               ` Daniel P. Berrangé
2021-09-02 10:25                 ` Leonardo Bras Soares Passos
2021-09-07 11:17             ` Dr. David Alan Gilbert
2021-09-07 18:32       ` Peter Xu
2021-09-08  2:59         ` Jason Wang
2021-09-08  3:24           ` Peter Xu
2021-09-08  3:26             ` Jason Wang
2021-09-08  8:19           ` Dr. David Alan Gilbert
2021-09-08 15:19             ` Peter Xu
2021-09-09  1:10               ` Jason Wang
2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
2021-09-01 19:21   ` Leonardo Bras Soares Passos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YTCXSoETM7UfeMQE@redhat.com \
    --to=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=fam@euphon.net \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=leobras@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).