qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Leonardo Bras Soares Passos <leobras@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"Jagannathan Raman" <jag.raman@oracle.com>,
	qemu-block@nongnu.org, "Juan Quintela" <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Fam Zheng" <fam@euphon.net>
Subject: Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
Date: Thu, 2 Sep 2021 07:25:46 -0300	[thread overview]
Message-ID: <CAJ6HWG4o2bkukJWXjjAyBjoAvRGiNr_7Sc5dQJQsmqufrrn3rQ@mail.gmail.com> (raw)
In-Reply-To: <YTCgZnhwUCrPp1lP@redhat.com>

On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > Hello Daniel, thanks for the feedback !
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > > >
> > > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > > flags for qio_channel_write*().
> > > > > > > >
> > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > > other data being sent at the default copying approach.
> > > > > > > >
> > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > > ---
> > > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > > >              }
> > > > > > > >
> > > > > > > >              if (used) {
> > > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > > +                                                          &local_err);
> > > > > > >
> > > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > > resource usage implications
> > > > > > >
> > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > >
> > > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > >
> > > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > > >
> > > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > > implies locked memory (eg PCI assignment)
> > > > > >
> > > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > Yes, by default limit QEMU sees will be something very small.
> > > > >
> > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > > memory for the user before using it, or adding a capability to qemu before.
> > > > > >
> > > > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > > > I told you in the previous mail).
> > > > >
> > > > > Currently ability to lock memory is something that has to be configured
> > > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > > allowing memory locking is a way to kill your entire host.
> > > >
> > > > You mean it's gonna consume too much memory, or something else?
> > >
> > > Essentially yes.
> >
> > Well, maybe we can check for available memory before doing that,
> > but maybe it's too much effort.
>
> From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
> app needs to enforce policy based on the worst case that the
> VM configuration allows for.
>
> Checking current available memory is not viable, because even
> if you see 10 GB free, QEMU can't know if that free memory is
> there to satisfy other VMs's worst case needs, or its own. QEMU
> needs to be explicitly told whether its OK to use locked memory,
> and an enforcement policy applied to it.


Yeah, it makes sense to let the mgmt app deal with that and enable/disable
the MSG_ZEROCOPY on migration whenever it sees fit.


>
>
> > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > > total.
> > >
> > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > > which exceeds physical RAM. Most of the time this may well be fine
> > > as the VMs don't concurrently need their full RAM allocation, and
> > > worst case they'll get pushed to swap as the kernel re-shares
> > > memory in respose to load. So perhaps each VM only needs 20 GB
> > > resident at any time, but over time one VM can burst upto 32 GB
> > > and then 12 GB of it get swapped out later when inactive.
> > >
> > > But now consider if we allowed 2 of the VMs to lock memory for
> > > purposes of migration. Those 2 VMs can now pin 64 GB of memory
> > > in the worst case, leaving no free memory for the 3rd VM or
> > > for the OS. This will likely take down the entire host, regardless
> > > of swap availability.
> > >
> > > IOW, if you are overcomitting RAM you have to be extremely
> > > careful about allowing any VM to lock memory. If you do decide
> > > to allow memory locking, you need to make sure that the worst
> > > case locked memory amount still leaves enough unlocked memory
> > > for the OS to be able to effectively manage the overcommit
> > > load via swap.  We definitely can't grant memory locking to
> > > VMs at startup in this scenario, and if we grant it at runtime,
> > > we need to be able to revoke it again later.
> > >
> > > These overcommit numbers are a bit more extreme that you'd
> > > usually do in real world, but it illustrates the genreal
> > > problem. Also bear in mind that QEMU has memory overhead
> > > beyond the guest RAM block, which varies over time, making
> > > accounting quite hard. We have to also assume that QEMU
> > > could have been compromised by a guest breakout, so we
> > > can't assume that migration will play nice - we have to
> > > assume the worst case possible, given the process ulimits.
> > >
> >
> > Yeah, that makes sense. Thanks for this illustration and elucidation !
> >
> > I assume there is no way of asking the OS to lock memory, and if there is
> > no space available, it fails and rolls back the locking.
>
> Yes & no. On most Linux configs though it ends up being no, because
> instead of getting a nice failure, when host memory pressure occurs,
> the OOM Killer wakes up and just culls processes.

oh, right the OOM Killer :)

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Thanks!

Best regards,
Leonardo



  reply	other threads:[~2021-09-02 10:33 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
2021-09-01 20:54   ` Eric Blake
2021-09-02  8:26     ` Leonardo Bras Soares Passos
2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
2021-08-31 12:57   ` Daniel P. Berrangé
2021-08-31 20:27     ` Peter Xu
2021-09-01  8:50       ` Daniel P. Berrangé
2021-09-01 15:52         ` Peter Xu
2021-09-01 15:59           ` Daniel P. Berrangé
2021-09-02  7:07         ` Leonardo Bras Soares Passos
2021-09-02  6:59       ` Leonardo Bras Soares Passos
2021-09-07 16:44         ` Peter Xu
2021-09-08 20:13           ` Leonardo Bras Soares Passos
2021-09-08 21:04             ` Peter Xu
2021-09-02  6:38     ` Leonardo Bras Soares Passos
2021-09-02  8:47       ` Daniel P. Berrangé
2021-09-02  9:34         ` Leonardo Bras Soares Passos
2021-09-02  9:49           ` Daniel P. Berrangé
2021-09-02 10:19             ` Leonardo Bras Soares Passos
2021-09-02 10:28               ` Daniel P. Berrangé
2021-09-07 11:06                 ` Dr. David Alan Gilbert
2021-09-07 18:09                   ` Peter Xu
2021-09-08  8:30                     ` Dr. David Alan Gilbert
2021-09-08 15:24                       ` Peter Xu
2021-09-09  8:49                         ` Dr. David Alan Gilbert
2021-09-08 20:25                   ` Leonardo Bras Soares Passos
2021-09-08 21:09                     ` Peter Xu
2021-09-08 21:57                       ` Daniel P. Berrangé
2021-09-09  2:05                         ` Peter Xu
2021-09-09  4:58                           ` Leonardo Bras Soares Passos
2021-09-09 16:40                             ` Peter Xu
2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
2021-08-31 13:16   ` Daniel P. Berrangé
2021-08-31 20:29     ` Peter Xu
2021-09-01  8:53       ` Daniel P. Berrangé
2021-09-01 15:35         ` Peter Xu
2021-09-01 15:44           ` Daniel P. Berrangé
2021-09-01 16:01             ` Peter Xu
2021-09-02  7:57             ` Leonardo Bras Soares Passos
2021-09-07 11:13             ` Dr. David Alan Gilbert
2021-09-08 15:26               ` Daniel P. Berrangé
2021-09-02  7:23           ` Jason Wang
2021-09-02  8:08             ` Leonardo Bras Soares Passos
2021-09-02  7:27       ` Leonardo Bras Soares Passos
2021-09-02  7:22     ` Leonardo Bras Soares Passos
2021-09-02  8:20       ` Daniel P. Berrangé
2021-09-02  8:52         ` Leonardo Bras Soares Passos
2021-09-02  9:20           ` Daniel P. Berrangé
2021-09-02  9:49             ` Leonardo Bras Soares Passos
2021-09-02  9:59               ` Daniel P. Berrangé
2021-09-02 10:25                 ` Leonardo Bras Soares Passos [this message]
2021-09-07 11:17             ` Dr. David Alan Gilbert
2021-09-07 18:32       ` Peter Xu
2021-09-08  2:59         ` Jason Wang
2021-09-08  3:24           ` Peter Xu
2021-09-08  3:26             ` Jason Wang
2021-09-08  8:19           ` Dr. David Alan Gilbert
2021-09-08 15:19             ` Peter Xu
2021-09-09  1:10               ` Jason Wang
2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
2021-09-01 19:21   ` Leonardo Bras Soares Passos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ6HWG4o2bkukJWXjjAyBjoAvRGiNr_7Sc5dQJQsmqufrrn3rQ@mail.gmail.com \
    --to=leobras@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=fam@euphon.net \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).