Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-arch@vger.kernel.org,
	mhocko@kernel.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
	arnd@arndb.de, keescook@chromium.org, gerg@linux-m68k.org,
	ktkhai@virtuozzo.com, christian.brauner@ubuntu.com,
	peterz@infradead.org, esyr@redhat.com, jgg@ziepe.ca,
	christian@kellner.me, areber@redhat.com, cyphar@cyphar.com
Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Date: Mon, 3 Aug 2020 16:03:59 -0400
Message-ID: <e4bf2e19-adc2-ad5e-f516-e8014500456d@oracle.com> (raw)
In-Reply-To: <1596469370.29091.13.camel@HansenPartnership.com>

On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve


  reply index

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-27 17:11 Anthony Yznaga
2020-07-27 17:07 ` Eric W. Biederman
2020-07-27 18:00   ` Steven Sistare
2020-07-28 13:40     ` Christian Brauner
2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
2020-07-28 13:38   ` Eric W. Biederman
2020-07-28 17:44     ` Anthony Yznaga
2020-07-29 13:52   ` Kirill A. Shutemov
2020-07-29 23:20     ` Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
2020-07-28 13:22   ` Kirill Tkhai
2020-07-28 14:06     ` Steven Sistare
2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
2020-07-28 17:28   ` Anthony Yznaga
2020-07-28 14:23 ` Andy Lutomirski
2020-07-28 14:30   ` Steven Sistare
2020-07-30 15:22 ` Matthew Wilcox
2020-07-30 15:27   ` Christian Brauner
2020-07-30 15:34     ` Matthew Wilcox
2020-07-30 15:54       ` Christian Brauner
2020-07-31  9:12     ` Stefan Hajnoczi
2020-07-30 15:59   ` Steven Sistare
2020-07-30 17:12     ` Matthew Wilcox
2020-07-30 17:35       ` Steven Sistare
2020-07-30 17:49         ` Matthew Wilcox
2020-07-30 18:27           ` Steven Sistare
2020-07-30 21:58             ` Eric W. Biederman
2020-07-31 14:57               ` Steven Sistare
2020-07-31 15:27                 ` Matthew Wilcox
2020-07-31 16:11                   ` Steven Sistare
2020-07-31 16:56                     ` Jason Gunthorpe
2020-07-31 17:15                       ` Steven Sistare
2020-07-31 17:48                         ` Jason Gunthorpe
2020-07-31 17:55                           ` Steven Sistare
2020-07-31 17:23                     ` Matthew Wilcox
2020-08-03 15:28                 ` Eric W. Biederman
2020-08-03 15:42                   ` James Bottomley
2020-08-03 20:03                     ` Steven Sistare [this message]
     [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
2020-08-04 11:13                       ` Matthew Wilcox
2020-08-03 19:29                   ` Steven Sistare
2020-07-31 19:41 ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4bf2e19-adc2-ad5e-f516-e8014500456d@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=akpm@linux-foundation.org \
    --cc=anthony.yznaga@oracle.com \
    --cc=areber@redhat.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=christian.brauner@ubuntu.com \
    --cc=christian@kellner.me \
    --cc=cyphar@cyphar.com \
    --cc=ebiederm@xmission.com \
    --cc=esyr@redhat.com \
    --cc=gerg@linux-m68k.org \
    --cc=hpa@zytor.com \
    --cc=jgg@ziepe.ca \
    --cc=keescook@chromium.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git