Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Anthony Yznaga" <anthony.yznaga@oracle.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-arch@vger.kernel.org,
	mhocko@kernel.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
	arnd@arndb.de, ebiederm@xmission.com, keescook@chromium.org,
	gerg@linux-m68k.org, ktkhai@virtuozzo.com, peterz@infradead.org,
	esyr@redhat.com, jgg@ziepe.ca, christian@kellner.me,
	areber@redhat.com, cyphar@cyphar.com, steven.sistare@oracle.com,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	jasowang@redhat.com, "Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	"Peter Xu" <peterx@redhat.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Alexander Graf" <graf@amazon.com>,
	vgoyal@redhat.com
Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Date: Fri, 31 Jul 2020 10:12:01 +0100
Message-ID: <20200731091201.GA173964@stefanha-x1.localdomain> (raw)
In-Reply-To: <20200730152705.ol42jppnl4xfhl32@wittgenstein>


[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]

Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
------------------
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
      .fd = memfd,
      .fd_offset = 8 * GiB,
      .size = 2 * GiB,
      .mem_offset = 8 * GiB,
      .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
      .fd = filefd,
      .fd_offset = 4 * GiB,
      .size = 8 * GiB,
      .mem_offset = 0,
      .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd               Permissions
0-8           big-file.iso     read
8-10          guest-ram        read+write
10-16         <none>           <none>

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
                   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
---------
How to use the memory view driver for vhost-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the guest directly accesses the memory view.

Now virtiofsd can map portions of files into the DAX Window without
coordinating with the QEMU process. This simplifies the virtio-fs code
and should also improve DAX map/unmap performance.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply index

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-27 17:11 Anthony Yznaga
2020-07-27 17:07 ` Eric W. Biederman
2020-07-27 18:00   ` Steven Sistare
2020-07-28 13:40     ` Christian Brauner
2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
2020-07-28 13:38   ` Eric W. Biederman
2020-07-28 17:44     ` Anthony Yznaga
2020-07-29 13:52   ` Kirill A. Shutemov
2020-07-29 23:20     ` Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
2020-07-28 13:22   ` Kirill Tkhai
2020-07-28 14:06     ` Steven Sistare
2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
2020-07-28 17:28   ` Anthony Yznaga
2020-07-28 14:23 ` Andy Lutomirski
2020-07-28 14:30   ` Steven Sistare
2020-07-30 15:22 ` Matthew Wilcox
2020-07-30 15:27   ` Christian Brauner
2020-07-30 15:34     ` Matthew Wilcox
2020-07-30 15:54       ` Christian Brauner
2020-07-31  9:12     ` Stefan Hajnoczi [this message]
2020-07-30 15:59   ` Steven Sistare
2020-07-30 17:12     ` Matthew Wilcox
2020-07-30 17:35       ` Steven Sistare
2020-07-30 17:49         ` Matthew Wilcox
2020-07-30 18:27           ` Steven Sistare
2020-07-30 21:58             ` Eric W. Biederman
2020-07-31 14:57               ` Steven Sistare
2020-07-31 15:27                 ` Matthew Wilcox
2020-07-31 16:11                   ` Steven Sistare
2020-07-31 16:56                     ` Jason Gunthorpe
2020-07-31 17:15                       ` Steven Sistare
2020-07-31 17:48                         ` Jason Gunthorpe
2020-07-31 17:55                           ` Steven Sistare
2020-07-31 17:23                     ` Matthew Wilcox
2020-08-03 15:28                 ` Eric W. Biederman
2020-08-03 15:42                   ` James Bottomley
2020-08-03 20:03                     ` Steven Sistare
     [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
2020-08-04 11:13                       ` Matthew Wilcox
2020-08-03 19:29                   ` Steven Sistare
2020-07-31 19:41 ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200731091201.GA173964@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=areber@redhat.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=christian@kellner.me \
    --cc=cyphar@cyphar.com \
    --cc=ebiederm@xmission.com \
    --cc=esyr@redhat.com \
    --cc=f4bug@amsat.org \
    --cc=gerg@linux-m68k.org \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jasowang@redhat.com \
    --cc=jgg@ziepe.ca \
    --cc=keescook@chromium.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=steven.sistare@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=vgoyal@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git