linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@huawei.com>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>
Subject: RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
Date: Mon, 16 Aug 2021 06:54:52 +0000	[thread overview]
Message-ID: <49528665cce5490eaa2961fe7d282752@huawei.com> (raw)
In-Reply-To: <55720e1b39cff0a0f882d8610e7906dc80ea0a01.camel@oracle.com>

Hi Khalid,

Thanks for your replay, PSB :)

> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, August 14, 2021 3:49 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>; Matthew Wilcox <willy@infradead.org>
> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.) wrote:
> > Hi Matthew,
> >
> > > -----Original Message-----
> > > From: Matthew Wilcox [mailto:willy@infradead.org]
> > > Sent: Monday, July 12, 2021 9:30 AM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > <longpeng2@huawei.com>
> > > Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> > > <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> > > linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> > >
> > > On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> > > Infrastructure Service Product Dept.) wrote:
> > > > Let me describe my use case more clearly (just ignore if you're
> > > > not interested in it):
> > > >
> > > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > > allocated VA range is [0x40000000,0x140000000)
> > > >
> > > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > > [0x80000000,0x100000000) will be shared by its child.
> > > >
> > > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > >
> > > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > > ...)
> > > > and remap them to a continuous VA range.
> > >
> > > This is dangerous.  There must be an active step for Prog B to
> > > accept Prog A's ranges into its address space.  Otherwise Prog A
> > > could almost completely fill Prog B's address space and so control
> > > where Prog B places its mappings.  It could also provoke a latent
> > > bug in Prog B if it doesn't handle address space exhaustion
> > > gracefully.
> > >
> > > I had a proposal to handle this.  Would it meet your requirements?
> > > https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead
> > > .org/
> >
> > I noticed your proposal of project Sileby and I think it can meet
> > Steven's requirement, but I not sure whether it's suitable for mine
> > because there's no sample code yet, is it in progress ?
> 
> Hi Mike,
> 
> I am working on refining the ideas from project Sileby. I am also working on
> designing the implementation. Since the original concept, the mshare API has
> evolved further. Here is what it loks like:
> 

What's the actual use cases of Sileby ? We can already share anonymous memory 
by memfd.

> The mshare API consists of two system calls - mshare() and
> mshare_unlink()
> 
> mshare
> ======
> 
> int mshare(char *name,void *addr, size_t length, int oflags, mode_t
> mode)
> 
> mshare() creates and opens a new, or opens an existing shared memory area that
> will be shared at PTE level. name refers to shared object name that exists under
> /dev/mshare (this is subject to change. There might be better ways to manage the
> names for mshare'd areas). addr is the starting address of this shared memory
> area and length is the size of this area. oflags can be one of:
> 
>     O_RDONLY opens shared memory area for read only access by everyone
>     O_RDWR opens shared memory area for read and write access
>     O_CREAT creates the named shared memory area if it does not exist
>     O_EXCL If O_CREAT was also specified, and a shared memory area
>         exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under /dev/mshare.
> 
> Return Value
> ------------
> 
> mshare() returns a file descriptor. A read from this file descriptor returns two long
> values - (1) starting address, and (2) size of the shared memory area.
> 
> Notes
> -----
> 
> PTEs are shared at pgdir level and hence it imposes following requirements on the
> address and size given to the mshare():
> 
>     - Starting address must be aligned to pgdir size (512GB on x86_64)

The limitation seems not unreasonable, why ?

>     - Size must be a multiple of pgdir size
>     - Any mappings created in this address range at any time become
>     shared automatically
>     - Shared address range can have unmapped addresses in it. Any
>     access to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared between
> threads, so a write to a MAP_PRIVATE mapping will create a page which is shared
> between all the sharers. The first process that declares an address range mshare'd
> can continue to map objects in the shared area. All other processes that want
> mshare'd access to this memory area can do so by calling mshare(). After this call,
> the address range given by mshare becomes a shared range in its address space.
> Anonymous mappings will be shared and not COWed.
> 
> 
> mshare_unlink
> =============
> 
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all processes
> have unmapped the shared object, the shared address range references are
> de-allocated and destroyed.
> 
> Return Value
> ------------
> 
> mshare_unlink() returns 0 on success or -1 on error.
> 
> 
> Example
> =======
> 
> A process can create an mshare'd area and map objects into it as
> follows:
> 
>     fd = mshare("junk",  TB(1), GB(512), O_CREAT|O_RDWR, 0600);
> 
>     /* Map objects in the shared address space and/or Write data */
> 
>     mshare_unlink("junk");
> 

Use the name to identify the range seems insecure and looks easy to be attacked.
How about to use the fd ? We can pass the fd to another process who is permit to
access the mshare'd memory.

> Another process can then access this shared memory area with another call to
> mshare():
> 
>     fd = mshare("junk", TB(1), GB(512), O_RDWR, 0600);
> 
>     /* Read and write data in TB(1)-((TB(1)+GB(512)) range */
> 
>     mshare_unlink("junk");
> 
> 
> >
> > According to the abstract of Sileby, I have two questions:
> > 1. Would you plan to support the file-mapping memory sharing ? e.g.
> > Prog A's 4G memory is backend with 2M hugetlb.
> 
> Yes, file-mapped memory sharing support is planned.
> 
> > 2. Does each mshare fd only containe one sharing VMA ? For large
> > memory process (1T~4T in our env), maybe there is hundreds of memory
> > ranges need to be shared, this will take too much fd space if so ?
> >
> 
> No, each fd can support all VMAs covered by the address range with a size that is
> multiple of pgdir size.
> 

I also made a proposal to meet our requirement inside.
Our requirement is:
1. The process A can specify some ranges to be shared by another process B.
2. The ranges can be shared in the order of the specified by process A.
3. The ranges can support more than one VMA, some of the VMAs can be anon and
  others can be file-mapped.

The proposal introduces a char device named /dev/mshare and interacts with the users
by ioctl interfaces. It supports three command currently:
- MSHARE_SIGN: specify the range want to be shared
- MSHARE_INFO: get the info of the shared ranges
- MSHARE_MMAP: map the shared ranges into the address space

Here is an example to show how to use it.

1. Suppose process A want to share four ranges:
  a. 0x1000 ~ 0x2000 --> Anonymous
  b. 0x4000 ~ 0x8000 --> PFN mapped range
  c. 0x200000 ~ 0x400000 --> 2M hugetlb
  d. 0x500000 ~ 0x501000 --> Anonymous


2. Process A write data into these four ranges according the following sequence:
  "b -- c -- a -- d"
  ( So the process B should map the ranges as the same order )


3. Process A specify the ranges:

  FD = open( /dev/msahre );
  ioctl(FD, MSHARE_SIGN, range b);
  ioctl(FD, MSHARE_SIGN, range c);
  ioctl(FD, MSHARE_SIGN, range a);
  ioctl(FD, MSHARE_SIGN, range d);


4. Process A make sure the O_CLOEXEC of FD is CLEAR


5. Process A now fork+exec process B with parameter shared_fd=FD:

  ./bin_B shared_fd=FD


6. Process B get the info by MSHARE_INFO ioctl:

  ioctl( shared_fd, MSHARE_INFO, &info );

  The info includes:
    num: the count of the shared ranges
    totalvm: the total vm size of these shared ranges
    align: align requirement


7. Process B use MSHARE_MMAP to map these ranges into its space:

  p = mmap ( info.totalvm, info.align );
  ioctl(shared_fd, MSHARE_MMAP, {MMAP_ALL, address fixed, p} );


I have wrote a draft for it and it works in local. But I'm not sure the concept is right,
so I'm glad to know that you're working on Sileby, maybe I can refer to the work of yours.


> --
> Khalid


  parent reply	other threads:[~2021-08-16  6:55 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
2020-07-27 17:07 ` Eric W. Biederman
2020-07-27 18:00   ` Steven Sistare
2020-07-28 13:40     ` Christian Brauner
2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
2020-07-28 13:38   ` Eric W. Biederman
2020-07-28 17:44     ` Anthony Yznaga
2020-07-29 13:52   ` Kirill A. Shutemov
2020-07-29 23:20     ` Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
2020-07-28 13:22   ` Kirill Tkhai
2020-07-28 14:06     ` Steven Sistare
2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
2020-07-28 17:28   ` Anthony Yznaga
2020-07-28 14:23 ` Andy Lutomirski
2020-07-28 14:30   ` Steven Sistare
2020-07-30 15:22 ` Matthew Wilcox
2020-07-30 15:27   ` Christian Brauner
2020-07-30 15:34     ` Matthew Wilcox
2020-07-30 15:54       ` Christian Brauner
2020-07-31  9:12     ` Stefan Hajnoczi
2020-07-30 15:59   ` Steven Sistare
2020-07-30 17:12     ` Matthew Wilcox
2020-07-30 17:35       ` Steven Sistare
2020-07-30 17:49         ` Matthew Wilcox
2020-07-30 18:27           ` Steven Sistare
2020-07-30 21:58             ` Eric W. Biederman
2020-07-31 14:57               ` Steven Sistare
2020-07-31 15:27                 ` Matthew Wilcox
2020-07-31 16:11                   ` Steven Sistare
2020-07-31 16:56                     ` Jason Gunthorpe
2020-07-31 17:15                       ` Steven Sistare
2020-07-31 17:48                         ` Jason Gunthorpe
2020-07-31 17:55                           ` Steven Sistare
2020-07-31 17:23                     ` Matthew Wilcox
2020-08-03 15:28                 ` Eric W. Biederman
2020-08-03 15:42                   ` James Bottomley
2020-08-03 20:03                     ` Steven Sistare
     [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
2020-08-04 11:13                       ` Matthew Wilcox
2020-08-03 19:29                   ` Steven Sistare
2020-07-31 19:41 ` Steven Sistare
2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-07-08 12:48   ` Steven Sistare
2021-07-12  1:05     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-07-12  1:30       ` Matthew Wilcox
2021-07-13  0:57         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-13 19:49           ` Khalid Aziz
2021-08-14 20:07             ` David Laight
2021-08-16  0:26               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-16  8:07                 ` David Laight
2021-08-16  6:54             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.) [this message]
2021-08-16  8:02             ` David Hildenbrand
2021-08-16 12:07               ` Matthew Wilcox
2021-08-16 12:20                 ` David Hildenbrand
2021-08-16 12:42                   ` David Hildenbrand
2021-08-16 12:46                   ` Matthew Wilcox
2021-08-16 13:24                     ` David Hildenbrand
2021-08-16 13:32                       ` Matthew Wilcox
2021-08-16 14:10                         ` David Hildenbrand
2021-08-16 14:27                           ` Matthew Wilcox
2021-08-16 14:33                             ` David Hildenbrand
2021-08-16 14:40                               ` Matthew Wilcox
2021-08-16 15:01                                 ` David Hildenbrand
2021-08-16 15:59                                   ` Matthew Wilcox
2021-08-16 16:06                                     ` Khalid Aziz
2021-08-16 16:15                                       ` Matthew Wilcox
2021-08-16 16:13                                     ` David Hildenbrand
2021-08-16 12:27                 ` [private] " David Hildenbrand
2021-08-16 12:30                   ` David Hildenbrand
2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-17  0:55                   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49528665cce5490eaa2961fe7d282752@huawei.com \
    --to=longpeng2@huawei.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=arei.gonglei@huawei.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=steven.sistare@oracle.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).