linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@huawei.com>
To: Khalid Aziz <khalid.aziz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Barry Song <21cnbao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>
Subject: RE: [RFC PATCH 0/6] Add support for shared PTEs across processes
Date: Sat, 22 Jan 2022 01:39:46 +0000	[thread overview]
Message-ID: <b34ded1e11154eabbce07618bf0a6676@huawei.com> (raw)
In-Reply-To: <0ec88ae7-9740-835d-1f07-60bd57081fcd@oracle.com>



> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, January 22, 2022 12:42 AM
> To: Matthew Wilcox <willy@infradead.org>; Barry Song <21cnbao@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>; Arnd Bergmann <arnd@arndb.de>;
> Dave Hansen <dave.hansen@linux.intel.com>; David Hildenbrand
> <david@redhat.com>; LKML <linux-kernel@vger.kernel.org>; Linux-MM
> <linux-mm@kvack.org>; Longpeng (Mike, Cloud Infrastructure Service Product
> Dept.) <longpeng2@huawei.com>; Mike Rapoport <rppt@kernel.org>; Suren
> Baghdasaryan <surenb@google.com>
> Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
> 
> On 1/21/22 07:47, Matthew Wilcox wrote:
> > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> >>>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>>> this file returns two long values - (1) starting address, and (2)
> >>>>> size of the mshare'd region.
> >>>>>
> >>>>> --
> >>>>> int mshare_unlink(char *name)
> >>>>>
> >>>>> A shared address range created by mshare() can be destroyed using
> >>>>> mshare_unlink() which removes the  shared named object. Once all
> >>>>> processes have unmapped the shared object, the shared address range
> >>>>> references are de-allocated and destroyed.
> >>>>
> >>>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>> I am still struggling with the user scenarios of these new APIs. This patch
> >>>> supposes multiple processes will have same virtual address for the shared
> >>>> area? How can this be guaranteed while different processes can map different
> >>>> stack, heap, libraries, files?
> >>>
> >>> The two processes choose to share a chunk of their address space.
> >>> They can map anything they like in that shared area, and then also
> >>> anything they like in the areas that aren't shared.  They can choose
> >>> for that shared area to have the same address in both processes
> >>> or different locations in each process.
> >>>
> >>> If two processes want to put a shared library in that shared address
> >>> space, that should work.  They probably would need to agree to use
> >>> the same virtual address for the shared page tables for that to work.
> >>
> >> we are depending on an elf loader and ld to map the library
> >> dynamically , so hardly
> >> can we find a chance in users' code to call mshare() to map libraries
> >> in application
> >> level?
> >
> > If somebody wants to modify ld.so to take advantage of mshare(), they
> > could.  That wasn't our primary motivation here, so if it turns out to
> > not work for that usecase, well, that's a shame.
> >
> >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> >>> memory, you can share _anything_ that's mmapable.
> >>
> >> yep, we can call mshare() on any kind of memory. for example, if multiple
> >> processes use SYSV shmem, posix shmem or mmap the same file. but
> >> it seems it is more sensible to let kernel do it automatically rather than
> >> depending on calling mshare() from users? It is difficult for users to
> >> decide which areas should be applied mshare(). users might want to call
> >> mshare() for all shared areas to save memory coming from duplicated PTEs?
> >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> >> communications,  mshare() looks not like a feature for applications,
> >> but like a feature
> >> for the whole system level? why would applications have to call something
> which
> >> doesn't directly help them? without mshare(), those applications
> >> will still work without any problem, right? is there anything in
> >> mshare() which is
> >> a must-have for applications? or mshare() is only a suggestion from
> applications
> >> like madvise()?
> >
> > Our use case is that we have some very large files stored on persistent
> > memory which we want to mmap in thousands of processes.  So the first
> > one shares a chunk of its address space and mmaps all the files into
> > that chunk of address space.  Subsequent processes find that a suitable
> > address space already exists and use it, sharing the page tables and
> > avoiding the calls to mmap.
> >
> > Sharing page tables is akin to running multiple threads in a single
> > address space; except that only part of the address space is the same.
> > There does need to be a certain amount of trust between the processes
> > sharing the address space.  You don't want to do it to an unsuspecting
> > process.
> >
> 
> Hello Barry,
> 
> mshare() is really meant for sharing data across unrelated processes by sharing
> address space explicitly and hence
> opt-in is required. As Matthew said, the processes sharing this virtual address
> space need to have a level of trust.
> Permissions on the msharefs files control who can access this shared address
> space. It is possible to adapt this
> mechanism to share stack, libraries etc but that is not the intent. This feature
> will be used by applications that share
> data with multiple processes using shared mapping normally and it helps them
> avoid the overhead of large number of
> duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces
> amount of memory available for
> applications and can result in out-of-memory condition. An example from the patch
> 0/6:
> 
> "On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant."
> 

The memory overhead of PTEs would be significantly saved if we use
hugetlbfs in this case, but why not?

> --
> Khalid

  reply	other threads:[~2022-01-22  1:57 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
2022-01-18 21:46   ` Matthew Wilcox
2022-01-18 22:47     ` Khalid Aziz
2022-01-18 22:06 ` Dave Hansen
2022-01-18 22:52   ` Khalid Aziz
2022-01-19 11:38 ` Mark Hemment
2022-01-19 17:02   ` Khalid Aziz
2022-01-20 12:49     ` Mark Hemment
2022-01-20 19:15       ` Khalid Aziz
2022-01-24 15:15         ` Mark Hemment
2022-01-24 15:27           ` Matthew Wilcox
2022-01-24 22:20           ` Khalid Aziz
2022-01-21  1:08 ` Barry Song
2022-01-21  2:13   ` Matthew Wilcox
2022-01-21  7:35     ` Barry Song
2022-01-21 14:47       ` Matthew Wilcox
2022-01-21 16:41         ` Khalid Aziz
2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.) [this message]
2022-01-22  1:41             ` Matthew Wilcox
2022-01-22 10:18               ` Thomas Schoebel-Theuer
2022-01-22 16:09                 ` Matthew Wilcox
2022-01-22 11:31 ` Mike Rapoport
2022-01-22 18:29   ` Andy Lutomirski
2022-01-24 18:48   ` Khalid Aziz
2022-01-24 19:45     ` Andy Lutomirski
2022-01-24 22:30       ` Khalid Aziz
2022-01-24 23:16         ` Andy Lutomirski
2022-01-24 23:44           ` Khalid Aziz
2022-01-25 11:42 ` Kirill A. Shutemov
2022-01-25 12:09   ` William Kucharski
2022-01-25 13:18     ` David Hildenbrand
2022-01-25 14:01       ` Kirill A. Shutemov
2022-01-25 13:23   ` Matthew Wilcox
2022-01-25 13:59     ` Kirill A. Shutemov
2022-01-25 14:09       ` Matthew Wilcox
2022-01-25 18:57         ` Kirill A. Shutemov
2022-01-25 18:59           ` Matthew Wilcox
2022-01-26  4:04             ` Matthew Wilcox
2022-01-26 10:16               ` David Hildenbrand
2022-01-26 13:38                 ` Matthew Wilcox
2022-01-26 13:55                   ` David Hildenbrand
2022-01-26 14:12                     ` Matthew Wilcox
2022-01-26 14:30                       ` David Hildenbrand
2022-01-26 14:12                   ` Mike Rapoport
2022-01-26 13:42               ` Kirill A. Shutemov
2022-01-26 14:18                 ` Mike Rapoport
2022-01-26 17:33                   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b34ded1e11154eabbce07618bf0a6676@huawei.com \
    --to=longpeng2@huawei.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).