All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Khalid Aziz <khalid.aziz@oracle.com>, Barry Song <21cnbao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>, Jonathan Corbet <corbet@lwn.net>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz,
	Kees Cook <keescook@chromium.org>,
	kirill@shutemov.name, kucharsk@gmail.com, linkinjeon@kernel.org,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	longpeng2@huawei.com, markhemm@googlemail.com,
	Peter Collingbourne <pcc@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	sieberf@amazon.com, sjpark@amazon.de,
	Suren Baghdasaryan <surenb@google.com>,
	tst@schoebel-theuer.de, Iurii Zaikin <yzaikin@google.com>
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
Date: Sun, 3 Jul 2022 13:54:26 -0700	[thread overview]
Message-ID: <48e40b61-f506-72a1-0839-08bc9db483cc@kernel.org> (raw)
In-Reply-To: <e5bebb34-5858-815c-9c2c-254a95b86b07@oracle.com>

On 6/29/22 10:38, Khalid Aziz wrote:
> On 5/30/22 22:24, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> 
>> wrote:
>>>
>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>> are removed when the last task sharing the PTEs exits. Add a check
>>> for shared PTEs and skip them.
>>>
>>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>>> ---
>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, 
>>> struct vm_area_struct *vma,
>>>                  } else {
>>>                          /*
>>>                           * Optimization: gather nearby vmas into one 
>>> call down
>>> +                        * as long as they all belong to the same mm 
>>> (that
>>> +                        * may not be the case if a vma is part of 
>>> mshare'd
>>> +                        * range
>>>                           */
>>>                          while (next && next->vm_start <= vma->vm_end 
>>> + PMD_SIZE
>>> -                              && !is_vm_hugetlb_page(next)) {
>>> +                              && !is_vm_hugetlb_page(next)
>>> +                              && vma->vm_mm == tlb->mm) {
>>>                                  vma = next;
>>>                                  next = vma->vm_next;
>>>                                  unlink_anon_vmas(vma);
>>>                                  unlink_file_vma(vma);
>>>                          }
>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>> -                               floor, next ? next->vm_start : ceiling);
>>> +                       /*
>>> +                        * Free pgd only if pgd is not allocated for an
>>> +                        * mshare'd range
>>> +                        */
>>> +                       if (vma->vm_mm == tlb->mm)
>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>> +                                       floor, next ? next->vm_start 
>>> : ceiling);
>>>                  }
>>>                  vma = next;
>>>          }
>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>          pgd_t *pgd;
>>>          unsigned long next;
>>>
>>> +       /*
>>> +        * If this is an mshare'd page, do not unmap it since it might
>>> +        * still be in use.
>>> +        */
>>> +       if (vma->vm_mm != tlb->mm)
>>> +               return;
>>> +
>>
>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>> folio_referenced()? are all vmas in those processes sharing page table 
>> still
>> in the rmap of the shared page?
>> without shared PTE, if 1000 processes share one page, we are reading 1000
>> PTEs, with it, are we reading just one? or are we reading the same PTE
>> 1000 times? Have you tested it?
>>
> 
> We are treating mshared region same as threads sharing address space. 
> There is one PTE that is being used by all processes and the VMA 
> maintained in the separate mshare mm struct that also holds the shared 
> PTE is the one that gets added to rmap. This is a different model with 
> mshare in that it adds an mm struct that is separate from the mm structs 
> of the processes that refer to the vma and pte in mshare mm struct. Do 
> you see issues with rmap in this model?

I think this patch is actually the most interesting bit of the series by 
far.  Most of the rest is defining an API (which is important!) and 
figuring out semantics.  This patch changes something rather fundamental 
about how user address spaces work: what vmas live in them.  So let's 
figure out its effects.

I admit I'm rather puzzled about what vm_mm is for in the first place. 
In current kernels (without your patch), I think it's a pretty hard 
requirement for vm_mm to equal the mm for all vmas in an mm.  After a 
quick and incomplete survey, vm_mm seems to be mostly used as a somewhat 
lazy way to find the mm.  Let's see:

file_operations->mmap doesn't receive an mm_struct.  Instead it infers 
the mm from vm_mm.  (Why?  I don't know.)

Some walk_page_range users seem to dig the mm out of vm_mm instead of 
mm_walk.

Some manual address space walkers start with an mm, don't bother passing 
it around, and dig it back out of of vm_mm.  For example, unuse_vma() 
and all its helpers.

The only real exception I've found so far is rmap: AFAICS (on quick 
inspection -- I could be wrong), rmap can map from a folio to a bunch of 
vmas, and the vmas' mms are not stored separately but instead determined 
by vm_mm.



Your patch makes me quite nervous.  You're potentially breaking any 
kernel code path that assumes that mms only contain vmas that have vm_mm 
== mm.  And you're potentially causing rmap to be quite confused.  I 
think that if you're going to take this approach, you need to clearly 
define the new semantics of vm_mm and audit or clean up every user of 
vm_mm in the tree.  This may be nontrivial (especially rmap), although a 
cleanup of everything else to stop using vm_mm might be valuable.

But I'm wondering if it would be better to attack this from a different 
direction.  Right now, there's a hardcoded assumption that an mm owns 
every page table it references.  That's really the thing you're 
changing.  To me, it seems that a magical vma that shares page tables 
should still be a vma that belongs to its mm_struct -- munmap() and 
potentialy other m***() operations should all work on it, existing 
find_vma() users should work, etc.

So maybe instead there should be new behavior (by a VM_ flag or 
otherwise) that indicates that a vma owns its PTEs.  It could even be a 
vm_operation, although if anyone ever wants regular file mappings to 
share PTEs, then a vm_operation doesn't really make sense.


  reply	other threads:[~2022-07-03 20:54 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 02/14] mm/mshare: Add msharefs filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 03/14] mm/mshare: Add read for msharefs Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 06/14] mm/mshare: Check for mounted filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
2022-04-11 18:48   ` Dave Hansen
2022-04-11 20:39     ` Khalid Aziz
2022-05-30 11:11   ` Barry Song
2022-06-28 20:11     ` Khalid Aziz
2022-05-31  3:46   ` Barry Song
2022-06-28 20:16     ` Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
2022-05-31  4:24   ` Barry Song
2022-06-29 17:38     ` Khalid Aziz
2022-07-03 20:54       ` Andy Lutomirski [this message]
2022-07-06 20:33         ` Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm Khalid Aziz
2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
2022-04-11 18:51   ` Dave Hansen
2022-04-11 19:08     ` Matthew Wilcox
2022-04-11 19:52   ` Khalid Aziz
2022-04-11 18:47 ` Dave Hansen
2022-04-11 20:10 ` Eric W. Biederman
2022-04-11 22:21   ` Khalid Aziz
2022-05-30 10:48 ` Barry Song
2022-05-30 11:18   ` David Hildenbrand
2022-05-30 11:49     ` Barry Song
2022-06-29 17:48     ` Khalid Aziz
2022-06-29 17:40   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48e40b61-f506-72a1-0839-08bc9db483cc@kernel.org \
    --to=luto@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=arnd@arndb.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=hagen@jauu.net \
    --cc=jack@suse.cz \
    --cc=keescook@chromium.org \
    --cc=khalid.aziz@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=kucharsk@gmail.com \
    --cc=linkinjeon@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=markhemm@googlemail.com \
    --cc=pcc@google.com \
    --cc=rppt@kernel.org \
    --cc=sieberf@amazon.com \
    --cc=sjpark@amazon.de \
    --cc=surenb@google.com \
    --cc=tst@schoebel-theuer.de \
    --cc=willy@infradead.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.