linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: linux-mm@kvack.org
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Paul McKenney <paulmckrcu@fb.com>
Subject: synchronize_rcu in munmap?
Date: Mon, 8 Feb 2021 13:26:43 +0000	[thread overview]
Message-ID: <20210208132643.GP308988@casper.infradead.org> (raw)

TLDR: I think we're going to need to call synchronize_rcu() in munmap(),
and I know this can be quite expensive.  Is this a reasonable price to
pay for avoiding taking the mmap_sem for page faults?

We're taking the first tentative steps towards moving the page fault
handler from being protected by the mmap_sem to being protected by the
RCU lock.  The first part is replacing the rbtree with the maple tree,
so a walk can be RCU safe.  Liam has taken care of that part.  The next
step is to introduce an RCU user that's perhaps less critical than the
page fault handler to see the consequences.  So I chose /proc/$pid/maps.
This also covers smaps and numa_maps.  It's probably the simplest user,
and taking the mmap_sem in this path actually affects some important
workloads.

Here's the tree:
https://git.infradead.org/users/willy/linux-maple.git/shortlog/refs/heads/proc-vma-rcu

The first of the two patches simply frees VMAs through the RCU mechanism.
That means we can now walk the tree and read from the VMA, knowing
the VMA will remain a VMA until we drop the RCU lock.  It might be
removed from the tree by a racing munmap(), but the memory will not be
reallocated.

The second patch converts the seq_file business of finding the VMAs
from being protected by the mmap_sem to the rcu_lock.

Problem: We're quite free about modifying VMAs.  So vm_start and vm_end
might get modified while we're in the middle of the walk.  I think
that's OK in some circumstances (eg a second mmap causes vm_end to
be increased and we miss the increase), but others are less OK.  For
example mremap(MREMAP_MAYMOVE) may result in us seeing output from
/proc/$pid/maps like this:

7f38db534000-7f38db81a000 r--p 00000000 fe:01 55416                      /usr/lib/locale/locale-archive
7f38db81a000-7f38db83f000 r--p 00000000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so
55d237566000-55d237567000 rw-p 00000000 00:00 0
7f38db98a000-7f38db9d4000 r--p 00170000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f38db9d4000-7f38db9d5000 ---p 001ba000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so

and I'm not sure how comfortable we are with userspace parsers coping
with that kind of situation.  We should probably allocate a new VMA
and free the old one in this situation.

Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
saying:
        mmap_assert_locked(walk.mm);
which made me realise that smaps is also going to walk the page tables.
So the page tables have to be pinned by the existence of the VMA.
Which means the page tables must be freed by the same RCU callback that
frees the VMA.  But doing that means that a task which calls mmap();
munmap(); mmap(); must avoid allocating the same address for the second
mmap (until the RCU grace period has elapsed), otherwise threads on
other CPUs may see the stale PTEs instead of the new ones.

Solution 1: Move the page table freeing into the RCU callback, call
synchronize_rcu() in munmap().

Solution 2: Refcount the VMA and free the page tables on refcount
dropping to zero.  This doesn't actually work because the stale PTE
problem still exists.

Solution 3: When unmapping a VMA, instead of erasing the VMA from the
maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
TLB shootdown has happened, erase the entry and it can then be allocated.
If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
a dead entry.

Thoughts?


             reply	other threads:[~2021-02-08 13:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 13:26 Matthew Wilcox [this message]
2021-02-09 14:29 ` synchronize_rcu in munmap? Matthew Wilcox
2021-02-09 17:19   ` Laurent Dufour
2021-02-09 17:38     ` Jason Gunthorpe
2021-02-09 19:58       ` Matthew Wilcox
2021-02-10 16:42         ` Jason Gunthorpe
2021-02-09 17:08 ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210208132643.GP308988@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=paulmckrcu@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).