linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* synchronize_rcu in munmap?
@ 2021-02-08 13:26 Matthew Wilcox
  2021-02-09 14:29 ` Matthew Wilcox
  2021-02-09 17:08 ` Jason Gunthorpe
  0 siblings, 2 replies; 7+ messages in thread
From: Matthew Wilcox @ 2021-02-08 13:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Liam R. Howlett, Laurent Dufour, Paul McKenney

TLDR: I think we're going to need to call synchronize_rcu() in munmap(),
and I know this can be quite expensive.  Is this a reasonable price to
pay for avoiding taking the mmap_sem for page faults?

We're taking the first tentative steps towards moving the page fault
handler from being protected by the mmap_sem to being protected by the
RCU lock.  The first part is replacing the rbtree with the maple tree,
so a walk can be RCU safe.  Liam has taken care of that part.  The next
step is to introduce an RCU user that's perhaps less critical than the
page fault handler to see the consequences.  So I chose /proc/$pid/maps.
This also covers smaps and numa_maps.  It's probably the simplest user,
and taking the mmap_sem in this path actually affects some important
workloads.

Here's the tree:
https://git.infradead.org/users/willy/linux-maple.git/shortlog/refs/heads/proc-vma-rcu

The first of the two patches simply frees VMAs through the RCU mechanism.
That means we can now walk the tree and read from the VMA, knowing
the VMA will remain a VMA until we drop the RCU lock.  It might be
removed from the tree by a racing munmap(), but the memory will not be
reallocated.

The second patch converts the seq_file business of finding the VMAs
from being protected by the mmap_sem to the rcu_lock.

Problem: We're quite free about modifying VMAs.  So vm_start and vm_end
might get modified while we're in the middle of the walk.  I think
that's OK in some circumstances (eg a second mmap causes vm_end to
be increased and we miss the increase), but others are less OK.  For
example mremap(MREMAP_MAYMOVE) may result in us seeing output from
/proc/$pid/maps like this:

7f38db534000-7f38db81a000 r--p 00000000 fe:01 55416                      /usr/lib/locale/locale-archive
7f38db81a000-7f38db83f000 r--p 00000000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so
55d237566000-55d237567000 rw-p 00000000 00:00 0
7f38db98a000-7f38db9d4000 r--p 00170000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f38db9d4000-7f38db9d5000 ---p 001ba000 fe:01 123093081                  /usr/lib/x86_64-linux-gnu/libc-2.31.so

and I'm not sure how comfortable we are with userspace parsers coping
with that kind of situation.  We should probably allocate a new VMA
and free the old one in this situation.

Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
saying:
        mmap_assert_locked(walk.mm);
which made me realise that smaps is also going to walk the page tables.
So the page tables have to be pinned by the existence of the VMA.
Which means the page tables must be freed by the same RCU callback that
frees the VMA.  But doing that means that a task which calls mmap();
munmap(); mmap(); must avoid allocating the same address for the second
mmap (until the RCU grace period has elapsed), otherwise threads on
other CPUs may see the stale PTEs instead of the new ones.

Solution 1: Move the page table freeing into the RCU callback, call
synchronize_rcu() in munmap().

Solution 2: Refcount the VMA and free the page tables on refcount
dropping to zero.  This doesn't actually work because the stale PTE
problem still exists.

Solution 3: When unmapping a VMA, instead of erasing the VMA from the
maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
TLB shootdown has happened, erase the entry and it can then be allocated.
If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
a dead entry.

Thoughts?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-08 13:26 synchronize_rcu in munmap? Matthew Wilcox
@ 2021-02-09 14:29 ` Matthew Wilcox
  2021-02-09 17:19   ` Laurent Dufour
  2021-02-09 17:08 ` Jason Gunthorpe
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2021-02-09 14:29 UTC (permalink / raw)
  To: linux-mm; +Cc: Liam R. Howlett, Laurent Dufour, Paul McKenney

On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote:
> Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
> saying:
>         mmap_assert_locked(walk.mm);
> which made me realise that smaps is also going to walk the page tables.
> So the page tables have to be pinned by the existence of the VMA.
> Which means the page tables must be freed by the same RCU callback that
> frees the VMA.  But doing that means that a task which calls mmap();
> munmap(); mmap(); must avoid allocating the same address for the second
> mmap (until the RCU grace period has elapsed), otherwise threads on
> other CPUs may see the stale PTEs instead of the new ones.
> 
> Solution 1: Move the page table freeing into the RCU callback, call
> synchronize_rcu() in munmap().
> 
> Solution 2: Refcount the VMA and free the page tables on refcount
> dropping to zero.  This doesn't actually work because the stale PTE
> problem still exists.
> 
> Solution 3: When unmapping a VMA, instead of erasing the VMA from the
> maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
> TLB shootdown has happened, erase the entry and it can then be allocated.
> If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
> a dead entry.

Solution 4: RCU free the page table pages and teach pagewalk.c to
be RCU-safe.  That means that it will have to use rcu_dereference()
or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run
under the rcu read lock instead of disabling interrupts.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-08 13:26 synchronize_rcu in munmap? Matthew Wilcox
  2021-02-09 14:29 ` Matthew Wilcox
@ 2021-02-09 17:08 ` Jason Gunthorpe
  1 sibling, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2021-02-09 17:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Liam R. Howlett, Laurent Dufour, Paul McKenney

On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote:
> TLDR: I think we're going to need to call synchronize_rcu() in munmap(),
> and I know this can be quite expensive.  Is this a reasonable price to
> pay for avoiding taking the mmap_sem for page faults?

In my experience we have been ripping out synchronize_rcu() in places
that intersect with userspace, it can be very slow and if something
does more than one iteration things get bad.

Based on that mmunmap seems like a poor place to put synchronize_rcu()

> Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
> saying:
>         mmap_assert_locked(walk.mm);
> which made me realise that smaps is also going to walk the page
> tables.

Aren't the presense/absence of the page table levels themselves
managed under the page table locks? I thought that unused levels were
wiped in some lazy fasion after the TLB flush based on being empty? 

There is a state graph of allowed page entry transitions under the
read side of the mmap sem, and allocated -> freed table is not
allowed (freed -> allocated is OK though, IIRC).

I think this has nothing to do with the mmap_sem acting as a VMA lock,
it feels like some extra behavior.

Jason


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-09 14:29 ` Matthew Wilcox
@ 2021-02-09 17:19   ` Laurent Dufour
  2021-02-09 17:38     ` Jason Gunthorpe
  0 siblings, 1 reply; 7+ messages in thread
From: Laurent Dufour @ 2021-02-09 17:19 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm; +Cc: Liam R. Howlett, Paul McKenney

Le 09/02/2021 à 15:29, Matthew Wilcox a écrit :
> On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote:
>> Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
>> saying:
>>          mmap_assert_locked(walk.mm);
>> which made me realise that smaps is also going to walk the page tables.
>> So the page tables have to be pinned by the existence of the VMA.
>> Which means the page tables must be freed by the same RCU callback that
>> frees the VMA.  But doing that means that a task which calls mmap();
>> munmap(); mmap(); must avoid allocating the same address for the second
>> mmap (until the RCU grace period has elapsed), otherwise threads on
>> other CPUs may see the stale PTEs instead of the new ones.
>>
>> Solution 1: Move the page table freeing into the RCU callback, call
>> synchronize_rcu() in munmap().
>>
>> Solution 2: Refcount the VMA and free the page tables on refcount
>> dropping to zero.  This doesn't actually work because the stale PTE
>> problem still exists.
>>
>> Solution 3: When unmapping a VMA, instead of erasing the VMA from the
>> maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
>> TLB shootdown has happened, erase the entry and it can then be allocated.
>> If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
>> a dead entry.
> 
> Solution 4: RCU free the page table pages and teach pagewalk.c to
> be RCU-safe.  That means that it will have to use rcu_dereference()
> or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run
> under the rcu read lock instead of disabling interrupts.

I might be wrong but my understanding is that the RCU window could not be closed 
on a CPU where IRQs are disabled. So in a first step GUP-fast might continue to 
disable interrupts to get safe walking the page directories.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-09 17:19   ` Laurent Dufour
@ 2021-02-09 17:38     ` Jason Gunthorpe
  2021-02-09 19:58       ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2021-02-09 17:38 UTC (permalink / raw)
  To: Laurent Dufour; +Cc: Matthew Wilcox, linux-mm, Liam R. Howlett, Paul McKenney

On Tue, Feb 09, 2021 at 06:19:35PM +0100, Laurent Dufour wrote:
> Le 09/02/2021 à 15:29, Matthew Wilcox a écrit :
> > On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote:
> > > Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
> > > saying:
> > >          mmap_assert_locked(walk.mm);
> > > which made me realise that smaps is also going to walk the page tables.
> > > So the page tables have to be pinned by the existence of the VMA.
> > > Which means the page tables must be freed by the same RCU callback that
> > > frees the VMA.  But doing that means that a task which calls mmap();
> > > munmap(); mmap(); must avoid allocating the same address for the second
> > > mmap (until the RCU grace period has elapsed), otherwise threads on
> > > other CPUs may see the stale PTEs instead of the new ones.
> > > 
> > > Solution 1: Move the page table freeing into the RCU callback, call
> > > synchronize_rcu() in munmap().
> > > 
> > > Solution 2: Refcount the VMA and free the page tables on refcount
> > > dropping to zero.  This doesn't actually work because the stale PTE
> > > problem still exists.
> > > 
> > > Solution 3: When unmapping a VMA, instead of erasing the VMA from the
> > > maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
> > > TLB shootdown has happened, erase the entry and it can then be allocated.
> > > If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
> > > a dead entry.
> > 
> > Solution 4: RCU free the page table pages and teach pagewalk.c to
> > be RCU-safe.  That means that it will have to use rcu_dereference()
> > or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run
> > under the rcu read lock instead of disabling interrupts.
> 
> I might be wrong but my understanding is that the RCU window could not be
> closed on a CPU where IRQs are disabled. So in a first step GUP-fast might
> continue to disable interrupts to get safe walking the page directories.

Yes, this is right. PPC already uses RCU for the TLB flush and the
GUP-fast trick is safe against that.

The comments for PPC say the downside of RCU is having to do an
allocation in paths that really don't want to fail on memory
exhaustion

The pagewalk.c needs to call its ops in a sleepable context, otherwise
it could just use the normal page table locks.. Not sure RCU could be
fit into here?

Jason


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-09 17:38     ` Jason Gunthorpe
@ 2021-02-09 19:58       ` Matthew Wilcox
  2021-02-10 16:42         ` Jason Gunthorpe
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2021-02-09 19:58 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Laurent Dufour, linux-mm, Liam R. Howlett, Paul McKenney

On Tue, Feb 09, 2021 at 01:38:22PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 09, 2021 at 06:19:35PM +0100, Laurent Dufour wrote:
> > Le 09/02/2021 à 15:29, Matthew Wilcox a écrit :
> > > On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote:
> > > > Next problem: /proc/$pid/smaps calls walk_page_vma() which starts out by
> > > > saying:
> > > >          mmap_assert_locked(walk.mm);
> > > > which made me realise that smaps is also going to walk the page tables.
> > > > So the page tables have to be pinned by the existence of the VMA.
> > > > Which means the page tables must be freed by the same RCU callback that
> > > > frees the VMA.  But doing that means that a task which calls mmap();
> > > > munmap(); mmap(); must avoid allocating the same address for the second
> > > > mmap (until the RCU grace period has elapsed), otherwise threads on
> > > > other CPUs may see the stale PTEs instead of the new ones.
> > > > 
> > > > Solution 1: Move the page table freeing into the RCU callback, call
> > > > synchronize_rcu() in munmap().
> > > > 
> > > > Solution 2: Refcount the VMA and free the page tables on refcount
> > > > dropping to zero.  This doesn't actually work because the stale PTE
> > > > problem still exists.
> > > > 
> > > > Solution 3: When unmapping a VMA, instead of erasing the VMA from the
> > > > maple tree, put a "dead" entry in its place.  Once the RCU freeing and the
> > > > TLB shootdown has happened, erase the entry and it can then be allocated.
> > > > If we do that MAP_FIXED will have to synchronize_rcu() if it overlaps
> > > > a dead entry.
> > > 
> > > Solution 4: RCU free the page table pages and teach pagewalk.c to
> > > be RCU-safe.  That means that it will have to use rcu_dereference()
> > > or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to run
> > > under the rcu read lock instead of disabling interrupts.
> > 
> > I might be wrong but my understanding is that the RCU window could not be
> > closed on a CPU where IRQs are disabled. So in a first step GUP-fast might
> > continue to disable interrupts to get safe walking the page directories.
> 
> Yes, this is right. PPC already uses RCU for the TLB flush and the
> GUP-fast trick is safe against that.
> 
> The comments for PPC say the downside of RCU is having to do an
> allocation in paths that really don't want to fail on memory
> exhaustion
> 
> The pagewalk.c needs to call its ops in a sleepable context, otherwise
> it could just use the normal page table locks.. Not sure RCU could be
> fit into here?

Depends on the caller of walk_page_*() whether the ops need to sleep
or not.  The specific problem we're trying to solve here is avoiding
taking the mmap_sem in /proc/$pid/smaps.  Now, we could just disable
interrupts instead of taking the mmap_sem, but I was hoping to do better.

So let's call that Solution 5:
 - smaps disables interrupts while calling pagewalk.
 - pagewalk accepts that it can be called locklessly (uses
   ptep_get_lockless() and so on)
 - smaps figures out how to handle races with khugepaged


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: synchronize_rcu in munmap?
  2021-02-09 19:58       ` Matthew Wilcox
@ 2021-02-10 16:42         ` Jason Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 16:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Laurent Dufour, linux-mm, Liam R. Howlett, Paul McKenney

On Tue, Feb 09, 2021 at 07:58:17PM +0000, Matthew Wilcox wrote:

> > The pagewalk.c needs to call its ops in a sleepable context, otherwise
> > it could just use the normal page table locks.. Not sure RCU could be
> > fit into here?
> 
> Depends on the caller of walk_page_*() whether the ops need to sleep
> or not.

We could create a non-sleeping-op version of walk_page that used the
PTL locks, that would avoid the need for the mmap_sem for places that
can tolerate non-sleeping ops. Page tables can't be freed while the
PTL spinlocks are held.

This is certainly easier to reason about than trying to understand if
races from having no locks are OK or not.

> The specific problem we're trying to solve here is avoiding
> taking the mmap_sem in /proc/$pid/smaps.

I thought you were trying to remove the mmap sem around the VMA
related things?

The secondary role the mmap_sem has for the page table itself is
unrelated to the VMA.

To be compatible with page_walk.c's no-ptl locks design anything that
calls tlb_finish_mmu() with the freed_tables=1 has to also hold the
write side of the mmap_sem. This ensures that the page table memory
cannot be freed under the read side of the mmap_sem. 

If you delete the mmap_sem as a VMA lock we can still keep this idea,
adding a new rwsem lock to something like the below. If someone thinks
of a smarter way to serialize this later then it is at least clearly
documented.

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 03c33c93a582b9..98ee4d0d9416d3 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -287,6 +287,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	inc_tlb_flush_pending(tlb->mm);
 }
 
+#define page_table_memory_write_lock(mm) mmap_assert_write_locked(mm)
+#define page_table_memory_write_unlock(mm)
+
 /**
  * tlb_finish_mmu - finish an mmu_gather structure
  * @tlb: the mmu_gather structure to finish
@@ -299,6 +302,16 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 void tlb_finish_mmu(struct mmu_gather *tlb,
 		unsigned long start, unsigned long end)
 {
+	if (tlb->freed_tables) {
+		/*
+		 * Page table levels to be freed are now removed from the page
+		 * table itself and on the free list in the mmu_gather. Exclude
+		 * any readers of this memory before we progress to freeing.
+		 */
+		page_table_memory_write_lock(tlb->mm);
+		page_table_memory_write_unlock(tlb->mm);
+	}
+
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
 	 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f17706..c190565ee0b404 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -311,6 +311,15 @@ static int walk_page_test(unsigned long start, unsigned long end,
 	return 0;
 }
 
+/*
+ * Write side of the page_table_memory is held across any place that will kfree
+ * a page table level, eg calls to tlb_finish_mmu() where struct mmu_gather
+ * freed_tables = 1. It is a sleepable alternative to the page table spin locks
+ * that allows semi-lockless reading.
+ */
+#define page_table_memory_read_lock(mm) mmap_assert_locked(mm)
+#define page_table_memory_read_unlock(mm)
+
 static int __walk_page_range(unsigned long start, unsigned long end,
 			struct mm_walk *walk)
 {
@@ -324,12 +333,16 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
+	page_table_memory_read_lock(walk->mm);
+
 	if (vma && is_vm_hugetlb_page(vma)) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
 		err = walk_pgd_range(start, end, walk);
 
+	page_table_memory_read_unlock(walk->mm);
+
 	if (vma && ops->post_vma)
 		ops->post_vma(walk);


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-02-10 16:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 13:26 synchronize_rcu in munmap? Matthew Wilcox
2021-02-09 14:29 ` Matthew Wilcox
2021-02-09 17:19   ` Laurent Dufour
2021-02-09 17:38     ` Jason Gunthorpe
2021-02-09 19:58       ` Matthew Wilcox
2021-02-10 16:42         ` Jason Gunthorpe
2021-02-09 17:08 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).