Re: synchronize_rcu in munmap?

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>,
	linux-mm@kvack.org, "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Paul McKenney <paulmckrcu@fb.com>
Subject: Re: synchronize_rcu in munmap?
Date: Wed, 10 Feb 2021 12:42:52 -0400	[thread overview]
Message-ID: <20210210164252.GM4718@ziepe.ca> (raw)
In-Reply-To: <20210209195817.GZ308988@casper.infradead.org>

On Tue, Feb 09, 2021 at 07:58:17PM +0000, Matthew Wilcox wrote:

> > The pagewalk.c needs to call its ops in a sleepable context, otherwise
> > it could just use the normal page table locks.. Not sure RCU could be
> > fit into here?
> 
> Depends on the caller of walk_page_*() whether the ops need to sleep
> or not.

We could create a non-sleeping-op version of walk_page that used the
PTL locks, that would avoid the need for the mmap_sem for places that
can tolerate non-sleeping ops. Page tables can't be freed while the
PTL spinlocks are held.

This is certainly easier to reason about than trying to understand if
races from having no locks are OK or not.

> The specific problem we're trying to solve here is avoiding
> taking the mmap_sem in /proc/$pid/smaps.

I thought you were trying to remove the mmap sem around the VMA
related things?

The secondary role the mmap_sem has for the page table itself is
unrelated to the VMA.

To be compatible with page_walk.c's no-ptl locks design anything that
calls tlb_finish_mmu() with the freed_tables=1 has to also hold the
write side of the mmap_sem. This ensures that the page table memory
cannot be freed under the read side of the mmap_sem. 

If you delete the mmap_sem as a VMA lock we can still keep this idea,
adding a new rwsem lock to something like the below. If someone thinks
of a smarter way to serialize this later then it is at least clearly
documented.

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 03c33c93a582b9..98ee4d0d9416d3 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -287,6 +287,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	inc_tlb_flush_pending(tlb->mm);
 }
 
+#define page_table_memory_write_lock(mm) mmap_assert_write_locked(mm)
+#define page_table_memory_write_unlock(mm)
+
 /**
  * tlb_finish_mmu - finish an mmu_gather structure
  * @tlb: the mmu_gather structure to finish
@@ -299,6 +302,16 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 void tlb_finish_mmu(struct mmu_gather *tlb,
 		unsigned long start, unsigned long end)
 {
+	if (tlb->freed_tables) {
+		/*
+		 * Page table levels to be freed are now removed from the page
+		 * table itself and on the free list in the mmu_gather. Exclude
+		 * any readers of this memory before we progress to freeing.
+		 */
+		page_table_memory_write_lock(tlb->mm);
+		page_table_memory_write_unlock(tlb->mm);
+	}
+
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
 	 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f17706..c190565ee0b404 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -311,6 +311,15 @@ static int walk_page_test(unsigned long start, unsigned long end,
 	return 0;
 }
 
+/*
+ * Write side of the page_table_memory is held across any place that will kfree
+ * a page table level, eg calls to tlb_finish_mmu() where struct mmu_gather
+ * freed_tables = 1. It is a sleepable alternative to the page table spin locks
+ * that allows semi-lockless reading.
+ */
+#define page_table_memory_read_lock(mm) mmap_assert_locked(mm)
+#define page_table_memory_read_unlock(mm)
+
 static int __walk_page_range(unsigned long start, unsigned long end,
 			struct mm_walk *walk)
 {
@@ -324,12 +333,16 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
+	page_table_memory_read_lock(walk->mm);
+
 	if (vma && is_vm_hugetlb_page(vma)) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
 		err = walk_pgd_range(start, end, walk);
 
+	page_table_memory_read_unlock(walk->mm);
+
 	if (vma && ops->post_vma)
 		ops->post_vma(walk);