Re: [PATCH] mm: release the spinlock on zap_pte_range

From: Minchan Kim <minchan@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Miguel de Dios <migueldedios@google.com>,
	Wei Wang <wvw@google.com>, Johannes Weiner <hannes@cmpxchg.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: release the spinlock on zap_pte_range
Date: Wed, 31 Jul 2019 15:14:40 +0900	[thread overview]
Message-ID: <20190731061440.GC155569@google.com> (raw)
In-Reply-To: <20190730124207.da70f92f19dc021bf052abd0@linux-foundation.org>

On Tue, Jul 30, 2019 at 12:42:07PM -0700, Andrew Morton wrote:
> On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > >  	flush_tlb_batched_pending(mm);
> > > >  	arch_enter_lazy_mmu_mode();
> > > >  	do {
> > > > -		pte_t ptent = *pte;
> > > > +		pte_t ptent;
> > > > +
> > > > +		if (progress >= 32) {
> > > > +			progress = 0;
> > > > +			if (need_resched())
> > > > +				break;
> > > > +		}
> > > > +		progress += 8;
> > > 
> > > Why 8?
> > 
> > Just copied from copy_pte_range.
> 
> copy_pte_range() does
> 
> 		if (pte_none(*src_pte)) {
> 			progress++;
> 			continue;
> 		}
> 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> 							vma, addr, rss);
> 		if (entry.val)
> 			break;
> 		progress += 8;
> 
> which appears to be an attempt to balance the cost of copy_one_pte()
> against the cost of not calling copy_one_pte().
> 

Indeed.

> Your code doesn't do this balancing and hence can be simpler.

Based on the balancing code of copy_one_pte, it seems we should balance
it with cost of mark_page_accessed against the cost of not calling
mark_page_accessed. IOW, add up 8 only when mark_page_accessed is called.

However, every mark_page_accessed is not heavy since it uses pagevec
and caller couldn't know whether the target page will be activated or
just have PG_referenced which is cheap. Thus, I agree, do not make it
complicated.

> 
> It all seems a bit overdesigned.  need_resched() is cheap.  It's
> possibly a mistake to check need_resched() on *every* loop because some
> crazy scheduling load might livelock us.  But surely it would be enough
> to do something like
> 
> 	if (progress++ && need_resched()) {
> 		<reschedule>
> 		progress = 0;
> 	}
> 
> and leave it at that?

Seems like this?

From bb1d7aaf520e98a6f9d988c25121602c28e12e67 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 29 Jul 2019 15:28:48 +0900
Subject: [PATCH] mm: release the spinlock on zap_pte_range

In our testing(carmera recording), Miguel and Wei found unmap_page_range
takes above 6ms with preemption disabled easily. When I see that, the
reason is it holds page table spinlock during entire 512 page operation
in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
run in the time because it could make frame drop or glitch audio problem.

I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.

I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.

If I remove activate_page, 35-40% overhead of zap_pte_range is gone
so most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.

Thus, this patch adds preemption point for once every 32 times in the
loop.

Reported-by: Miguel de Dios <migueldedios@google.com>
Reported-by: Wei Wang <wvw@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/memory.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e796372927fd..8bfcef09da674 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
+	int progress = 0;
 	int force_flush = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
@@ -1022,7 +1023,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
-		pte_t ptent = *pte;
+		pte_t ptent;
+
+		if (progress++ >= 32) {
+			progress = 0;
+			if (need_resched())
+				break;
+		}
+
+		ptent = *pte;
 		if (pte_none(ptent))
 			continue;
 
@@ -1123,8 +1132,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	if (force_flush) {
 		force_flush = 0;
 		tlb_flush_mmu(tlb);
-		if (addr != end)
-			goto again;
+	}
+
+	if (addr != end) {
+		progress = 0;
+		goto again;
 	}
 
 	return addr;
-- 
2.22.0.709.g102302147b-goog