* [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault @ 2020-07-07 18:54 Yang Shi 2020-07-08 8:00 ` Will Deacon 0 siblings, 1 reply; 5+ messages in thread From: Yang Shi @ 2020-07-07 18:54 UTC (permalink / raw) To: hannes, catalin.marinas, will.deacon, akpm Cc: yang.shi, xuyu, linux-mm, linux-kernel, linux-arm-kernel Recently we found regression when running will_it_scale/page_fault3 test on ARM64. Over 70% down for the multi processes cases and over 20% down for the multi threads cases. It turns out the regression is caused by commit 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before calling balance_dirty_pages() in write fault"). The test mmaps a memory size file then write to the mapping, this would make all memory dirty and trigger dirty pages throttle, that upstream commit would release mmap_sem then retry the page fault. The retried page fault would see correct PTEs installed by the first try then update access flags and flush TLBs. The regression is caused by the excessive TLB flush. It is fine on x86 since x86 doesn't need flush TLB for access flag update. The page fault would be retried due to: 1. Waiting for page readahead 2. Waiting for page swapped in 3. Waiting for dirty pages throttling The first two cases don't have PTEs set up at all, so the retried page fault would install the PTEs, so they don't reach there. But the #3 case usually has PTEs installed, the retried page fault would reach the access flag update. But it seems not necessary to update access flags for #3 since retried page fault is not real "second access", so it sounds safe to skip access flag update for retried page fault. With this fix the test result get back to normal. Reported-by: Xu Yu <xuyu@linux.alibaba.com> Debugged-by: Xu Yu <xuyu@linux.alibaba.com> Tested-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but there may be still corner cases not covered. mm/memory.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index 87ec87c..3d4e671 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (vmf->flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(vmf); - entry = pte_mkdirty(entry); } + + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) + entry = pte_mkdirty(entry); + else if (vmf->flags & FAULT_FLAG_TRIED) + goto unlock; + entry = pte_mkyoung(entry); if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) { -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault 2020-07-07 18:54 [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault Yang Shi @ 2020-07-08 8:00 ` Will Deacon 2020-07-08 16:40 ` Yang Shi 0 siblings, 1 reply; 5+ messages in thread From: Will Deacon @ 2020-07-08 8:00 UTC (permalink / raw) To: Yang Shi Cc: hannes, catalin.marinas, will.deacon, akpm, xuyu, linux-mm, linux-kernel, linux-arm-kernel On Wed, Jul 08, 2020 at 02:54:32AM +0800, Yang Shi wrote: > Recently we found regression when running will_it_scale/page_fault3 test > on ARM64. Over 70% down for the multi processes cases and over 20% down > for the multi threads cases. It turns out the regression is caused by commit > 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before > calling balance_dirty_pages() in write fault"). > > The test mmaps a memory size file then write to the mapping, this would > make all memory dirty and trigger dirty pages throttle, that upstream > commit would release mmap_sem then retry the page fault. The retried > page fault would see correct PTEs installed by the first try then update > access flags and flush TLBs. The regression is caused by the excessive > TLB flush. It is fine on x86 since x86 doesn't need flush TLB for > access flag update. > > The page fault would be retried due to: > 1. Waiting for page readahead > 2. Waiting for page swapped in > 3. Waiting for dirty pages throttling > > The first two cases don't have PTEs set up at all, so the retried page > fault would install the PTEs, so they don't reach there. But the #3 > case usually has PTEs installed, the retried page fault would reach the > access flag update. But it seems not necessary to update access flags > for #3 since retried page fault is not real "second access", so it > sounds safe to skip access flag update for retried page fault. > > With this fix the test result get back to normal. > > Reported-by: Xu Yu <xuyu@linux.alibaba.com> > Debugged-by: Xu Yu <xuyu@linux.alibaba.com> > Tested-by: Xu Yu <xuyu@linux.alibaba.com> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but > there may be still corner cases not covered. > > mm/memory.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 87ec87c..3d4e671 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > if (vmf->flags & FAULT_FLAG_WRITE) { > if (!pte_write(entry)) > return do_wp_page(vmf); > - entry = pte_mkdirty(entry); > } > + > + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) > + entry = pte_mkdirty(entry); > + else if (vmf->flags & FAULT_FLAG_TRIED) > + goto unlock; > + Can you rewrite this as: if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; if (vmf->flags & FAULT_FLAG_WRITE) entry = pte_mkdirty(entry); ? (I'm half-asleep this morning and there are people screaming and shouting outside my window, so this might be rubbish) If you _can_make that change, then I don't understand why the existing pte_mkdirty() line needs to move at all. Couldn't you just add: if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; after the existing "vmf->flags & FAULT_FLAG_WRITE" block? Will ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault 2020-07-08 8:00 ` Will Deacon @ 2020-07-08 16:40 ` Yang Shi 2020-07-08 17:29 ` Catalin Marinas 0 siblings, 1 reply; 5+ messages in thread From: Yang Shi @ 2020-07-08 16:40 UTC (permalink / raw) To: Will Deacon Cc: hannes, catalin.marinas, will.deacon, akpm, xuyu, linux-mm, linux-kernel, linux-arm-kernel On 7/8/20 1:00 AM, Will Deacon wrote: > On Wed, Jul 08, 2020 at 02:54:32AM +0800, Yang Shi wrote: >> Recently we found regression when running will_it_scale/page_fault3 test >> on ARM64. Over 70% down for the multi processes cases and over 20% down >> for the multi threads cases. It turns out the regression is caused by commit >> 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before >> calling balance_dirty_pages() in write fault"). >> >> The test mmaps a memory size file then write to the mapping, this would >> make all memory dirty and trigger dirty pages throttle, that upstream >> commit would release mmap_sem then retry the page fault. The retried >> page fault would see correct PTEs installed by the first try then update >> access flags and flush TLBs. The regression is caused by the excessive >> TLB flush. It is fine on x86 since x86 doesn't need flush TLB for >> access flag update. >> >> The page fault would be retried due to: >> 1. Waiting for page readahead >> 2. Waiting for page swapped in >> 3. Waiting for dirty pages throttling >> >> The first two cases don't have PTEs set up at all, so the retried page >> fault would install the PTEs, so they don't reach there. But the #3 >> case usually has PTEs installed, the retried page fault would reach the >> access flag update. But it seems not necessary to update access flags >> for #3 since retried page fault is not real "second access", so it >> sounds safe to skip access flag update for retried page fault. >> >> With this fix the test result get back to normal. >> >> Reported-by: Xu Yu <xuyu@linux.alibaba.com> >> Debugged-by: Xu Yu <xuyu@linux.alibaba.com> >> Tested-by: Xu Yu <xuyu@linux.alibaba.com> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but >> there may be still corner cases not covered. >> >> mm/memory.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 87ec87c..3d4e671 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >> if (vmf->flags & FAULT_FLAG_WRITE) { >> if (!pte_write(entry)) >> return do_wp_page(vmf); >> - entry = pte_mkdirty(entry); >> } >> + >> + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) >> + entry = pte_mkdirty(entry); >> + else if (vmf->flags & FAULT_FLAG_TRIED) >> + goto unlock; >> + > Can you rewrite this as: > > if (vmf->flags & FAULT_FLAG_TRIED) > goto unlock; > > if (vmf->flags & FAULT_FLAG_WRITE) > entry = pte_mkdirty(entry); Yes, it does the same. > > ? (I'm half-asleep this morning and there are people screaming and shouting > outside my window, so this might be rubbish) > > If you _can_make that change, then I don't understand why the existing > pte_mkdirty() line needs to move at all. Couldn't you just add: > > if (vmf->flags & FAULT_FLAG_TRIED) > goto unlock; > > after the existing "vmf->flags & FAULT_FLAG_WRITE" block? The intention is to not set dirty bit if it is in retried page fault since the bit should be already set in the first try. And, I'm not quite sure if TLB needs to be flushed on non-x86 if dirty bit is set. If it is unnecessary, then the above change does make sense. > > Will ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault 2020-07-08 16:40 ` Yang Shi @ 2020-07-08 17:29 ` Catalin Marinas 2020-07-08 18:13 ` Yang Shi 0 siblings, 1 reply; 5+ messages in thread From: Catalin Marinas @ 2020-07-08 17:29 UTC (permalink / raw) To: Yang Shi Cc: Will Deacon, hannes, will.deacon, akpm, xuyu, linux-mm, linux-kernel, linux-arm-kernel On Wed, Jul 08, 2020 at 09:40:11AM -0700, Yang Shi wrote: > On 7/8/20 1:00 AM, Will Deacon wrote: > > On Wed, Jul 08, 2020 at 02:54:32AM +0800, Yang Shi wrote: > > > Recently we found regression when running will_it_scale/page_fault3 test > > > on ARM64. Over 70% down for the multi processes cases and over 20% down > > > for the multi threads cases. It turns out the regression is caused by commit > > > 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before > > > calling balance_dirty_pages() in write fault"). > > > > > > The test mmaps a memory size file then write to the mapping, this would > > > make all memory dirty and trigger dirty pages throttle, that upstream > > > commit would release mmap_sem then retry the page fault. The retried > > > page fault would see correct PTEs installed by the first try then update > > > access flags and flush TLBs. The regression is caused by the excessive > > > TLB flush. It is fine on x86 since x86 doesn't need flush TLB for > > > access flag update. > > > > > > The page fault would be retried due to: > > > 1. Waiting for page readahead > > > 2. Waiting for page swapped in > > > 3. Waiting for dirty pages throttling > > > > > > The first two cases don't have PTEs set up at all, so the retried page > > > fault would install the PTEs, so they don't reach there. But the #3 > > > case usually has PTEs installed, the retried page fault would reach the > > > access flag update. But it seems not necessary to update access flags > > > for #3 since retried page fault is not real "second access", so it > > > sounds safe to skip access flag update for retried page fault. Is this the access flag or the dirty flag? On arm64 we distinguish between the two. Setting the access flag on arm64 doesn't need TLB flushing since an inaccessible entry is not allowed to be cached in the TLB. However, setting the dirty bit (clearing read-only on arm64) does require a TLB flush and ptep_set_access_flags() takes care of this. > > > With this fix the test result get back to normal. > > > > > > Reported-by: Xu Yu <xuyu@linux.alibaba.com> > > > Debugged-by: Xu Yu <xuyu@linux.alibaba.com> > > > Tested-by: Xu Yu <xuyu@linux.alibaba.com> > > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > > > --- > > > I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but > > > there may be still corner cases not covered. > > > > > > mm/memory.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 87ec87c..3d4e671 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > > > if (vmf->flags & FAULT_FLAG_WRITE) { > > > if (!pte_write(entry)) > > > return do_wp_page(vmf); > > > - entry = pte_mkdirty(entry); > > > } > > > + > > > + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) > > > + entry = pte_mkdirty(entry); > > > + else if (vmf->flags & FAULT_FLAG_TRIED) > > > + goto unlock; > > > + > > Can you rewrite this as: > > > > if (vmf->flags & FAULT_FLAG_TRIED) > > goto unlock; > > > > if (vmf->flags & FAULT_FLAG_WRITE) > > entry = pte_mkdirty(entry); > > Yes, it does the same. > > > > > ? (I'm half-asleep this morning and there are people screaming and shouting > > outside my window, so this might be rubbish) > > > > If you _can_make that change, then I don't understand why the existing > > pte_mkdirty() line needs to move at all. Couldn't you just add: > > > > if (vmf->flags & FAULT_FLAG_TRIED) > > goto unlock; > > > > after the existing "vmf->flags & FAULT_FLAG_WRITE" block? > > The intention is to not set dirty bit if it is in retried page fault since > the bit should be already set in the first try. And, I'm not quite sure if > TLB needs to be flushed on non-x86 if dirty bit is set. If it is > unnecessary, then the above change does make sense. It is necessary on arm32/arm64 since pte_mkdirty() clears the read-only bit. But do we have guarantee that every time handle_mm_fault() returns VM_FAULT_RETRY, the pte has already been updated? -- Catalin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault 2020-07-08 17:29 ` Catalin Marinas @ 2020-07-08 18:13 ` Yang Shi 0 siblings, 0 replies; 5+ messages in thread From: Yang Shi @ 2020-07-08 18:13 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, hannes, will.deacon, akpm, xuyu, linux-mm, linux-kernel, linux-arm-kernel On 7/8/20 10:29 AM, Catalin Marinas wrote: > On Wed, Jul 08, 2020 at 09:40:11AM -0700, Yang Shi wrote: >> On 7/8/20 1:00 AM, Will Deacon wrote: >>> On Wed, Jul 08, 2020 at 02:54:32AM +0800, Yang Shi wrote: >>>> Recently we found regression when running will_it_scale/page_fault3 test >>>> on ARM64. Over 70% down for the multi processes cases and over 20% down >>>> for the multi threads cases. It turns out the regression is caused by commit >>>> 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before >>>> calling balance_dirty_pages() in write fault"). >>>> >>>> The test mmaps a memory size file then write to the mapping, this would >>>> make all memory dirty and trigger dirty pages throttle, that upstream >>>> commit would release mmap_sem then retry the page fault. The retried >>>> page fault would see correct PTEs installed by the first try then update >>>> access flags and flush TLBs. The regression is caused by the excessive >>>> TLB flush. It is fine on x86 since x86 doesn't need flush TLB for >>>> access flag update. >>>> >>>> The page fault would be retried due to: >>>> 1. Waiting for page readahead >>>> 2. Waiting for page swapped in >>>> 3. Waiting for dirty pages throttling >>>> >>>> The first two cases don't have PTEs set up at all, so the retried page >>>> fault would install the PTEs, so they don't reach there. But the #3 >>>> case usually has PTEs installed, the retried page fault would reach the >>>> access flag update. But it seems not necessary to update access flags >>>> for #3 since retried page fault is not real "second access", so it >>>> sounds safe to skip access flag update for retried page fault. > Is this the access flag or the dirty flag? On arm64 we distinguish > between the two. Setting the access flag on arm64 doesn't need TLB > flushing since an inaccessible entry is not allowed to be cached in the > TLB. However, setting the dirty bit (clearing read-only on arm64) does > require a TLB flush and ptep_set_access_flags() takes care of this. I think it is dirty bit if updating access bit doesn't need flush TLB. > >>>> With this fix the test result get back to normal. >>>> >>>> Reported-by: Xu Yu <xuyu@linux.alibaba.com> >>>> Debugged-by: Xu Yu <xuyu@linux.alibaba.com> >>>> Tested-by: Xu Yu <xuyu@linux.alibaba.com> >>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >>>> --- >>>> I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but >>>> there may be still corner cases not covered. >>>> >>>> mm/memory.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 87ec87c..3d4e671 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>>> if (vmf->flags & FAULT_FLAG_WRITE) { >>>> if (!pte_write(entry)) >>>> return do_wp_page(vmf); >>>> - entry = pte_mkdirty(entry); >>>> } >>>> + >>>> + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) >>>> + entry = pte_mkdirty(entry); >>>> + else if (vmf->flags & FAULT_FLAG_TRIED) >>>> + goto unlock; >>>> + >>> Can you rewrite this as: >>> >>> if (vmf->flags & FAULT_FLAG_TRIED) >>> goto unlock; >>> >>> if (vmf->flags & FAULT_FLAG_WRITE) >>> entry = pte_mkdirty(entry); >> Yes, it does the same. >> >>> ? (I'm half-asleep this morning and there are people screaming and shouting >>> outside my window, so this might be rubbish) >>> >>> If you _can_make that change, then I don't understand why the existing >>> pte_mkdirty() line needs to move at all. Couldn't you just add: >>> >>> if (vmf->flags & FAULT_FLAG_TRIED) >>> goto unlock; >>> >>> after the existing "vmf->flags & FAULT_FLAG_WRITE" block? >> The intention is to not set dirty bit if it is in retried page fault since >> the bit should be already set in the first try. And, I'm not quite sure if >> TLB needs to be flushed on non-x86 if dirty bit is set. If it is >> unnecessary, then the above change does make sense. > It is necessary on arm32/arm64 since pte_mkdirty() clears the read-only > bit. Thanks for confirming this. I didn't realize pte_mkdirty() also clears read-only bit on arm. > > But do we have guarantee that every time handle_mm_fault() returns > VM_FAULT_RETRY, the pte has already been updated? I think so if I understand the code correctly. As I mentioned in the commit log, there 3 places which could return VM_FAULT_RETRY then retry page fault. The first two cases (swap and readahead) just return before PTEs are installed, so the retried page fault should not reach here at all (just finish the handling in do_fault() or do_swap_page() like the first try). So only dirty pages throttling may reach here since the throttling happens after PTEs are setup. The write fault handler already set PTEs with dirty bit, please see the below code snippet from alloc_set_pte(): entry = mk_pte(page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); So the retried page fault should come in with dirty bit set. Of course the parallel page fault may set up PTEs, but we just need care about write fault. If the parallel page fault setup a writable and dirty PTE then the retried fault doesn't need do anything extra. If the parallel page fault setup a clean read-only PTE, the below code should be executed in the retried page fault: if (vmf->flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(vmf); entry = pte_mkdirty(entry); The retried fault should just call do_wp_page() then return. > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-07-08 18:14 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-07 18:54 [RFC PATCH] mm: avoid access flag update TLB flush for retried page fault Yang Shi 2020-07-08 8:00 ` Will Deacon 2020-07-08 16:40 ` Yang Shi 2020-07-08 17:29 ` Catalin Marinas 2020-07-08 18:13 ` Yang Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).