linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
@ 2023-10-24 14:36 Aneesh Kumar K.V
  2023-10-27  9:46 ` Michael Ellerman
  2023-10-27 10:50 ` Matthew Wilcox
  0 siblings, 2 replies; 5+ messages in thread
From: Aneesh Kumar K.V @ 2023-10-24 14:36 UTC (permalink / raw)
  To: linuxppc-dev, mpe, npiggin, christophe.leroy; +Cc: Aneesh Kumar K.V, willy

With commit 9fee28baa601 ("powerpc: implement the new page table range
API") we added set_ptes to powerpc architecture. The implementation
included calling arch_enter/leave_lazy_mmu() calls.

The patch removes the usage of arch_enter/leave_lazy_mmu() because
set_pte is not supposed to be used when updating a pte entry. Powerpc
architecture uses this rule to skip the expensive tlb invalidate which
is not needed when you are setting up the pte for the first time. See
commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating
_PAGE_NUMA bit") for more details

The patch also makes sure we are not using the interface to update a
valid/present pte entry by adding VM_WARN_ON check all the ptes we
are setting up. Furthermore, we add a comment to set_pte_filter to
clarify it can only update folio-related flags and cannot filter
pfn specific details in pte filtering.

Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of
these functions that are not supported. For ex:

remap_pte_range()
  -> arch_enter_lazy_mmu()
  -> set_ptes()
      -> arch_enter_lazy_mmu()
      -> arch_leave_lazy_mmu()
  -> arch_leave_lazy_mmu()

Fixes: 9fee28baa601 ("powerpc: implement the new page table range API")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/pgtable.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 3ba9fe411604..4d69bfb9bc11 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -104,6 +104,8 @@ static pte_t set_pte_filter_hash(pte_t pte) { return pte; }
 /* Embedded type MMU with HW exec support. This is a bit more complicated
  * as we don't have two bits to spare for _PAGE_EXEC and _PAGE_HWEXEC so
  * instead we "filter out" the exec permission for non clean pages.
+ *
+ * This is also called once for the folio. So only work with folio->flags here.
  */
 static inline pte_t set_pte_filter(pte_t pte)
 {
@@ -190,29 +192,39 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma,
 void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		pte_t pte, unsigned int nr)
 {
-	/*
-	 * Make sure hardware valid bit is not set. We don't do
-	 * tlb flush for this update.
-	 */
-	VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
 
 	/* Note: mm->context.id might not yet have been assigned as
 	 * this context might not have been activated yet when this
-	 * is called.
+	 * is called. Filter the pte value and use the filtered value
+	 * to setup all the ptes in the range.
 	 */
 	pte = set_pte_filter(pte);
 
-	/* Perform the setting of the PTE */
-	arch_enter_lazy_mmu_mode();
+	/*
+	 * We don't need to call arch_enter/leave_lazy_mmu_mode()
+	 * because we expect set_ptes to be only be used on not present
+	 * and not hw_valid ptes. Hence there is no translation cache flush
+	 * involved that need to be batched.
+	 */
 	for (;;) {
+
+		/*
+		 * Make sure hardware valid bit is not set. We don't do
+		 * tlb flush for this update.
+		 */
+		VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
+
+		/* Perform the setting of the PTE */
 		__set_pte_at(mm, addr, ptep, pte, 0);
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
 		addr += PAGE_SIZE;
+		/*
+		 * increment the pfn.
+		 */
+		pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
 	}
-	arch_leave_lazy_mmu_mode();
 }
 
 void unmap_kernel_page(unsigned long va)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  2023-10-24 14:36 [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Aneesh Kumar K.V
@ 2023-10-27  9:46 ` Michael Ellerman
  2023-10-27 10:50 ` Matthew Wilcox
  1 sibling, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2023-10-27  9:46 UTC (permalink / raw)
  To: linuxppc-dev, npiggin, christophe.leroy, Aneesh Kumar K.V; +Cc: willy

On Tue, 24 Oct 2023 20:06:04 +0530, Aneesh Kumar K.V wrote:
> With commit 9fee28baa601 ("powerpc: implement the new page table range
> API") we added set_ptes to powerpc architecture. The implementation
> included calling arch_enter/leave_lazy_mmu() calls.
> 
> The patch removes the usage of arch_enter/leave_lazy_mmu() because
> set_pte is not supposed to be used when updating a pte entry. Powerpc
> architecture uses this rule to skip the expensive tlb invalidate which
> is not needed when you are setting up the pte for the first time. See
> commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating
> _PAGE_NUMA bit") for more details
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
      https://git.kernel.org/powerpc/c/47b8def9358c5eb888e78b24b7e5b7f2e2e97b8e

cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  2023-10-24 14:36 [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Aneesh Kumar K.V
  2023-10-27  9:46 ` Michael Ellerman
@ 2023-10-27 10:50 ` Matthew Wilcox
  2023-11-02 11:39   ` Michael Ellerman
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2023-10-27 10:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin

On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>  		ptep++;
> -		pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>  		addr += PAGE_SIZE;
> +		/*
> +		 * increment the pfn.
> +		 */
> +		pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));

when i looked at this, it generated shit code.  did you check?
why did you bother changing this?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  2023-10-27 10:50 ` Matthew Wilcox
@ 2023-11-02 11:39   ` Michael Ellerman
  2023-11-11 10:33     ` Christophe Leroy
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2023-11-02 11:39 UTC (permalink / raw)
  To: Matthew Wilcox, Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin

Matthew Wilcox <willy@infradead.org> writes:
> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>  		ptep++;
>> -		pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>  		addr += PAGE_SIZE;
>> +		/*
>> +		 * increment the pfn.
>> +		 */
>> +		pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>
> when i looked at this, it generated shit code.  did you check?

I didn't look ...

<goes and looks>

It's not super clear cut. There's some difference because pfn_pte()
contains two extra VM_BUG_ONs.

But with DEBUG_VM *off* the version using pfn_pte() generates *better*
code, or at least less code, ~160 instructions vs ~200.

For some reason the version using PTE_RPN_SHIFT seems to be byte
swapping the pte an extra two times, each of which generates ~8
instructions. But I can't see why.

I tried a few other things and couldn't come up with anything that
generated better code. But I'll keep poking at it tomorrow.

cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  2023-11-02 11:39   ` Michael Ellerman
@ 2023-11-11 10:33     ` Christophe Leroy
  0 siblings, 0 replies; 5+ messages in thread
From: Christophe Leroy @ 2023-11-11 10:33 UTC (permalink / raw)
  To: Michael Ellerman, Matthew Wilcox, Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin



Le 02/11/2023 à 12:39, Michael Ellerman a écrit :
> Matthew Wilcox <willy@infradead.org> writes:
>> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>>   		ptep++;
>>> -		pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>>   		addr += PAGE_SIZE;
>>> +		/*
>>> +		 * increment the pfn.
>>> +		 */
>>> +		pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>
>> when i looked at this, it generated shit code.  did you check?
> 
> I didn't look ...
> 
> <goes and looks>
> 
> It's not super clear cut. There's some difference because pfn_pte()
> contains two extra VM_BUG_ONs.
> 
> But with DEBUG_VM *off* the version using pfn_pte() generates *better*
> code, or at least less code, ~160 instructions vs ~200.
> 
> For some reason the version using PTE_RPN_SHIFT seems to be byte
> swapping the pte an extra two times, each of which generates ~8
> instructions. But I can't see why.
> 
> I tried a few other things and couldn't come up with anything that
> generated better code. But I'll keep poking at it tomorrow.

On PPC32 the version using PTE_RPN_SHIFT is better, here is what the 
main loop of set_ptes() looks like:

  22c:	55 29 f0 be 	srwi    r9,r9,2
  230:	7d 29 03 a6 	mtctr   r9
  234:	39 3f 10 00 	addi    r9,r31,4096
  238:	39 1f 20 00 	addi    r8,r31,8192
  23c:	39 5f 30 00 	addi    r10,r31,12288
  240:	3b ff 40 00 	addi    r31,r31,16384
  244:	91 3e 00 04 	stw     r9,4(r30)
  248:	91 1e 00 08 	stw     r8,8(r30)
  24c:	91 5e 00 0c 	stw     r10,12(r30)
  250:	97 fe 00 10 	stwu    r31,16(r30)
  254:	42 00 ff e0 	bdnz    234 <set_ptes+0x78>

With the version using pfn_pte(), the main loop is:

  218:	54 e9 f8 7e 	srwi    r9,r7,1
  21c:	7d 29 03 a6 	mtctr   r9
  220:	57 e9 00 26 	clrrwi  r9,r31,12
  224:	39 29 10 00 	addi    r9,r9,4096
  228:	57 ff 05 3e 	clrlwi  r31,r31,20
  22c:	7d 29 fb 78 	or      r9,r9,r31
  230:	55 3f 00 26 	clrrwi  r31,r9,12
  234:	3b ff 10 00 	addi    r31,r31,4096
  238:	55 28 05 3e 	clrlwi  r8,r9,20
  23c:	7f ff 43 78 	or      r31,r31,r8
  240:	91 3d 00 04 	stw     r9,4(r29)
  244:	93 fd 00 08 	stw     r31,8(r29)
  248:	3b bd 00 08 	addi    r29,r29,8
  24c:	42 00 ff d4 	bdnz    220 <set_ptes+0x64>

Not only the loop is bigger, but it is also only unrolled by 2 while 
first one is unrolled by 4 (r7 and r9 contain the same value).

Therefore allthough the PTE_RPN_SHIFT version is 87 instructions while 
the other one is only 81 instructions, the former looks better.

Christophe

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-11 10:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-24 14:36 [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Aneesh Kumar K.V
2023-10-27  9:46 ` Michael Ellerman
2023-10-27 10:50 ` Matthew Wilcox
2023-11-02 11:39   ` Michael Ellerman
2023-11-11 10:33     ` Christophe Leroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).