[PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
@ 2021-07-20  6:55 Huang Ying
  2021-07-20 13:36 ` Zi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Huang Ying @ 2021-07-20  6:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Yang Shi, Dan Carpenter,
	Mel Gorman, Christian Borntraeger, Gerald Schaefer,
	Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Zi Yan

Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
via flush_tlb_range().

But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
handling"), the TLB flushing is done in migrate_pages() as in the
following code path anyway.

do_huge_pmd_numa_page
  migrate_misplaced_page
    migrate_pages

So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
unnecessary.  So the code is deleted in this patch to simplify the
code.  This is only code cleanup, there's no visible performance
difference.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 26 --------------------------
 1 file changed, 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..9f21e44c9030 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-	/*
-	 * Since we took the NUMA fault, we must have observed the !accessible
-	 * bit. Make sure all other CPUs agree with that, to avoid them
-	 * modifying the page we're about to migrate.
-	 *
-	 * Must be done under PTL such that we'll observe the relevant
-	 * inc_tlb_flush_pending().
-	 *
-	 * We are not sure a pending tlb flush here is for a huge page
-	 * mapping or not. Hence use the tlb range variant
-	 */
-	if (mm_tlb_flush_pending(vma->vm_mm)) {
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-		/*
-		 * change_huge_pmd() released the pmd lock before
-		 * invalidating the secondary MMUs sharing the primary
-		 * MMU pagetables (with ->invalidate_range()). The
-		 * mmu_notifier_invalidate_range_end() (which
-		 * internally calls ->invalidate_range()) in
-		 * change_pmd_range() will run after us, so we can't
-		 * rely on it here and we need an explicit invalidate.
-		 */
-		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
-					      haddr + HPAGE_PMD_SIZE);
-	}
-
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	page = vm_normal_page_pmd(vma, haddr, pmd);
 	if (!page)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20  6:55 [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code Huang Ying
@ 2021-07-20 13:36 ` Zi Yan
  2021-07-20 14:25 ` Christian Borntraeger
  2021-07-20 20:48 ` Yang Shi
  2 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2021-07-20 13:36 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Yang Shi, Dan Carpenter,
	Mel Gorman, Christian Borntraeger, Gerald Schaefer,
	Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik

[-- Attachment #1: Type: text/plain, Size: 3023 bytes --]

On 20 Jul 2021, at 2:55, Huang Ying wrote:

> Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> via flush_tlb_range().
>
> But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in migrate_pages() as in the
> following code path anyway.
>
> do_huge_pmd_numa_page
>   migrate_misplaced_page
>     migrate_pages
>
> So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> unnecessary.  So the code is deleted in this patch to simplify the
> code.  This is only code cleanup, there's no visible performance
> difference.

But we save a potential TLB flush here right? Maybe the if statement
is not a common case.

Anyway, LGTM, Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com>

>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 26 --------------------------
>  1 file changed, 26 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index afff3ac87067..9f21e44c9030 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  		goto out;
>  	}
>
> -	/*
> -	 * Since we took the NUMA fault, we must have observed the !accessible
> -	 * bit. Make sure all other CPUs agree with that, to avoid them
> -	 * modifying the page we're about to migrate.
> -	 *
> -	 * Must be done under PTL such that we'll observe the relevant
> -	 * inc_tlb_flush_pending().
> -	 *
> -	 * We are not sure a pending tlb flush here is for a huge page
> -	 * mapping or not. Hence use the tlb range variant
> -	 */
> -	if (mm_tlb_flush_pending(vma->vm_mm)) {
> -		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> -		/*
> -		 * change_huge_pmd() released the pmd lock before
> -		 * invalidating the secondary MMUs sharing the primary
> -		 * MMU pagetables (with ->invalidate_range()). The
> -		 * mmu_notifier_invalidate_range_end() (which
> -		 * internally calls ->invalidate_range()) in
> -		 * change_pmd_range() will run after us, so we can't
> -		 * rely on it here and we need an explicit invalidate.
> -		 */
> -		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> -					      haddr + HPAGE_PMD_SIZE);
> -	}
> -
>  	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>  	page = vm_normal_page_pmd(vma, haddr, pmd);
>  	if (!page)
> -- 
> 2.30.2


—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20  6:55 [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code Huang Ying
  2021-07-20 13:36 ` Zi Yan
@ 2021-07-20 14:25 ` Christian Borntraeger
  2021-07-20 20:53   ` Yang Shi
  2021-07-20 20:48 ` Yang Shi
  2 siblings, 1 reply; 14+ messages in thread
From: Christian Borntraeger @ 2021-07-20 14:25 UTC (permalink / raw)
  To: Huang Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Yang Shi, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Zi Yan,
	Paolo Bonzini, kvm list



On 20.07.21 08:55, Huang Ying wrote:
> Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> via flush_tlb_range().
> 
> But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in migrate_pages() as in the
> following code path anyway.
> 
> do_huge_pmd_numa_page
>    migrate_misplaced_page
>      migrate_pages
> 
> So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> unnecessary.  So the code is deleted in this patch to simplify the
> code.  This is only code cleanup, there's no visible performance
> difference.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c | 26 --------------------------
>   1 file changed, 26 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index afff3ac87067..9f21e44c9030 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   		goto out;
>   	}
>   
> -	/*
> -	 * Since we took the NUMA fault, we must have observed the !accessible
> -	 * bit. Make sure all other CPUs agree with that, to avoid them
> -	 * modifying the page we're about to migrate.
> -	 *
> -	 * Must be done under PTL such that we'll observe the relevant
> -	 * inc_tlb_flush_pending().
> -	 *
> -	 * We are not sure a pending tlb flush here is for a huge page
> -	 * mapping or not. Hence use the tlb range variant
> -	 */
> -	if (mm_tlb_flush_pending(vma->vm_mm)) {
> -		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> -		/*
> -		 * change_huge_pmd() released the pmd lock before
> -		 * invalidating the secondary MMUs sharing the primary
> -		 * MMU pagetables (with ->invalidate_range()). The
> -		 * mmu_notifier_invalidate_range_end() (which
> -		 * internally calls ->invalidate_range()) in
> -		 * change_pmd_range() will run after us, so we can't
> -		 * rely on it here and we need an explicit invalidate.
> -		 */
> -		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> -					      haddr + HPAGE_PMD_SIZE);
> -	}
> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
now in migrate_pages? I am not an expert in that code, but I cant find
an equivalent mmu_notifier in migrate_misplaced_pages.
I might be totally wrong, just something that I noticed.

>   	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>   	page = vm_normal_page_pmd(vma, haddr, pmd);
>   	if (!page)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20  6:55 [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code Huang Ying
  2021-07-20 13:36 ` Zi Yan
  2021-07-20 14:25 ` Christian Borntraeger
@ 2021-07-20 20:48 ` Yang Shi
  2021-07-20 22:21   ` Yang Shi
  2 siblings, 1 reply; 14+ messages in thread
From: Yang Shi @ 2021-07-20 20:48 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Dan Carpenter, Mel Gorman, Christian Borntraeger,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Zi Yan

On Mon, Jul 19, 2021 at 11:56 PM Huang Ying <ying.huang@intel.com> wrote:
>
> Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> via flush_tlb_range().
>
> But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> handling"), the TLB flushing is done in migrate_pages() as in the
> following code path anyway.
>
> do_huge_pmd_numa_page
>   migrate_misplaced_page
>     migrate_pages
>
> So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> unnecessary.  So the code is deleted in this patch to simplify the
> code.  This is only code cleanup, there's no visible performance
> difference.

Yes, there is tlb flush in try_to_migrate(), but it seems mmu notifier
invalidate is missed for the THP migration case. I'm not quite sure
why it is not needed, maybe just missed?

So, you may need the below change too:

diff --git a/mm/rmap.c b/mm/rmap.c
index 2d29a57d29e8..e1c8b654563d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1749,6 +1749,8 @@ static bool try_to_migrate_one(struct page
*page, struct vm_area_struct *vma,
                                       !PageTransCompound(page), page);

                        set_pmd_migration_entry(&pvmw, page);
+                       mmu_notifier_invalidate_range(mm, range.start,
+                                                     range.end);
                        continue;
                }
 #endif

>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 26 --------------------------
>  1 file changed, 26 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index afff3ac87067..9f21e44c9030 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>                 goto out;
>         }
>
> -       /*
> -        * Since we took the NUMA fault, we must have observed the !accessible
> -        * bit. Make sure all other CPUs agree with that, to avoid them
> -        * modifying the page we're about to migrate.
> -        *
> -        * Must be done under PTL such that we'll observe the relevant
> -        * inc_tlb_flush_pending().
> -        *
> -        * We are not sure a pending tlb flush here is for a huge page
> -        * mapping or not. Hence use the tlb range variant
> -        */
> -       if (mm_tlb_flush_pending(vma->vm_mm)) {
> -               flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> -               /*
> -                * change_huge_pmd() released the pmd lock before
> -                * invalidating the secondary MMUs sharing the primary
> -                * MMU pagetables (with ->invalidate_range()). The
> -                * mmu_notifier_invalidate_range_end() (which
> -                * internally calls ->invalidate_range()) in
> -                * change_pmd_range() will run after us, so we can't
> -                * rely on it here and we need an explicit invalidate.
> -                */
> -               mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> -                                             haddr + HPAGE_PMD_SIZE);
> -       }
> -
>         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>         page = vm_normal_page_pmd(vma, haddr, pmd);
>         if (!page)
> --
> 2.30.2
>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20 14:25 ` Christian Borntraeger
@ 2021-07-20 20:53   ` Yang Shi
  2021-07-20 21:04     ` Zi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: Yang Shi @ 2021-07-20 20:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Huang Ying, Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Dan Carpenter, Mel Gorman, Gerald Schaefer, Heiko Carstens,
	Hugh Dickins, Andrea Arcangeli, Kirill A . Shutemov,
	Michal Hocko, Vasily Gorbik, Zi Yan, Paolo Bonzini, kvm list

On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
>
>
>
> On 20.07.21 08:55, Huang Ying wrote:
> > Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> > handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> > via flush_tlb_range().
> >
> > But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> > handling"), the TLB flushing is done in migrate_pages() as in the
> > following code path anyway.
> >
> > do_huge_pmd_numa_page
> >    migrate_misplaced_page
> >      migrate_pages
> >
> > So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> > unnecessary.  So the code is deleted in this patch to simplify the
> > code.  This is only code cleanup, there's no visible performance
> > difference.
> >
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> > Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> > Cc: Heiko Carstens <hca@linux.ibm.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Vasily Gorbik <gor@linux.ibm.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > ---
> >   mm/huge_memory.c | 26 --------------------------
> >   1 file changed, 26 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index afff3ac87067..9f21e44c9030 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >               goto out;
> >       }
> >
> > -     /*
> > -      * Since we took the NUMA fault, we must have observed the !accessible
> > -      * bit. Make sure all other CPUs agree with that, to avoid them
> > -      * modifying the page we're about to migrate.
> > -      *
> > -      * Must be done under PTL such that we'll observe the relevant
> > -      * inc_tlb_flush_pending().
> > -      *
> > -      * We are not sure a pending tlb flush here is for a huge page
> > -      * mapping or not. Hence use the tlb range variant
> > -      */
> > -     if (mm_tlb_flush_pending(vma->vm_mm)) {
> > -             flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> > -             /*
> > -              * change_huge_pmd() released the pmd lock before
> > -              * invalidating the secondary MMUs sharing the primary
> > -              * MMU pagetables (with ->invalidate_range()). The
> > -              * mmu_notifier_invalidate_range_end() (which
> > -              * internally calls ->invalidate_range()) in
> > -              * change_pmd_range() will run after us, so we can't
> > -              * rely on it here and we need an explicit invalidate.
> > -              */
> > -             mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> > -                                           haddr + HPAGE_PMD_SIZE);
> > -     }
> > CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
> now in migrate_pages? I am not an expert in that code, but I cant find
> an equivalent mmu_notifier in migrate_misplaced_pages.
> I might be totally wrong, just something that I noticed.

Do you mean the missed mmu notifier invalidate for the THP migration
case? Yes, I noticed that too. But I'm not sure whether it is intended
or just missed.

Zi Yan is the author for THP migration code, he may have some clue.

>
> >       pmd = pmd_modify(oldpmd, vma->vm_page_prot);
> >       page = vm_normal_page_pmd(vma, haddr, pmd);
> >       if (!page)
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20 20:53   ` Yang Shi
@ 2021-07-20 21:04     ` Zi Yan
  2021-07-20 22:19       ` Yang Shi
  0 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2021-07-20 21:04 UTC (permalink / raw)
  To: Yang Shi
  Cc: Christian Borntraeger, Huang Ying, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]

On 20 Jul 2021, at 16:53, Yang Shi wrote:

> On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>>
>>
>>
>> On 20.07.21 08:55, Huang Ying wrote:
>>> Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
>>> handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
>>> via flush_tlb_range().
>>>
>>> But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
>>> handling"), the TLB flushing is done in migrate_pages() as in the
>>> following code path anyway.
>>>
>>> do_huge_pmd_numa_page
>>>    migrate_misplaced_page
>>>      migrate_pages
>>>
>>> So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
>>> unnecessary.  So the code is deleted in this patch to simplify the
>>> code.  This is only code cleanup, there's no visible performance
>>> difference.
>>>
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Dan Carpenter <dan.carpenter@oracle.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
>>> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>>> Cc: Heiko Carstens <hca@linux.ibm.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Vasily Gorbik <gor@linux.ibm.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> ---
>>>   mm/huge_memory.c | 26 --------------------------
>>>   1 file changed, 26 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index afff3ac87067..9f21e44c9030 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>               goto out;
>>>       }
>>>
>>> -     /*
>>> -      * Since we took the NUMA fault, we must have observed the !accessible
>>> -      * bit. Make sure all other CPUs agree with that, to avoid them
>>> -      * modifying the page we're about to migrate.
>>> -      *
>>> -      * Must be done under PTL such that we'll observe the relevant
>>> -      * inc_tlb_flush_pending().
>>> -      *
>>> -      * We are not sure a pending tlb flush here is for a huge page
>>> -      * mapping or not. Hence use the tlb range variant
>>> -      */
>>> -     if (mm_tlb_flush_pending(vma->vm_mm)) {
>>> -             flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
>>> -             /*
>>> -              * change_huge_pmd() released the pmd lock before
>>> -              * invalidating the secondary MMUs sharing the primary
>>> -              * MMU pagetables (with ->invalidate_range()). The
>>> -              * mmu_notifier_invalidate_range_end() (which
>>> -              * internally calls ->invalidate_range()) in
>>> -              * change_pmd_range() will run after us, so we can't
>>> -              * rely on it here and we need an explicit invalidate.
>>> -              */
>>> -             mmu_notifier_invalidate_range(vma->vm_mm, haddr,
>>> -                                           haddr + HPAGE_PMD_SIZE);
>>> -     }
>>> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
>> now in migrate_pages? I am not an expert in that code, but I cant find
>> an equivalent mmu_notifier in migrate_misplaced_pages.
>> I might be totally wrong, just something that I noticed.
>
> Do you mean the missed mmu notifier invalidate for the THP migration
> case? Yes, I noticed that too. But I'm not sure whether it is intended
> or just missed.

From my understand of mmu_notifier document, mmu_notifier_invalidate_range()
is needed only if the PTE is updated to point to a new page or the page pointed
by the PTE is freed. Page migration does not fall into either case.
In addition, in migrate_pages(), more specifically try_to_migrate_one(),
there is a pair of mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end() around the PTE manipulation code, which should
be sufficient to notify secondary TLBs (including KVM) about the PTE change
for page migration. Correct me if I am wrong.

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20 21:04     ` Zi Yan
@ 2021-07-20 22:19       ` Yang Shi
  2021-07-21 15:41         ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Yang Shi @ 2021-07-20 22:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: Christian Borntraeger, Huang Ying, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

On Tue, Jul 20, 2021 at 2:04 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 20 Jul 2021, at 16:53, Yang Shi wrote:
>
> > On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger
> > <borntraeger@de.ibm.com> wrote:
> >>
> >>
> >>
> >> On 20.07.21 08:55, Huang Ying wrote:
> >>> Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> >>> handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> >>> via flush_tlb_range().
> >>>
> >>> But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> >>> handling"), the TLB flushing is done in migrate_pages() as in the
> >>> following code path anyway.
> >>>
> >>> do_huge_pmd_numa_page
> >>>    migrate_misplaced_page
> >>>      migrate_pages
> >>>
> >>> So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> >>> unnecessary.  So the code is deleted in this patch to simplify the
> >>> code.  This is only code cleanup, there's no visible performance
> >>> difference.
> >>>
> >>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Cc: Yang Shi <shy828301@gmail.com>
> >>> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> >>> Cc: Mel Gorman <mgorman@suse.de>
> >>> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> >>> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> >>> Cc: Heiko Carstens <hca@linux.ibm.com>
> >>> Cc: Hugh Dickins <hughd@google.com>
> >>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Vasily Gorbik <gor@linux.ibm.com>
> >>> Cc: Zi Yan <ziy@nvidia.com>
> >>> ---
> >>>   mm/huge_memory.c | 26 --------------------------
> >>>   1 file changed, 26 deletions(-)
> >>>
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index afff3ac87067..9f21e44c9030 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>               goto out;
> >>>       }
> >>>
> >>> -     /*
> >>> -      * Since we took the NUMA fault, we must have observed the !accessible
> >>> -      * bit. Make sure all other CPUs agree with that, to avoid them
> >>> -      * modifying the page we're about to migrate.
> >>> -      *
> >>> -      * Must be done under PTL such that we'll observe the relevant
> >>> -      * inc_tlb_flush_pending().
> >>> -      *
> >>> -      * We are not sure a pending tlb flush here is for a huge page
> >>> -      * mapping or not. Hence use the tlb range variant
> >>> -      */
> >>> -     if (mm_tlb_flush_pending(vma->vm_mm)) {
> >>> -             flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> >>> -             /*
> >>> -              * change_huge_pmd() released the pmd lock before
> >>> -              * invalidating the secondary MMUs sharing the primary
> >>> -              * MMU pagetables (with ->invalidate_range()). The
> >>> -              * mmu_notifier_invalidate_range_end() (which
> >>> -              * internally calls ->invalidate_range()) in
> >>> -              * change_pmd_range() will run after us, so we can't
> >>> -              * rely on it here and we need an explicit invalidate.
> >>> -              */
> >>> -             mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> >>> -                                           haddr + HPAGE_PMD_SIZE);
> >>> -     }
> >>> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
> >> now in migrate_pages? I am not an expert in that code, but I cant find
> >> an equivalent mmu_notifier in migrate_misplaced_pages.
> >> I might be totally wrong, just something that I noticed.
> >
> > Do you mean the missed mmu notifier invalidate for the THP migration
> > case? Yes, I noticed that too. But I'm not sure whether it is intended
> > or just missed.
>
> From my understand of mmu_notifier document, mmu_notifier_invalidate_range()
> is needed only if the PTE is updated to point to a new page or the page pointed
> by the PTE is freed. Page migration does not fall into either case.
> In addition, in migrate_pages(), more specifically try_to_migrate_one(),
> there is a pair of mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end() around the PTE manipulation code, which should
> be sufficient to notify secondary TLBs (including KVM) about the PTE change
> for page migration. Correct me if I am wrong.

Thanks, I think you are correct. By looking into commit 7066f0f933a1
("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
the tlb flush and mmu notifier invalidate were needed since the old
numa fault implementation didn't change PTE to migration entry so it
may cause data corruption due to the writes from GPU secondary MMU.

The refactor does use the generic migration code which converts PTE to
migration entry before copying data to the new page.

>
> —
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20 20:48 ` Yang Shi
@ 2021-07-20 22:21   ` Yang Shi
  0 siblings, 0 replies; 14+ messages in thread
From: Yang Shi @ 2021-07-20 22:21 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Dan Carpenter, Mel Gorman, Christian Borntraeger,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Zi Yan

On Tue, Jul 20, 2021 at 1:48 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jul 19, 2021 at 11:56 PM Huang Ying <ying.huang@intel.com> wrote:
> >
> > Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> > handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
> > via flush_tlb_range().
> >
> > But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
> > handling"), the TLB flushing is done in migrate_pages() as in the
> > following code path anyway.
> >
> > do_huge_pmd_numa_page
> >   migrate_misplaced_page
> >     migrate_pages
> >
> > So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
> > unnecessary.  So the code is deleted in this patch to simplify the
> > code.  This is only code cleanup, there's no visible performance
> > difference.
>
> Yes, there is tlb flush in try_to_migrate(), but it seems mmu notifier
> invalidate is missed for the THP migration case. I'm not quite sure
> why it is not needed, maybe just missed?
>
> So, you may need the below change too:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2d29a57d29e8..e1c8b654563d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1749,6 +1749,8 @@ static bool try_to_migrate_one(struct page
> *page, struct vm_area_struct *vma,
>                                        !PageTransCompound(page), page);
>
>                         set_pmd_migration_entry(&pvmw, page);
> +                       mmu_notifier_invalidate_range(mm, range.start,
> +                                                     range.end);
>                         continue;
>                 }
>  #endif

Per the discussion with Zi Yan the mmu notifier invalidate should be
not needed. The patch looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
> >
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> > Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> > Cc: Heiko Carstens <hca@linux.ibm.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Vasily Gorbik <gor@linux.ibm.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > ---
> >  mm/huge_memory.c | 26 --------------------------
> >  1 file changed, 26 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index afff3ac87067..9f21e44c9030 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >                 goto out;
> >         }
> >
> > -       /*
> > -        * Since we took the NUMA fault, we must have observed the !accessible
> > -        * bit. Make sure all other CPUs agree with that, to avoid them
> > -        * modifying the page we're about to migrate.
> > -        *
> > -        * Must be done under PTL such that we'll observe the relevant
> > -        * inc_tlb_flush_pending().
> > -        *
> > -        * We are not sure a pending tlb flush here is for a huge page
> > -        * mapping or not. Hence use the tlb range variant
> > -        */
> > -       if (mm_tlb_flush_pending(vma->vm_mm)) {
> > -               flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> > -               /*
> > -                * change_huge_pmd() released the pmd lock before
> > -                * invalidating the secondary MMUs sharing the primary
> > -                * MMU pagetables (with ->invalidate_range()). The
> > -                * mmu_notifier_invalidate_range_end() (which
> > -                * internally calls ->invalidate_range()) in
> > -                * change_pmd_range() will run after us, so we can't
> > -                * rely on it here and we need an explicit invalidate.
> > -                */
> > -               mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> > -                                             haddr + HPAGE_PMD_SIZE);
> > -       }
> > -
> >         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
> >         page = vm_normal_page_pmd(vma, haddr, pmd);
> >         if (!page)
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-20 22:19       ` Yang Shi
@ 2021-07-21 15:41         ` Sean Christopherson
  2021-07-22  0:26           ` Huang, Ying
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2021-07-21 15:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zi Yan, Christian Borntraeger, Huang Ying, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

On Tue, Jul 20, 2021, Yang Shi wrote:
> On Tue, Jul 20, 2021 at 2:04 PM Zi Yan <ziy@nvidia.com> wrote:
> >
> > On 20 Jul 2021, at 16:53, Yang Shi wrote:
> >
> > > On Tue, Jul 20, 2021 at 7:25 AM Christian Borntraeger
> > > <borntraeger@de.ibm.com> wrote:
> > >>> -     if (mm_tlb_flush_pending(vma->vm_mm)) {
> > >>> -             flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> > >>> -             /*
> > >>> -              * change_huge_pmd() released the pmd lock before
> > >>> -              * invalidating the secondary MMUs sharing the primary
> > >>> -              * MMU pagetables (with ->invalidate_range()). The
> > >>> -              * mmu_notifier_invalidate_range_end() (which
> > >>> -              * internally calls ->invalidate_range()) in
> > >>> -              * change_pmd_range() will run after us, so we can't
> > >>> -              * rely on it here and we need an explicit invalidate.
> > >>> -              */
> > >>> -             mmu_notifier_invalidate_range(vma->vm_mm, haddr,
> > >>> -                                           haddr + HPAGE_PMD_SIZE);
> > >>> -     }
> > >>> CC Paolo/KVM list so we also remove the mmu notifier here. Do we need those
> > >> now in migrate_pages? I am not an expert in that code, but I cant find
> > >> an equivalent mmu_notifier in migrate_misplaced_pages.
> > >> I might be totally wrong, just something that I noticed.
> > >
> > > Do you mean the missed mmu notifier invalidate for the THP migration
> > > case? Yes, I noticed that too. But I'm not sure whether it is intended
> > > or just missed.
> >
> > From my understand of mmu_notifier document, mmu_notifier_invalidate_range()
> > is needed only if the PTE is updated to point to a new page or the page pointed
> > by the PTE is freed. Page migration does not fall into either case.

The "new page" part of

  a page table entry is updated to point to a new page

is referring to a different physical page, i.e. a different pfn, not a different
struct page.  do_huge_pmd_numa_page() is moving a THP between nodes, thus it's
changing the backing pfn and needs to invalidate secondary MMUs at some point.

> > In addition, in migrate_pages(), more specifically try_to_migrate_one(),
> > there is a pair of mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end() around the PTE manipulation code, which should
> > be sufficient to notify secondary TLBs (including KVM) about the PTE change
> > for page migration. Correct me if I am wrong.
> 
> Thanks, I think you are correct. By looking into commit 7066f0f933a1
> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
> the tlb flush and mmu notifier invalidate were needed since the old
> numa fault implementation didn't change PTE to migration entry so it
> may cause data corruption due to the writes from GPU secondary MMU.
> 
> The refactor does use the generic migration code which converts PTE to
> migration entry before copying data to the new page.

That's my understanding as well, based on this blurb from commit 7066f0f933a1.

    The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
    uses the generic migrate_pages which transitions the pte from
    numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
    and all mmu notifiers there before copying the page.

That analysis/justification for removing the invalidate_range() call should be
captured in the changelog.  Confirmation from Andrea would be a nice bonus.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-21 15:41         ` Sean Christopherson
@ 2021-07-22  0:26           ` Huang, Ying
  2021-07-22  7:36             ` Christian Borntraeger
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2021-07-22  0:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Shi, Zi Yan, Christian Borntraeger, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

Sean Christopherson <seanjc@google.com> writes:
>> 
>> Thanks, I think you are correct. By looking into commit 7066f0f933a1
>> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
>> the tlb flush and mmu notifier invalidate were needed since the old
>> numa fault implementation didn't change PTE to migration entry so it
>> may cause data corruption due to the writes from GPU secondary MMU.
>> 
>> The refactor does use the generic migration code which converts PTE to
>> migration entry before copying data to the new page.
>
> That's my understanding as well, based on this blurb from commit 7066f0f933a1.
>
>     The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
>     uses the generic migrate_pages which transitions the pte from
>     numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
>     and all mmu notifiers there before copying the page.
>
> That analysis/justification for removing the invalidate_range() call should be
> captured in the changelog.  Confirmation from Andrea would be a nice bonus.

When we flush CPU TLB for a page that may be shared with device/VM TLB,
we will call MMU notifiers for the page to flush the device/VM TLB.
Right?  So when we replaced CPU TLB flushing in do_huge_pmd_numa_page()
with that in try_to_migrate_one(), we will replace the MMU notifiers
calling too.  Do you agree?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-22  0:26           ` Huang, Ying
@ 2021-07-22  7:36             ` Christian Borntraeger
  2021-07-22 23:10               ` Huang, Ying
  2021-07-23  0:03               ` Huang, Ying
  0 siblings, 2 replies; 14+ messages in thread
From: Christian Borntraeger @ 2021-07-22  7:36 UTC (permalink / raw)
  To: Huang, Ying, Sean Christopherson
  Cc: Yang Shi, Zi Yan, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list



On 22.07.21 02:26, Huang, Ying wrote:
> Sean Christopherson <seanjc@google.com> writes:
>>>
>>> Thanks, I think you are correct. By looking into commit 7066f0f933a1
>>> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
>>> the tlb flush and mmu notifier invalidate were needed since the old
>>> numa fault implementation didn't change PTE to migration entry so it
>>> may cause data corruption due to the writes from GPU secondary MMU.
>>>
>>> The refactor does use the generic migration code which converts PTE to
>>> migration entry before copying data to the new page.
>>
>> That's my understanding as well, based on this blurb from commit 7066f0f933a1.
>>
>>      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
>>      uses the generic migrate_pages which transitions the pte from
>>      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
>>      and all mmu notifiers there before copying the page.
>>
>> That analysis/justification for removing the invalidate_range() call should be
>> captured in the changelog.  Confirmation from Andrea would be a nice bonus.
> 
> When we flush CPU TLB for a page that may be shared with device/VM TLB,
> we will call MMU notifiers for the page to flush the device/VM TLB.
> Right?  So when we replaced CPU TLB flushing in do_huge_pmd_numa_page()
> with that in try_to_migrate_one(), we will replace the MMU notifiers
> calling too.  Do you agree?

Can someone write an updated commit messages that contains this information?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-22  7:36             ` Christian Borntraeger
@ 2021-07-22 23:10               ` Huang, Ying
  2021-07-23  0:03               ` Huang, Ying
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2021-07-22 23:10 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Sean Christopherson, Yang Shi, Zi Yan, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 22.07.21 02:26, Huang, Ying wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>>>
>>>> Thanks, I think you are correct. By looking into commit 7066f0f933a1
>>>> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
>>>> the tlb flush and mmu notifier invalidate were needed since the old
>>>> numa fault implementation didn't change PTE to migration entry so it
>>>> may cause data corruption due to the writes from GPU secondary MMU.
>>>>
>>>> The refactor does use the generic migration code which converts PTE to
>>>> migration entry before copying data to the new page.
>>>
>>> That's my understanding as well, based on this blurb from commit 7066f0f933a1.
>>>
>>>      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
>>>      uses the generic migrate_pages which transitions the pte from
>>>      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
>>>      and all mmu notifiers there before copying the page.
>>>
>>> That analysis/justification for removing the invalidate_range() call should be
>>> captured in the changelog.  Confirmation from Andrea would be a nice bonus.
>> When we flush CPU TLB for a page that may be shared with device/VM
>> TLB,
>> we will call MMU notifiers for the page to flush the device/VM TLB.
>> Right?  So when we replaced CPU TLB flushing in do_huge_pmd_numa_page()
>> with that in try_to_migrate_one(), we will replace the MMU notifiers
>> calling too.  Do you agree?
>
> Can someone write an updated commit messages that contains this information?

OK.  I will update the patch description to add MMU notifiers
description.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-22  7:36             ` Christian Borntraeger
  2021-07-22 23:10               ` Huang, Ying
@ 2021-07-23  0:03               ` Huang, Ying
  2021-07-23 20:19                 ` Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2021-07-23  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Borntraeger, Sean Christopherson, Yang Shi, Zi Yan,
	Linux MM, Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 22.07.21 02:26, Huang, Ying wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>>>
>>>> Thanks, I think you are correct. By looking into commit 7066f0f933a1
>>>> ("mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()"),
>>>> the tlb flush and mmu notifier invalidate were needed since the old
>>>> numa fault implementation didn't change PTE to migration entry so it
>>>> may cause data corruption due to the writes from GPU secondary MMU.
>>>>
>>>> The refactor does use the generic migration code which converts PTE to
>>>> migration entry before copying data to the new page.
>>>
>>> That's my understanding as well, based on this blurb from commit 7066f0f933a1.
>>>
>>>      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
>>>      uses the generic migrate_pages which transitions the pte from
>>>      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
>>>      and all mmu notifiers there before copying the page.
>>>
>>> That analysis/justification for removing the invalidate_range() call should be
>>> captured in the changelog.  Confirmation from Andrea would be a nice bonus.
>> When we flush CPU TLB for a page that may be shared with device/VM
>> TLB,
>> we will call MMU notifiers for the page to flush the device/VM TLB.
>> Right?  So when we replaced CPU TLB flushing in do_huge_pmd_numa_page()
>> with that in try_to_migrate_one(), we will replace the MMU notifiers
>> calling too.  Do you agree?
>
> Can someone write an updated commit messages that contains this information?

Hi, Andrew,

Can you help to add the following text to the end of the original patch
description?

"
The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
deleted too.  Because migrate_pages() takes care of that too when CPU
TLB is flushed.
"

Or, if you prefer the complete patch, it's as below.

Best Regards,
Huang, Ying

------------------------------------8<---------------------------------------------
From a7ce0c58dcc0d2f0d87b43b4e93a6623d78c9c25 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Tue, 13 Jul 2021 13:41:37 +0800
Subject: [PATCH -V2] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing
 code

Before the commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself
via flush_tlb_range().

But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault
handling"), the TLB flushing is done in migrate_pages() as in the
following code path anyway.

do_huge_pmd_numa_page
  migrate_misplaced_page
    migrate_pages

So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
unnecessary.  So the code is deleted in this patch to simplify the
code.  This is only code cleanup, there's no visible performance
difference.

The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
deleted too.  Because migrate_pages() takes care of that too when CPU
TLB is flushed.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
---
 mm/huge_memory.c | 26 --------------------------
 1 file changed, 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..9f21e44c9030 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1440,32 +1440,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-	/*
-	 * Since we took the NUMA fault, we must have observed the !accessible
-	 * bit. Make sure all other CPUs agree with that, to avoid them
-	 * modifying the page we're about to migrate.
-	 *
-	 * Must be done under PTL such that we'll observe the relevant
-	 * inc_tlb_flush_pending().
-	 *
-	 * We are not sure a pending tlb flush here is for a huge page
-	 * mapping or not. Hence use the tlb range variant
-	 */
-	if (mm_tlb_flush_pending(vma->vm_mm)) {
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-		/*
-		 * change_huge_pmd() released the pmd lock before
-		 * invalidating the secondary MMUs sharing the primary
-		 * MMU pagetables (with ->invalidate_range()). The
-		 * mmu_notifier_invalidate_range_end() (which
-		 * internally calls ->invalidate_range()) in
-		 * change_pmd_range() will run after us, so we can't
-		 * rely on it here and we need an explicit invalidate.
-		 */
-		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
-					      haddr + HPAGE_PMD_SIZE);
-	}
-
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	page = vm_normal_page_pmd(vma, haddr, pmd);
 	if (!page)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
  2021-07-23  0:03               ` Huang, Ying
@ 2021-07-23 20:19                 ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2021-07-23 20:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Christian Borntraeger, Sean Christopherson, Yang Shi, Zi Yan,
	Linux MM, Linux Kernel Mailing List, Dan Carpenter, Mel Gorman,
	Gerald Schaefer, Heiko Carstens, Hugh Dickins, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko, Vasily Gorbik, Paolo Bonzini,
	kvm list

On Fri, 23 Jul 2021 08:03:42 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> Can you help to add the following text to the end of the original patch
> description?
> 
> "
> The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
> deleted too.  Because migrate_pages() takes care of that too when CPU
> TLB is flushed.
> "

Done.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-07-23 20:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20  6:55 [PATCH] mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code Huang Ying
2021-07-20 13:36 ` Zi Yan
2021-07-20 14:25 ` Christian Borntraeger
2021-07-20 20:53   ` Yang Shi
2021-07-20 21:04     ` Zi Yan
2021-07-20 22:19       ` Yang Shi
2021-07-21 15:41         ` Sean Christopherson
2021-07-22  0:26           ` Huang, Ying
2021-07-22  7:36             ` Christian Borntraeger
2021-07-22 23:10               ` Huang, Ying
2021-07-23  0:03               ` Huang, Ying
2021-07-23 20:19                 ` Andrew Morton
2021-07-20 20:48 ` Yang Shi
2021-07-20 22:21   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).