All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Benjamin Herrenschmidt <benh@au1.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	linuxppc-dev@lists.ozlabs.org, paulus@samba.org
Subject: Re: [PATCH -V10 00/15] THP support for PPC64
Date: Tue, 18 Jun 2013 23:24:30 +0530	[thread overview]
Message-ID: <8761xbi9t5.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <1371348007.21896.62.camel@pasglop>

Benjamin Herrenschmidt <benh@au1.ibm.com> writes:

> On Wed, 2013-06-05 at 20:58 +0530, Aneesh Kumar K.V wrote:
>> This is the second patchset needed to support THP on ppc64. Some of
>> the changes
>> included in this series are tricky in that it changes the powerpc
>> linux page table
>> walk subtly. We also overload few of the pte flags for ptes at PMD
>> level (huge
>> page PTEs).
>> 
>> The related mm/ changes are already merged to Andrew's -mm tree.
>
> [Andrea, question for you near the end ]
>
> So I'm trying to understand how you handle races between hash_page
> and collapse.
>
> The generic collapse code does:
>
> 	_pmd = pmdp_clear_flush(vma, address, pmd);
>
> Which expects the architecture to essentially have stopped any
> concurrent walk by the time it returns.
>
> Your implementation of the above does this:
>
> 		pmd = *pmdp;
> 		pmd_clear(pmdp);
> 		/*
> 		 * Now invalidate the hpte entries in the range
> 		 * covered by pmd. This make sure we take a
> 		 * fault and will find the pmd as none, which will
> 		 * result in a major fault which takes mmap_sem and
> 		 * hence wait for collapse to complete. Without this
> 		 * the __collapse_huge_page_copy can result in copying
> 		 * the old content.
> 		 */
> 		flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
>
> So we clear the pmd after making a copy of it. This will eventually
> prevent a tablewalk but only eventually, once that store becomes visible
> to other processors, which may take a while. Then you proceed to flush
> the hash table for all the underlying PTEs.
>
> So at this point, hash_page might *still* see the old pmd. Unless I
> missed something, you did nothing that will prevent that (the only way
> to lock against hash_page is really an IPI & wait or to take the PTE's
> busy and make them !present or something). So as far as I can tell,
> a concurrent hash_page can still sneak into the hash some "small"
> entries after you have supposedly flushed them.
>

We are depending on the pmd being none. But as you said that doesn't
take care of an already undergoing hash_page. As per the discussion I
am listing below the option that use synchronize_sched. The other option
that we have is to clear _PAGE_USER.

commit f69f11ba6a957aac81ea8096b244005c450a2059
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Jun 18 19:17:17 2013 +0530

    powerpc/THP: Wait for all hash_page calls to finish before invalidating HPTE entries
    
    When we collapse normal pages to hugepage, we first clear the pmd, then invalidate all
    the PTE entries. The assumption here is that any low level page fault will see pmd as
    none and take the slow path that will wait on mmap_sem. But we could very well be in
    a hash_page with local ptep pointer value. Such a hash page can result in adding new
    HPTE entries for normal subpages/small page. That means we could be modifying the
    page content as we copy them to a huge page. Fix this by waiting on hash_page to finish
    after marking the pmd none and bfore invalidating HPTE entries. We use the heavy
    synchronize_sched(). This should be ok as we do this in the background khugepaged thread
    and not in application context. Also if we find collapse slow we can ideally increase
    the scan rate.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index bbecac4..92b733e 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -532,6 +532,7 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pmd_t *pmdp)
 {
 	pmd_t pmd;
+	struct mm_struct *mm = vma->vm_mm;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	if (pmd_trans_huge(*pmdp)) {
@@ -542,6 +543,16 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		 */
 		pmd = *pmdp;
 		pmd_clear(pmdp);
+		spin_unlock(&mm->page_table_lock);
+		/*
+		 * Wait for all pending hash_page to finish
+		 * We can do this by waiting for a context switch to happen on
+		 * the cpus. Any new hash_page after this will see pmd none
+		 * and fallback to code that takes mmap_sem and hence will block
+		 * for collapse to finish.
+		 */
+		synchronize_sched();
+		spin_lock(&mm->page_table_lock);
 		/*
 		 * Now invalidate the hpte entries in the range
 		 * covered by pmd. This make sure we take a

      parent reply	other threads:[~2013-06-18 17:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-05 15:28 [PATCH -V10 00/15] THP support for PPC64 Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 01/15] powerpc/mm: handle hugepage size correctly when invalidating hpte entries Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 02/15] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 03/15] powerpc/THP: Implement transparent hugepages for ppc64 Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 04/15] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 05/15] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 06/15] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 07/15] powerpc: Update gup_pmd_range to handle transparent hugepages Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 08/15] powerpc/THP: Add code to handle HPTE faults for hugepages Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 09/15] powerpc: Make linux pagetable walk safe with THP enabled Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 10/15] powerpc: Prevent gcc to re-read the pagetables Aneesh Kumar K.V
2013-06-05 15:41   ` David Laight
2013-06-05 22:39     ` Benjamin Herrenschmidt
2013-06-05 15:28 ` [PATCH -V10 11/15] powerpc/THP: Enable THP on PPC64 Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 12/15] powerpc: Optimize hugepage invalidate Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 13/15] powerpc: disable assert_pte_locked for collapse_huge_page Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 14/15] powerpc: use smp_rmb when looking at deposisted pgtable to store hash index Aneesh Kumar K.V
2013-06-05 15:28 ` [PATCH -V10 15/15] powerpc: split hugepage when using subpage protection Aneesh Kumar K.V
2013-06-05 23:31 ` [PATCH -V10 00/15] THP support for PPC64 Benjamin Herrenschmidt
2013-06-06  0:13   ` Andrew Morton
2013-06-06  6:05     ` Aneesh Kumar K.V
2013-06-06  7:20       ` Andrew Morton
2013-06-16  2:00 ` Benjamin Herrenschmidt
2013-06-16  3:37   ` Benjamin Herrenschmidt
2013-06-16  4:06     ` Benjamin Herrenschmidt
2013-06-16 23:02       ` Benjamin Herrenschmidt
2013-06-18 18:46       ` Aneesh Kumar K.V
2013-06-18 22:03         ` Benjamin Herrenschmidt
2013-06-19  3:30           ` Aneesh Kumar K.V
2013-06-18 17:54   ` Aneesh Kumar K.V [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8761xbi9t5.fsf@linux.vnet.ibm.com \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=benh@au1.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.