linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/8] Support for contiguous pte hugepages
@ 2017-06-19 17:01 Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 1/8] arm64: hugetlb: Refactor find_num_contig Punit Agrawal
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar

Hi Andrew,

This is v5 of the patchset to update the hugetlb code to support
contiguous hugepages. Previous version of the patchset can be found at
[0].

The main changes in this version are updating Patch 4 and 7 due to
issues highlighted in the previous postings (ltp and build failure).

Please update the patches in your queue with this version.

Thanks,
Punit

Changes since v4:

* Patch 4 updated to fix arm64 ltp failure (pth_str01, pth_str03) [1]
* Patch 7 update to fix build failure when CONFIG_HUGETLB_PAGE is disabled

[0] https://lkml.org/lkml/2017/5/24/463
[1] https://lkml.org/lkml/2017/6/5/332

Punit Agrawal (5):
  mm, gup: Ensure real head page is ref-counted when using hugepages
  mm/hugetlb: add size parameter to huge_pte_offset()
  mm/hugetlb: Allow architectures to override huge_pte_clear()
  mm/hugetlb: Introduce set_huge_swap_pte_at() helper
  mm: rmap: Use correct helper when poisoning hugepages

Steve Capper (2):
  arm64: hugetlb: Refactor find_num_contig
  arm64: hugetlb: Remove spurious calls to huge_ptep_offset

Will Deacon (1):
  mm, gup: Remove broken VM_BUG_ON_PAGE compound check for hugepages

 arch/arm64/mm/hugetlbpage.c     | 53 +++++++++++++++++------------------------
 arch/ia64/mm/hugetlbpage.c      |  4 ++--
 arch/metag/mm/hugetlbpage.c     |  3 ++-
 arch/mips/mm/hugetlbpage.c      |  3 ++-
 arch/parisc/mm/hugetlbpage.c    |  3 ++-
 arch/powerpc/mm/hugetlbpage.c   |  2 +-
 arch/s390/include/asm/hugetlb.h |  2 +-
 arch/s390/mm/hugetlbpage.c      |  3 ++-
 arch/sh/mm/hugetlbpage.c        |  3 ++-
 arch/sparc/mm/hugetlbpage.c     |  3 ++-
 arch/tile/mm/hugetlbpage.c      |  3 ++-
 arch/x86/mm/hugetlbpage.c       |  2 +-
 fs/userfaultfd.c                |  7 ++++--
 include/asm-generic/hugetlb.h   |  4 +++-
 include/linux/hugetlb.h         | 18 ++++++++++++--
 mm/gup.c                        | 15 +++++-------
 mm/hugetlb.c                    | 33 +++++++++++++++----------
 mm/page_vma_mapped.c            |  3 ++-
 mm/pagewalk.c                   |  3 ++-
 mm/rmap.c                       |  7 ++++--
 20 files changed, 100 insertions(+), 74 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/8] arm64: hugetlb: Refactor find_num_contig
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 2/8] arm64: hugetlb: Remove spurious calls to huge_ptep_offset Punit Agrawal
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Steve Capper, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, mark.rutland, linux-arch, aneesh.kumar,
	David Woods, Punit Agrawal

From: Steve Capper <steve.capper@arm.com>

As we regularly check for contiguous pte's in the huge accessors, remove
this extra check from find_num_contig.

Cc: David Woods <dwoods@mellanox.com>
Signed-off-by: Steve Capper <steve.capper@arm.com>
[ Resolved rebase conflicts due to patch re-ordering ]
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 69b8200b1cfd..710bf935a473 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -42,15 +42,13 @@ int pud_huge(pud_t pud)
 }
 
 static int find_num_contig(struct mm_struct *mm, unsigned long addr,
-			   pte_t *ptep, pte_t pte, size_t *pgsize)
+			   pte_t *ptep, size_t *pgsize)
 {
 	pgd_t *pgd = pgd_offset(mm, addr);
 	pud_t *pud;
 	pmd_t *pmd;
 
 	*pgsize = PAGE_SIZE;
-	if (!pte_cont(pte))
-		return 1;
 	pud = pud_offset(pgd, addr);
 	pmd = pmd_offset(pud, addr);
 	if ((pte_t *)pmd == ptep) {
@@ -65,15 +63,16 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 {
 	size_t pgsize;
 	int i;
-	int ncontig = find_num_contig(mm, addr, ptep, pte, &pgsize);
+	int ncontig;
 	unsigned long pfn;
 	pgprot_t hugeprot;
 
-	if (ncontig == 1) {
+	if (!pte_cont(pte)) {
 		set_pte_at(mm, addr, ptep, pte);
 		return;
 	}
 
+	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	pfn = pte_pfn(pte);
 	hugeprot = __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 	for (i = 0; i < ncontig; i++) {
@@ -188,7 +187,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 		bool is_dirty = false;
 
 		cpte = huge_pte_offset(mm, addr);
-		ncontig = find_num_contig(mm, addr, cpte, *cpte, &pgsize);
+		ncontig = find_num_contig(mm, addr, cpte, &pgsize);
 		/* save the 1st pte to return */
 		pte = ptep_get_and_clear(mm, addr, cpte);
 		for (i = 1, addr += pgsize; i < ncontig; ++i, addr += pgsize) {
@@ -228,7 +227,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 		cpte = huge_pte_offset(vma->vm_mm, addr);
 		pfn = pte_pfn(*cpte);
 		ncontig = find_num_contig(vma->vm_mm, addr, cpte,
-					  *cpte, &pgsize);
+					  &pgsize);
 		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize) {
 			changed |= ptep_set_access_flags(vma, addr, cpte,
 							pfn_pte(pfn,
@@ -251,7 +250,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 		size_t pgsize = 0;
 
 		cpte = huge_pte_offset(mm, addr);
-		ncontig = find_num_contig(mm, addr, cpte, *cpte, &pgsize);
+		ncontig = find_num_contig(mm, addr, cpte, &pgsize);
 		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize)
 			ptep_set_wrprotect(mm, addr, cpte);
 	} else {
@@ -269,7 +268,7 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
 
 		cpte = huge_pte_offset(vma->vm_mm, addr);
 		ncontig = find_num_contig(vma->vm_mm, addr, cpte,
-					  *cpte, &pgsize);
+					  &pgsize);
 		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize)
 			ptep_clear_flush(vma, addr, cpte);
 	} else {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/8] arm64: hugetlb: Remove spurious calls to huge_ptep_offset
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 1/8] arm64: hugetlb: Refactor find_num_contig Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 3/8] mm, gup: Remove broken VM_BUG_ON_PAGE compound check for hugepages Punit Agrawal
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Steve Capper, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, mark.rutland, linux-arch, aneesh.kumar,
	David Woods, Punit Agrawal

From: Steve Capper <steve.capper@arm.com>

We don't need to call huge_ptep_offset as our accessors are already
supplied with the pte_t *. This patch removes those spurious calls.

Cc: David Woods <dwoods@mellanox.com>
Signed-off-by: Steve Capper <steve.capper@arm.com>
[ Resolved rebase conflicts due to patch re-ordering ]
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 37 ++++++++++++++-----------------------
 1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 710bf935a473..f89aa8fa5855 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -183,21 +183,19 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	if (pte_cont(*ptep)) {
 		int ncontig, i;
 		size_t pgsize;
-		pte_t *cpte;
 		bool is_dirty = false;
 
-		cpte = huge_pte_offset(mm, addr);
-		ncontig = find_num_contig(mm, addr, cpte, &pgsize);
+		ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 		/* save the 1st pte to return */
-		pte = ptep_get_and_clear(mm, addr, cpte);
+		pte = ptep_get_and_clear(mm, addr, ptep);
 		for (i = 1, addr += pgsize; i < ncontig; ++i, addr += pgsize) {
 			/*
 			 * If HW_AFDBM is enabled, then the HW could
 			 * turn on the dirty bit for any of the page
 			 * in the set, so check them all.
 			 */
-			++cpte;
-			if (pte_dirty(ptep_get_and_clear(mm, addr, cpte)))
+			++ptep;
+			if (pte_dirty(ptep_get_and_clear(mm, addr, ptep)))
 				is_dirty = true;
 		}
 		if (is_dirty)
@@ -213,8 +211,6 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	pte_t *cpte;
-
 	if (pte_cont(pte)) {
 		int ncontig, i, changed = 0;
 		size_t pgsize = 0;
@@ -224,12 +220,11 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			__pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^
 				 pte_val(pte));
 
-		cpte = huge_pte_offset(vma->vm_mm, addr);
-		pfn = pte_pfn(*cpte);
-		ncontig = find_num_contig(vma->vm_mm, addr, cpte,
+		pfn = pte_pfn(pte);
+		ncontig = find_num_contig(vma->vm_mm, addr, ptep,
 					  &pgsize);
-		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize) {
-			changed |= ptep_set_access_flags(vma, addr, cpte,
+		for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize) {
+			changed |= ptep_set_access_flags(vma, addr, ptep,
 							pfn_pte(pfn,
 								hugeprot),
 							dirty);
@@ -246,13 +241,11 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 {
 	if (pte_cont(*ptep)) {
 		int ncontig, i;
-		pte_t *cpte;
 		size_t pgsize = 0;
 
-		cpte = huge_pte_offset(mm, addr);
-		ncontig = find_num_contig(mm, addr, cpte, &pgsize);
-		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize)
-			ptep_set_wrprotect(mm, addr, cpte);
+		ncontig = find_num_contig(mm, addr, ptep, &pgsize);
+		for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize)
+			ptep_set_wrprotect(mm, addr, ptep);
 	} else {
 		ptep_set_wrprotect(mm, addr, ptep);
 	}
@@ -263,14 +256,12 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
 {
 	if (pte_cont(*ptep)) {
 		int ncontig, i;
-		pte_t *cpte;
 		size_t pgsize = 0;
 
-		cpte = huge_pte_offset(vma->vm_mm, addr);
-		ncontig = find_num_contig(vma->vm_mm, addr, cpte,
+		ncontig = find_num_contig(vma->vm_mm, addr, ptep,
 					  &pgsize);
-		for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize)
-			ptep_clear_flush(vma, addr, cpte);
+		for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize)
+			ptep_clear_flush(vma, addr, ptep);
 	} else {
 		ptep_clear_flush(vma, addr, ptep);
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/8] mm, gup: Remove broken VM_BUG_ON_PAGE compound check for hugepages
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 1/8] arm64: hugetlb: Refactor find_num_contig Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 2/8] arm64: hugetlb: Remove spurious calls to huge_ptep_offset Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 4/8] mm, gup: Ensure real head page is ref-counted when using hugepages Punit Agrawal
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Will Deacon, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, n-horiguchi, kirill.shutemov, mike.kravetz,
	steve.capper, mark.rutland, linux-arch, aneesh.kumar,
	Punit Agrawal

From: Will Deacon <will.deacon@arm.com>

When operating on hugepages with DEBUG_VM enabled, the GUP code checks the
compound head for each tail page prior to calling page_cache_add_speculative.
This is broken, because on the fast-GUP path (where we don't hold any page
table locks) we can be racing with a concurrent invocation of
split_huge_page_to_list.

split_huge_page_to_list deals with this race by using page_ref_freeze to
freeze the page and force concurrent GUPs to fail whilst the component
pages are modified. This modification includes clearing the compound_head
field for the tail pages, so checking this prior to a successful call
to page_cache_add_speculative can lead to false positives: In fact,
page_cache_add_speculative *already* has this check once the page refcount
has been successfully updated, so we can simply remove the broken calls
to VM_BUG_ON_PAGE.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b3c7214d710d..e74e0b5a0c7c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1357,7 +1357,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
@@ -1396,7 +1395,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	head = pud_page(orig);
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
@@ -1434,7 +1432,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	head = pgd_page(orig);
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/8] mm, gup: Ensure real head page is ref-counted when using hugepages
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (2 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 3/8] mm, gup: Remove broken VM_BUG_ON_PAGE compound check for hugepages Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 5/8] mm/hugetlb: add size parameter to huge_pte_offset() Punit Agrawal
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar, Michal Hocko

When speculatively taking references to a hugepage using
page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
page returned by pmd_page() is the head page. Although normally true,
this assumption doesn't hold when the hugepage comprises of successive
page table entries such as when using contiguous bit on arm64 at PTE or
PMD levels.

This can be addressed by ensuring that the page passed to
page_cache_add_speculative() is the real head or by de-referencing the
head page within the function.

We take the first approach to keep the usage pattern aligned with
page_cache_get_speculative() where users already pass the appropriate
page, i.e., the de-referenced head.

Apply the same logic to fix gup_huge_[pud|pgd]() as well.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 mm/gup.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e74e0b5a0c7c..6bd39264d0e7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1354,8 +1354,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
 
 	refs = 0;
-	head = pmd_page(orig);
-	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
 		pages[*nr] = page;
 		(*nr)++;
@@ -1363,6 +1362,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pmd_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
@@ -1392,8 +1392,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return __gup_device_huge_pud(orig, addr, end, pages, nr);
 
 	refs = 0;
-	head = pud_page(orig);
-	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
 		pages[*nr] = page;
 		(*nr)++;
@@ -1401,6 +1400,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pud_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
@@ -1429,8 +1429,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 
 	BUILD_BUG_ON(pgd_devmap(orig));
 	refs = 0;
-	head = pgd_page(orig);
-	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
+	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	do {
 		pages[*nr] = page;
 		(*nr)++;
@@ -1438,6 +1437,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pgd_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 5/8] mm/hugetlb: add size parameter to huge_pte_offset()
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (3 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 4/8] mm, gup: Ensure real head page is ref-counted when using hugepages Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 6/8] mm/hugetlb: Allow architectures to override huge_pte_clear() Punit Agrawal
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, Michal Hocko

A poisoned or migrated hugepage is stored as a swap entry in the page
tables. On architectures that support hugepages consisting of contiguous
page table entries (such as on arm64) this leads to ambiguity in
determining the page table entry to return in huge_pte_offset() when a
poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address. Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
 arch/arm64/mm/hugetlbpage.c   |  3 ++-
 arch/ia64/mm/hugetlbpage.c    |  4 ++--
 arch/metag/mm/hugetlbpage.c   |  3 ++-
 arch/mips/mm/hugetlbpage.c    |  3 ++-
 arch/parisc/mm/hugetlbpage.c  |  3 ++-
 arch/powerpc/mm/hugetlbpage.c |  2 +-
 arch/s390/mm/hugetlbpage.c    |  3 ++-
 arch/sh/mm/hugetlbpage.c      |  3 ++-
 arch/sparc/mm/hugetlbpage.c   |  3 ++-
 arch/tile/mm/hugetlbpage.c    |  3 ++-
 arch/x86/mm/hugetlbpage.c     |  2 +-
 fs/userfaultfd.c              |  7 +++++--
 include/linux/hugetlb.h       |  5 +++--
 mm/hugetlb.c                  | 23 ++++++++++++++---------
 mm/page_vma_mapped.c          |  3 ++-
 mm/pagewalk.c                 |  3 ++-
 16 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f89aa8fa5855..656e0ece2289 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -131,7 +131,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 85de86d36fdf..ae35140332f7 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -44,7 +44,7 @@ huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 }
 
 pte_t *
-huge_pte_offset (struct mm_struct *mm, unsigned long addr)
+huge_pte_offset (struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
@@ -92,7 +92,7 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int writ
 	if (REGION_NUMBER(addr) != RGN_HPAGE)
 		return ERR_PTR(-EINVAL);
 
-	ptep = huge_pte_offset(mm, addr);
+	ptep = huge_pte_offset(mm, addr, HPAGE_SIZE);
 	if (!ptep || pte_none(*ptep))
 		return NULL;
 	page = pte_page(*ptep);
diff --git a/arch/metag/mm/hugetlbpage.c b/arch/metag/mm/hugetlbpage.c
index db1b7da91e4f..67fd53e2935a 100644
--- a/arch/metag/mm/hugetlbpage.c
+++ b/arch/metag/mm/hugetlbpage.c
@@ -74,7 +74,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index 74aa6f62468f..cef152234312 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -36,7 +36,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
+		       unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index aa50ac090e9b..5eb8f633b282 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -69,7 +69,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a4f33de4008e..e46744d3b4ae 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -55,7 +55,7 @@ static unsigned nr_gpages;
 
 #define hugepd_none(hpd)	(hpd_val(hpd) == 0)
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	/* Only called for hugetlbfs pages, hence can ignore THP */
 	return __find_linux_pte_or_hugepte(mm->pgd, addr, NULL, NULL);
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 9b4050caa4e9..ae23afc18493 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -176,7 +176,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return (pte_t *) pmdp;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
 	pud_t *pudp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index cc948db74878..d2412d2d6462 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -42,7 +42,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7c29d38e6b99..8989c5e155b3 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -277,7 +277,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index cb10153b5c9f..1f0993945521 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -102,7 +102,8 @@ static pte_t *get_pte(pte_t *base, int index, int level)
 	return ptep;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..ccf509063dfd 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -33,7 +33,7 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 	if (!vma || !is_vm_hugetlb_page(vma))
 		return ERR_PTR(-EINVAL);
 
-	pte = huge_pte_offset(mm, address);
+	pte = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
 
 	/* hugetlb should be locked, and hence, prefaulted */
 	WARN_ON(!pte || pte_none(*pte));
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f7555fc25877..7b9c94837895 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -214,6 +214,7 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
  * hugepmd ranges.
  */
 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 struct vm_area_struct *vma,
 					 unsigned long address,
 					 unsigned long flags,
 					 unsigned long reason)
@@ -224,7 +225,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	pte = huge_pte_offset(mm, address);
+	pte = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
 	if (!pte)
 		goto out;
 
@@ -243,6 +244,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 }
 #else
 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 struct vm_area_struct *vma,
 					 unsigned long address,
 					 unsigned long flags,
 					 unsigned long reason)
@@ -435,7 +437,8 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
 						  reason);
 	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vmf->address,
+		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+						       vmf->address,
 						       vmf->flags, reason);
 	up_read(&mm->mmap_sem);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b857fc8cc2ec..23010a3b2047 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -113,7 +113,8 @@ extern struct list_head huge_boot_pages;
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
 			unsigned long addr, unsigned long sz);
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
@@ -157,7 +158,7 @@ static inline void hugetlb_show_meminfo(void)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
 				src_addr, pagep)	({ BUG(); 0; })
-#define huge_pte_offset(mm, address)	0
+#define huge_pte_offset(mm, address, sz)	0
 static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 {
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3eedb187e549..d9f9e4b7381c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3233,7 +3233,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
-		src_pte = huge_pte_offset(src, addr);
+		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte)
 			continue;
 		dst_pte = huge_pte_alloc(dst, addr, sz);
@@ -3317,7 +3317,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 	for (; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset(mm, address, sz);
 		if (!ptep)
 			continue;
 
@@ -3535,7 +3535,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 			unmap_ref_private(mm, vma, old_page, address);
 			BUG_ON(huge_pte_none(pte));
 			spin_lock(ptl);
-			ptep = huge_pte_offset(mm, address & huge_page_mask(h));
+			ptep = huge_pte_offset(mm, address & huge_page_mask(h),
+					       huge_page_size(h));
 			if (likely(ptep &&
 				   pte_same(huge_ptep_get(ptep), pte)))
 				goto retry_avoidcopy;
@@ -3574,7 +3575,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * before the page tables are altered
 	 */
 	spin_lock(ptl);
-	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h),
+			       huge_page_size(h));
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		ClearPagePrivate(new_page);
 
@@ -3861,7 +3863,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	address &= huge_page_mask(h);
 
-	ptep = huge_pte_offset(mm, address);
+	ptep = huge_pte_offset(mm, address, huge_page_size(h));
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
@@ -4118,7 +4120,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 *
 		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
+				      huge_page_size(h));
 		if (pte)
 			ptl = huge_pte_lock(h, mm, pte);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -4257,7 +4260,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address);
+		ptep = huge_pte_offset(mm, address, huge_page_size(h));
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
@@ -4521,7 +4524,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
-			spte = huge_pte_offset(svma->vm_mm, saddr);
+			spte = huge_pte_offset(svma->vm_mm, saddr,
+					       vma_mmu_pagesize(svma));
 			if (spte) {
 				get_page(virt_to_page(spte));
 				break;
@@ -4617,7 +4621,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm,
+		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index de9c40d7304a..8ec6ba230bb9 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -116,7 +116,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 
 	if (unlikely(PageHuge(pvmw->page))) {
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address);
+		pvmw->pte = huge_pte_offset(mm, pvmw->address,
+					    PAGE_SIZE << compound_order(page));
 		if (!pvmw->pte)
 			return false;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 60f7856e508f..1a4197965415 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -180,12 +180,13 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	struct hstate *h = hstate_vma(vma);
 	unsigned long next;
 	unsigned long hmask = huge_page_mask(h);
+	unsigned long sz = huge_page_size(h);
 	pte_t *pte;
 	int err = 0;
 
 	do {
 		next = hugetlb_entry_end(h, addr, end);
-		pte = huge_pte_offset(walk->mm, addr & hmask);
+		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
 		if (pte && walk->hugetlb_entry)
 			err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
 		if (err)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 6/8] mm/hugetlb: Allow architectures to override huge_pte_clear()
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (4 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 5/8] mm/hugetlb: add size parameter to huge_pte_offset() Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 7/8] mm/hugetlb: Introduce set_huge_swap_pte_at() helper Punit Agrawal
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar, Heiko Carstens

When unmapping a hugepage range, huge_pte_clear() is used to clear the
page table entries that are marked as not present. huge_pte_clear()
internally just ends up calling pte_clear() which does not correctly
deal with hugepages consisting of contiguous page table entries.

Add a size argument to address this issue and allow architectures to
override huge_pte_clear() by wrapping it in a #ifndef block.

Update s390 implementation with the size parameter as well.

Note that the change only affects huge_pte_clear() - the other generic
hugetlb functions don't need any change.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/s390/include/asm/hugetlb.h | 2 +-
 include/asm-generic/hugetlb.h   | 4 +++-
 mm/hugetlb.c                    | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index cd546a245c68..c0443500baec 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -39,7 +39,7 @@ static inline int prepare_hugepage_range(struct file *file,
 #define arch_clear_hugepage_flags(page)		do { } while (0)
 
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
-				  pte_t *ptep)
+				  pte_t *ptep, unsigned long sz)
 {
 	if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
 		pte_val(*ptep) = _REGION3_ENTRY_EMPTY;
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 99b490b4d05a..540354f94f83 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -31,10 +31,12 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
 	return pte_modify(pte, newprot);
 }
 
+#ifndef huge_pte_clear
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
-				  pte_t *ptep)
+		    pte_t *ptep, unsigned long sz)
 {
 	pte_clear(mm, addr, ptep);
 }
+#endif
 
 #endif /* _ASM_GENERIC_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d9f9e4b7381c..b20620ff3751 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3338,7 +3338,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 * unmapped and its refcount is dropped, so just clear pte here.
 		 */
 		if (unlikely(!pte_present(pte))) {
-			huge_pte_clear(mm, address, ptep);
+			huge_pte_clear(mm, address, ptep, sz);
 			spin_unlock(ptl);
 			continue;
 		}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 7/8] mm/hugetlb: Introduce set_huge_swap_pte_at() helper
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (5 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 6/8] mm/hugetlb: Allow architectures to override huge_pte_clear() Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 17:01 ` [PATCH v5 8/8] mm: rmap: Use correct helper when poisoning hugepages Punit Agrawal
  2017-06-19 22:01 ` [PATCH v5 0/8] Support for contiguous pte hugepages Andrew Morton
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar

set_huge_pte_at(), an architecture callback to populate hugepage ptes,
does not provide the range of virtual memory that is targeted. This
leads to ambiguity when dealing with swap entries on architectures that
support hugepages consisting of contiguous ptes.

Fix the problem by introducing an overridable helper for architectures
needing this support. The helper is called when populating the page
tables with swap entries. The size of the targeted region is provided to
the helper to help determine the number of entries to be updated.

Provide a default implementation that maintains the current behaviour.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
---
 include/linux/hugetlb.h | 13 +++++++++++++
 mm/hugetlb.c            |  8 +++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 23010a3b2047..af859564509e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -502,6 +502,14 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 {
 	atomic_long_sub(l, &mm->hugetlb_usage);
 }
+
+#ifndef set_huge_swap_pte_at
+static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned long sz)
+{
+	set_huge_pte_at(mm, addr, ptep, pte);
+}
+#endif
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
@@ -546,6 +554,11 @@ static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
 static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 {
 }
+
+static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned long sz)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b20620ff3751..2017f3f89ab7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3263,9 +3263,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				 */
 				make_migration_entry_read(&swp_entry);
 				entry = swp_entry_to_pte(swp_entry);
-				set_huge_pte_at(src, addr, src_pte, entry);
+				set_huge_swap_pte_at(src, addr, src_pte,
+						     entry, sz);
 			}
-			set_huge_pte_at(dst, addr, dst_pte, entry);
+			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -4282,7 +4283,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 				make_migration_entry_read(&entry);
 				newpte = swp_entry_to_pte(entry);
-				set_huge_pte_at(mm, address, ptep, newpte);
+				set_huge_swap_pte_at(mm, address, ptep,
+						     newpte, huge_page_size(h));
 				pages++;
 			}
 			spin_unlock(ptl);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 8/8] mm: rmap: Use correct helper when poisoning hugepages
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (6 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 7/8] mm/hugetlb: Introduce set_huge_swap_pte_at() helper Punit Agrawal
@ 2017-06-19 17:01 ` Punit Agrawal
  2017-06-19 22:01 ` [PATCH v5 0/8] Support for contiguous pte hugepages Andrew Morton
  8 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-19 17:01 UTC (permalink / raw)
  To: akpm
  Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar

Using set_pte_at() does not do the right thing when putting down
HWPOISON swap entries for hugepages on architectures that support
contiguous ptes.

Fix this problem by using set_huge_swap_pte_at() which was introduced to
fix exactly this problem.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/rmap.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..feb2352aa95f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1379,15 +1379,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		update_hiwater_rss(mm);
 
 		if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (PageHuge(page)) {
 				int nr = 1 << compound_order(page);
 				hugetlb_count_sub(nr, mm);
+				set_huge_swap_pte_at(mm, address,
+						     pvmw.pte, pteval,
+						     vma_mmu_pagesize(vma));
 			} else {
 				dec_mm_counter(mm, mm_counter(page));
+				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
 
-			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			set_pte_at(mm, address, pvmw.pte, pteval);
 		} else if (pte_unused(pteval)) {
 			/*
 			 * The guest indicated that the page content is of no
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/8] Support for contiguous pte hugepages
  2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
                   ` (7 preceding siblings ...)
  2017-06-19 17:01 ` [PATCH v5 8/8] mm: rmap: Use correct helper when poisoning hugepages Punit Agrawal
@ 2017-06-19 22:01 ` Andrew Morton
  2017-06-20 13:39   ` Punit Agrawal
  8 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2017-06-19 22:01 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, n-horiguchi, kirill.shutemov, mike.kravetz,
	steve.capper, mark.rutland, linux-arch, aneesh.kumar

On Mon, 19 Jun 2017 18:01:37 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:

> This is v5 of the patchset to update the hugetlb code to support
> contiguous hugepages. Previous version of the patchset can be found at
> [0].

Dumb question: is there a handy description anywhere which describes
how arm64 implements huge pages?  "contiguous 4k ptes" doesn't sound
like a huge page at all - what's going on here?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/8] Support for contiguous pte hugepages
  2017-06-19 22:01 ` [PATCH v5 0/8] Support for contiguous pte hugepages Andrew Morton
@ 2017-06-20 13:39   ` Punit Agrawal
  2017-06-20 21:08     ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Punit Agrawal @ 2017-06-20 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, n-horiguchi, kirill.shutemov, mike.kravetz,
	steve.capper, mark.rutland, linux-arch, aneesh.kumar

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 19 Jun 2017 18:01:37 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:
>
>> This is v5 of the patchset to update the hugetlb code to support
>> contiguous hugepages. Previous version of the patchset can be found at
>> [0].
>
> Dumb question: is there a handy description anywhere which describes
> how arm64 implements huge pages?  "contiguous 4k ptes" doesn't sound
> like a huge page at all - what's going on here?

Indeed! I should've provided more context with the cover letter.

I couldn't find anything direct to point to so cobbling together
a summary from the commit history[0][1] and the ARM architecture
manual[1].

The architecture supports two flavours of hugepages -

* Block mappings at the pud/pmd level

  These are regular hugepages where a pmd or a pud page table entry
  points to a block of memory. Depending on the PAGE_SIZE in use the
  following size of block mappings are supported -

          PMD	PUD
          ---	---
  4K:      2M	 1G
  16K:    32M
  64K:   512M

  For certain applications/usecases such as HPC and large enterprise
  workloads, folks are using 64k page size but the minimum hugepage size
  of 512MB isn't very practical.

To overcome this ...

* Using the Contiguous bit

  The architecture provides a contiguous bit in the translation table
  entry which acts as a hint to the mmu to indicate that it is one of a
  contiguous set of entries that can be cached in a single TLB entry.

  We use the contiguous bit in Linux to increase the mapping size at the
  pmd and pte (last) level.

  The number of supported contiguous entries varies by page size and
  level of the page table.

  Using the contiguous bit allows additional hugepage sizes -

           CONT PTE    PMD    CONT PMD    PUD
           --------    ---    --------    ---
    4K:         64K     2M         32M     1G
    16K:         2M    32M          1G
    64K:         2M   512M         16G

  Of these, 64K with 4K and 2M with 64K pages have been explicitly
  requested by a few different users.

Entries with the contiguous bit set are required to be modified all
together - which makes things like memory poisoning and migration
impossible to do correctly without knowing the size of hugepage being
dealt with - the reason for adding size parameter to a few of the
hugepage helpers in this series.

Apologies for the length, but I am hoping the context provides
motivation for the changes.

Thanks for pulling the updated version of the patches.

Punit

[0] https://github.com/torvalds/linux/commit/084bd29810a5689e423d2f085255a3200a03a06e
[1] https://github.com/torvalds/linux/commit/66b3923a1a0f77a563b43f43f6ad091354abbfe9
[2] ARM DDI 0487B.a Section D4.3 VMSAv8-64 translation table format
    [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0487b.a/index.html]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/8] Support for contiguous pte hugepages
  2017-06-20 13:39   ` Punit Agrawal
@ 2017-06-20 21:08     ` Andrew Morton
  2017-06-21 12:32       ` Punit Agrawal
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2017-06-20 21:08 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, n-horiguchi, kirill.shutemov, mike.kravetz,
	steve.capper, mark.rutland, linux-arch, aneesh.kumar

On Tue, 20 Jun 2017 14:39:57 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:

> 
> The architecture supports two flavours of hugepages -
> 
> * Block mappings at the pud/pmd level
> 
>   These are regular hugepages where a pmd or a pud page table entry
>   points to a block of memory. Depending on the PAGE_SIZE in use the
>   following size of block mappings are supported -
> 
>           PMD	PUD
>           ---	---
>   4K:      2M	 1G
>   16K:    32M
>   64K:   512M
> 
>   For certain applications/usecases such as HPC and large enterprise
>   workloads, folks are using 64k page size but the minimum hugepage size
>   of 512MB isn't very practical.
> 
> To overcome this ...
> 
> * Using the Contiguous bit
> 
>   The architecture provides a contiguous bit in the translation table
>   entry which acts as a hint to the mmu to indicate that it is one of a
>   contiguous set of entries that can be cached in a single TLB entry.
> 
>   We use the contiguous bit in Linux to increase the mapping size at the
>   pmd and pte (last) level.
> 
>   The number of supported contiguous entries varies by page size and
>   level of the page table.
> 
>   Using the contiguous bit allows additional hugepage sizes -
> 
>            CONT PTE    PMD    CONT PMD    PUD
>            --------    ---    --------    ---
>     4K:         64K     2M         32M     1G
>     16K:         2M    32M          1G
>     64K:         2M   512M         16G
> 
>   Of these, 64K with 4K and 2M with 64K pages have been explicitly
>   requested by a few different users.
> 
> Entries with the contiguous bit set are required to be modified all
> together - which makes things like memory poisoning and migration
> impossible to do correctly without knowing the size of hugepage being
> dealt with - the reason for adding size parameter to a few of the
> hugepage helpers in this series.
> 

Thanks, I added the above to the 1/n changelog.  Perhaps it's worth
adding something like this to Documentation/vm/hugetlbpage.txt.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/8] Support for contiguous pte hugepages
  2017-06-20 21:08     ` Andrew Morton
@ 2017-06-21 12:32       ` Punit Agrawal
  0 siblings, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2017-06-21 12:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, n-horiguchi, kirill.shutemov, mike.kravetz,
	steve.capper, mark.rutland, linux-arch, aneesh.kumar

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 20 Jun 2017 14:39:57 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:
>
>> 
>> The architecture supports two flavours of hugepages -
>> 
>> * Block mappings at the pud/pmd level
>> 
>>   These are regular hugepages where a pmd or a pud page table entry
>>   points to a block of memory. Depending on the PAGE_SIZE in use the
>>   following size of block mappings are supported -
>> 
>>           PMD	PUD
>>           ---	---
>>   4K:      2M	 1G
>>   16K:    32M
>>   64K:   512M
>> 
>>   For certain applications/usecases such as HPC and large enterprise
>>   workloads, folks are using 64k page size but the minimum hugepage size
>>   of 512MB isn't very practical.
>> 
>> To overcome this ...
>> 
>> * Using the Contiguous bit
>> 
>>   The architecture provides a contiguous bit in the translation table
>>   entry which acts as a hint to the mmu to indicate that it is one of a
>>   contiguous set of entries that can be cached in a single TLB entry.
>> 
>>   We use the contiguous bit in Linux to increase the mapping size at the
>>   pmd and pte (last) level.
>> 
>>   The number of supported contiguous entries varies by page size and
>>   level of the page table.
>> 
>>   Using the contiguous bit allows additional hugepage sizes -
>> 
>>            CONT PTE    PMD    CONT PMD    PUD
>>            --------    ---    --------    ---
>>     4K:         64K     2M         32M     1G
>>     16K:         2M    32M          1G
>>     64K:         2M   512M         16G
>> 
>>   Of these, 64K with 4K and 2M with 64K pages have been explicitly
>>   requested by a few different users.
>> 
>> Entries with the contiguous bit set are required to be modified all
>> together - which makes things like memory poisoning and migration
>> impossible to do correctly without knowing the size of hugepage being
>> dealt with - the reason for adding size parameter to a few of the
>> hugepage helpers in this series.
>> 
>
> Thanks, I added the above to the 1/n changelog.  Perhaps it's worth
> adding something like this to Documentation/vm/hugetlbpage.txt.

Yes, it would be useful to have this documented.

I'll send a patch once the architecture bits for re-enabling contiguous
hugepages are merged.

Thanks,
Punit

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-21 12:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-19 17:01 [PATCH v5 0/8] Support for contiguous pte hugepages Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 1/8] arm64: hugetlb: Refactor find_num_contig Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 2/8] arm64: hugetlb: Remove spurious calls to huge_ptep_offset Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 3/8] mm, gup: Remove broken VM_BUG_ON_PAGE compound check for hugepages Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 4/8] mm, gup: Ensure real head page is ref-counted when using hugepages Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 5/8] mm/hugetlb: add size parameter to huge_pte_offset() Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 6/8] mm/hugetlb: Allow architectures to override huge_pte_clear() Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 7/8] mm/hugetlb: Introduce set_huge_swap_pte_at() helper Punit Agrawal
2017-06-19 17:01 ` [PATCH v5 8/8] mm: rmap: Use correct helper when poisoning hugepages Punit Agrawal
2017-06-19 22:01 ` [PATCH v5 0/8] Support for contiguous pte hugepages Andrew Morton
2017-06-20 13:39   ` Punit Agrawal
2017-06-20 21:08     ` Andrew Morton
2017-06-21 12:32       ` Punit Agrawal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).