linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] Introduce huge zero page
@ 2012-09-10 13:13 Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
        char *p;
        int i;

        posix_memalign((void **)&p, 2 * MB, 200 * MB);
        for (i = 0; i < 200 * MB; i+= 4096)
                assert(p[i] == 0);
        pause();
        return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

Kirill A. Shutemov (10):
  thp: huge zero page: basic preparation
  thp: zap_huge_pmd(): zap huge zero pmd
  thp: copy_huge_pmd(): copy huge zero page
  thp: do_huge_pmd_wp_page(): handle huge zero page
  thp: change_huge_pmd(): keep huge zero page write-protected
  thp: change split_huge_page_pmd() interface
  thp: implement splitting pmd for huge zero page
  thp: setup huge zero page on non-write page fault
  thp: lazy huge zero page allocation
  thp: implement refcounting for huge zero page

 Documentation/vm/transhuge.txt |    4 +-
 arch/x86/kernel/vm86_32.c      |    2 +-
 fs/proc/task_mmu.c             |    2 +-
 include/linux/huge_mm.h        |   14 ++-
 include/linux/mm.h             |    8 +
 mm/huge_memory.c               |  303 ++++++++++++++++++++++++++++++++++++----
 mm/memory.c                    |   11 +--
 mm/mempolicy.c                 |    2 +-
 mm/mprotect.c                  |    2 +-
 mm/mremap.c                    |    2 +-
 mm/pagewalk.c                  |    2 +-
 11 files changed, 301 insertions(+), 51 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 01/10] thp: huge zero page: basic preparation
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..88e0a7a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -167,6 +168,28 @@ out:
 	return err;
 }
 
+static int init_huge_zero_page(void)
+{
+	struct page *hpage;
+
+	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+	if (!hpage)
+		return -ENOMEM;
+
+	huge_zero_pfn = page_to_pfn(hpage);
+	return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+	return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t double_flag_show(struct kobject *kobj,
@@ -550,6 +573,10 @@ static int __init hugepage_init(void)
 	if (err)
 		return err;
 
+	err = init_huge_zero_page();
+	if (err)
+		goto out;
+
 	err = khugepaged_slab_init();
 	if (err)
 		goto out;
@@ -574,6 +601,8 @@ static int __init hugepage_init(void)
 
 	return 0;
 out:
+	if (huge_zero_pfn)
+		__free_page(pfn_to_page(huge_zero_pfn));
 	hugepage_exit_sysfs(hugepage_kobj);
 	return err;
 }
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88e0a7a..9dcb9e6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1071,16 +1071,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page;
 		pgtable_t pgtable;
 		pgtable = get_pmd_huge_pte(tlb->mm);
-		page = pmd_page(*pmd);
-		pmd_clear(pmd);
-		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-		page_remove_rmap(page);
-		VM_BUG_ON(page_mapcount(page) < 0);
-		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-		VM_BUG_ON(!PageHead(page));
-		tlb->mm->nr_ptes--;
-		spin_unlock(&tlb->mm->page_table_lock);
-		tlb_remove_page(tlb, page);
+		if (is_huge_zero_pmd(*pmd)) {
+			pmd_clear(pmd);
+			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+			tlb->mm->nr_ptes--;
+			spin_unlock(&tlb->mm->page_table_lock);
+		} else {
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			VM_BUG_ON(!PageHead(page));
+			tlb->mm->nr_ptes--;
+			spin_unlock(&tlb->mm->page_table_lock);
+			tlb_remove_page(tlb, page);
+		}
 		pte_free(tlb->mm, pgtable);
 		ret = 1;
 	}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dcb9e6..a534f84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,18 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+	pmd_t entry;
+	entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+	entry = pmd_wrprotect(entry);
+	entry = pmd_mkhuge(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes++;
+}
+
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
@@ -802,6 +814,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
+	if (is_huge_zero_pmd(pmd)) {
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+		ret = 0;
+		goto out_unlock;
+	}
 	if (unlikely(pmd_trans_splitting(pmd))) {
 		/* split huge page running from under us */
 		spin_unlock(&src_mm->page_table_lock);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

On right access to huge zero page we alloc a new page and clear it.

In fallback path we create a new table and set pte around fault address
to the newly allocated page. All other ptes set to normal zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    8 ++++
 mm/huge_memory.c   |  102 ++++++++++++++++++++++++++++++++++++++++++++--------
 mm/memory.c        |    7 ----
 3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..179a41c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -514,6 +514,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 }
 #endif
 
+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+	extern unsigned long zero_pfn;
+	return zero_pfn;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a534f84..f5029d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -867,6 +867,61 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
 	return pgtable;
 }
 
+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	struct page *page;
+	int i, ret = 0;
+
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!page) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+		put_page(page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	clear_user_highpage(page, address);
+	__SetPageUptodate(page);
+
+	spin_lock(&mm->page_table_lock);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		if (haddr == (address & PAGE_MASK)) {
+			entry = mk_pte(page, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			page_add_new_anon_rmap(page, vma, haddr);
+		} else {
+			entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+			entry = pte_mkspecial(entry);
+		}
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+out:
+	return ret;
+}
+
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -964,17 +1019,19 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
 	int ret = 0;
-	struct page *page, *new_page;
+	struct page *page = NULL, *new_page;
 	unsigned long haddr;
 
 	VM_BUG_ON(!vma->anon_vma);
+	haddr = address & HPAGE_PMD_MASK;
+	if (is_huge_zero_pmd(orig_pmd))
+		goto alloc;
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_unlock;
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	haddr = address & HPAGE_PMD_MASK;
 	if (page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
@@ -986,7 +1043,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-
+alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -996,28 +1053,39 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (unlikely(!new_page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
-		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
-						   pmd, orig_pmd, page, haddr);
-		if (ret & VM_FAULT_OOM)
-			split_huge_page(page);
-		put_page(page);
+		if (is_huge_zero_pmd(orig_pmd)) {
+			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+					address, pmd, haddr);
+		} else {
+			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+					pmd, orig_pmd, page, haddr);
+			if (ret & VM_FAULT_OOM)
+				split_huge_page(page);
+			put_page(page);
+		}
 		goto out;
 	}
 	count_vm_event(THP_FAULT_ALLOC);
 
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		put_page(new_page);
-		split_huge_page(page);
-		put_page(page);
+		if (page) {
+			split_huge_page(page);
+			put_page(page);
+		}
 		ret |= VM_FAULT_OOM;
 		goto out;
 	}
 
-	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	if (is_huge_zero_pmd(orig_pmd))
+		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+	else
+		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
 	spin_lock(&mm->page_table_lock);
-	put_page(page);
+	if (page)
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(&mm->page_table_lock);
 		mem_cgroup_uncharge_page(new_page);
@@ -1025,7 +1093,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out;
 	} else {
 		pmd_t entry;
-		VM_BUG_ON(!PageHead(page));
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		entry = pmd_mkhuge(entry);
@@ -1033,8 +1100,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache(vma, address, entry);
-		page_remove_rmap(page);
-		put_page(page);
+		if (is_huge_zero_pmd(orig_pmd))
+			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		if (page) {
+			VM_BUG_ON(!PageHead(page));
+			page_remove_rmap(page);
+			put_page(page);
+		}
 		ret |= VM_FAULT_WRITE;
 	}
 out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dbd92ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
 }
 #endif
 
-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
-	return zero_pfn;
-}
-#endif
-
 /*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5029d4..4001f1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1248,6 +1248,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		entry = pmd_modify(entry, newprot);
+		if (is_huge_zero_pmd(entry))
+			entry = pmd_wrprotect(entry);
 		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 06/10] thp: change split_huge_page_pmd() interface
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |    4 ++--
 arch/x86/kernel/vm86_32.c      |    2 +-
 fs/proc/task_mmu.c             |    2 +-
 include/linux/huge_mm.h        |   14 ++++++++++----
 mm/huge_memory.c               |   24 +++++++++++++++++++-----
 mm/memory.c                    |    4 ++--
 mm/mempolicy.c                 |    2 +-
 mm/mprotect.c                  |    2 +-
 mm/mremap.c                    |    2 +-
 mm/pagewalk.c                  |    2 +-
 10 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
 by just grepping for "pmd_offset" and adding split_huge_page_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(mm, pmd);
++	split_huge_page_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 54abcc0..22840bb 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd(mm, pmd);
+	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..766d5d7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,7 +597,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	split_huge_page_pmd(walk->mm, pmd);
+	split_huge_page_pmd(vma, addr, pmd);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..c68e073 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,12 +92,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
 			    struct vm_area_struct *vma, unsigned long address,
 			    pte_t *pte, pmd_t *pmd, unsigned int flags);
 extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd)				\
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd)			\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__mm, ____pmd);		\
+			__split_huge_page_pmd(__vma, __address,		\
+					____pmd);			\
 	}  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
@@ -107,6 +109,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd);
 #if HPAGE_PMD_ORDER > MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -174,10 +178,12 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__mm, __pmd)	\
+#define split_huge_page_pmd(__vma, __address, __pmd)	\
 	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+	do { } while (0)
 #define compound_trans_head(page) compound_head(page)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4001f1a..48ecc46 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2503,19 +2503,23 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd)
 {
 	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 
-	spin_lock(&mm->page_table_lock);
+	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+	spin_lock(&vma->vm_mm->page_table_lock);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&vma->vm_mm->page_table_lock);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!page_count(page));
 	get_page(page);
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(&vma->vm_mm->page_table_lock);
 
 	split_huge_page(page);
 
@@ -2523,6 +2527,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 	BUG_ON(pmd_trans_huge(*pmd));
 }
 
+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma == NULL);
+	split_huge_page_pmd(vma, address, pmd);
+}
+
 static void split_huge_page_address(struct mm_struct *mm,
 				    unsigned long address)
 {
@@ -2547,7 +2561,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd(mm, pmd);
+	split_huge_page_pmd_mm(mm, address, pmd);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index dbd92ba..312c21d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1236,7 +1236,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma->vm_mm, pmd);
+				split_huge_page_pmd(vma, addr, pmd);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
@@ -1505,7 +1505,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	}
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
-			split_huge_page_pmd(mm, pmd);
+			split_huge_page_pmd(vma, address, pmd);
 			goto split_fallthrough;
 		}
 		spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ada3be..55ac3b6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		split_huge_page_pmd(vma, addr, pmd);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma->vm_mm, pmd);
+				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot))
 				continue;
 			/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index cc06d0e..292ec46 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -156,7 +156,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma->vm_mm, old_pmd);
+				split_huge_page_pmd(vma, old_addr, old_pmd);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd(walk->mm, pmd);
+		split_huge_page_pmd_mm(walk->mm, addr, pmd);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 07/10] thp: implement splitting pmd for huge zero page
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 48ecc46..995894f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1599,6 +1599,7 @@ int split_huge_page(struct page *page)
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
 	BUG_ON(!PageAnon(page));
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
 	return 0;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(vma->vm_mm);
+	pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(vma->vm_mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
 void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 		pmd_t *pmd)
 {
@@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		return;
 	}
+	if (is_huge_zero_pmd(*pmd)) {
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		return;
+	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!page_count(page));
 	get_page(page);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 08/10] thp: setup huge zero page on non-write page fault
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

All code paths seems covered. Now we can map huge zero page on read page
fault.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 995894f..c788445 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -750,6 +750,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
+		if (!(flags & FAULT_FLAG_WRITE)) {
+			pgtable_t pgtable;
+			pgtable = pte_alloc_one(mm, haddr);
+			if (unlikely(!pgtable))
+				goto out;
+			spin_lock(&mm->page_table_lock);
+			set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+			spin_unlock(&mm->page_table_lock);
+			return 0;
+		}
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
 		if (unlikely(!page)) {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 09/10] thp: lazy huge zero page allocation
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
  2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
  10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c788445..0981b09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -168,21 +168,23 @@ out:
 	return err;
 }
 
-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
 {
 	struct page *hpage;
+	unsigned long pfn;
 
 	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
 	if (!hpage)
 		return -ENOMEM;
-
-	huge_zero_pfn = page_to_pfn(hpage);
+	pfn = page_to_pfn(hpage);
+	if (cmpxchg(&huge_zero_pfn, 0, pfn))
+		__free_page(hpage);
 	return 0;
 }
 
 static inline bool is_huge_zero_pfn(unsigned long pfn)
 {
-	return pfn == huge_zero_pfn;
+	return huge_zero_pfn && pfn == huge_zero_pfn;
 }
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -573,10 +575,6 @@ static int __init hugepage_init(void)
 	if (err)
 		return err;
 
-	err = init_huge_zero_page();
-	if (err)
-		goto out;
-
 	err = khugepaged_slab_init();
 	if (err)
 		goto out;
@@ -601,8 +599,6 @@ static int __init hugepage_init(void)
 
 	return 0;
 out:
-	if (huge_zero_pfn)
-		__free_page(pfn_to_page(huge_zero_pfn));
 	hugepage_exit_sysfs(hugepage_kobj);
 	return err;
 }
@@ -752,6 +748,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (!(flags & FAULT_FLAG_WRITE)) {
 			pgtable_t pgtable;
+			if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+				count_vm_event(THP_FAULT_FALLBACK);
+				goto out;
+			}
 			pgtable = pte_alloc_one(mm, haddr);
 			if (unlikely(!pgtable))
 				goto out;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
  2012-09-10 14:02   ` Eric Dumazet
  2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
  10 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  108 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0981b09..fa740fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/shrinker.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -168,23 +168,13 @@ out:
 	return err;
 }
 
-static int init_huge_zero_pfn(void)
-{
-	struct page *hpage;
-	unsigned long pfn;
-
-	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
-	if (!hpage)
-		return -ENOMEM;
-	pfn = page_to_pfn(hpage);
-	if (cmpxchg(&huge_zero_pfn, 0, pfn))
-		__free_page(hpage);
-	return 0;
-}
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
 
 static inline bool is_huge_zero_pfn(unsigned long pfn)
 {
-	return huge_zero_pfn && pfn == huge_zero_pfn;
+	unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+	return zero_pfn && pfn == zero_pfn;
 }
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -192,6 +182,56 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return is_huge_zero_pfn(pmd_pfn(pmd));
 }
 
+static unsigned long get_huge_zero_page(void)
+{
+	struct page *zero_page;
+retry:
+	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+		return ACCESS_ONCE(huge_zero_pfn);
+
+	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+	if (!zero_page)
+		return 0;
+	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+		__free_page(zero_page);
+		goto retry;
+	}
+
+	/* We take additional reference here. It will be put back by shinker */
+	atomic_set(&huge_zero_refcount, 2);
+	return ACCESS_ONCE(huge_zero_pfn);
+}
+
+static void put_huge_zero_page(void)
+{
+	/*
+	 * Counter should never go to zero here. Only shrinker can put
+	 * last reference.
+	 */
+	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
+}
+
+static int shrink_huge_zero_page(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	if (!sc->nr_to_scan)
+		/* we can free zero page only if last reference remains */
+		return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+		unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+		BUG_ON(zero_pfn == 0);
+		__free_page(__pfn_to_page(zero_pfn));
+	}
+
+	return 0;
+}
+
+static struct shrinker huge_zero_page_shrinker = {
+	.shrink = shrink_huge_zero_page,
+	.seeks = DEFAULT_SEEKS,
+};
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t double_flag_show(struct kobject *kobj,
@@ -585,6 +625,8 @@ static int __init hugepage_init(void)
 		goto out;
 	}
 
+	register_shrinker(&huge_zero_page_shrinker);
+
 	/*
 	 * By default disable transparent hugepages on smaller systems,
 	 * where the extra memory used could hurt more than TLB overhead
@@ -722,10 +764,11 @@ static inline struct page *alloc_hugepage(int defrag)
 #endif
 
 static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+		unsigned long zero_pfn)
 {
 	pmd_t entry;
-	entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+	entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	set_pmd_at(mm, haddr, pmd, entry);
@@ -748,15 +791,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (!(flags & FAULT_FLAG_WRITE)) {
 			pgtable_t pgtable;
-			if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
+			unsigned long zero_pfn;
 			pgtable = pte_alloc_one(mm, haddr);
 			if (unlikely(!pgtable))
 				goto out;
+			zero_pfn = get_huge_zero_page();
+			if (unlikely(!zero_pfn)) {
+				pte_free(mm, pgtable);
+				count_vm_event(THP_FAULT_FALLBACK);
+				goto out;
+			}
 			spin_lock(&mm->page_table_lock);
-			set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+			set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+					zero_pfn);
 			spin_unlock(&mm->page_table_lock);
 			return 0;
 		}
@@ -825,7 +872,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 	if (is_huge_zero_pmd(pmd)) {
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+		unsigned long zero_pfn;
+		/*
+		 * get_huge_zero_page() will never allocate a new page here,
+		 * since we already have a zero page to copy. It just takes a
+		 * reference.
+		 */
+		zero_pfn = get_huge_zero_page();
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+				zero_pfn);
 		ret = 0;
 		goto out_unlock;
 	}
@@ -926,6 +981,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
 	spin_unlock(&mm->page_table_lock);
+	put_huge_zero_page();
 
 	ret |= VM_FAULT_WRITE;
 out:
@@ -1110,8 +1166,10 @@ alloc:
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache(vma, address, entry);
-		if (is_huge_zero_pmd(orig_pmd))
+		if (is_huge_zero_pmd(orig_pmd)) {
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			put_huge_zero_page();
+		}
 		if (page) {
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
@@ -1175,6 +1233,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
+			put_huge_zero_page();
 		} else {
 			page = pmd_page(*pmd);
 			pmd_clear(pmd);
@@ -2538,6 +2597,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	}
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(vma->vm_mm, pmd, pgtable);
+	put_huge_zero_page();
 }
 
 void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
@ 2012-09-10 14:02   ` Eric Dumazet
  2012-09-10 14:44     ` Kirill A. Shutemov
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel, Kirill A. Shutemov

On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> after the first allocation. Here's implementation of lockless refcounting
> for huge zero page.
> 
...

> +static unsigned long get_huge_zero_page(void)
> +{
> +	struct page *zero_page;
> +retry:
> +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> +		return ACCESS_ONCE(huge_zero_pfn);
> +
> +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> +	if (!zero_page)
> +		return 0;
> +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> +		__free_page(zero_page);
> +		goto retry;
> +	}

This might break if preemption can happen here ?

The second thread might loop forever because huge_zero_refcount is 0,
and huge_zero_pfn not zero.

If preemption already disabled, a comment would be nice.


> +
> +	/* We take additional reference here. It will be put back by shinker */

typo : shrinker

> +	atomic_set(&huge_zero_refcount, 2);
> +	return ACCESS_ONCE(huge_zero_pfn);
> +}
> +




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 14:02   ` Eric Dumazet
@ 2012-09-10 14:44     ` Kirill A. Shutemov
  2012-09-10 14:48       ` Eric Dumazet
  2012-09-10 14:57       ` Eric Dumazet
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 14:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel, Kirill A. Shutemov

[-- Attachment #1: Type: text/plain, Size: 1552 bytes --]

On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > after the first allocation. Here's implementation of lockless refcounting
> > for huge zero page.
> > 
> ...
> 
> > +static unsigned long get_huge_zero_page(void)
> > +{
> > +	struct page *zero_page;
> > +retry:
> > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > +		return ACCESS_ONCE(huge_zero_pfn);
> > +
> > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > +	if (!zero_page)
> > +		return 0;
> > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > +		__free_page(zero_page);
> > +		goto retry;
> > +	}
> 
> This might break if preemption can happen here ?
> 
> The second thread might loop forever because huge_zero_refcount is 0,
> and huge_zero_pfn not zero.

I fail to see why the second thread might loop forever. Long time yes, but
forever?

Yes, disabling preemption before alloc_pages() and enabling after
atomic_set() looks reasonable. Thanks.

> 
> If preemption already disabled, a comment would be nice.
> 
> 
> > +
> > +	/* We take additional reference here. It will be put back by shinker */
> 
> typo : shrinker

Thx.

> > +	atomic_set(&huge_zero_refcount, 2);
> > +	return ACCESS_ONCE(huge_zero_pfn);
> > +}
> > +
> 
> 
> 

-- 
 Kirill A. Shutemov

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 14:44     ` Kirill A. Shutemov
@ 2012-09-10 14:48       ` Eric Dumazet
  2012-09-10 14:50         ` Kirill A. Shutemov
  2012-09-10 14:57       ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel, Kirill A. Shutemov

On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > after the first allocation. Here's implementation of lockless refcounting
> > > for huge zero page.
> > > 
> > ...
> > 
> > > +static unsigned long get_huge_zero_page(void)
> > > +{
> > > +	struct page *zero_page;
> > > +retry:
> > > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > +		return ACCESS_ONCE(huge_zero_pfn);
> > > +
> > > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > +	if (!zero_page)
> > > +		return 0;
> > > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > +		__free_page(zero_page);
> > > +		goto retry;
> > > +	}
> > 
> > This might break if preemption can happen here ?
> > 
> > The second thread might loop forever because huge_zero_refcount is 0,
> > and huge_zero_pfn not zero.
> 
> I fail to see why the second thread might loop forever. Long time yes, but
> forever?
> 
> Yes, disabling preemption before alloc_pages() and enabling after
> atomic_set() looks reasonable. Thanks.

If you have one online cpu, and the second thread is real time or
something like that, it wont give cpu back to preempted thread.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 14:48       ` Eric Dumazet
@ 2012-09-10 14:50         ` Kirill A. Shutemov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 14:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, linux-mm,
	Andi Kleen, H. Peter Anvin, linux-kernel

On Mon, Sep 10, 2012 at 04:48:07PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > > after the first allocation. Here's implementation of lockless refcounting
> > > > for huge zero page.
> > > > 
> > > ...
> > > 
> > > > +static unsigned long get_huge_zero_page(void)
> > > > +{
> > > > +	struct page *zero_page;
> > > > +retry:
> > > > +	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > > +		return ACCESS_ONCE(huge_zero_pfn);
> > > > +
> > > > +	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > > +	if (!zero_page)
> > > > +		return 0;
> > > > +	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > > +		__free_page(zero_page);
> > > > +		goto retry;
> > > > +	}
> > > 
> > > This might break if preemption can happen here ?
> > > 
> > > The second thread might loop forever because huge_zero_refcount is 0,
> > > and huge_zero_pfn not zero.
> > 
> > I fail to see why the second thread might loop forever. Long time yes, but
> > forever?
> > 
> > Yes, disabling preemption before alloc_pages() and enabling after
> > atomic_set() looks reasonable. Thanks.
> 
> If you have one online cpu, and the second thread is real time or
> something like that, it wont give cpu back to preempted thread.

Okay, I see. I'll update the patch.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 14:44     ` Kirill A. Shutemov
  2012-09-10 14:48       ` Eric Dumazet
@ 2012-09-10 14:57       ` Eric Dumazet
  2012-09-10 15:07         ` Kirill A. Shutemov
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel, Kirill A. Shutemov

On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:


> Yes, disabling preemption before alloc_pages() and enabling after
> atomic_set() looks reasonable. Thanks.

In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
might sleep, it would be better to disable preemption after calling it :

zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
if (!zero_page)
	return 0;
preempt_disable();
if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
	preempt_enable();
	__free_page(zero_page);
	goto retry;
}
atomic_set(&huge_zero_refcount, 2);
preempt_enable();



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
  2012-09-10 14:57       ` Eric Dumazet
@ 2012-09-10 15:07         ` Kirill A. Shutemov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 15:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, linux-mm,
	Andi Kleen, H. Peter Anvin, linux-kernel

On Mon, Sep 10, 2012 at 04:57:59PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> 
> 
> > Yes, disabling preemption before alloc_pages() and enabling after
> > atomic_set() looks reasonable. Thanks.
> 
> In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> might sleep, it would be better to disable preemption after calling it :

Yeah, I've alread thought about that. :)

> zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> if (!zero_page)
> 	return 0;
> preempt_disable();
> if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> 	preempt_enable();
> 	__free_page(zero_page);
> 	goto retry;
> }
> atomic_set(&huge_zero_refcount, 2);
> preempt_enable();
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 10/10] thp: implement refcounting for huge zero page
  2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
@ 2012-09-12 10:07 ` Kirill A. Shutemov
  2012-09-13 17:16   ` Andrea Arcangeli
  10 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-12 10:07 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, linux-mm
  Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  111 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0981b09..23d9634 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/shrinker.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -168,23 +168,13 @@ out:
 	return err;
 }
 
-static int init_huge_zero_pfn(void)
-{
-	struct page *hpage;
-	unsigned long pfn;
-
-	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
-	if (!hpage)
-		return -ENOMEM;
-	pfn = page_to_pfn(hpage);
-	if (cmpxchg(&huge_zero_pfn, 0, pfn))
-		__free_page(hpage);
-	return 0;
-}
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
 
 static inline bool is_huge_zero_pfn(unsigned long pfn)
 {
-	return huge_zero_pfn && pfn == huge_zero_pfn;
+	unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+	return zero_pfn && pfn == zero_pfn;
 }
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -192,6 +182,59 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return is_huge_zero_pfn(pmd_pfn(pmd));
 }
 
+static unsigned long get_huge_zero_page(void)
+{
+	struct page *zero_page;
+retry:
+	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+		return ACCESS_ONCE(huge_zero_pfn);
+
+	zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+	if (!zero_page)
+		return 0;
+	preempt_disable();
+	if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+		preempt_enable();
+		__free_page(zero_page);
+		goto retry;
+	}
+
+	/* We take additional reference here. It will be put back by shrinker */
+	atomic_set(&huge_zero_refcount, 2);
+	preempt_enable();
+	return ACCESS_ONCE(huge_zero_pfn);
+}
+
+static void put_huge_zero_page(void)
+{
+	/*
+	 * Counter should never go to zero here. Only shrinker can put
+	 * last reference.
+	 */
+	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
+}
+
+static int shrink_huge_zero_page(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	if (!sc->nr_to_scan)
+		/* we can free zero page only if last reference remains */
+		return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+		unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+		BUG_ON(zero_pfn == 0);
+		__free_page(__pfn_to_page(zero_pfn));
+	}
+
+	return 0;
+}
+
+static struct shrinker huge_zero_page_shrinker = {
+	.shrink = shrink_huge_zero_page,
+	.seeks = DEFAULT_SEEKS,
+};
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t double_flag_show(struct kobject *kobj,
@@ -585,6 +628,8 @@ static int __init hugepage_init(void)
 		goto out;
 	}
 
+	register_shrinker(&huge_zero_page_shrinker);
+
 	/*
 	 * By default disable transparent hugepages on smaller systems,
 	 * where the extra memory used could hurt more than TLB overhead
@@ -722,10 +767,11 @@ static inline struct page *alloc_hugepage(int defrag)
 #endif
 
 static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+		unsigned long zero_pfn)
 {
 	pmd_t entry;
-	entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+	entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	set_pmd_at(mm, haddr, pmd, entry);
@@ -748,15 +794,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		if (!(flags & FAULT_FLAG_WRITE)) {
 			pgtable_t pgtable;
-			if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
+			unsigned long zero_pfn;
 			pgtable = pte_alloc_one(mm, haddr);
 			if (unlikely(!pgtable))
 				goto out;
+			zero_pfn = get_huge_zero_page();
+			if (unlikely(!zero_pfn)) {
+				pte_free(mm, pgtable);
+				count_vm_event(THP_FAULT_FALLBACK);
+				goto out;
+			}
 			spin_lock(&mm->page_table_lock);
-			set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+			set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+					zero_pfn);
 			spin_unlock(&mm->page_table_lock);
 			return 0;
 		}
@@ -825,7 +875,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 	if (is_huge_zero_pmd(pmd)) {
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+		unsigned long zero_pfn;
+		/*
+		 * get_huge_zero_page() will never allocate a new page here,
+		 * since we already have a zero page to copy. It just takes a
+		 * reference.
+		 */
+		zero_pfn = get_huge_zero_page();
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+				zero_pfn);
 		ret = 0;
 		goto out_unlock;
 	}
@@ -926,6 +984,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
 	spin_unlock(&mm->page_table_lock);
+	put_huge_zero_page();
 
 	ret |= VM_FAULT_WRITE;
 out:
@@ -1110,8 +1169,10 @@ alloc:
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache(vma, address, entry);
-		if (is_huge_zero_pmd(orig_pmd))
+		if (is_huge_zero_pmd(orig_pmd)) {
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			put_huge_zero_page();
+		}
 		if (page) {
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
@@ -1175,6 +1236,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
+			put_huge_zero_page();
 		} else {
 			page = pmd_page(*pmd);
 			pmd_clear(pmd);
@@ -2538,6 +2600,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	}
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(vma->vm_mm, pmd, pgtable);
+	put_huge_zero_page();
 }
 
 void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
  2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
@ 2012-09-13 17:16   ` Andrea Arcangeli
  2012-09-13 17:37     ` Kirill A. Shutemov
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2012-09-13 17:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Andi Kleen, H. Peter Anvin,
	linux-kernel, Kirill A. Shutemov

Hi Kirill,

On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);

The page is likely as large as a pageblock so it's unlikely to create
much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
it would be more correct if __GFP_MOVABLE was clear, like
(GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
really movable (it's only reclaimable).

The xchg vs xchgcmp locking also looks good.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
  2012-09-13 17:16   ` Andrea Arcangeli
@ 2012-09-13 17:37     ` Kirill A. Shutemov
  2012-09-13 21:17       ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-13 17:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel

On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> 
> The page is likely as large as a pageblock so it's unlikely to create
> much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> it would be more correct if __GFP_MOVABLE was clear, like
> (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> really movable (it's only reclaimable).

Good point. I'll update the patchset.

> The xchg vs xchgcmp locking also looks good.
> 
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Is it for the whole patchset? :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
  2012-09-13 17:37     ` Kirill A. Shutemov
@ 2012-09-13 21:17       ` Andrea Arcangeli
  0 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2012-09-13 21:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
	H. Peter Anvin, linux-kernel

Hi Kirill,

On Thu, Sep 13, 2012 at 08:37:58PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> > Hi Kirill,
> > 
> > On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > > -	hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > 
> > The page is likely as large as a pageblock so it's unlikely to create
> > much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> > it would be more correct if __GFP_MOVABLE was clear, like
> > (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> > really movable (it's only reclaimable).
> 
> Good point. I'll update the patchset.
> 
> > The xchg vs xchgcmp locking also looks good.
> > 
> > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> Is it for the whole patchset? :)

It was meant for this one, but I reviewed the whole patchset and it
looks fine to me, so in this case it can apply to the whole patchset ;)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-09-13 21:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
2012-09-10 14:02   ` Eric Dumazet
2012-09-10 14:44     ` Kirill A. Shutemov
2012-09-10 14:48       ` Eric Dumazet
2012-09-10 14:50         ` Kirill A. Shutemov
2012-09-10 14:57       ` Eric Dumazet
2012-09-10 15:07         ` Kirill A. Shutemov
2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
2012-09-13 17:16   ` Andrea Arcangeli
2012-09-13 17:37     ` Kirill A. Shutemov
2012-09-13 21:17       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).