* [PATCH v2 01/10] thp: huge zero page: basic preparation
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
` (9 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.
We are not going to map the huge zero page until we can handle it
properly on all code paths.
is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..88e0a7a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -167,6 +168,28 @@ out:
return err;
}
+static int init_huge_zero_page(void)
+{
+ struct page *hpage;
+
+ hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+ if (!hpage)
+ return -ENOMEM;
+
+ huge_zero_pfn = page_to_pfn(hpage);
+ return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+ return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
#ifdef CONFIG_SYSFS
static ssize_t double_flag_show(struct kobject *kobj,
@@ -550,6 +573,10 @@ static int __init hugepage_init(void)
if (err)
return err;
+ err = init_huge_zero_page();
+ if (err)
+ goto out;
+
err = khugepaged_slab_init();
if (err)
goto out;
@@ -574,6 +601,8 @@ static int __init hugepage_init(void)
return 0;
out:
+ if (huge_zero_pfn)
+ __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
` (8 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 27 +++++++++++++++++----------
1 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88e0a7a..9dcb9e6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1071,16 +1071,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pgtable = get_pmd_huge_pte(tlb->mm);
- page = pmd_page(*pmd);
- pmd_clear(pmd);
- tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- page_remove_rmap(page);
- VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
- spin_unlock(&tlb->mm->page_table_lock);
- tlb_remove_page(tlb, page);
+ if (is_huge_zero_pmd(*pmd)) {
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ } else {
+ page = pmd_page(*pmd);
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ page_remove_rmap(page);
+ VM_BUG_ON(page_mapcount(page) < 0);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON(!PageHead(page));
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ tlb_remove_page(tlb, page);
+ }
pte_free(tlb->mm, pgtable);
ret = 1;
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
` (7 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
It's easy to copy huge zero page. Just set destination pmd to huge zero
page.
It's safe to copy huge zero page since we have none yet :-p
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dcb9e6..a534f84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,18 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+ pmd_t entry;
+ entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ prepare_pmd_huge_pte(pgtable, mm);
+ mm->nr_ptes++;
+}
+
int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
@@ -802,6 +814,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_free(dst_mm, pgtable);
goto out_unlock;
}
+ if (is_huge_zero_pmd(pmd)) {
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ ret = 0;
+ goto out_unlock;
+ }
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
spin_unlock(&src_mm->page_table_lock);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (2 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
` (6 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
On right access to huge zero page we alloc a new page and clear it.
In fallback path we create a new table and set pte around fault address
to the newly allocated page. All other ptes set to normal zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/mm.h | 8 ++++
mm/huge_memory.c | 102 ++++++++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 7 ----
3 files changed, 95 insertions(+), 22 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..179a41c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -514,6 +514,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}
#endif
+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ extern unsigned long zero_pfn;
+ return zero_pfn;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a534f84..f5029d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -867,6 +867,61 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
}
+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned long haddr)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ struct page *page;
+ int i, ret = 0;
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!page) {
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ put_page(page);
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ clear_user_highpage(page, address);
+ __SetPageUptodate(page);
+
+ spin_lock(&mm->page_table_lock);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ if (haddr == (address & PAGE_MASK)) {
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ }
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ spin_unlock(&mm->page_table_lock);
+
+ ret |= VM_FAULT_WRITE;
+out:
+ return ret;
+}
+
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -964,17 +1019,19 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
int ret = 0;
- struct page *page, *new_page;
+ struct page *page = NULL, *new_page;
unsigned long haddr;
VM_BUG_ON(!vma->anon_vma);
+ haddr = address & HPAGE_PMD_MASK;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto alloc;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;
page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- haddr = address & HPAGE_PMD_MASK;
if (page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
@@ -986,7 +1043,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
get_page(page);
spin_unlock(&mm->page_table_lock);
-
+alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -996,28 +1053,39 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!new_page)) {
count_vm_event(THP_FAULT_FALLBACK);
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
- split_huge_page(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, haddr);
+ } else {
+ ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+ pmd, orig_pmd, page, haddr);
+ if (ret & VM_FAULT_OOM)
+ split_huge_page(page);
+ put_page(page);
+ }
goto out;
}
count_vm_event(THP_FAULT_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
- split_huge_page(page);
- put_page(page);
+ if (page) {
+ split_huge_page(page);
+ put_page(page);
+ }
ret |= VM_FAULT_OOM;
goto out;
}
- copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+ if (is_huge_zero_pmd(orig_pmd))
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ else
+ copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);
spin_lock(&mm->page_table_lock);
- put_page(page);
+ if (page)
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(&mm->page_table_lock);
mem_cgroup_uncharge_page(new_page);
@@ -1025,7 +1093,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out;
} else {
pmd_t entry;
- VM_BUG_ON(!PageHead(page));
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
@@ -1033,8 +1100,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- page_remove_rmap(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd))
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (page) {
+ VM_BUG_ON(!PageHead(page));
+ page_remove_rmap(page);
+ put_page(page);
+ }
ret |= VM_FAULT_WRITE;
}
out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dbd92ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
}
#endif
-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- return zero_pfn;
-}
-#endif
-
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (3 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
` (5 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5029d4..4001f1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1248,6 +1248,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ if (is_huge_zero_pmd(entry))
+ entry = pmd_wrprotect(entry);
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 06/10] thp: change split_huge_page_pmd() interface
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (4 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
` (4 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Pass vma instead of mm and add address parameter.
In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
This change is preparation to huge zero pmd splitting implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
Documentation/vm/transhuge.txt | 4 ++--
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++++++++++----
mm/huge_memory.c | 24 +++++++++++++++++++-----
mm/memory.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
10 files changed, 39 insertions(+), 19 deletions(-)
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==
Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;
pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 54abcc0..22840bb 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, 0xA0000, pmd);
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..766d5d7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,7 +597,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;
- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..c68e073 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,12 +92,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd) \
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__mm, ____pmd); \
+ __split_huge_page_pmd(__vma, __address, \
+ ____pmd); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
@@ -107,6 +109,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd);
#if HPAGE_PMD_ORDER > MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -174,10 +178,12 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__mm, __pmd) \
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+ do { } while (0)
#define compound_trans_head(page) compound_head(page)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4001f1a..48ecc46 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2503,19 +2503,23 @@ static int khugepaged(void *none)
return 0;
}
-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd)
{
struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
- spin_lock(&mm->page_table_lock);
+ BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+ spin_lock(&vma->vm_mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);
split_huge_page(page);
@@ -2523,6 +2527,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
BUG_ON(pmd_trans_huge(*pmd));
}
+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ BUG_ON(vma == NULL);
+ split_huge_page_pmd(vma, address, pmd);
+}
+
static void split_huge_page_address(struct mm_struct *mm,
unsigned long address)
{
@@ -2547,7 +2561,7 @@ static void split_huge_page_address(struct mm_struct *mm,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, address, pmd);
}
void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index dbd92ba..312c21d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1236,7 +1236,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
@@ -1505,7 +1505,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
}
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ada3be..55ac3b6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
continue;
if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot))
continue;
/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index cc06d0e..292ec46 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -156,7 +156,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma->vm_mm, old_pmd);
+ split_huge_page_pmd(vma, old_addr, old_pmd);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;
- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd_mm(walk->mm, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 07/10] thp: implement splitting pmd for huge zero page
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (5 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
` (3 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
We can't split huge zero page itself, but we can split the pmd which
points to it.
On splitting the pmd we create a table with all ptes set to normal zero
page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 48ecc46..995894f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1599,6 +1599,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;
+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
}
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(vma->vm_mm);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(vma->vm_mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
@@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
spin_unlock(&vma->vm_mm->page_table_lock);
return;
}
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(vma, haddr, pmd);
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ return;
+ }
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 08/10] thp: setup huge zero page on non-write page fault
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (6 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
` (2 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
All code paths seems covered. Now we can map huge zero page on read page
fault.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 995894f..c788445 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -750,6 +750,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ pgtable_t pgtable;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ goto out;
+ spin_lock(&mm->page_table_lock);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+ }
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 09/10] thp: lazy huge zero page allocation
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (7 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
10 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.
cmpxchg() is used to avoid race on huge_zero_pfn initialization.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c788445..0981b09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -168,21 +168,23 @@ out:
return err;
}
-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
{
struct page *hpage;
+ unsigned long pfn;
hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
if (!hpage)
return -ENOMEM;
-
- huge_zero_pfn = page_to_pfn(hpage);
+ pfn = page_to_pfn(hpage);
+ if (cmpxchg(&huge_zero_pfn, 0, pfn))
+ __free_page(hpage);
return 0;
}
static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return pfn == huge_zero_pfn;
+ return huge_zero_pfn && pfn == huge_zero_pfn;
}
static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -573,10 +575,6 @@ static int __init hugepage_init(void)
if (err)
return err;
- err = init_huge_zero_page();
- if (err)
- goto out;
-
err = khugepaged_slab_init();
if (err)
goto out;
@@ -601,8 +599,6 @@ static int __init hugepage_init(void)
return 0;
out:
- if (huge_zero_pfn)
- __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
@@ -752,6 +748,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
+ if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (8 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
@ 2012-09-10 13:13 ` Kirill A. Shutemov
2012-09-10 14:02 ` Eric Dumazet
2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
10 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 13:13 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.
We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.
If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).
put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.
Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.
Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 108 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 84 insertions(+), 24 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0981b09..fa740fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/shrinker.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -168,23 +168,13 @@ out:
return err;
}
-static int init_huge_zero_pfn(void)
-{
- struct page *hpage;
- unsigned long pfn;
-
- hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
- if (!hpage)
- return -ENOMEM;
- pfn = page_to_pfn(hpage);
- if (cmpxchg(&huge_zero_pfn, 0, pfn))
- __free_page(hpage);
- return 0;
-}
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return huge_zero_pfn && pfn == huge_zero_pfn;
+ unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+ return zero_pfn && pfn == zero_pfn;
}
static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -192,6 +182,56 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
return is_huge_zero_pfn(pmd_pfn(pmd));
}
+static unsigned long get_huge_zero_page(void)
+{
+ struct page *zero_page;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return ACCESS_ONCE(huge_zero_pfn);
+
+ zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+ if (!zero_page)
+ return 0;
+ if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ __free_page(zero_page);
+ goto retry;
+ }
+
+ /* We take additional reference here. It will be put back by shinker */
+ atomic_set(&huge_zero_refcount, 2);
+ return ACCESS_ONCE(huge_zero_pfn);
+}
+
+static void put_huge_zero_page(void)
+{
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
+}
+
+static int shrink_huge_zero_page(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ if (!sc->nr_to_scan)
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+ BUG_ON(zero_pfn == 0);
+ __free_page(__pfn_to_page(zero_pfn));
+ }
+
+ return 0;
+}
+
+static struct shrinker huge_zero_page_shrinker = {
+ .shrink = shrink_huge_zero_page,
+ .seeks = DEFAULT_SEEKS,
+};
+
#ifdef CONFIG_SYSFS
static ssize_t double_flag_show(struct kobject *kobj,
@@ -585,6 +625,8 @@ static int __init hugepage_init(void)
goto out;
}
+ register_shrinker(&huge_zero_page_shrinker);
+
/*
* By default disable transparent hugepages on smaller systems,
* where the extra memory used could hurt more than TLB overhead
@@ -722,10 +764,11 @@ static inline struct page *alloc_hugepage(int defrag)
#endif
static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+ unsigned long zero_pfn)
{
pmd_t entry;
- entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -748,15 +791,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
- if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
+ unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
spin_lock(&mm->page_table_lock);
- set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -825,7 +872,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
if (is_huge_zero_pmd(pmd)) {
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ unsigned long zero_pfn;
+ /*
+ * get_huge_zero_page() will never allocate a new page here,
+ * since we already have a zero page to copy. It just takes a
+ * reference.
+ */
+ zero_pfn = get_huge_zero_page();
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ zero_pfn);
ret = 0;
goto out_unlock;
}
@@ -926,6 +981,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
+ put_huge_zero_page();
ret |= VM_FAULT_WRITE;
out:
@@ -1110,8 +1166,10 @@ alloc:
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- if (is_huge_zero_pmd(orig_pmd))
+ if (is_huge_zero_pmd(orig_pmd)) {
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ put_huge_zero_page();
+ }
if (page) {
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
@@ -1175,6 +1233,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
+ put_huge_zero_page();
} else {
page = pmd_page(*pmd);
pmd_clear(pmd);
@@ -2538,6 +2597,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(vma->vm_mm, pmd, pgtable);
+ put_huge_zero_page();
}
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
@ 2012-09-10 14:02 ` Eric Dumazet
2012-09-10 14:44 ` Kirill A. Shutemov
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:02 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel, Kirill A. Shutemov
On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> after the first allocation. Here's implementation of lockless refcounting
> for huge zero page.
>
...
> +static unsigned long get_huge_zero_page(void)
> +{
> + struct page *zero_page;
> +retry:
> + if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> + return ACCESS_ONCE(huge_zero_pfn);
> +
> + zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> + if (!zero_page)
> + return 0;
> + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> + __free_page(zero_page);
> + goto retry;
> + }
This might break if preemption can happen here ?
The second thread might loop forever because huge_zero_refcount is 0,
and huge_zero_pfn not zero.
If preemption already disabled, a comment would be nice.
> +
> + /* We take additional reference here. It will be put back by shinker */
typo : shrinker
> + atomic_set(&huge_zero_refcount, 2);
> + return ACCESS_ONCE(huge_zero_pfn);
> +}
> +
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 14:02 ` Eric Dumazet
@ 2012-09-10 14:44 ` Kirill A. Shutemov
2012-09-10 14:48 ` Eric Dumazet
2012-09-10 14:57 ` Eric Dumazet
0 siblings, 2 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 14:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel, Kirill A. Shutemov
[-- Attachment #1: Type: text/plain, Size: 1552 bytes --]
On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > after the first allocation. Here's implementation of lockless refcounting
> > for huge zero page.
> >
> ...
>
> > +static unsigned long get_huge_zero_page(void)
> > +{
> > + struct page *zero_page;
> > +retry:
> > + if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > + return ACCESS_ONCE(huge_zero_pfn);
> > +
> > + zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > + if (!zero_page)
> > + return 0;
> > + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > + __free_page(zero_page);
> > + goto retry;
> > + }
>
> This might break if preemption can happen here ?
>
> The second thread might loop forever because huge_zero_refcount is 0,
> and huge_zero_pfn not zero.
I fail to see why the second thread might loop forever. Long time yes, but
forever?
Yes, disabling preemption before alloc_pages() and enabling after
atomic_set() looks reasonable. Thanks.
>
> If preemption already disabled, a comment would be nice.
>
>
> > +
> > + /* We take additional reference here. It will be put back by shinker */
>
> typo : shrinker
Thx.
> > + atomic_set(&huge_zero_refcount, 2);
> > + return ACCESS_ONCE(huge_zero_pfn);
> > +}
> > +
>
>
>
--
Kirill A. Shutemov
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 14:44 ` Kirill A. Shutemov
@ 2012-09-10 14:48 ` Eric Dumazet
2012-09-10 14:50 ` Kirill A. Shutemov
2012-09-10 14:57 ` Eric Dumazet
1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:48 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel, Kirill A. Shutemov
On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > >
> > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > after the first allocation. Here's implementation of lockless refcounting
> > > for huge zero page.
> > >
> > ...
> >
> > > +static unsigned long get_huge_zero_page(void)
> > > +{
> > > + struct page *zero_page;
> > > +retry:
> > > + if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > + return ACCESS_ONCE(huge_zero_pfn);
> > > +
> > > + zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > + if (!zero_page)
> > > + return 0;
> > > + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > + __free_page(zero_page);
> > > + goto retry;
> > > + }
> >
> > This might break if preemption can happen here ?
> >
> > The second thread might loop forever because huge_zero_refcount is 0,
> > and huge_zero_pfn not zero.
>
> I fail to see why the second thread might loop forever. Long time yes, but
> forever?
>
> Yes, disabling preemption before alloc_pages() and enabling after
> atomic_set() looks reasonable. Thanks.
If you have one online cpu, and the second thread is real time or
something like that, it wont give cpu back to preempted thread.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 14:48 ` Eric Dumazet
@ 2012-09-10 14:50 ` Kirill A. Shutemov
0 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 14:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, linux-mm,
Andi Kleen, H. Peter Anvin, linux-kernel
On Mon, Sep 10, 2012 at 04:48:07PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 10, 2012 at 04:02:39PM +0200, Eric Dumazet wrote:
> > > On Mon, 2012-09-10 at 16:13 +0300, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > >
> > > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > > after the first allocation. Here's implementation of lockless refcounting
> > > > for huge zero page.
> > > >
> > > ...
> > >
> > > > +static unsigned long get_huge_zero_page(void)
> > > > +{
> > > > + struct page *zero_page;
> > > > +retry:
> > > > + if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> > > > + return ACCESS_ONCE(huge_zero_pfn);
> > > > +
> > > > + zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> > > > + if (!zero_page)
> > > > + return 0;
> > > > + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> > > > + __free_page(zero_page);
> > > > + goto retry;
> > > > + }
> > >
> > > This might break if preemption can happen here ?
> > >
> > > The second thread might loop forever because huge_zero_refcount is 0,
> > > and huge_zero_pfn not zero.
> >
> > I fail to see why the second thread might loop forever. Long time yes, but
> > forever?
> >
> > Yes, disabling preemption before alloc_pages() and enabling after
> > atomic_set() looks reasonable. Thanks.
>
> If you have one online cpu, and the second thread is real time or
> something like that, it wont give cpu back to preempted thread.
Okay, I see. I'll update the patch.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 14:44 ` Kirill A. Shutemov
2012-09-10 14:48 ` Eric Dumazet
@ 2012-09-10 14:57 ` Eric Dumazet
2012-09-10 15:07 ` Kirill A. Shutemov
1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2012-09-10 14:57 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel, Kirill A. Shutemov
On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
> Yes, disabling preemption before alloc_pages() and enabling after
> atomic_set() looks reasonable. Thanks.
In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
might sleep, it would be better to disable preemption after calling it :
zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
if (!zero_page)
return 0;
preempt_disable();
if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
preempt_enable();
__free_page(zero_page);
goto retry;
}
atomic_set(&huge_zero_refcount, 2);
preempt_enable();
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 10/10] thp: implement refcounting for huge zero page
2012-09-10 14:57 ` Eric Dumazet
@ 2012-09-10 15:07 ` Kirill A. Shutemov
0 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-10 15:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, linux-mm,
Andi Kleen, H. Peter Anvin, linux-kernel
On Mon, Sep 10, 2012 at 04:57:59PM +0200, Eric Dumazet wrote:
> On Mon, 2012-09-10 at 17:44 +0300, Kirill A. Shutemov wrote:
>
>
> > Yes, disabling preemption before alloc_pages() and enabling after
> > atomic_set() looks reasonable. Thanks.
>
> In fact, as alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> might sleep, it would be better to disable preemption after calling it :
Yeah, I've alread thought about that. :)
> zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> if (!zero_page)
> return 0;
> preempt_disable();
> if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> preempt_enable();
> __free_page(zero_page);
> goto retry;
> }
> atomic_set(&huge_zero_refcount, 2);
> preempt_enable();
>
>
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v3 10/10] thp: implement refcounting for huge zero page
2012-09-10 13:13 [PATCH v2 00/10] Introduce huge zero page Kirill A. Shutemov
` (9 preceding siblings ...)
2012-09-10 13:13 ` [PATCH v2 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
@ 2012-09-12 10:07 ` Kirill A. Shutemov
2012-09-13 17:16 ` Andrea Arcangeli
10 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-12 10:07 UTC (permalink / raw)
To: Andrew Morton, Andrea Arcangeli, linux-mm
Cc: Andi Kleen, H. Peter Anvin, linux-kernel, Kirill A. Shutemov,
Kirill A. Shutemov
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.
We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.
If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).
put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.
Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.
Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 111 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 87 insertions(+), 24 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0981b09..23d9634 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/shrinker.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -168,23 +168,13 @@ out:
return err;
}
-static int init_huge_zero_pfn(void)
-{
- struct page *hpage;
- unsigned long pfn;
-
- hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
- if (!hpage)
- return -ENOMEM;
- pfn = page_to_pfn(hpage);
- if (cmpxchg(&huge_zero_pfn, 0, pfn))
- __free_page(hpage);
- return 0;
-}
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return huge_zero_pfn && pfn == huge_zero_pfn;
+ unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+ return zero_pfn && pfn == zero_pfn;
}
static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -192,6 +182,59 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
return is_huge_zero_pfn(pmd_pfn(pmd));
}
+static unsigned long get_huge_zero_page(void)
+{
+ struct page *zero_page;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return ACCESS_ONCE(huge_zero_pfn);
+
+ zero_page = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+ if (!zero_page)
+ return 0;
+ preempt_disable();
+ if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ preempt_enable();
+ __free_page(zero_page);
+ goto retry;
+ }
+
+ /* We take additional reference here. It will be put back by shrinker */
+ atomic_set(&huge_zero_refcount, 2);
+ preempt_enable();
+ return ACCESS_ONCE(huge_zero_pfn);
+}
+
+static void put_huge_zero_page(void)
+{
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
+}
+
+static int shrink_huge_zero_page(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ if (!sc->nr_to_scan)
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+ BUG_ON(zero_pfn == 0);
+ __free_page(__pfn_to_page(zero_pfn));
+ }
+
+ return 0;
+}
+
+static struct shrinker huge_zero_page_shrinker = {
+ .shrink = shrink_huge_zero_page,
+ .seeks = DEFAULT_SEEKS,
+};
+
#ifdef CONFIG_SYSFS
static ssize_t double_flag_show(struct kobject *kobj,
@@ -585,6 +628,8 @@ static int __init hugepage_init(void)
goto out;
}
+ register_shrinker(&huge_zero_page_shrinker);
+
/*
* By default disable transparent hugepages on smaller systems,
* where the extra memory used could hurt more than TLB overhead
@@ -722,10 +767,11 @@ static inline struct page *alloc_hugepage(int defrag)
#endif
static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+ unsigned long zero_pfn)
{
pmd_t entry;
- entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -748,15 +794,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
- if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
+ unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
spin_lock(&mm->page_table_lock);
- set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -825,7 +875,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
if (is_huge_zero_pmd(pmd)) {
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ unsigned long zero_pfn;
+ /*
+ * get_huge_zero_page() will never allocate a new page here,
+ * since we already have a zero page to copy. It just takes a
+ * reference.
+ */
+ zero_pfn = get_huge_zero_page();
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ zero_pfn);
ret = 0;
goto out_unlock;
}
@@ -926,6 +984,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
+ put_huge_zero_page();
ret |= VM_FAULT_WRITE;
out:
@@ -1110,8 +1169,10 @@ alloc:
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- if (is_huge_zero_pmd(orig_pmd))
+ if (is_huge_zero_pmd(orig_pmd)) {
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ put_huge_zero_page();
+ }
if (page) {
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
@@ -1175,6 +1236,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
+ put_huge_zero_page();
} else {
page = pmd_page(*pmd);
pmd_clear(pmd);
@@ -2538,6 +2600,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(vma->vm_mm, pmd, pgtable);
+ put_huge_zero_page();
}
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.7.6
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
2012-09-12 10:07 ` [PATCH v3 " Kirill A. Shutemov
@ 2012-09-13 17:16 ` Andrea Arcangeli
2012-09-13 17:37 ` Kirill A. Shutemov
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2012-09-13 17:16 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Andrew Morton, linux-mm, Andi Kleen, H. Peter Anvin,
linux-kernel, Kirill A. Shutemov
Hi Kirill,
On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> - hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
The page is likely as large as a pageblock so it's unlikely to create
much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
it would be more correct if __GFP_MOVABLE was clear, like
(GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
really movable (it's only reclaimable).
The xchg vs xchgcmp locking also looks good.
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Thanks,
Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
2012-09-13 17:16 ` Andrea Arcangeli
@ 2012-09-13 17:37 ` Kirill A. Shutemov
2012-09-13 21:17 ` Andrea Arcangeli
0 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2012-09-13 17:37 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel
On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
>
> On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > - hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
>
> The page is likely as large as a pageblock so it's unlikely to create
> much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> it would be more correct if __GFP_MOVABLE was clear, like
> (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> really movable (it's only reclaimable).
Good point. I'll update the patchset.
> The xchg vs xchgcmp locking also looks good.
>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Is it for the whole patchset? :)
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 10/10] thp: implement refcounting for huge zero page
2012-09-13 17:37 ` Kirill A. Shutemov
@ 2012-09-13 21:17 ` Andrea Arcangeli
0 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2012-09-13 21:17 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andi Kleen,
H. Peter Anvin, linux-kernel
Hi Kirill,
On Thu, Sep 13, 2012 at 08:37:58PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 13, 2012 at 07:16:13PM +0200, Andrea Arcangeli wrote:
> > Hi Kirill,
> >
> > On Wed, Sep 12, 2012 at 01:07:53PM +0300, Kirill A. Shutemov wrote:
> > > - hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
> >
> > The page is likely as large as a pageblock so it's unlikely to create
> > much fragmentation even if the __GFP_MOVABLE is set. Said that I guess
> > it would be more correct if __GFP_MOVABLE was clear, like
> > (GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE because this page isn't
> > really movable (it's only reclaimable).
>
> Good point. I'll update the patchset.
>
> > The xchg vs xchgcmp locking also looks good.
> >
> > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
>
> Is it for the whole patchset? :)
It was meant for this one, but I reviewed the whole patchset and it
looks fine to me, so in this case it can apply to the whole patchset ;)
^ permalink raw reply [flat|nested] 21+ messages in thread