* [PATCHv8 01/32] thp, mlock: update unevictable-lru.txt
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 02/32] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
` (31 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Add description of THP handling into unevictable-lru.txt.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
Documentation/vm/unevictable-lru.txt | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index fa3b527086fa..0026a8d33fc0 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -461,6 +461,27 @@ unevictable LRU is enabled, the work of compaction is mostly handled by
the page migration code and the same work flow as described in MIGRATING
MLOCKED PAGES will apply.
+MLOCKING TRANSPARENT HUGE PAGES
+-------------------------------
+
+A transparent huge page is represented by a single entry on an LRU list.
+Therefore, we can only make unevictable an entire compound page, not
+individual subpages.
+
+If a user tries to mlock() part of a huge page, we want the rest of the
+page to be reclaimable.
+
+We cannot just split the page on partial mlock() as split_huge_page() can
+fail and new intermittent failure mode for the syscall is undesirable.
+
+We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
+PMD on border of VM_LOCKED VMA will be split into PTE table.
+
+This way the huge page is accessible for vmscan. Under memory pressure the
+page will be split, subpages which belong to VM_LOCKED VMAs will be moved
+to unevictable LRU and the rest can be reclaimed.
+
+See also comment in follow_trans_huge_pmd().
mmap(MAP_LOCKED) SYSTEM CALL HANDLING
-------------------------------------
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 02/32] mm: do not pass mm_struct into handle_mm_fault
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 01/32] thp, mlock: update unevictable-lru.txt Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 03/32] mm: introduce fault_env Kirill A. Shutemov
` (30 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
We always have vma->vm_mm around.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 ++--
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/iommu/amd_iommu_v2.c | 3 +--
drivers/iommu/intel-svm.c | 2 +-
include/linux/mm.h | 9 ++++-----
mm/gup.c | 5 ++---
mm/ksm.c | 5 ++---
mm/memory.c | 13 +++++++------
36 files changed, 48 insertions(+), 51 deletions(-)
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd667e2..83e9eee57a55 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -147,7 +147,7 @@ retry:
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a13e60..e94e5aa33985 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
if (unlikely(fatal_signal_pending(current))) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index ad5841856007..3a2e678b8d30 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
goto out;
}
- return handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, flags);
check_stack:
/* Don't allow expansion below FIRST_USER_ADDRESS */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 95df28bc875f..4dbb076b9e1f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -183,7 +183,7 @@ good_area:
goto out;
}
- return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c03533937a9f..a4b7edac8f10 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -134,7 +134,7 @@ good_area:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40a6db1..112ef26c7f2e 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -168,7 +168,7 @@ retry:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 61d99767fe16..614a46c413d2 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -164,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, flags);
+ fault = handle_mm_fault(vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c9320032..bd7c251e2bce 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -101,7 +101,7 @@ good_area:
break;
}
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1205a6..fa6ad95e992e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -159,7 +159,7 @@ retry:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 8f9875b7933d..a3785d3644c2 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -196,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, flags);
+ fault = handle_mm_fault(vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd0c830..bd66a0b20c6b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -136,7 +136,7 @@ good_area:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
pr_debug("handle_mm_fault returns %d\n", fault);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca63609..372783a67dda 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -133,7 +133,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc003643..abb678ccde6f 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -216,7 +216,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa031891..9560ad731120 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -153,7 +153,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181ed32f..f23781d6bbb3 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -254,7 +254,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b0c6b8..affc4eb3f89e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -131,7 +131,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20ae794..e94cd225e816 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -163,7 +163,7 @@ good_area:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 16dbe81c97c9..163af2c31d76 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -239,7 +239,7 @@ good_area:
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 6527882ce05e..bb0354222b11 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -75,7 +75,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
}
ret = 0;
- *flt = handle_mm_fault(mm, vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
+ *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
if (unlikely(*flt & VM_FAULT_ERROR)) {
if (*flt & VM_FAULT_OOM) {
ret = -ENOMEM;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d781c52..a4db22f65021 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -429,7 +429,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (fault & VM_FAULT_SIGSEGV)
goto bad_area;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index cce577feab1e..a9524531320e 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -455,7 +455,7 @@ retry:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* No reason to continue if interrupted by SIGKILL. */
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
fault = VM_FAULT_SIGNAL;
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 37a6c2e0e969..995b71e4db4b 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -111,7 +111,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276377d1..9bf876780cef 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -487,7 +487,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b6c559cbd64d..4714061d6cd3 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -241,7 +241,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
@@ -411,7 +411,7 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, flags)) {
+ switch (handle_mm_fault(vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a33da59..6c43b924a7a2 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -436,7 +436,7 @@ good_area:
goto bad_area;
}
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto exit_exception;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..beba986589e5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -434,7 +434,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd0fa2e..ad8f206ab5e8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -73,7 +73,7 @@ good_area:
do {
int fault;
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto out_nosemaphore;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 2ec3d3adcefc..6c7f70bcaae3 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -194,7 +194,7 @@ good_area:
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ fault = handle_mm_fault(vma, addr & PAGE_MASK, flags);
return fault;
check_stack:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5ce1ed02f7e8..2915b5b7bd7e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1348,7 +1348,7 @@ good_area:
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;
/*
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 7f4a1fdb1502..2725e08ef353 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 56999d2fac07..fbdaf81ae925 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -538,8 +538,7 @@ static void do_fault(struct work_struct *work)
if (access_error(vma, fault))
goto out;
- ret = handle_mm_fault(mm, vma, address, flags);
-
+ ret = handle_mm_fault(vma, address, flags);
out:
up_read(&mm->mmap_sem);
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d9939fa9b588..8ebb3530afa7 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -583,7 +583,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (access_error(vma, req))
goto invalid;
- ret = handle_mm_fault(svm->mm, vma, address,
+ ret = handle_mm_fault(vma, address,
req->wr_req ? FAULT_FLAG_WRITE : 0);
if (ret & VM_FAULT_ERROR)
goto invalid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffcff53e3b2b..8fc68ed3c862 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1214,15 +1214,14 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
-extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
#else
-static inline int handle_mm_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static inline int handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
/* should never happen if there's no MMU */
BUG();
diff --git a/mm/gup.c b/mm/gup.c
index fb87aea9edc8..39f751e1cb33 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -351,7 +351,6 @@ unmap:
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned int fault_flags = 0;
int ret;
@@ -376,7 +375,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_TRIED;
}
- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
@@ -691,7 +690,7 @@ retry:
if (!vma_permits_fault(vma, fault_flags))
return -EFAULT;
- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
major |= ret & VM_FAULT_MAJOR;
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
diff --git a/mm/ksm.c b/mm/ksm.c
index b99e828172f6..c967d1b8df97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -376,9 +376,8 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
- ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE |
- FAULT_FLAG_REMOTE);
+ ret = handle_mm_fault(vma, addr,
+ FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 13dcc53f8fba..0002898341a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3363,9 +3363,10 @@ unlock:
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
+ struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3452,15 +3453,15 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
int ret;
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT);
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
@@ -3472,7 +3473,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
- ret = __handle_mm_fault(mm, vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 03/32] mm: introduce fault_env
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 01/32] thp, mlock: update unevictable-lru.txt Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 02/32] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 04/32] mm: postpone page table allocation until we have page to map Kirill A. Shutemov
` (29 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:
Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.
The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.
This patch is preparation for the next one.
[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
Documentation/filesystems/Locking | 10 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 20 +-
include/linux/mm.h | 34 ++-
include/linux/userfaultfd_k.h | 8 +-
mm/filemap.c | 28 +-
mm/huge_memory.c | 280 +++++++++---------
mm/internal.h | 4 +-
mm/memory.c | 579 ++++++++++++++++++--------------------
mm/nommu.c | 3 +-
10 files changed, 473 insertions(+), 515 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..7f77837a7cf1 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -544,13 +544,13 @@ subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
->map_pages() is called when VM asks to map easy accessible pages.
-Filesystem should find and map pages associated with offsets from "pgoff"
-till "max_pgoff". ->map_pages() is called with page table locked and must
+Filesystem should find and map pages associated with offsets from "start_pgoff"
+till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
-page table entry. Pointer to entry associated with offset "pgoff" is
-passed in "pte" field in vm_fault structure. Pointers to entries for other
-offsets should be calculated relative to "pte".
+page table entry. Pointer to entry associated with the page is passed in
+"pte" field in fault_env structure. Pointers to entries for other offsets
+should be calculated relative to "pte".
->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 66cdb44616d5..588e9da27ab4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,10 +257,9 @@ out:
* fatal_signal_pending()s, and the mmap_sem must be released before
* returning it.
*/
-int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason)
+int handle_userfault(struct fault_env *fe, unsigned long reason)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = fe->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -269,7 +268,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
ret = VM_FAULT_SIGBUS;
- ctx = vma->vm_userfaultfd_ctx.ctx;
+ ctx = fe->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;
@@ -302,17 +301,17 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* without first stopping userland access to the memory. For
* VM_UFFD_MISSING userfaults this is enough for now.
*/
- if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
* Validate the invariant that nowait must allow retry
* to be sure not to return SIGBUS erroneously on
* nowait invocations.
*/
- BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+ BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
#ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", fe->flags);
dump_stack();
}
#endif
@@ -324,7 +323,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* and wait.
*/
ret = VM_FAULT_RETRY;
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;
/* take the reference before dropping the mmap_sem */
@@ -332,10 +331,11 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.msg = userfault_msg(address, flags, reason);
+ uwq.msg = userfault_msg(fe->address, fe->flags, reason);
uwq.ctx = ctx;
- return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+ return_to_userland =
+ (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
spin_lock(&ctx->fault_pending_wqh.lock);
@@ -353,7 +353,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);
- must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+ must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
up_read(&mm->mmap_sem);
if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d0367c523a0d..05efca1619b6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,20 +1,12 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H
-extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags);
+extern int do_huge_pmd_anonymous_page(struct fault_env *fe);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
-extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd);
+extern void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd);
+extern int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
@@ -134,8 +126,7 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
}
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *huge_zero_page;
@@ -195,8 +186,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
return NULL;
}
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fc68ed3c862..696ec39b616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,10 +299,27 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
- /* for ->map_pages() only */
- pgoff_t max_pgoff; /* map pages for offset from pgoff till
- * max_pgoff inclusive */
- pte_t *pte; /* pte entry associated with ->pgoff */
+};
+
+/*
+ * Page fault context: passes though page fault handler instead of endless list
+ * of function arguments.
+ */
+struct fault_env {
+ struct vm_area_struct *vma; /* Target VMA */
+ unsigned long address; /* Faulting virtual address */
+ unsigned int flags; /* FAULT_FLAG_xxx flags */
+ pmd_t *pmd; /* Pointer to pmd entry matching
+ * the 'address'
+ */
+ pte_t *pte; /* Pointer to pte entry matching
+ * the 'address'. NULL if the page
+ * table hasn't been allocated.
+ */
+ spinlock_t *ptl; /* Page table lock.
+ * Protects pte page table if 'pte'
+ * is not NULL, otherwise pmd.
+ */
};
/*
@@ -317,7 +334,8 @@ struct vm_operations_struct {
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
pmd_t *, unsigned int flags);
- void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*map_pages)(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -583,8 +601,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon);
+void do_set_pte(struct fault_env *fe, struct page *page);
#endif
/*
@@ -2119,7 +2136,8 @@ extern void truncate_inode_pages_final(struct address_space *);
/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
/* mm/page-writeback.c */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480ad41b7..dd66a952e8cd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -27,8 +27,7 @@
#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
-extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason);
+extern int handle_userfault(struct fault_env *fe, unsigned long reason);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len);
@@ -56,10 +55,7 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
#else /* CONFIG_USERFAULTFD */
/* mm helpers */
-static inline int handle_userfault(struct vm_area_struct *vma,
- unsigned long address,
- unsigned int flags,
- unsigned long reason)
+static inline int handle_userfault(struct fault_env *fe, unsigned long reason)
{
return VM_FAULT_SIGBUS;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index f2479af09da9..e32e5d70fc0c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2131,22 +2131,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
- struct file *file = vma->vm_file;
+ struct file *file = fe->vma->vm_file;
struct address_space *mapping = file->f_mapping;
+ pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
- unsigned long address = (unsigned long) vmf->virtual_address;
- unsigned long addr;
- pte_t *pte;
rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
- if (iter.index > vmf->max_pgoff)
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start_pgoff) {
+ if (iter.index > end_pgoff)
break;
+ fe->pte += iter.index - last_pgoff;
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ last_pgoff = iter.index;
+ if (!pte_none(*fe->pte))
+ goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2182,14 +2187,9 @@ repeat:
if (page->index >= size >> PAGE_SHIFT)
goto unlock;
- pte = vmf->pte + page->index - vmf->pgoff;
- if (!pte_none(*pte))
- goto unlock;
-
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false);
+ do_set_pte(fe, page);
unlock_page(page);
goto next;
unlock:
@@ -2197,7 +2197,7 @@ unlock:
skip:
put_page(page);
next:
- if (iter.index == vmf->max_pgoff)
+ if (iter.index == end_pgoff)
break;
}
rcu_read_unlock();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5211c59e5f26..a4014b484737 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -821,26 +821,23 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
}
-static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- struct page *page, gfp_t gfp,
- unsigned int flags)
+static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
+ gfp_t gfp)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
pgtable_t pgtable;
- spinlock_t *ptl;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
VM_BUG_ON_PAGE(!PageCompound(page), page);
- if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
@@ -855,12 +852,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(ptl);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
} else {
pmd_t entry;
@@ -868,12 +865,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
if (userfaultfd_missing(vma)) {
int ret;
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_free(vma->vm_mm, pgtable);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
}
@@ -883,11 +879,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, haddr, pmd, entry);
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&mm->nr_ptes);
- spin_unlock(ptl);
+ pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, pgtable);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ spin_unlock(fe->ptl);
count_vm_event(THP_FAULT_ALLOC);
}
@@ -937,13 +933,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags)
+int do_huge_pmd_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
gfp_t gfp;
struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
@@ -951,42 +946,40 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
- spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
int ret;
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
zero_page = get_huge_zero_page();
if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- ptl = pmd_lock(mm, pmd);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
ret = 0;
set = false;
- if (pmd_none(*pmd)) {
+ if (pmd_none(*fe->pmd)) {
if (userfaultfd_missing(vma)) {
- spin_unlock(ptl);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ spin_unlock(fe->ptl);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
- set_huge_zero_page(pgtable, mm, vma,
- haddr, pmd,
- zero_page);
- spin_unlock(ptl);
+ set_huge_zero_page(pgtable, vma->vm_mm, vma,
+ haddr, fe->pmd, zero_page);
+ spin_unlock(fe->ptl);
set = true;
}
} else
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
if (!set) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
put_huge_zero_page();
}
return ret;
@@ -998,8 +991,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
prep_transhuge_page(page);
- return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
- flags);
+ return __do_huge_pmd_anonymous_page(fe, page, gfp);
}
static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1171,38 +1163,31 @@ out:
return ret;
}
-void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- int dirty)
+void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(fe->vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto unlock;
entry = pmd_mkyoung(orig_pmd);
- haddr = address & HPAGE_PMD_MASK;
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
- update_mmu_cache_pmd(vma, address, pmd);
+ haddr = fe->address & HPAGE_PMD_MASK;
+ if (pmdp_set_access_flags(fe->vma, haddr, fe->pmd, entry,
+ fe->flags & FAULT_FLAG_WRITE))
+ update_mmu_cache_pmd(fe->vma, fe->address, fe->pmd);
unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
}
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- struct page *page,
- unsigned long haddr)
+static int do_huge_pmd_wp_page_fallback(struct fault_env *fe, pmd_t orig_pmd,
+ struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1219,11 +1204,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
for (i = 0; i < HPAGE_PMD_NR; i++) {
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
- __GFP_OTHER_NODE,
- vma, address, page_to_nid(page));
+ __GFP_OTHER_NODE, vma,
+ fe->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg, false))) {
+ mem_cgroup_try_charge(pages[i], vma->vm_mm,
+ GFP_KERNEL, &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
@@ -1249,41 +1234,41 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON_PAGE(!PageHead(page), page);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
/* leave pmd empty until pte is filled */
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
+ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, fe->pmd);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
+ pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr, false);
+ page_add_new_anon_rmap(pages[i], fe->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ fe->pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*fe->pte));
+ set_pte_at(vma->vm_mm, haddr, fe->pte, entry);
+ pte_unmap(fe->pte);
}
kfree(pages);
smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
+ pmd_populate(vma->vm_mm, fe->pmd, pgtable);
page_remove_rmap(page, true);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1292,8 +1277,8 @@ out:
return ret;
out_free_pages:
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ spin_unlock(fe->ptl);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1304,25 +1289,23 @@ out_free_pages:
goto out;
}
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
- int ret = 0;
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
- unsigned long haddr;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
+ int ret = 0;
- ptl = pmd_lockptr(mm, pmd);
+ fe->ptl = pmd_lockptr(vma->vm_mm, fe->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
- haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(ptl);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_unlock;
page = pmd_page(orig_pmd);
@@ -1341,13 +1324,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, pmd);
+ if (pmdp_set_access_flags(vma, haddr, fe->pmd, entry, 1))
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
@@ -1360,13 +1343,12 @@ alloc:
prep_transhuge_page(new_page);
} else {
if (!page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
} else {
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
+ ret = do_huge_pmd_wp_page_fallback(fe, orig_pmd, page);
if (ret & VM_FAULT_OOM) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
}
put_page(page);
@@ -1375,14 +1357,12 @@ alloc:
goto out;
}
- if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg,
- true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+ huge_gfp, &memcg, true))) {
put_page(new_page);
- if (page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
+ if (page)
put_page(page);
- } else
- split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1398,13 +1378,13 @@ alloc:
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
- spin_lock(ptl);
+ spin_lock(fe->ptl);
if (page)
put_page(page);
- if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
@@ -1412,14 +1392,14 @@ alloc:
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
- set_pmd_at(mm, haddr, pmd, entry);
- update_mmu_cache_pmd(vma, address, pmd);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
if (!page) {
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -1428,13 +1408,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
return ret;
}
@@ -1494,13 +1474,12 @@ out:
}
/* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct anon_vma *anon_vma = NULL;
struct page *page;
- unsigned long haddr = addr & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
bool page_locked;
@@ -1511,8 +1490,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* A PROT_NONE fault should not end up here */
BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
- ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(pmd, *fe->pmd)))
goto out_unlock;
/*
@@ -1520,9 +1499,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* without disrupting NUMA hinting information. Do not relock and
* check_same as the page may no longer be mapped.
*/
- if (unlikely(pmd_trans_migrating(*pmdp))) {
- page = pmd_page(*pmdp);
- spin_unlock(ptl);
+ if (unlikely(pmd_trans_migrating(*fe->pmd))) {
+ page = pmd_page(*fe->pmd);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
goto out;
}
@@ -1555,7 +1534,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migration could have started since the pmd_trans_migrating check */
if (!page_locked) {
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
page_nid = -1;
goto out;
@@ -1566,12 +1545,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* to serialises splits
*/
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
- spin_lock(ptl);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(pmd, *fe->pmd))) {
unlock_page(page);
put_page(page);
page_nid = -1;
@@ -1589,9 +1568,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
- spin_unlock(ptl);
- migrated = migrate_misplaced_transhuge_page(mm, vma,
- pmdp, pmd, addr, page, target_nid);
+ spin_unlock(fe->ptl);
+ migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
+ fe->pmd, pmd, fe->address, page, target_nid);
if (migrated) {
flags |= TNF_MIGRATED;
page_nid = target_nid;
@@ -1606,18 +1585,18 @@ clear_pmdnuma:
pmd = pmd_mkyoung(pmd);
if (was_writable)
pmd = pmd_mkwrite(pmd);
- set_pmd_at(mm, haddr, pmdp, pmd);
- update_mmu_cache_pmd(vma, addr, pmdp);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, pmd);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
unlock_page(page);
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, fe->flags);
return 0;
}
@@ -2396,29 +2375,32 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd)
{
- unsigned long _address;
- pte_t *pte, pteval;
+ pte_t pteval;
int swapped_in = 0, ret = 0;
-
- pte = pte_offset_map(pmd, address);
- for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
- pte++, _address += PAGE_SIZE) {
- pteval = *pte;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
if (!is_swap_pte(pteval))
continue;
swapped_in++;
- ret = do_swap_page(mm, vma, _address, pte, pmd,
- FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- pteval);
+ ret = do_swap_page(&fe, pteval);
if (ret & VM_FAULT_ERROR) {
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
return;
}
/* pte is unmapped now, we need to map it */
- pte = pte_offset_map(pmd, _address);
+ fe.pte = pte_offset_map(pmd, fe.address);
}
- pte--;
- pte_unmap(pte);
+ fe.pte--;
+ pte_unmap(fe.pte);
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
}
diff --git a/mm/internal.h b/mm/internal.h
index 6e9222a41c4a..1458bd904daf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,9 +35,7 @@
/* Do not use these with a slab allocator */
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
-extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte);
+int do_swap_page(struct fault_env *fe, pte_t orig_pte);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index 0002898341a3..6c37f84212ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2013,13 +2013,11 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
* case, all we need to do here is to mark the page as writable and update
* any related book-keeping.
*/
-static inline int wp_page_reuse(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- struct page *page, int page_mkwrite,
- int dirty_shared)
- __releases(ptl)
+static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
+ struct page *page, int page_mkwrite, int dirty_shared)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
pte_t entry;
/*
* Clear the pages cpupid information as the existing
@@ -2029,12 +2027,12 @@ static inline int wp_page_reuse(struct mm_struct *mm,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (ptep_set_access_flags(vma, address, page_table, entry, 1))
- update_mmu_cache(vma, address, page_table);
- pte_unmap_unlock(page_table, ptl);
+ if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1))
+ update_mmu_cache(vma, fe->address, fe->pte);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (dirty_shared) {
struct address_space *mapping;
@@ -2080,30 +2078,31 @@ static inline int wp_page_reuse(struct mm_struct *mm,
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
-static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- pte_t orig_pte, struct page *old_page)
+static int wp_page_copy(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
{
+ struct vm_area_struct *vma = fe->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct page *new_page = NULL;
- spinlock_t *ptl = NULL;
pte_t entry;
int page_copied = 0;
- const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */
- const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */
+ const unsigned long mmun_start = fe->address & PAGE_MASK;
+ const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;
if (unlikely(anon_vma_prepare(vma)))
goto oom;
if (is_zero_pfn(pte_pfn(orig_pte))) {
- new_page = alloc_zeroed_user_highpage_movable(vma, address);
+ new_page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
+ fe->address);
if (!new_page)
goto oom;
- cow_user_page(new_page, old_page, address, vma);
+ cow_user_page(new_page, old_page, fe->address, vma);
}
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
@@ -2116,8 +2115,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Re-check the pte - we dropped the lock
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte))) {
+ fe->pte = pte_offset_map_lock(mm, fe->pmd, fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
dec_mm_counter_fast(mm,
@@ -2127,7 +2126,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/*
@@ -2136,8 +2135,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address, false);
+ ptep_clear_flush_notify(vma, fe->address, fe->pte);
+ page_add_new_anon_rmap(new_page, vma, fe->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2145,8 +2144,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
- update_mmu_cache(vma, address, page_table);
+ set_pte_at_notify(mm, fe->address, fe->pte, entry);
+ update_mmu_cache(vma, fe->address, fe->pte);
if (old_page) {
/*
* Only after switching the pte to the new page may
@@ -2183,7 +2182,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
if (new_page)
put_page(new_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
if (old_page) {
/*
@@ -2211,44 +2210,43 @@ oom:
* Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
* mapping
*/
-static int wp_pfn_shared(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- pmd_t *pmd)
+static int wp_pfn_shared(struct fault_env *fe, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
+
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf = {
.page = NULL,
- .pgoff = linear_page_index(vma, address),
- .virtual_address = (void __user *)(address & PAGE_MASK),
+ .pgoff = linear_page_index(vma, fe->address),
+ .virtual_address =
+ (void __user *)(fe->address & PAGE_MASK),
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
int ret;
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
if (ret & VM_FAULT_ERROR)
return ret;
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
/*
* We might have raced with another page fault while we
* released the pte_offset_map_lock.
*/
- if (!pte_same(*page_table, orig_pte)) {
- pte_unmap_unlock(page_table, ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
}
- return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
- NULL, 0, 0);
+ return wp_page_reuse(fe, orig_pte, NULL, 0, 0);
}
-static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table,
- pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
- struct page *old_page)
- __releases(ptl)
+static int wp_page_shared(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
int page_mkwrite = 0;
get_page(old_page);
@@ -2256,8 +2254,8 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;
- pte_unmap_unlock(page_table, ptl);
- tmp = do_page_mkwrite(vma, old_page, address);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ tmp = do_page_mkwrite(vma, old_page, fe->address);
if (unlikely(!tmp || (tmp &
(VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
put_page(old_page);
@@ -2269,19 +2267,18 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
put_page(old_page);
return 0;
}
page_mkwrite = 1;
}
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, page_mkwrite, 1);
+ return wp_page_reuse(fe, orig_pte, old_page, page_mkwrite, 1);
}
/*
@@ -2302,14 +2299,13 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte)
- __releases(ptl)
+static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *old_page;
- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, fe->address, orig_pte);
if (!old_page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -2320,12 +2316,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
- return wp_pfn_shared(mm, vma, address, page_table, ptl,
- orig_pte, pmd);
+ return wp_pfn_shared(fe, orig_pte);
- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}
/*
@@ -2335,13 +2329,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (PageAnon(old_page) && !PageKsm(old_page)) {
if (!trylock_page(old_page)) {
get_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
lock_page(old_page);
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
put_page(old_page);
return 0;
}
@@ -2353,16 +2347,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
- page_move_anon_rmap(old_page, vma, address);
+ page_move_anon_rmap(old_page, vma, fe->address);
unlock_page(old_page);
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, 0, 0);
+ return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
- return wp_page_shared(mm, vma, address, page_table, pmd,
- ptl, orig_pte, old_page);
+ return wp_page_shared(fe, orig_pte, old_page);
}
/*
@@ -2370,9 +2362,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
get_page(old_page);
- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}
static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2462,11 +2453,9 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+int do_swap_page(struct fault_env *fe, pte_t orig_pte)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct page *page, *swapcache;
struct mem_cgroup *memcg;
swp_entry_t entry;
@@ -2475,17 +2464,17 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
int exclusive = 0;
int ret = 0;
- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
+ if (!pte_unmap_same(vma->vm_mm, fe->pmd, fe->pte, orig_pte))
goto out;
entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
- print_bad_pte(vma, address, orig_pte, NULL);
+ print_bad_pte(vma, fe->address, orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
@@ -2494,14 +2483,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = lookup_swap_cache(entry);
if (!page) {
page = swapin_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vma, address);
+ GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!page) {
/*
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
@@ -2510,7 +2500,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
@@ -2523,7 +2513,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
swapcache = page;
- locked = lock_page_or_retry(page, mm, flags);
+ locked = lock_page_or_retry(page, vma->vm_mm, fe->flags);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!locked) {
@@ -2540,14 +2530,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
goto out_page;
- page = ksm_might_need_to_copy(page, vma, address);
+ page = ksm_might_need_to_copy(page, vma, fe->address);
if (unlikely(!page)) {
ret = VM_FAULT_OOM;
page = swapcache;
goto out_page;
}
- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2555,8 +2546,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Back out if somebody else already faulted in this pte.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte)))
goto out_nomap;
if (unlikely(!PageUptodate(page))) {
@@ -2574,24 +2566,24 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
* must be called after the swap_free(), or it will never succeed.
*/
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+ if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- flags &= ~FAULT_FLAG_WRITE;
+ fe->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- set_pte_at(mm, address, page_table, pte);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
if (page == swapcache) {
- do_page_add_anon_rmap(page, vma, address, exclusive);
+ do_page_add_anon_rmap(page, vma, fe->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2614,22 +2606,22 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(swapcache);
}
- if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+ if (fe->flags & FAULT_FLAG_WRITE) {
+ ret |= do_wp_page(fe, pte);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
}
/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
return ret;
out_nomap:
mem_cgroup_cancel_charge(page, memcg, false);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out_page:
unlock_page(page);
out_release:
@@ -2680,37 +2672,36 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
struct page *page;
- spinlock_t *ptl;
pte_t entry;
- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
/* Check if we need to add a guard page to the stack */
- if (check_stack_guard_page(vma, address) < 0)
+ if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;
/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
- entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm)) {
+ entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
vma->vm_page_prot));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
goto setpte;
}
@@ -2718,11 +2709,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, address);
+ page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!page)
goto oom;
- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;
/*
@@ -2736,30 +2727,30 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
- set_pte_at(mm, address, page_table, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
release:
mem_cgroup_cancel_charge(page, memcg, false);
@@ -2776,16 +2767,16 @@ oom:
* released depending on flags and vma->vm_ops->fault() return value.
* See filemap_fault() and __lock_page_retry().
*/
-static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
+ struct page *cow_page, struct page **page)
{
+ struct vm_area_struct *vma = fe->vma;
struct vm_fault vmf;
int ret;
- vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
vmf.pgoff = pgoff;
- vmf.flags = flags;
+ vmf.flags = fe->flags;
vmf.page = NULL;
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;
@@ -2816,38 +2807,36 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
/**
* do_set_pte - setup new PTE entry for given page and add reverse page mapping.
*
- * @vma: virtual memory area
- * @address: user virtual address
+ * @fe: fault environment
* @page: page to map
- * @pte: pointer to target page table entry
- * @write: true, if new entry is writable
- * @anon: true, if it's anonymous page
*
- * Caller must hold page table lock relevant for @pte.
+ * Caller must hold page table lock relevant for @fe->pte.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon)
+void do_set_pte(struct fault_env *fe, struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
+ /* copy-on-write page */
+ if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
}
- set_pte_at(vma->vm_mm, address, pte, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
/* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
}
static unsigned long fault_around_bytes __read_mostly =
@@ -2914,57 +2903,53 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pgoff_t pgoff, unsigned int flags)
+static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long start_addr, nr_pages, mask;
- pgoff_t max_pgoff;
- struct vm_fault vmf;
+ unsigned long address = fe->address, start_addr, nr_pages, mask;
+ pte_t *pte = fe->pte;
+ pgoff_t end_pgoff;
int off;
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
- start_addr = max(address & mask, vma->vm_start);
- off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- pte -= off;
- pgoff -= off;
+ start_addr = max(fe->address & mask, fe->vma->vm_start);
+ off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ fe->pte -= off;
+ start_pgoff -= off;
/*
- * max_pgoff is either end of page table or end of vma
- * or fault_around_pages() from pgoff, depending what is nearest.
+ * end_pgoff is either end of page table or end of vma
+ * or fault_around_pages() from start_pgoff, depending what is nearest.
*/
- max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ end_pgoff = start_pgoff -
+ ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
- max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
- pgoff + nr_pages - 1);
+ end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+ start_pgoff + nr_pages - 1);
/* Check if it makes any sense to call ->map_pages */
- while (!pte_none(*pte)) {
- if (++pgoff > max_pgoff)
- return;
- start_addr += PAGE_SIZE;
- if (start_addr >= vma->vm_end)
- return;
- pte++;
+ fe->address = start_addr;
+ while (!pte_none(*fe->pte)) {
+ if (++start_pgoff > end_pgoff)
+ goto out;
+ fe->address += PAGE_SIZE;
+ if (fe->address >= fe->vma->vm_end)
+ goto out;
+ fe->pte++;
}
- vmf.virtual_address = (void __user *) start_addr;
- vmf.pte = pte;
- vmf.pgoff = pgoff;
- vmf.max_pgoff = max_pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = __get_fault_gfp_mask(vma);
- vma->vm_ops->map_pages(vma, &vmf);
+ fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+out:
+ /* restore fault_env */
+ fe->pte = pte;
+ fe->address = address;
}
-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
- spinlock_t *ptl;
- pte_t *pte;
int ret = 0;
/*
@@ -2973,64 +2958,64 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
- if (!pte_same(*pte, orig_pte))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ do_fault_around(fe, pgoff);
+ if (!pte_same(*fe->pte, orig_pte))
goto unlock_out;
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
}
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, false, false);
+ do_set_pte(fe, fault_page);
unlock_page(fault_page);
unlock_out:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return ret;
}
-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
- pte_t *pte;
int ret;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!new_page)
return VM_FAULT_OOM;
- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
put_page(new_page);
return VM_FAULT_OOM;
}
- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(fe, pgoff, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
if (fault_page)
- copy_user_highpage(new_page, fault_page, address, vma);
+ copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3043,10 +3028,10 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
+ do_set_pte(fe, new_page);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3064,18 +3049,15 @@ uncharge_out:
return ret;
}
-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
struct address_space *mapping;
- spinlock_t *ptl;
- pte_t *pte;
int dirtied = 0;
int ret, tmp;
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
@@ -3085,7 +3067,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops->page_mkwrite) {
unlock_page(fault_page);
- tmp = do_page_mkwrite(vma, fault_page, address);
+ tmp = do_page_mkwrite(vma, fault_page, fe->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
put_page(fault_page);
@@ -3093,15 +3075,16 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, true, false);
- pte_unmap_unlock(pte, ptl);
+ do_set_pte(fe, fault_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3133,23 +3116,20 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+static int do_fault(struct fault_env *fe, pte_t orig_pte)
{
- pgoff_t pgoff = linear_page_index(vma, address);
+ struct vm_area_struct *vma = fe->vma;
+ pgoff_t pgoff = linear_page_index(vma, fe->address);
- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
- if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
+ if (!(fe->flags & FAULT_FLAG_WRITE))
+ return do_read_fault(fe, pgoff, orig_pte);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return do_cow_fault(fe, pgoff, orig_pte);
+ return do_shared_fault(fe, pgoff, orig_pte);
}
static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3167,11 +3147,10 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
}
-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe, pte_t pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL;
- spinlock_t *ptl;
int page_nid = -1;
int last_cpupid;
int target_nid;
@@ -3191,10 +3170,10 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* page table entry is not accessible, so there would be no
* concurrent hardware modifications to the PTE.
*/
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, pte))) {
- pte_unmap_unlock(ptep, ptl);
+ fe->ptl = pte_lockptr(vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
goto out;
}
@@ -3203,18 +3182,18 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
- set_pte_at(mm, addr, ptep, pte);
- update_mmu_cache(vma, addr, ptep);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, fe->address, pte);
if (!page) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
/* TODO: handle PTE-mapped THP */
if (PageCompound(page)) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
@@ -3238,8 +3217,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
- pte_unmap_unlock(ptep, ptl);
+ target_nid = numa_migrate_prep(page, vma, fe->address, page_nid,
+ &flags);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (target_nid == -1) {
put_page(page);
goto out;
@@ -3259,24 +3239,24 @@ out:
return 0;
}
-static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+static int create_huge_pmd(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
if (vma_is_anonymous(vma))
- return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ return do_huge_pmd_anonymous_page(fe);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return vma->vm_ops->pmd_fault(vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}
-static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
- unsigned int flags)
+static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
{
- if (vma_is_anonymous(vma))
- return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
- if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_huge_pmd_wp_page(fe, orig_pmd);
+ if (fe->vma->vm_ops->pmd_fault)
+ return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}
@@ -3296,12 +3276,9 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- spinlock_t *ptl;
/*
* some architectures can have larger ptes than wordsize,
@@ -3311,37 +3288,34 @@ static int handle_pte_fault(struct mm_struct *mm,
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
- entry = *pte;
+ entry = *fe->pte;
barrier();
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (vma_is_anonymous(vma))
- return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
else
- return do_fault(mm, vma, address, pte, pmd,
- flags, entry);
+ return do_fault(fe, entry);
}
- return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
+ return do_swap_page(fe, entry);
}
if (pte_protnone(entry))
- return do_numa_page(mm, vma, address, entry, pte, pmd);
+ return do_numa_page(fe, entry);
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*pte, entry)))
+ fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, entry)))
goto unlock;
- if (flags & FAULT_FLAG_WRITE) {
+ if (fe->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
- return do_wp_page(mm, vma, address,
- pte, pmd, ptl, entry);
+ return do_wp_page(fe, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
- if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
- update_mmu_cache(vma, address, pte);
+ if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
+ fe->flags & FAULT_FLAG_WRITE)) {
+ update_mmu_cache(fe->vma, fe->address, fe->pte);
} else {
/*
* This is needed only for protection faults but the arch code
@@ -3349,11 +3323,11 @@ static int handle_pte_fault(struct mm_struct *mm,
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
- if (flags & FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ if (fe->flags & FAULT_FLAG_WRITE)
+ flush_tlb_fix_spurious_fault(fe->vma, fe->address);
}
unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
@@ -3366,51 +3340,42 @@ unlock:
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = flags,
+ };
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
- flags & FAULT_FLAG_INSTRUCTION,
- flags & FAULT_FLAG_REMOTE))
- return VM_FAULT_SIGSEGV;
-
- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, flags);
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
+ fe.pmd = pmd_alloc(mm, pud, address);
+ if (!fe.pmd)
return VM_FAULT_OOM;
- if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+ if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = create_huge_pmd(&fe);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pmd_t orig_pmd = *pmd;
+ pmd_t orig_pmd = *fe.pmd;
int ret;
barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
-
if (pmd_protnone(orig_pmd))
- return do_huge_pmd_numa_page(mm, vma, address,
- orig_pmd, pmd);
+ return do_huge_pmd_numa_page(&fe, orig_pmd);
- if (dirty && !pmd_write(orig_pmd)) {
- ret = wp_huge_pmd(mm, vma, address, pmd,
- orig_pmd, flags);
+ if ((fe.flags & FAULT_FLAG_WRITE) &&
+ !pmd_write(orig_pmd)) {
+ ret = wp_huge_pmd(&fe, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pmd_set_accessed(mm, vma, address, pmd,
- orig_pmd, dirty);
+ huge_pmd_set_accessed(&fe, orig_pmd);
return 0;
}
}
@@ -3421,7 +3386,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(pte_alloc(mm, pmd, address)))
+ if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
return VM_FAULT_OOM;
/*
* If a huge pmd materialized under us just retry later. Use
@@ -3434,7 +3399,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* through an atomic read in C, which is what pmd_trans_unstable()
* provides.
*/
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
@@ -3442,9 +3407,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
- pte = pte_offset_map(pmd, address);
+ fe.pte = pte_offset_map(fe.pmd, fe.address);
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return handle_pte_fault(&fe);
}
/*
@@ -3473,7 +3438,15 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
- ret = __handle_mm_fault(vma, address, flags);
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ flags & FAULT_FLAG_REMOTE))
+ return VM_FAULT_SIGSEGV;
+
+ if (unlikely(is_vm_hugetlb_page(vma)))
+ ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ else
+ ret = __handle_mm_fault(vma, address, flags);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
diff --git a/mm/nommu.c b/mm/nommu.c
index 102e257cc6c3..146c2139f506 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1811,7 +1811,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
BUG();
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 04/32] mm: postpone page table allocation until we have page to map
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (2 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 03/32] mm: introduce fault_env Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 05/32] rmap: support file thp Kirill A. Shutemov
` (28 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].
Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands. This approach opens possibility to map the
page as huge if filesystem supports this.
Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte(). This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().
do_set_pte() got renamed to alloc_set_pte() as it can allocate page
table if required.
[1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/mm.h | 10 +-
mm/filemap.c | 16 +--
mm/memory.c | 297 +++++++++++++++++++++++++++++++----------------------
3 files changed, 196 insertions(+), 127 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 696ec39b616b..afe79bb75c47 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -320,6 +320,13 @@ struct fault_env {
* Protects pte page table if 'pte'
* is not NULL, otherwise pmd.
*/
+ pgtable_t prealloc_pte; /* Pre-allocated pte page table.
+ * vm_ops->map_pages() calls
+ * alloc_set_pte() from atomic context.
+ * do_fault_around() pre-allocates
+ * page table to avoid allocation from
+ * atomic context.
+ */
};
/*
@@ -601,7 +608,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}
-void do_set_pte(struct fault_env *fe, struct page *page);
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page);
#endif
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index e32e5d70fc0c..7e982835d4ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2147,11 +2147,6 @@ void filemap_map_pages(struct fault_env *fe,
start_pgoff) {
if (iter.index > end_pgoff)
break;
- fe->pte += iter.index - last_pgoff;
- fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
- last_pgoff = iter.index;
- if (!pte_none(*fe->pte))
- goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2189,7 +2184,13 @@ repeat:
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- do_set_pte(fe, page);
+
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ if (fe->pte)
+ fe->pte += iter.index - last_pgoff;
+ last_pgoff = iter.index;
+ if (alloc_set_pte(fe, NULL, page))
+ goto unlock;
unlock_page(page);
goto next;
unlock:
@@ -2197,6 +2198,9 @@ unlock:
skip:
put_page(page);
next:
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*fe->pmd))
+ break;
if (iter.index == end_pgoff)
break;
}
diff --git a/mm/memory.c b/mm/memory.c
index 6c37f84212ee..c31c52507956 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2679,8 +2679,6 @@ static int do_anonymous_page(struct fault_env *fe)
struct page *page;
pte_t entry;
- pte_unmap(fe->pte);
-
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2689,6 +2687,23 @@ static int do_anonymous_page(struct fault_env *fe)
if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;
+ /*
+ * Use pte_alloc() instead of pte_alloc_map(). We can't run
+ * pte_offset_map() on pmds where a huge pmd might be created
+ * from a different thread.
+ *
+ * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+ * parallel threads are excluded by other means.
+ *
+ * Here we only have down_read(mmap_sem).
+ */
+ if (pte_alloc(vma->vm_mm, fe->pmd, fe->address))
+ return VM_FAULT_OOM;
+
+ /* See the comment in pte_alloc_one_map() */
+ if (unlikely(pmd_trans_unstable(fe->pmd)))
+ return 0;
+
/* Use the zero-page for reads */
if (!(fe->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -2804,23 +2819,76 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
return ret;
}
+static int pte_alloc_one_map(struct fault_env *fe)
+{
+ struct vm_area_struct *vma = fe->vma;
+
+ if (!pmd_none(*fe->pmd))
+ goto map_pte;
+ if (fe->prealloc_pte) {
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
+ goto map_pte;
+ }
+
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+ spin_unlock(fe->ptl);
+ fe->prealloc_pte = 0;
+ } else if (unlikely(pte_alloc(vma->vm_mm, fe->pmd, fe->address))) {
+ return VM_FAULT_OOM;
+ }
+map_pte:
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return VM_FAULT_NOPAGE;
+
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ return 0;
+}
+
/**
- * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ * alloc_set_pte - setup new PTE entry for given page and add reverse page
+ * mapping. If needed, the fucntion allocates page table or use pre-allocated.
*
* @fe: fault environment
+ * @memcg: memcg to charge page (only for private mappings)
* @page: page to map
*
- * Caller must hold page table lock relevant for @fe->pte.
+ * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct fault_env *fe, struct page *page)
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page)
{
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ if (!fe->pte) {
+ int ret = pte_alloc_one_map(fe);
+ if (ret)
+ return ret;
+ }
+
+ /* Re-check under ptl */
+ if (unlikely(!pte_none(*fe->pte)))
+ return VM_FAULT_NOPAGE;
+
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
@@ -2829,6 +2897,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, fe->address, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
@@ -2837,6 +2907,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, fe->address, fe->pte);
+
+ return 0;
}
static unsigned long fault_around_bytes __read_mostly =
@@ -2903,19 +2975,17 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
+static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long address = fe->address, start_addr, nr_pages, mask;
- pte_t *pte = fe->pte;
+ unsigned long address = fe->address, nr_pages, mask;
pgoff_t end_pgoff;
- int off;
+ int off, ret = 0;
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
- start_addr = max(fe->address & mask, fe->vma->vm_start);
- off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- fe->pte -= off;
+ fe->address = max(address & mask, fe->vma->vm_start);
+ off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;
/*
@@ -2923,30 +2993,45 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
* or fault_around_pages() from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);
- /* Check if it makes any sense to call ->map_pages */
- fe->address = start_addr;
- while (!pte_none(*fe->pte)) {
- if (++start_pgoff > end_pgoff)
- goto out;
- fe->address += PAGE_SIZE;
- if (fe->address >= fe->vma->vm_end)
- goto out;
- fe->pte++;
+ if (pmd_none(*fe->pmd)) {
+ fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
+ smp_wmb(); /* See comment in __pte_alloc() */
}
fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+
+ /* preallocated pagetable is unused: free it */
+ if (fe->prealloc_pte) {
+ pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+ fe->prealloc_pte = 0;
+ }
+ /* Huge page is mapped? Page fault is solved */
+ if (pmd_trans_huge(*fe->pmd)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }
+
+ /* ->map_pages() haven't done anything useful. Cold page cache? */
+ if (!fe->pte)
+ goto out;
+
+ /* check if the page fault is solved */
+ fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
+ if (!pte_none(*fe->pte))
+ ret = VM_FAULT_NOPAGE;
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
- /* restore fault_env */
- fe->pte = pte;
fe->address = address;
+ fe->pte = NULL;
+ return ret;
}
-static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -2958,33 +3043,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- do_fault_around(fe, pgoff);
- if (!pte_same(*fe->pte, orig_pte))
- goto unlock_out;
- pte_unmap_unlock(fe->pte, fe->ptl);
+ ret = do_fault_around(fe, pgoff);
+ if (ret)
+ return ret;
}
ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- unlock_page(fault_page);
- put_page(fault_page);
- return ret;
- }
- do_set_pte(fe, fault_page);
unlock_page(fault_page);
-unlock_out:
- pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ put_page(fault_page);
return ret;
}
-static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
@@ -3012,26 +3089,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, memcg, new_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- if (fault_page) {
- unlock_page(fault_page);
- put_page(fault_page);
- } else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
- }
- goto uncharge_out;
- }
- do_set_pte(fe, new_page);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3042,6 +3102,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
@@ -3049,7 +3111,7 @@ uncharge_out:
return ret;
}
-static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -3075,16 +3137,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
}
}
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ VM_FAULT_RETRY))) {
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(fe, fault_page);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3116,20 +3177,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct fault_env *fe, pte_t orig_pte)
+static int do_fault(struct fault_env *fe)
{
struct vm_area_struct *vma = fe->vma;
pgoff_t pgoff = linear_page_index(vma, fe->address);
- pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(fe->flags & FAULT_FLAG_WRITE))
- return do_read_fault(fe, pgoff, orig_pte);
+ return do_read_fault(fe, pgoff);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(fe, pgoff, orig_pte);
- return do_shared_fault(fe, pgoff, orig_pte);
+ return do_cow_fault(fe, pgoff);
+ return do_shared_fault(fe, pgoff);
}
static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3269,37 +3329,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
+ * concurrent faults).
*
- * The mmap_sem may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * The mmap_sem may have been released depending on flags and our return value.
+ * See filemap_fault() and __lock_page_or_retry().
*/
static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
- * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
- * The code below just needs a consistent view for the ifs and
- * we later double check anyway with the ptl lock held. So here
- * a barrier will do.
- */
- entry = *fe->pte;
- barrier();
- if (!pte_present(entry)) {
+ if (unlikely(pmd_none(*fe->pmd))) {
+ /*
+ * Leave __pte_alloc() until later: because vm_ops->fault may
+ * want to allocate huge page, and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ } else {
+ /* See comment in pte_alloc_one_map() */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ fe->pte = pte_offset_map(fe->pmd, fe->address);
+
+ entry = *fe->pte;
+
+ /*
+ * some architectures can have larger ptes than wordsize,
+ * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
+ * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
+ * atomic accesses. The code below just needs a consistent
+ * view for the ifs and we later double check anyway with the
+ * ptl lock held. So here a barrier will do.
+ */
+ barrier();
if (pte_none(entry)) {
- if (vma_is_anonymous(fe->vma))
- return do_anonymous_page(fe);
- else
- return do_fault(fe, entry);
+ pte_unmap(fe->pte);
+ fe->pte = NULL;
}
- return do_swap_page(fe, entry);
}
+ if (!fe->pte) {
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
+ else
+ return do_fault(fe);
+ }
+
+ if (!pte_present(entry))
+ return do_swap_page(fe, entry);
+
if (pte_protnone(entry))
return do_numa_page(fe, entry);
@@ -3381,34 +3466,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
}
- /*
- * Use pte_alloc() instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- fe.pte = pte_offset_map(fe.pmd, fe.address);
-
return handle_pte_fault(&fe);
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 05/32] rmap: support file thp
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (3 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 04/32] mm: postpone page table allocation until we have page to map Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 06/32] mm: introduce do_set_pmd() Kirill A. Shutemov
` (27 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.
PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/rmap.h | 2 +-
mm/huge_memory.c | 10 +++++++---
mm/memory.c | 4 ++--
mm/migrate.c | 2 +-
mm/rmap.c | 48 +++++++++++++++++++++++++++++++++++-------------
mm/util.c | 6 ++++++
6 files changed, 52 insertions(+), 20 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 49eb4f8ebac9..5704f101b52e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -165,7 +165,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
-void page_add_file_rmap(struct page *);
+void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4014b484737..aab10c81de12 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3262,18 +3262,22 @@ static void __split_huge_page(struct page *page, struct list_head *list)
int total_mapcount(struct page *page)
{
- int i, ret;
+ int i, compound, ret;
VM_BUG_ON_PAGE(PageTail(page), page);
if (likely(!PageCompound(page)))
return atomic_read(&page->_mapcount) + 1;
- ret = compound_mapcount(page);
+ compound = compound_mapcount(page);
if (PageHuge(page))
- return ret;
+ return compound;
+ ret = compound;
for (i = 0; i < HPAGE_PMD_NR; i++)
ret += atomic_read(&page[i]._mapcount) + 1;
+ /* File pages has compound_mapcount included in _mapcount */
+ if (!PageAnon(page))
+ return ret - compound * HPAGE_PMD_NR;
if (PageDoubleMap(page))
ret -= HPAGE_PMD_NR;
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index c31c52507956..49c55446576a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1438,7 +1438,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
retval = 0;
@@ -2901,7 +2901,7 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3eafd17fb398..b8f8363df8da 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,7 +170,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, false);
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7f8652720a25..76d8c9269ac6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1271,18 +1271,34 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
- if (atomic_inc_and_test(&page->_mapcount)) {
- __inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_inc_and_test(&page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_inc_and_test(&page->_mapcount))
+ goto out;
}
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+out:
unlock_page_memcg(page);
}
-static void page_remove_file_rmap(struct page *page)
+static void page_remove_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
@@ -1293,15 +1309,24 @@ static void page_remove_file_rmap(struct page *page)
}
/* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
- goto out;
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ goto out;
+ }
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
if (unlikely(PageMlocked(page)))
@@ -1357,11 +1382,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- if (!PageAnon(page)) {
- VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
- page_remove_file_rmap(page);
- return;
- }
+ if (!PageAnon(page))
+ return page_remove_file_rmap(page, compound);
if (compound)
return page_remove_anon_compound_rmap(page);
diff --git a/mm/util.c b/mm/util.c
index 6cc81e7b8705..b7ac1d708cb0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -386,6 +386,12 @@ int __page_mapcount(struct page *page)
int ret;
ret = atomic_read(&page->_mapcount) + 1;
+ /*
+ * For file THP page->_mapcount contains total number of mapping
+ * of the page: no need to look into compound_mapcount.
+ */
+ if (!PageAnon(page) && !PageHuge(page))
+ return ret;
page = compound_head(page);
ret += atomic_read(compound_mapcount_ptr(page)) + 1;
if (PageDoubleMap(page))
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 06/32] mm: introduce do_set_pmd()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (4 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 05/32] rmap: support file thp Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 07/32] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
` (26 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:
- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 8 ------
mm/memory.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-
mm/migrate.c | 3 +--
4 files changed, 74 insertions(+), 11 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 05efca1619b6..95f83770443c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -142,6 +142,8 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
struct page *get_huge_zero_page(void);
+#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aab10c81de12..cece53a94192 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -793,14 +793,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static inline struct list_head *page_deferred_list(struct page *page)
{
/*
diff --git a/mm/memory.c b/mm/memory.c
index 49c55446576a..ca45e9b19ad9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2859,6 +2859,66 @@ map_pte:
return 0;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long haddr)
+{
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return false;
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return false;
+ return true;
+}
+
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
+ pmd_t entry;
+ int i, ret;
+
+ if (!transhuge_vma_suitable(vma, haddr))
+ return VM_FAULT_FALLBACK;
+
+ ret = VM_FAULT_FALLBACK;
+ page = compound_head(page);
+
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd)))
+ goto out;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ flush_icache_page(vma, page + i);
+
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+
+ add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page, true);
+
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+
+ update_mmu_cache_pmd(vma, haddr, fe->pmd);
+
+ /* fault is handled */
+ ret = 0;
+out:
+ spin_unlock(fe->ptl);
+ return ret;
+}
+#else
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ BUILD_BUG();
+ return 0;
+}
+#endif
+
/**
* alloc_set_pte - setup new PTE entry for given page and add reverse page
* mapping. If needed, the fucntion allocates page table or use pre-allocated.
@@ -2878,9 +2938,19 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ int ret;
+
+ if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+ /* THP on COW? */
+ VM_BUG_ON_PAGE(memcg, page);
+
+ ret = do_set_pmd(fe, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
if (!fe->pte) {
- int ret = pte_alloc_one_map(fe);
+ ret = pte_alloc_one_map(fe);
if (ret)
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index b8f8363df8da..f3f65f00448f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,8 +1820,7 @@ fail_putback:
}
orig_entry = *pmd;
- entry = mk_pmd(new_page, vma->vm_page_prot);
- entry = pmd_mkhuge(entry);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 07/32] thp, vmstats: add counters for huge file pages
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (5 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 06/32] mm: introduce do_set_pmd() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 08/32] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
` (25 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.
THP_FILE_MAPPED: how many times file huge page was mapped.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/vm_event_item.h | 7 +++++++
mm/memory.c | 1 +
mm/vmstat.c | 2 ++
3 files changed, 10 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ec084321fe09..42604173f122 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_FILE_ALLOC,
+ THP_FILE_MAPPED,
THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED,
THP_DEFERRED_SPLIT_PAGE,
@@ -100,4 +102,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_VM_EVENT_ITEMS
};
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
+#endif
+
#endif /* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index ca45e9b19ad9..23de0567db18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2907,6 +2907,7 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
/* fault is handled */
ret = 0;
+ count_vm_event(THP_FILE_MAPPED);
out:
spin_unlock(fe->ptl);
return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 38ad49532b6f..904ac95fbf00 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -846,6 +846,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_file_alloc",
+ "thp_file_mapped",
"thp_split_page",
"thp_split_page_failed",
"thp_deferred_split_page",
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 08/32] thp: support file pages in zap_huge_pmd()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (6 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 07/32] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 09/32] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
` (24 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.
This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cece53a94192..29ba922a9c26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1696,10 +1696,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page = pmd_page(orig_pmd);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
- pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
- atomic_long_dec(&tlb->mm->nr_ptes);
+ if (PageAnon(page)) {
+ pgtable_t pgtable;
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ pte_free(tlb->mm, pgtable);
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ } else {
+ add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ }
spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 09/32] thp: handle file pages in split_huge_pmd()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (7 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 08/32] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 10/32] thp: handle file COW faults Kirill A. Shutemov
` (23 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Splitting THP PMD is simple: just unmap it as in DAX case. This way we
can avoid memory overhead on page table allocation to deposit.
It's probably a good idea to try to allocation page table with
GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
but clearing pmd should be good enough for now.
Unlike DAX, we also remove the page from rmap and drop reference.
pmd_young() is transfered to PageReferenced().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29ba922a9c26..df7b620afd7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2940,10 +2940,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
count_vm_event(THP_SPLIT_PMD);
- if (vma_is_dax(vma)) {
- pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ if (!vma_is_anonymous(vma)) {
+ _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
if (is_huge_zero_pmd(_pmd))
put_huge_zero_page();
+ if (vma_is_dax(vma))
+ return;
+ page = pmd_page(_pmd);
+ if (!PageReferenced(page) && pmd_young(_pmd))
+ SetPageReferenced(page);
+ page_remove_rmap(page, true);
+ put_page(page);
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
return;
} else if (is_huge_zero_pmd(*pmd)) {
return __split_huge_zero_page_pmd(vma, haddr, pmd);
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 10/32] thp: handle file COW faults
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (8 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 09/32] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 11/32] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
` (22 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
File COW for THP is handled on pte level: just split the pmd.
It's not clear how benefitial would be allocation of huge pages on COW
faults. And it would require some code to make them work.
I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably
overkill.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 23de0567db18..cba29c033702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3388,6 +3388,11 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
if (fe->vma->vm_ops->pmd_fault)
return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
fe->flags);
+
+ /* COW handled on pte level: split pmd */
+ VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
+ split_huge_pmd(fe->vma, fe->pmd, fe->address);
+
return VM_FAULT_FALLBACK;
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 11/32] thp: skip file huge pmd on copy_huge_pmd()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (9 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 10/32] thp: handle file COW faults Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 12/32] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
` (21 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level. We still copy all page
table entries from private mappings, even if they map page cache.
We can simplify copy_huge_pmd() a bit by skipping file PMDs.
We don't map file private pages with PMDs, so they only can map page
cache. It's safe to skip them as they can be re-faulted later.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 34 ++++++++++++++++------------------
1 file changed, 16 insertions(+), 18 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index df7b620afd7f..6acb64e6ce79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1094,14 +1094,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable = NULL;
- int ret;
+ int ret = -ENOMEM;
- if (!vma_is_dax(vma)) {
- ret = -ENOMEM;
- pgtable = pte_alloc_one(dst_mm, addr);
- if (unlikely(!pgtable))
- goto out;
- }
+ /* Skip if can be re-fill on fault */
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ pgtable = pte_alloc_one(dst_mm, addr);
+ if (unlikely(!pgtable))
+ goto out;
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1109,7 +1110,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
- if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) {
+ if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
}
@@ -1132,16 +1133,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
- if (!vma_is_dax(vma)) {
- /* thp accounting separate from pmd_devmap accounting */
- src_page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
- get_page(src_page);
- page_dup_rmap(src_page, true);
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&dst_mm->nr_ptes);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- }
+ src_page = pmd_page(pmd);
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ get_page(src_page);
+ page_dup_rmap(src_page, true);
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&dst_mm->nr_ptes);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 12/32] thp: prepare change_huge_pmd() for file thp
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (10 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 11/32] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 13/32] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
` (20 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
change_huge_pmd() has assert which is not relvant for file page.
For shared mapping it's perfectly fine to have page table entry
writable, without explicit mkwrite.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6acb64e6ce79..cff5bc64d006 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1795,7 +1795,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(!preserve_write && pmd_write(entry));
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
+ pmd_write(entry));
}
spin_unlock(ptl);
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 13/32] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (11 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 12/32] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 14/32] thp: file pages support for split_huge_page() Kirill A. Shutemov
` (19 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.
rmap wants to take the lock on its own.
Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index bd2e1a533bc1..df3976af9302 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -678,6 +678,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ vma_adjust_trans_huge(vma, start, end, adjust_next);
+
if (file) {
mapping = file->f_mapping;
root = &mapping->i_mmap;
@@ -698,8 +700,6 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
- vma_adjust_trans_huge(vma, start, end, adjust_next);
-
anon_vma = vma->anon_vma;
if (!anon_vma && adjust_next)
anon_vma = next->anon_vma;
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 14/32] thp: file pages support for split_huge_page()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (12 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 13/32] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 15/32] thp, mlock: do not mlock PTE-mapped file huge pages Kirill A. Shutemov
` (18 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Basic scheme is the same as for anon THP.
Main differences:
- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.
- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;
- Lockless access is prevented by setting the head->_count to 0 during
split;
- After split, some pages can be beyond i_size. We drop them from
radix-tree.
- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/gup.c | 2 +
mm/huge_memory.c | 160 +++++++++++++++++++++++++++++++++++++++----------------
mm/mempolicy.c | 2 +
3 files changed, 119 insertions(+), 45 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 39f751e1cb33..52f091e28f76 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -287,6 +287,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
+ if (pmd_none(*pmd))
+ return no_page_table(vma, flags);
}
return ret ? ERR_PTR(ret) :
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cff5bc64d006..5daad812d71e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -31,6 +31,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -3145,12 +3146,15 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
static void freeze_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_MIGRATION | TTU_IGNORE_MLOCK |
- TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED;
+ enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+ TTU_RMAP_LOCKED;
int i, ret;
VM_BUG_ON_PAGE(!PageHead(page), page);
+ if (PageAnon(page))
+ ttu_flags |= TTU_MIGRATION;
+
/* We only need TTU_SPLIT_HUGE_PMD once */
ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
for (i = 1; !ret && i < HPAGE_PMD_NR; i++) {
@@ -3160,7 +3164,7 @@ static void freeze_page(struct page *page)
ret = try_to_unmap(page + i, ttu_flags);
}
- VM_BUG_ON(ret);
+ VM_BUG_ON_PAGE(ret, page + i - 1);
}
static void unfreeze_page(struct page *page)
@@ -3182,15 +3186,20 @@ static void __split_huge_page_tail(struct page *head, int tail,
/*
* tail_page->_count is zero and not changing from under us. But
* get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc(), we
- * would then run atomic_set() concurrently with
+ * tail_page. If we used atomic_set() below instead of atomic_inc() or
+ * atomic_add(), we would then run atomic_set() concurrently with
* get_page_unless_zero(), and atomic_set() is implemented in C not
* using locked ops. spin_unlock on x86 sometime uses locked ops
* because of PPro errata 66, 92, so unless somebody can guarantee
* atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc().
+ * it's safer to use atomic_inc()/atomic_add().
*/
- page_ref_inc(page_tail);
+ if (PageAnon(head)) {
+ page_ref_inc(page_tail);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(page_tail, 2);
+ }
page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
@@ -3226,25 +3235,44 @@ static void __split_huge_page_tail(struct page *head, int tail,
lru_add_page_tail(head, page_tail, lruvec, list);
}
-static void __split_huge_page(struct page *page, struct list_head *list)
+static void __split_huge_page(struct page *page, struct list_head *list,
+ unsigned long flags)
{
struct page *head = compound_head(page);
struct zone *zone = page_zone(head);
struct lruvec *lruvec;
+ pgoff_t end = -1;
int i;
- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
- for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+ if (!PageAnon(page))
+ end = DIV_ROUND_UP(i_size_read(head->mapping->host), PAGE_SIZE);
+
+ for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
+ /* Some pages can be beyond i_size: drop them from page cache */
+ if (head[i].index >= end) {
+ __ClearPageDirty(head + i);
+ __delete_from_page_cache(head + i, NULL);
+ put_page(head + i);
+ }
+ }
ClearPageCompound(head);
- spin_unlock_irq(&zone->lru_lock);
+ /* See comment in __split_huge_page_tail() */
+ if (PageAnon(head)) {
+ page_ref_inc(head);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(head, 2);
+ spin_unlock(&head->mapping->tree_lock);
+ }
+
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
@@ -3311,36 +3339,54 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
- struct anon_vma *anon_vma;
- int count, mapcount, ret;
+ struct anon_vma *anon_vma = NULL;
+ struct address_space *mapping = NULL;
+ int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);
- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(head);
- if (!anon_vma) {
- ret = -EBUSY;
- goto out;
+ if (PageAnon(head)) {
+ /*
+ * The caller does not necessarily hold an mmap_sem that would
+ * prevent the anon_vma disappearing so we first we take a
+ * reference to it and then lock the anon_vma for write. This
+ * is similar to page_lock_anon_vma_read except the write lock
+ * is taken to serialise against parallel split or collapse
+ * operations.
+ */
+ anon_vma = page_get_anon_vma(head);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ extra_pins = 0;
+ mapping = NULL;
+ anon_vma_lock_write(anon_vma);
+ } else {
+ mapping = head->mapping;
+
+ /* Truncated ? */
+ if (!mapping) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* Addidional pins from radix tree */
+ extra_pins = HPAGE_PMD_NR;
+ anon_vma = NULL;
+ i_mmap_lock_read(mapping);
}
- anon_vma_lock_write(anon_vma);
/*
* Racy check if we can split the page, before freeze_page() will
* split PMDs
*/
- if (total_mapcount(head) != page_count(head) - 1) {
+ if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3353,35 +3399,60 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mlocked)
lru_add_drain();
+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+
+ if (mapping) {
+ void **pslot;
+
+ spin_lock(&mapping->tree_lock);
+ pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ page_index(head));
+ /*
+ * Check if the head page is present in radix tree.
+ * We assume all tail are present too, if head is there.
+ */
+ if (radix_tree_deref_slot_protected(pslot,
+ &mapping->tree_lock) != head)
+ goto fail;
+ }
+
/* Prevent deferred_split_scan() touching ->_count */
- spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+ spin_lock(&pgdata->split_queue_lock);
count = page_count(head);
mapcount = total_mapcount(head);
- if (!mapcount && count == 1) {
+ if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- __split_huge_page(page, list);
+ spin_unlock(&pgdata->split_queue_lock);
+ __split_huge_page(page, list, flags);
ret = 0;
- } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- pr_alert("total_mapcount: %u, page_count(): %u\n",
- mapcount, count);
- if (PageTail(page))
- dump_page(head, NULL);
- dump_page(page, "total_mapcount(head) > 0");
- BUG();
} else {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+ if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+ pr_alert("total_mapcount: %u, page_count(): %u\n",
+ mapcount, count);
+ if (PageTail(page))
+ dump_page(head, NULL);
+ dump_page(page, "total_mapcount(head) > 0");
+ BUG();
+ }
+ spin_unlock(&pgdata->split_queue_lock);
+fail: if (mapping)
+ spin_unlock(&mapping->tree_lock);
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
ret = -EBUSY;
}
out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
+ if (anon_vma) {
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+ }
+ if (mapping)
+ i_mmap_unlock_read(mapping);
out:
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
@@ -3504,8 +3575,7 @@ static int split_huge_pages_set(void *data, u64 val)
if (zone != page_zone(page))
goto next;
- if (!PageHead(page) || !PageAnon(page) ||
- PageHuge(page))
+ if (!PageHead(page) || PageHuge(page) || !PageLRU(page))
goto next;
total++;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cc01bc950a..148143974c5b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -515,6 +515,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
}
}
+ if (pmd_trans_unstable(pmd))
+ return 0;
retry:
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 15/32] thp, mlock: do not mlock PTE-mapped file huge pages
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (13 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 14/32] thp: file pages support for split_huge_page() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 16/32] vmscan: split file huge pages before paging them out Kirill A. Shutemov
` (17 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.
We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the
page mapped with PTEs.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/page-flags.h | 13 ++++++++++++-
mm/huge_memory.c | 27 ++++++++++++++++++++-------
mm/mmap.c | 6 ++++++
mm/page_alloc.c | 2 ++
mm/rmap.c | 16 ++++++++++++++--
5 files changed, 54 insertions(+), 10 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..517707ae8cd1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -544,6 +544,17 @@ static inline int PageDoubleMap(struct page *page)
return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
}
+static inline void SetPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline void ClearPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ clear_bit(PG_double_map, &page[1].flags);
+}
static inline int TestSetPageDoubleMap(struct page *page)
{
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -560,7 +571,7 @@ static inline int TestClearPageDoubleMap(struct page *page)
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
TESTPAGEFLAG_FALSE(TransTail)
-TESTPAGEFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(DoubleMap)
TESTSETFLAG_FALSE(DoubleMap)
TESTCLEARFLAG_FALSE(DoubleMap)
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5daad812d71e..001687987777 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1439,6 +1439,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* We don't mlock() pte-mapped THPs. This way we can avoid
* leaking mlocked pages into non-VM_LOCKED VMAs.
*
+ * For anon THP:
+ *
* In most cases the pmd is the only mapping of the page as we
* break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
* writable private mappings in populate_vma_page_range().
@@ -1446,15 +1448,26 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* The only scenario when we have the page shared here is if we
* mlocking read-only mapping shared over fork(). We skip
* mlocking such pages.
+ *
+ * For file THP:
+ *
+ * We can expect PageDoubleMap() to be stable under page lock:
+ * for file pages we set it in page_add_file_rmap(), which
+ * requires page to be locked.
*/
- if (compound_mapcount(page) == 1 && !PageDoubleMap(page) &&
- page->mapping && trylock_page(page)) {
- lru_add_drain();
- if (page->mapping)
- mlock_vma_page(page);
- unlock_page(page);
- }
+
+ if (PageAnon(page) && compound_mapcount(page) != 1)
+ goto skip_mlock;
+ if (PageDoubleMap(page) || !page->mapping)
+ goto skip_mlock;
+ if (!trylock_page(page))
+ goto skip_mlock;
+ lru_add_drain();
+ if (page->mapping && !PageDoubleMap(page))
+ mlock_vma_page(page);
+ unlock_page(page);
}
+skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
diff --git a/mm/mmap.c b/mm/mmap.c
index df3976af9302..0a68cf0b37e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2584,6 +2584,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
/* drop PG_Mlocked flag for over-mapped range */
for (tmp = vma; tmp->vm_start >= start + size;
tmp = tmp->vm_next) {
+ /*
+ * Split pmd and munlock page on the border
+ * of the range.
+ */
+ vma_adjust_trans_huge(tmp, start, start + size, 0);
+
munlock_vma_pages_range(tmp,
max(tmp->vm_start, start),
min(tmp->vm_end, start + size));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d5d3a3..4243b400b331 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1036,6 +1036,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (PageAnon(page))
page->mapping = NULL;
+ if (compound)
+ ClearPageDoubleMap(page);
bad += free_pages_check(page);
for (i = 1; i < (1 << order); i++) {
if (compound)
diff --git a/mm/rmap.c b/mm/rmap.c
index 76d8c9269ac6..f67e765869f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1285,6 +1285,12 @@ void page_add_file_rmap(struct page *page, bool compound)
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
} else {
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ SetPageDoubleMap(compound_head(page));
+ if (PageMlocked(page))
+ clear_page_mlock(compound_head(page));
+ }
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
}
@@ -1458,8 +1464,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
- /* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_page(page);
+ /* PTE-mapped THP are never mlocked */
+ if (!PageTransCompound(page)) {
+ /*
+ * Holding pte lock, we do *not* need
+ * mmap_sem here
+ */
+ mlock_vma_page(page);
+ }
ret = SWAP_MLOCK;
goto out_unmap;
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 16/32] vmscan: split file huge pages before paging them out
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (14 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 15/32] thp, mlock: do not mlock PTE-mapped file huge pages Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 17/32] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
` (16 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
This is preparation of vmscan for file huge pages. We cannot write out
huge pages, so we need to split them on the way out.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/vmscan.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e9fe17c96ef8..df56f6a2dbfe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1055,8 +1055,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/* Adding to swap updated mapping */
mapping = page_mapping(page);
+ } else if (unlikely(PageTransHuge(page))) {
+ /* Split file THP */
+ if (split_huge_page_to_list(page, page_list))
+ goto keep_locked;
}
+ VM_BUG_ON_PAGE(PageTransHuge(page), page);
+
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 17/32] page-flags: relax policy for PG_mappedtodisk and PG_reclaim
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (15 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 16/32] vmscan: split file huge pages before paging them out Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 18/32] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
` (15 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
These flags are in use for file THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/page-flags.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 517707ae8cd1..9d518876dd6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -292,11 +292,11 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
*/
TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
/* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
+ TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 18/32] radix-tree: implement radix_tree_maybe_preload_order()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (16 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 17/32] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:40 ` [PATCHv8 19/32] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
` (14 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
The new helper is similar to radix_tree_maybe_preload(), but tries to
preload number of nodes required to insert (1 << order) continuous
naturally-aligned elements.
This is required to push huge pages into pagecache.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/radix-tree.h | 1 +
lib/radix-tree.c | 68 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 61 insertions(+), 8 deletions(-)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac8bfbf..b6e73ea314d9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -296,6 +296,7 @@ unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1624c4117961..448abbf04e87 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -42,6 +42,9 @@
*/
static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+/* Number of nodes in fully populated tree of given height */
+static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+
/*
* Radix tree node cache.
*/
@@ -296,7 +299,7 @@ radix_tree_node_free(struct radix_tree_node *node)
* To make use of this facility, the radix tree must be initialised without
* __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload(gfp_t gfp_mask, int nr)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -304,14 +307,14 @@ static int __radix_tree_preload(gfp_t gfp_mask)
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- while (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ while (rtp->nr < nr) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- if (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ if (rtp->nr < nr) {
node->private_data = rtp->nodes;
rtp->nodes = node;
rtp->nr++;
@@ -337,7 +340,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
}
EXPORT_SYMBOL(radix_tree_preload);
@@ -349,7 +352,7 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfpflags_allow_blocking(gfp_mask))
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
@@ -357,6 +360,51 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
EXPORT_SYMBOL(radix_tree_maybe_preload);
/*
+ * The same as function above, but preload number of nodes required to insert
+ * (1 << order) continuous naturally-aligned elements.
+ */
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
+{
+ unsigned long nr_subtrees;
+ int nr_nodes, subtree_height;
+
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ if (!gfpflags_allow_blocking(gfp_mask)) {
+ preempt_disable();
+ return 0;
+ }
+
+ /*
+ * Calculate number and height of fully populated subtrees it takes to
+ * store (1 << order) elements.
+ */
+ nr_subtrees = 1 << order;
+ for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
+ subtree_height++)
+ nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
+
+ /*
+ * The worst case is zero height tree with a single item at index 0 and
+ * then inserting items starting at ULONG_MAX - (1 << order).
+ *
+ * This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
+ * 0-index item.
+ */
+ nr_nodes = RADIX_TREE_MAX_PATH;
+
+ /* Plus branch to fully populated subtrees. */
+ nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
+
+ /* Root node is shared. */
+ nr_nodes--;
+
+ /* Plus nodes required to build subtrees. */
+ nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
+
+ return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+
+/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
*/
@@ -1576,12 +1624,16 @@ static __init unsigned long __maxindex(unsigned int height)
return ~0UL >> shift;
}
-static __init void radix_tree_init_maxindex(void)
+static __init void radix_tree_init_arrays(void)
{
- unsigned int i;
+ unsigned int i, j;
for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
height_to_maxindex[i] = __maxindex(i);
+ for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
+ for (j = i; j > 0; j--)
+ height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
+ }
}
static int radix_tree_callback(struct notifier_block *nfb,
@@ -1611,6 +1663,6 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
- radix_tree_init_maxindex();
+ radix_tree_init_arrays();
hotcpu_notifier(radix_tree_callback, 0);
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 19/32] filemap: prepare find and delete operations for huge pages
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (17 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 18/32] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
@ 2016-05-12 15:40 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 20/32] truncate: handle file thp Kirill A. Shutemov
` (13 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:40 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page. That's suboptimal and it will be changed to use Matthew's
multi-order entries later.
'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/filemap.c | 187 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 134 insertions(+), 53 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7e982835d4ec..bf29ab4f87dc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,43 +110,18 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/
-static void page_cache_tree_delete(struct address_space *mapping,
- struct page *page, void *shadow)
+static void __page_cache_tree_delete(struct address_space *mapping,
+ struct radix_tree_node *node, void **slot, unsigned long index,
+ void *shadow)
{
- struct radix_tree_node *node;
- unsigned long index;
- unsigned int offset;
unsigned int tag;
- void **slot;
-
- VM_BUG_ON(!PageLocked(page));
- __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
-
- if (shadow) {
- mapping->nrexceptional++;
- /*
- * Make sure the nrexceptional update is committed before
- * the nrpages update so that final truncate racing
- * with reclaim does not see both counters 0 at the
- * same time and miss a shadow entry.
- */
- smp_wmb();
- }
- mapping->nrpages--;
-
- if (!node) {
- /* Clear direct pointer tags in root node */
- mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
- radix_tree_replace_slot(slot, shadow);
- return;
- }
+ VM_BUG_ON(node == NULL);
+ VM_BUG_ON(*slot == NULL);
/* Clear tree tags for the removed page */
- index = page->index;
- offset = index & RADIX_TREE_MAP_MASK;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (test_bit(offset, node->tags[tag]))
+ if (test_bit(index & RADIX_TREE_MAP_MASK, node->tags[tag]))
radix_tree_tag_clear(&mapping->page_tree, index, tag);
}
@@ -173,6 +148,54 @@ static void page_cache_tree_delete(struct address_space *mapping,
}
}
+static void page_cache_tree_delete(struct address_space *mapping,
+ struct page *page, void *shadow)
+{
+ struct radix_tree_node *node;
+ unsigned long index;
+ void **slot;
+ int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+ if (shadow) {
+ mapping->nrexceptional += nr;
+ /*
+ * Make sure the nrexceptional update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ }
+ mapping->nrpages -= nr;
+
+ if (!node) {
+ /* Clear direct pointer tags in root node */
+ mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+ VM_BUG_ON(nr != 1);
+ radix_tree_replace_slot(slot, shadow);
+ return;
+ }
+
+ index = page->index;
+ VM_BUG_ON_PAGE(index & (nr - 1), page);
+ for (i = 0; i < nr; i++) {
+ /* Cross node border */
+ if (i && ((index + i) & RADIX_TREE_MAP_MASK) == 0) {
+ __radix_tree_lookup(&mapping->page_tree,
+ page->index + i, &node, &slot);
+ }
+
+ __page_cache_tree_delete(mapping, node,
+ slot + (i & RADIX_TREE_MAP_MASK), index + i,
+ shadow);
+ }
+}
+
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
@@ -181,6 +204,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
void __delete_from_page_cache(struct page *page, void *shadow)
{
struct address_space *mapping = page->mapping;
+ int nr = hpage_nr_pages(page);
trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -193,6 +217,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
else
cleancache_invalidate_page(mapping, page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page_mapped(page), page);
if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
int mapcount;
@@ -224,9 +249,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
/*
* At this point page must be either written or cleaned by truncate.
@@ -250,9 +275,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
*/
void delete_from_page_cache(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_mapping(page);
unsigned long flags;
-
void (*freepage)(struct page *);
BUG_ON(!PageLocked(page));
@@ -265,7 +289,13 @@ void delete_from_page_cache(struct page *page)
if (freepage)
freepage(page);
- put_page(page);
+
+ if (PageTransHuge(page) && !PageHuge(page)) {
+ page_ref_sub(page, HPAGE_PMD_NR);
+ VM_BUG_ON_PAGE(page_count(page) <= 0, page);
+ } else {
+ put_page(page);
+ }
}
EXPORT_SYMBOL(delete_from_page_cache);
@@ -1054,7 +1084,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
{
void **pagep;
- struct page *page;
+ struct page *head, *page;
rcu_read_lock();
repeat:
@@ -1074,9 +1104,17 @@ repeat:
*/
goto out;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/*
* Has the page moved?
* This is part of the lockless pagecache protocol. See
@@ -1119,12 +1157,12 @@ repeat:
if (page && !radix_tree_exception(page)) {
lock_page(page);
/* Has the page been truncated? */
- if (unlikely(page->mapping != mapping)) {
+ if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;
}
- VM_BUG_ON_PAGE(page->index != offset, page);
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
}
return page;
}
@@ -1256,7 +1294,7 @@ unsigned find_get_entries(struct address_space *mapping,
rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1273,8 +1311,16 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1319,7 +1365,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1338,9 +1384,16 @@ repeat:
continue;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
@@ -1380,7 +1433,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
rcu_read_lock();
radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
/* The hole, there no reason to continue */
@@ -1400,8 +1453,14 @@ repeat:
break;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1414,7 +1473,7 @@ repeat:
* otherwise we can get both false positives and false
* negatives, which is just confusing to the caller.
*/
- if (page->mapping == NULL || page->index != iter.index) {
+ if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
put_page(page);
break;
}
@@ -1452,7 +1511,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
rcu_read_lock();
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, *index, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1477,8 +1536,15 @@ repeat:
continue;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1526,7 +1592,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
rcu_read_lock();
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, start, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1544,9 +1610,17 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
@@ -2140,7 +2214,7 @@ void filemap_map_pages(struct fault_env *fe,
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
- struct page *page;
+ struct page *head, *page;
rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
@@ -2159,8 +2233,15 @@ repeat:
goto next;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 20/32] truncate: handle file thp
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (18 preceding siblings ...)
2016-05-12 15:40 ` [PATCHv8 19/32] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 21/32] mm, rmap: account shmem thp pages Kirill A. Shutemov
` (12 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/truncate.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index b00272810871..4f931ca9333b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -157,10 +157,14 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
int truncate_inode_page(struct address_space *mapping, struct page *page)
{
+ loff_t holelen;
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ holelen = PageTransHuge(page) ? HPAGE_PMD_SIZE : PAGE_SIZE;
if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index << PAGE_SHIFT,
- PAGE_SIZE, 0);
+ holelen, 0);
}
return truncate_complete_page(mapping, page);
}
@@ -489,7 +493,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (!trylock_page(page))
continue;
- WARN_ON(page->index != index);
+
+ WARN_ON(page_to_pgoff(page) != index);
+
+ /* Middle of THP: skip */
+ if (PageTransTail(page)) {
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ /* 'end' is in the middle of THP */
+ if (index == round_down(end, HPAGE_PMD_NR))
+ continue;
+ }
+
ret = invalidate_inode_page(page);
unlock_page(page);
/*
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 21/32] mm, rmap: account shmem thp pages
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (19 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 20/32] truncate: handle file thp Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 22/32] shmem: prepare huge= mount option and sysfs knob Kirill A. Shutemov
` (11 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps. It indicates how many times we allocate and map shmem THP.
NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
drivers/base/node.c | 13 +++++++++----
fs/proc/meminfo.c | 7 +++++--
fs/proc/task_mmu.c | 10 +++++++++-
include/linux/mmzone.h | 4 +++-
mm/huge_memory.c | 4 +++-
mm/page_alloc.c | 19 +++++++++++++++++++
mm/rmap.c | 14 ++++++++------
mm/vmstat.c | 2 ++
8 files changed, 58 insertions(+), 15 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..51c7db2c4ee2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,8 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d ShmemHugePages: %8lu kB\n"
+ "Node %d ShmemPmdMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +133,13 @@ static ssize_t node_read_meminfo(struct device *dev,
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
+ nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(node_page_state(nid, NR_ANON_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+ HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..cf301a9ef512 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "ShmemHugePages: %8lu kB\n"
+ "ShmemPmdMapped: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -162,8 +164,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 229cb546bee0..b51651d9d8ae 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -448,6 +448,7 @@ struct mem_size_stats {
unsigned long referenced;
unsigned long anonymous;
unsigned long anonymous_thp;
+ unsigned long shmem_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -576,7 +577,12 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
if (IS_ERR_OR_NULL(page))
return;
- mss->anonymous_thp += HPAGE_PMD_SIZE;
+ if (PageAnon(page))
+ mss->anonymous_thp += HPAGE_PMD_SIZE;
+ else if (PageSwapBacked(page))
+ mss->shmem_thp += HPAGE_PMD_SIZE;
+ else
+ VM_BUG_ON_PAGE(1, page);
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
@@ -770,6 +776,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "ShmemPmdMapped: %8lu kB\n"
"Shared_Hugetlb: %8lu kB\n"
"Private_Hugetlb: %7lu kB\n"
"Swap: %8lu kB\n"
@@ -787,6 +794,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.shmem_thp >> 10,
mss.shared_hugetlb >> 10,
mss.private_hugetlb >> 10,
mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..13be8a13bbc1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,9 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_THPS,
+ NR_SHMEM_THPS,
+ NR_SHMEM_PMDMAPPED,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 001687987777..b88b84e2f497 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3020,7 +3020,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
@@ -3439,6 +3439,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
+ if (mapping)
+ __dec_zone_page_state(page, NR_SHMEM_THPS);
spin_unlock(&pgdata->split_queue_lock);
__split_huge_page(page, list, flags);
ret = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4243b400b331..106c38f59e48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3888,6 +3888,9 @@ void show_free_areas(unsigned int filter)
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
+#endif
" free:%lu free_pcp:%lu free_cma:%lu\n",
global_page_state(NR_ACTIVE_ANON),
global_page_state(NR_INACTIVE_ANON),
@@ -3905,6 +3908,11 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR,
+#endif
global_page_state(NR_FREE_PAGES),
free_pcp,
global_page_state(NR_FREE_CMA_PAGES));
@@ -3939,6 +3947,11 @@ void show_free_areas(unsigned int filter)
" writeback:%lukB"
" mapped:%lukB"
" shmem:%lukB"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ " shmem_thp: %lukB"
+ " shmem_pmdmapped: %lukB"
+ " anon_thp: %lukB"
+#endif
" slab_reclaimable:%lukB"
" slab_unreclaimable:%lukB"
" kernel_stack:%lukB"
@@ -3971,6 +3984,12 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_WRITEBACK)),
K(zone_page_state(zone, NR_FILE_MAPPED)),
K(zone_page_state(zone, NR_SHMEM)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)
+ * HPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_ANON_THPS) * HPAGE_PMD_NR),
+#endif
K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
zone_page_state(zone, NR_KERNEL_STACK) *
diff --git a/mm/rmap.c b/mm/rmap.c
index f67e765869f3..ba0539074ada 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1213,10 +1213,8 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_THPS);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1254,7 +1252,7 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1284,6 +1282,8 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ __inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (PageTransCompound(page)) {
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1322,6 +1322,8 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ __dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
@@ -1355,7 +1357,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 904ac95fbf00..4d7921f560b5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,8 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_shmem_hugepages",
+ "nr_shmem_pmdmapped",
"nr_free_cma",
/* enum writeback_stat_item counters */
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 22/32] shmem: prepare huge= mount option and sysfs knob
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (20 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 21/32] mm, rmap: account shmem thp pages Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 23/32] shmem: get_unmapped_area align huge page Kirill A. Shutemov
` (10 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
This patch adds new mount option "huge=". It can have following values:
- "always":
Attempt to allocate huge pages every time we need a new page;
- "never":
Do not allocate huge pages;
- "within_size":
Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;
- "advise:
Only allocate huge pages if requested with fadvise()/madvise();
Default is "never" for now.
"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.
No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.
Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).
Allow enabling THP only if the machine has_transparent_hugepage().
But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem. Though unlikely to suit all usages, provide
sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to
experiment with huge on those.
And allow shmem_enabled two further values:
- "deny":
For use in emergencies, to force the huge option off from
all mounts;
- "force":
Force the huge option on for all - very useful for testing;
Based on patch by Hugh Dickins.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/huge_mm.h | 2 +
include/linux/shmem_fs.h | 3 +-
mm/huge_memory.c | 3 +
mm/shmem.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 168 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 95f83770443c..278ade85e224 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -41,6 +41,8 @@ enum transparent_hugepage_flag {
#endif
};
+extern struct kobj_attribute shmem_enabled_attr;
+
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d4780c00d34..466f18c73a49 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -28,9 +28,10 @@ struct shmem_sb_info {
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
+ umode_t mode; /* Mount mode for root directory */
+ unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
- umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b88b84e2f497..1b0b24f0b013 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -442,6 +442,9 @@ static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
+#ifdef CONFIG_SHMEM
+ &shmem_enabled_attr.attr,
+#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 5f3b9af0d495..3a6349e9c186 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -289,6 +289,87 @@ static bool shmem_confirm_swap(struct address_space *mapping,
}
/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ * disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ * enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ * only allocate huge pages if the page will be fully within i_size,
+ * also respect fadvise()/madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ * only allocate huge pages if requested with fadvise()/madvise();
+ */
+
+#define SHMEM_HUGE_NEVER 0
+#define SHMEM_HUGE_ALWAYS 1
+#define SHMEM_HUGE_WITHIN_SIZE 2
+#define SHMEM_HUGE_ADVISE 3
+
+/*
+ * Special values.
+ * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
+ *
+ * SHMEM_HUGE_DENY:
+ * disables huge on shm_mnt and all mounts, for emergency use;
+ * SHMEM_HUGE_FORCE:
+ * enables huge on shm_mnt and all mounts, w/o needing option, for testing;
+ *
+ */
+#define SHMEM_HUGE_DENY (-1)
+#define SHMEM_HUGE_FORCE (-2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+static int shmem_parse_huge(const char *str)
+{
+ if (!strcmp(str, "never"))
+ return SHMEM_HUGE_NEVER;
+ if (!strcmp(str, "always"))
+ return SHMEM_HUGE_ALWAYS;
+ if (!strcmp(str, "within_size"))
+ return SHMEM_HUGE_WITHIN_SIZE;
+ if (!strcmp(str, "advise"))
+ return SHMEM_HUGE_ADVISE;
+ if (!strcmp(str, "deny"))
+ return SHMEM_HUGE_DENY;
+ if (!strcmp(str, "force"))
+ return SHMEM_HUGE_FORCE;
+ return -EINVAL;
+}
+
+static const char *shmem_format_huge(int huge)
+{
+ switch (huge) {
+ case SHMEM_HUGE_NEVER:
+ return "never";
+ case SHMEM_HUGE_ALWAYS:
+ return "always";
+ case SHMEM_HUGE_WITHIN_SIZE:
+ return "within_size";
+ case SHMEM_HUGE_ADVISE:
+ return "advise";
+ case SHMEM_HUGE_DENY:
+ return "deny";
+ case SHMEM_HUGE_FORCE:
+ return "force";
+ default:
+ VM_BUG_ON(1);
+ return "bad_val";
+ }
+}
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -2857,11 +2938,24 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ } else if (!strcmp(this_char, "huge")) {
+ int huge;
+ huge = shmem_parse_huge(value);
+ if (huge < 0)
+ goto bad_val;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER)
+ goto bad_val;
+ sbinfo->huge = huge;
+#endif
+#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
mpol_put(mpol);
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+#endif
} else {
pr_err("tmpfs: Bad mount option %s\n", this_char);
goto error;
@@ -2907,6 +3001,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
goto out;
error = 0;
+ sbinfo->huge = config.huge;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2940,6 +3035,11 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+ if (sbinfo->huge)
+ seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
@@ -3278,6 +3378,13 @@ int __init shmem_init(void)
pr_err("Could not kern_mount tmpfs\n");
goto out1;
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
+ else
+ shmem_huge = 0; /* just in case it was patched */
+#endif
return 0;
out1:
@@ -3289,6 +3396,60 @@ out3:
return error;
}
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+static ssize_t shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int values[] = {
+ SHMEM_HUGE_ALWAYS,
+ SHMEM_HUGE_WITHIN_SIZE,
+ SHMEM_HUGE_ADVISE,
+ SHMEM_HUGE_NEVER,
+ SHMEM_HUGE_DENY,
+ SHMEM_HUGE_FORCE,
+ };
+ int i, count;
+
+ for (i = 0, count = 0; i < ARRAY_SIZE(values); i++) {
+ const char *fmt = shmem_huge == values[i] ? "[%s] " : "%s ";
+
+ count += sprintf(buf + count, fmt,
+ shmem_format_huge(values[i]));
+ }
+ buf[count - 1] = '\n';
+ return count;
+}
+
+static ssize_t shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ char tmp[16];
+ int huge;
+
+ if (count + 1 > sizeof(tmp))
+ return -EINVAL;
+ memcpy(tmp, buf, count);
+ tmp[count] = '\0';
+ if (count && tmp[count - 1] == '\n')
+ tmp[count - 1] = '\0';
+
+ huge = shmem_parse_huge(tmp);
+ if (huge == -EINVAL)
+ return -EINVAL;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
+ return -EINVAL;
+
+ shmem_huge = huge;
+ if (shmem_huge < SHMEM_HUGE_DENY)
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
+ return count;
+}
+
+struct kobj_attribute shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
+
#else /* !CONFIG_SHMEM */
/*
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 23/32] shmem: get_unmapped_area align huge page
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (21 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 22/32] shmem: prepare huge= mount option and sysfs knob Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 24/32] shmem: add huge pages support Kirill A. Shutemov
` (9 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A . Shutemov
From: Hugh Dickins <hughd@google.com>
Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address. It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.
shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.
There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though
this only matters when /proc/sys/vm/shmem_huge has been set.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
drivers/char/mem.c | 24 ++++++++++++
include/linux/shmem_fs.h | 2 +
ipc/shm.c | 6 ++-
mm/mmap.c | 16 +++++++-
mm/shmem.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 142 insertions(+), 4 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 71025c2f6bbb..9656f1095c19 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>
@@ -661,6 +662,28 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static unsigned long get_unmapped_area_zero(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+#ifdef CONFIG_MMU
+ if (flags & MAP_SHARED) {
+ /*
+ * mmap_zero() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge;
+ * and pass NULL for file as in mmap.c's get_unmapped_area(),
+ * so as not to confuse shmem with our handle on "/dev/zero".
+ */
+ return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+ }
+
+ /* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+#else
+ return -ENOSYS;
+#endif
+}
+
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -768,6 +791,7 @@ static const struct file_operations zero_fops = {
.read_iter = read_iter_zero,
.write_iter = write_iter_zero,
.mmap = mmap_zero,
+ .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
#endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 466f18c73a49..ff2de4bab61f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -50,6 +50,8 @@ extern struct file *shmem_file_setup(const char *name,
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/ipc/shm.c b/ipc/shm.c
index 331fc1b0b3c7..b1ed484eeda1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_file_operations = {
.mmap = shm_mmap,
.fsync = shm_fsync,
.release = shm_release,
-#ifndef CONFIG_MMU
.get_unmapped_area = shm_get_unmapped_area,
-#endif
.llseek = noop_llseek,
.fallocate = shm_fallocate,
};
+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
static const struct file_operations shm_file_operations_huge = {
.mmap = shm_mmap,
.fsync = shm_fsync,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0a68cf0b37e7..cc1fa2b0d14f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/profile.h>
#include <linux/export.h>
#include <linux/mount.h>
@@ -1900,8 +1901,19 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
return -ENOMEM;
get_area = current->mm->get_unmapped_area;
- if (file && file->f_op->get_unmapped_area)
- get_area = file->f_op->get_unmapped_area;
+ if (file) {
+ if (file->f_op->get_unmapped_area)
+ get_area = file->f_op->get_unmapped_area;
+ } else if (flags & MAP_SHARED) {
+ /*
+ * mmap_region() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge.
+ * do_mmap_pgoff() will clear pgoff, so match alignment.
+ */
+ pgoff = 0;
+ get_area = shmem_get_unmapped_area;
+ }
+
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/mm/shmem.c b/mm/shmem.c
index 3a6349e9c186..929fbf2ac43f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1513,6 +1513,94 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return ret;
}
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ unsigned long (*get_area)(struct file *,
+ unsigned long, unsigned long, unsigned long, unsigned long);
+ unsigned long addr;
+ unsigned long offset;
+ unsigned long inflated_len;
+ unsigned long inflated_addr;
+ unsigned long inflated_offset;
+
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ get_area = current->mm->get_unmapped_area;
+ addr = get_area(file, uaddr, len, pgoff, flags);
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return addr;
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ if (addr & ~PAGE_MASK)
+ return addr;
+ if (addr > TASK_SIZE - len)
+ return addr;
+
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return addr;
+ if (len < HPAGE_PMD_SIZE)
+ return addr;
+ if (flags & MAP_FIXED)
+ return addr;
+ /*
+ * Our priority is to support MAP_SHARED mapped hugely;
+ * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+ * But if caller specified an address hint, respect that as before.
+ */
+ if (uaddr)
+ return addr;
+
+ if (shmem_huge != SHMEM_HUGE_FORCE) {
+ struct super_block *sb;
+
+ if (file) {
+ VM_BUG_ON(file->f_op != &shmem_file_operations);
+ sb = file_inode(file)->i_sb;
+ } else {
+ /*
+ * Called directly from mm/mmap.c, or drivers/char/mem.c
+ * for "/dev/zero", to create a shared anonymous object.
+ */
+ if (IS_ERR(shm_mnt))
+ return addr;
+ sb = shm_mnt->mnt_sb;
+ }
+ if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+ return addr;
+ }
+
+ offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+ if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ return addr;
+ if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ return addr;
+
+ inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ if (inflated_len > TASK_SIZE)
+ return addr;
+ if (inflated_len < len)
+ return addr;
+
+ inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ if (IS_ERR_VALUE(inflated_addr))
+ return addr;
+ if (inflated_addr & ~PAGE_MASK)
+ return addr;
+
+ inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_addr += offset - inflated_offset;
+ if (inflated_offset > offset)
+ inflated_addr += HPAGE_PMD_SIZE;
+
+ if (inflated_addr > TASK_SIZE - len)
+ return addr;
+ return inflated_addr;
+}
+
#ifdef CONFIG_NUMA
static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
{
@@ -3257,6 +3345,7 @@ static const struct address_space_operations shmem_aops = {
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
+ .get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
.llseek = shmem_file_llseek,
.read_iter = shmem_file_read_iter,
@@ -3492,6 +3581,15 @@ void shmem_unlock_mapping(struct address_space *mapping)
{
}
+#ifdef CONFIG_MMU
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+#endif
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 24/32] shmem: add huge pages support
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (22 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 23/32] shmem: get_unmapped_area align huge page Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 25/32] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings Kirill A. Shutemov
` (8 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Here's basic implementation of huge pages support for shmem/tmpfs.
It's all pretty streight-forward:
- shmem_getpage() allcoates huge page if it can and try to inserd into
radix tree with shmem_add_to_page_cache();
- shmem_add_to_page_cache() puts the page onto radix-tree if there's
space for it;
- shmem_undo_range() removes huge pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.
This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
behaviour. As we don't really create hole in this case,
lseek(SEEK_HOLE) may have inconsistent results depending what
pages happened to be allocated.
- no need to change shmem_fault: core-mm will map an compound page as
huge if VMA is suitable;
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/huge_mm.h | 2 +
include/linux/shmem_fs.h | 3 +
mm/filemap.c | 7 +-
mm/huge_memory.c | 2 +
mm/memory.c | 2 +-
mm/mempolicy.c | 2 +-
mm/page-writeback.c | 1 +
mm/shmem.c | 380 ++++++++++++++++++++++++++++++++++++++---------
mm/swap.c | 2 +
9 files changed, 331 insertions(+), 70 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 278ade85e224..0315e8d6db57 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -155,6 +155,8 @@ struct page *get_huge_zero_page(void);
#define transparent_hugepage_enabled(__vma) 0
+static inline void prep_transhuge_page(struct page *page) {}
+
#define transparent_hugepage_flags 0UL
static inline int
split_huge_page_to_list(struct page *page, struct list_head *list)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff2de4bab61f..94eaaa2c6ad9 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -71,6 +71,9 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
}
+extern bool shmem_charge(struct inode *inode, long pages);
+extern void shmem_uncharge(struct inode *inode, long pages);
+
#ifdef CONFIG_TMPFS
extern int shmem_add_seals(struct file *file, unsigned int seals);
diff --git a/mm/filemap.c b/mm/filemap.c
index bf29ab4f87dc..39f4a18b870b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -250,8 +250,13 @@ void __delete_from_page_cache(struct page *page, void *shadow)
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
- if (PageSwapBacked(page))
+ if (PageSwapBacked(page)) {
__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+ if (PageTransHuge(page))
+ __dec_zone_page_state(page, NR_SHMEM_THPS);
+ } else {
+ VM_BUG_ON_PAGE(PageTransHuge(page) && !PageHuge(page), page);
+ }
/*
* At this point page must be either written or cleaned by truncate.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b0b24f0b013..648f60d55759 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3274,6 +3274,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
if (head[i].index >= end) {
__ClearPageDirty(head + i);
__delete_from_page_cache(head + i, NULL);
+ if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
+ shmem_uncharge(head->mapping->host, 1);
put_page(head + i);
}
}
diff --git a/mm/memory.c b/mm/memory.c
index cba29c033702..8d8b1357e1d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1091,7 +1091,7 @@ again:
* unmap shared but keep private pages.
*/
if (details->check_mapping &&
- details->check_mapping != page->mapping)
+ details->check_mapping != page_rmapping(page))
continue;
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 148143974c5b..90c74344957e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -534,7 +534,7 @@ retry:
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
- if (PageTransCompound(page) && PageAnon(page)) {
+ if (PageTransCompound(page)) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d35ccc..8aded71b15a2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2554,6 +2554,7 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ page = compound_head(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 929fbf2ac43f..ec899b5ea9e6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -173,10 +173,13 @@ static inline int shmem_reacct_size(unsigned long flags,
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(unsigned long flags, long pages)
{
- return (flags & VM_NORESERVE) ?
- security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_SIZE)) : 0;
+ if (!(flags & VM_NORESERVE))
+ return 0;
+
+ return security_vm_enough_memory_mm(current->mm,
+ pages * VM_ACCT(PAGE_SIZE));
}
static inline void shmem_unacct_blocks(unsigned long flags, long pages)
@@ -249,6 +252,51 @@ static void shmem_recalc_inode(struct inode *inode)
}
}
+bool shmem_charge(struct inode *inode, long pages)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+
+ if (shmem_acct_block(info->flags, pages))
+ return false;
+ spin_lock(&info->lock);
+ info->alloced += pages;
+ inode->i_blocks += pages * BLOCKS_PER_PAGE;
+ shmem_recalc_inode(inode);
+ spin_unlock(&info->lock);
+ inode->i_mapping->nrpages += pages;
+
+ if (!sbinfo->max_blocks)
+ return true;
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks - pages) > 0) {
+ inode->i_mapping->nrpages -= pages;
+ spin_lock(&info->lock);
+ info->alloced -= pages;
+ shmem_recalc_inode(inode);
+ spin_unlock(&info->lock);
+
+ return false;
+ }
+ percpu_counter_add(&sbinfo->used_blocks, pages);
+ return true;
+}
+
+void shmem_uncharge(struct inode *inode, long pages)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+
+ spin_lock(&info->lock);
+ info->alloced -= pages;
+ inode->i_blocks -= pages * BLOCKS_PER_PAGE;
+ shmem_recalc_inode(inode);
+ spin_unlock(&info->lock);
+
+ if (sbinfo->max_blocks)
+ percpu_counter_sub(&sbinfo->used_blocks, pages);
+}
+
/*
* Replace item expected in radix tree by a new item, while holding tree lock.
*/
@@ -376,30 +424,57 @@ static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
pgoff_t index, void *expected)
{
- int error;
+ int error, nr = hpage_nr_pages(page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ VM_BUG_ON_PAGE(index != round_down(index, nr), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON(expected && PageTransHuge(page));
- get_page(page);
+ page_ref_add(page, nr);
page->mapping = mapping;
page->index = index;
spin_lock_irq(&mapping->tree_lock);
- if (!expected)
+ if (PageTransHuge(page)) {
+ void __rcu **results;
+ pgoff_t idx;
+ int i;
+
+ error = 0;
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+ &results, &idx, index, 1) &&
+ idx < index + HPAGE_PMD_NR) {
+ error = -EEXIST;
+ }
+
+ if (!error) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ index + i, page + i);
+ VM_BUG_ON(error);
+ }
+ count_vm_event(THP_FILE_ALLOC);
+ }
+ } else if (!expected) {
error = radix_tree_insert(&mapping->page_tree, index, page);
- else
+ } else {
error = shmem_radix_tree_replace(mapping, index, expected,
page);
+ }
+
if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
+ mapping->nrpages += nr;
+ if (PageTransHuge(page))
+ __inc_zone_page_state(page, NR_SHMEM_THPS);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
spin_unlock_irq(&mapping->tree_lock);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
- put_page(page);
+ page_ref_sub(page, nr);
}
return error;
}
@@ -412,6 +487,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
struct address_space *mapping = page->mapping;
int error;
+ VM_BUG_ON_PAGE(PageCompound(page), page);
+
spin_lock_irq(&mapping->tree_lock);
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
@@ -591,10 +668,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
+
if (!trylock_page(page))
continue;
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
}
@@ -670,8 +770,36 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
lock_page(page);
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ /*
+ * Partial thp truncate due 'start' in middle
+ * of THP: don't need to look on these pages
+ * again on !pvec.nr restart.
+ */
+ if (index != round_down(end, HPAGE_PMD_NR))
+ start++;
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
} else {
@@ -929,6 +1057,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
swp_entry_t swap;
pgoff_t index;
+ VM_BUG_ON_PAGE(PageCompound(page), page);
BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
@@ -1065,24 +1194,63 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
#define vm_policy vm_private_data
#endif
+static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ /* Create a pseudo vma that just contains the policy */
+ vma->vm_start = 0;
+ /* Bias interleave by inode number to distribute better across nodes */
+ vma->vm_pgoff = index + info->vfs_inode.i_ino;
+ vma->vm_ops = NULL;
+ vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+}
+
+static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
+{
+ /* Drop reference taken by mpol_shared_policy_lookup() */
+ mpol_cond_put(vma->vm_policy);
+}
+
static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
struct page *page;
- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
+ shmem_pseudo_vma_init(&pvma, info, index);
page = swapin_readahead(swap, gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ return page;
+}
+
+static struct page *shmem_alloc_hugepage(gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ struct vm_area_struct pvma;
+ struct inode *inode = &info->vfs_inode;
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t idx, hindex = round_down(index, HPAGE_PMD_NR);
+ void __rcu **results;
+ struct page *page;
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return NULL;
+
+ rcu_read_lock();
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+ hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ rcu_read_unlock();
+
+ shmem_pseudo_vma_init(&pvma, info, hindex);
+ page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+ shmem_pseudo_vma_destroy(&pvma);
+ if (page)
+ prep_transhuge_page(page);
return page;
}
@@ -1092,23 +1260,51 @@ static struct page *shmem_alloc_page(gfp_t gfp,
struct vm_area_struct pvma;
struct page *page;
- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+ shmem_pseudo_vma_init(&pvma, info, index);
+ page = alloc_page_vma(gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);
+
+ return page;
+}
+
+static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
+ struct shmem_inode_info *info, struct shmem_sb_info *sbinfo,
+ pgoff_t index, bool huge)
+{
+ struct page *page;
+ int nr;
+ int err = -ENOSPC;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ huge = false;
+ nr = huge ? HPAGE_PMD_NR : 1;
+
+ if (shmem_acct_block(info->flags, nr))
+ goto failed;
+ if (sbinfo->max_blocks) {
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks - nr) > 0)
+ goto unacct;
+ percpu_counter_add(&sbinfo->used_blocks, nr);
+ }
- page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id(), false);
+ if (huge)
+ page = shmem_alloc_hugepage(gfp, info, index);
+ else
+ page = shmem_alloc_page(gfp, info, index);
if (page) {
__SetPageLocked(page);
__SetPageSwapBacked(page);
+ return page;
}
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
-
- return page;
+ err = -ENOMEM;
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks, -nr);
+unacct:
+ shmem_unacct_blocks(info->flags, nr);
+failed:
+ return ERR_PTR(err);
}
/*
@@ -1213,6 +1409,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ pgoff_t hindex = index;
int error;
int once = 0;
int alloced = 0;
@@ -1334,47 +1531,74 @@ repeat:
swap_free(swap);
} else {
- if (shmem_acct_block(info->flags)) {
- error = -ENOSPC;
- goto failed;
- }
- if (sbinfo->max_blocks) {
- if (percpu_counter_compare(&sbinfo->used_blocks,
- sbinfo->max_blocks) >= 0) {
- error = -ENOSPC;
- goto unacct;
- }
- percpu_counter_inc(&sbinfo->used_blocks);
+ /* shmem_symlink() */
+ if (mapping->a_ops != &shmem_aops)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ goto alloc_huge;
+ switch (sbinfo->huge) {
+ loff_t i_size;
+ pgoff_t off;
+ case SHMEM_HUGE_NEVER:
+ goto alloc_nohuge;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(index, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >= HPAGE_PMD_SIZE &&
+ i_size >> PAGE_SHIFT >= off)
+ goto alloc_huge;
+ /* fallthrough */
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: wire up fadvise()/madvise() */
+ goto alloc_nohuge;
}
- page = shmem_alloc_page(gfp, info, index);
- if (!page) {
- error = -ENOMEM;
- goto decused;
+alloc_huge:
+ page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, true);
+ if (IS_ERR(page)) {
+alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, false);
+ }
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ page = NULL;
+ goto failed;
}
+
+ if (PageTransHuge(page))
+ hindex = round_down(index, HPAGE_PMD_NR);
+ else
+ hindex = index;
+
if (sgp == SGP_WRITE)
__SetPageReferenced(page);
error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
- false);
+ PageTransHuge(page));
if (error)
- goto decused;
- error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+ goto unacct;
+ error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
+ compound_order(page));
if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
+ error = shmem_add_to_page_cache(page, mapping, hindex,
NULL);
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg, false);
- goto decused;
+ mem_cgroup_cancel_charge(page, memcg,
+ PageTransHuge(page));
+ goto unacct;
}
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false,
+ PageTransHuge(page));
lru_cache_add_anon(page);
spin_lock(&info->lock);
- info->alloced++;
- inode->i_blocks += BLOCKS_PER_PAGE;
+ info->alloced += 1 << compound_order(page);
+ inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
alloced = true;
@@ -1390,10 +1614,15 @@ clear:
* but SGP_FALLOC on a page fallocated earlier must initialize
* it now, lest undo on failure cancel our earlier guarantee.
*/
- if (sgp != SGP_WRITE) {
- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
+ if (sgp != SGP_WRITE && !PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ int i;
+
+ for (i = 0; i < (1 << compound_order(head)); i++) {
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ SetPageUptodate(head);
}
}
@@ -1410,17 +1639,23 @@ clear:
error = -EINVAL;
goto unlock;
}
- *pagep = page;
+ *pagep = page + index - hindex;
return 0;
/*
* Error recovery.
*/
-decused:
- if (sbinfo->max_blocks)
- percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
- shmem_unacct_blocks(info->flags, 1);
+ if (sbinfo->max_blocks)
+ percpu_counter_sub(&sbinfo->used_blocks,
+ 1 << compound_order(page));
+ shmem_unacct_blocks(info->flags, 1 << compound_order(page));
+
+ if (PageTransHuge(page)) {
+ unlock_page(page);
+ put_page(page);
+ goto alloc_nohuge;
+ }
failed:
if (swap.val && !shmem_confirm_swap(mapping, index, swap))
error = -EEXIST;
@@ -1758,12 +1993,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
i_size_write(inode, pos + copied);
if (!PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ if (PageTransCompound(page)) {
+ int i;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ if (head + i == page)
+ continue;
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ }
if (copied < PAGE_SIZE) {
unsigned from = pos & (PAGE_SIZE - 1);
zero_user_segments(page, 0, from,
from + copied, PAGE_SIZE);
}
- SetPageUptodate(page);
+ SetPageUptodate(head);
}
set_page_dirty(page);
unlock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index a0bc206b4ac6..9d777cb75b87 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,6 +291,7 @@ static bool need_activate_page_drain(int cpu)
void activate_page(struct page *page)
{
+ page = compound_head(page);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
@@ -315,6 +316,7 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);
+ page = compound_head(page);
spin_lock_irq(&zone->lru_lock);
__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
spin_unlock_irq(&zone->lru_lock);
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 25/32] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (23 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 24/32] shmem: add huge pages support Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt Kirill A. Shutemov
` (7 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Let's wire up existing madvise() hugepage hints for file mappings.
MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA. It only has effect if the filesystem is mounted with huge=advise or
huge=within_size.
MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA. It doesn't prevent a huge page from being allocated by other
means, i.e. page fault into different mapping or write(2) into file.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/huge_memory.c | 19 +++++--------------
mm/shmem.c | 20 +++++++++++++++++---
2 files changed, 22 insertions(+), 17 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 648f60d55759..3658b7d2c424 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1837,7 +1837,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return NULL;
}
-#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
@@ -1853,11 +1853,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
if (mm_has_pgste(vma->vm_mm))
return 0;
#endif
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
@@ -1865,15 +1860,11 @@ int hugepage_madvise(struct vm_area_struct *vma,
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- if (unlikely(khugepaged_enter_vma_merge(vma, *vm_flags)))
+ if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
return -ENOMEM;
break;
case MADV_NOHUGEPAGE:
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
/*
@@ -1984,7 +1975,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
if (vma->vm_ops)
/* khugepaged not yet working on file or special mappings */
return 0;
- VM_BUG_ON_VMA(vm_flags & VM_NO_THP, vma);
+ VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
hend = vma->vm_end & HPAGE_PMD_MASK;
if (hstart < hend)
@@ -2373,7 +2364,7 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return false;
if (is_vma_temporary_stack(vma))
return false;
- VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma);
+ VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
return true;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index ec899b5ea9e6..872d658d37c0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -101,6 +101,8 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
};
@@ -1409,6 +1411,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
int error;
int once = 0;
@@ -1416,6 +1419,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
+ if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
+ sgp = SGP_CACHE;
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
@@ -1534,7 +1539,7 @@ repeat:
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
- if (shmem_huge == SHMEM_HUGE_DENY)
+ if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
@@ -1551,7 +1556,9 @@ repeat:
goto alloc_huge;
/* fallthrough */
case SHMEM_HUGE_ADVISE:
- /* TODO: wire up fadvise()/madvise() */
+ if (sgp_huge == SGP_HUGE)
+ goto alloc_huge;
+ /* TODO: implement fadvise() hints */
goto alloc_nohuge;
}
@@ -1680,6 +1687,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
+ enum sgp_type sgp;
int error;
int ret = VM_FAULT_LOCKED;
@@ -1741,7 +1749,13 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
spin_unlock(&inode->i_lock);
}
- error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
+ sgp = SGP_CACHE;
+ if (vma->vm_flags & VM_HUGEPAGE)
+ sgp = SGP_HUGE;
+ else if (vma->vm_flags & VM_NOHUGEPAGE)
+ sgp = SGP_NOHUGE;
+
+ error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
gfp, vma->vm_mm, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (24 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 25/32] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-19 16:20 ` Julien Grall
2016-05-12 15:41 ` [PATCHv8 27/32] thp: extract khugepaged from mm/huge_memory.c Kirill A. Shutemov
` (6 subsequent siblings)
32 siblings, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Add info about tmpfs/shmem with huge pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
Documentation/vm/transhuge.txt | 130 +++++++++++++++++++++++++++++------------
1 file changed, 93 insertions(+), 37 deletions(-)
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index d9cb65cf5cfd..96a49f123cac 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.
-Currently it only works for anonymous memory mappings but in the
-future it can expand over the pagecache layer starting with tmpfs.
+Currently it only works for anonymous memory mappings and tmpfs/shmem.
+But in the future it can expand to other filesystems.
The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
@@ -48,7 +48,7 @@ miss is going to run faster.
- if some task quits and more hugepages become available (either
immediately in the buddy or through the VM), guest physical memory
backed by regular pages should be relocated on hugepages
- automatically (with khugepaged)
+ automatically (with khugepaged, limited to anonymous huge pages for now)
- it doesn't require memory reservation and in turn it uses hugepages
whenever possible (the only possible reservation here is kernelcore=
@@ -57,10 +57,6 @@ miss is going to run faster.
feature that applies to all dynamic high order allocations in the
kernel)
-- this initial support only offers the feature in the anonymous memory
- regions but it'd be ideal to move it to tmpfs and the pagecache
- later
-
Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
unused memory to be used as cache or other movable (or even unmovable
@@ -94,21 +90,21 @@ madvise(MADV_HUGEPAGE) on their critical mmapped regions.
== sysfs ==
-Transparent Hugepage Support can be entirely disabled (mostly for
-debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
-avoid the risk of consuming more memory resources) or enabled system
-wide. This can be achieved with one of:
+Transparent Hugepage Support for anonymous memory can be entirely disabled
+(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
+regions (to avoid the risk of consuming more memory resources) or enabled
+system wide. This can be achieved with one of:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
It's also possible to limit defrag efforts in the VM to generate
-hugepages in case they're not immediately free to madvise regions or
-to never try to defrag memory and simply fallback to regular pages
-unless hugepages are immediately available. Clearly if we spend CPU
-time to defrag memory, we would expect to gain even more by the fact
-we use hugepages later instead of regular pages. This isn't always
+anonymous hugepages in case they're not immediately free to madvise
+regions or to never try to defrag memory and simply fallback to regular
+pages unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact we
+use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.
@@ -133,9 +129,9 @@ that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
"never" should be self-explanatory.
-By default kernel tries to use huge zero page on read page fault.
-It's possible to disable huge zero page by writing 0 or enable it
-back by writing 1:
+By default kernel tries to use huge zero page on read page fault to
+anonymous mapping. It's possible to disable huge zero page by writing 0
+or enable it back by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
@@ -204,21 +200,67 @@ Support by passing the parameter "transparent_hugepage=always" or
"transparent_hugepage=madvise" or "transparent_hugepage=never"
(without "") to the kernel command line.
+== Hugepages in tmpfs/shmem ==
+
+You can control hugepage allocation policy in tmpfs with mount option
+"huge=". It can have following values:
+
+ - "always":
+ Attempt to allocate huge pages every time we need a new page;
+
+ - "never":
+ Do not allocate huge pages;
+
+ - "within_size":
+ Only allocate huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+ - "advise:
+ Only allocate huge pages if requested with fadvise()/madvise();
+
+The default policy is "never".
+
+"mount -o remount,huge= /mountpoint" works fine after mount: remounting
+huge=never will not attempt to break up huge pages at all, just stop more
+from being allocated.
+
+There's also sysfs knob to control hugepage allocation policy for internal
+shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
+is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
+MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+In addition to policies listed above, shmem_enabled allows two further
+values:
+
+ - "deny":
+ For use in emergencies, to force the huge option off from
+ all mounts;
+ - "force":
+ Force the huge option on for all - very useful for testing;
+
== Need of application restart ==
-The transparent_hugepage/enabled values only affect future
-behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to
-the regions registered in khugepaged.
+The transparent_hugepage/enabled values and tmpfs mount option only affect
+future behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to the
+regions registered in khugepaged.
== Monitoring usage ==
-The number of transparent huge pages currently used by the system is
-available by reading the AnonHugePages field in /proc/meminfo. To
-identify what applications are using transparent huge pages, it is
-necessary to read /proc/PID/smaps and count the AnonHugePages fields
-for each mapping. Note that reading the smaps file is expensive and
-reading it frequently will incur overhead.
+The number of anonymous transparent huge pages currently used by the
+system is available by reading the AnonHugePages field in /proc/meminfo.
+To identify what applications are using anonymous transparent huge pages,
+it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping.
+
+The number of file transparent huge pages mapped to userspace is available
+by reading the FileHugeMapped field in /proc/meminfo. To identify what
+applications are mapping file transparent huge pages, it is necessary
+to read /proc/PID/smaps and count the FileHugeMapped fields for each
+mapping.
+
+Note that reading the smaps file is expensive and reading it
+frequently will incur overhead.
There are a number of counters in /proc/vmstat that may be used to
monitor how successfully the system is providing huge pages for use.
@@ -238,6 +280,12 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.
+thp_file_alloc is incremented every time a file huge page is successfully
+i allocated.
+
+thp_file_mapped is incremented every time a file huge page is mapped into
+ user address space.
+
thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
@@ -403,19 +451,27 @@ pages:
on relevant sub-page of the compound page.
- map/unmap of the whole compound page accounted in compound_mapcount
- (stored in first tail page).
+ (stored in first tail page). For file huge pages, we also increment
+ ->_mapcount of all sub-pages in order to have race-free detection of
+ last unmap of subpages.
-PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
-This additional reference is required to get race-free detection of unmap of
-subpages when we have them mapped with both PMDs and PTEs.
+PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
+
+For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
+subpages is offset up by one. This additional reference is required to
+get race-free detection of unmap of subpages when we have them mapped with
+both PMDs and PTEs.
This is optimization required to lower overhead of per-subpage mapcount
tracking. The alternative is alter ->_mapcount in all subpages on each
map/unmap of the whole compound page.
-We set PG_double_map when a PMD of the page got split for the first time,
-but still have PMD mapping. The addtional references go away with last
-compound_mapcount.
+For anonymous pages, we set PG_double_map when a PMD of the page got split
+for the first time, but still have PMD mapping. The additional references
+go away with last compound_mapcount.
+
+File pages get PG_double_map set on first map of the page with PTE and
+goes away when the page gets evicted from page cache.
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
@@ -427,7 +483,7 @@ sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).
split_huge_page uses migration entries to stabilize page->_count and
-page->_mapcount.
+page->_mapcount of anonymous pages. File pages just got unmapped.
We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt
2016-05-12 15:41 ` [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt Kirill A. Shutemov
@ 2016-05-19 16:20 ` Julien Grall
2016-05-20 10:33 ` Kirill A. Shutemov
0 siblings, 1 reply; 39+ messages in thread
From: Julien Grall @ 2016-05-19 16:20 UTC (permalink / raw)
To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel, Steve Capper
Hello Kirill,
On 12/05/16 16:41, Kirill A. Shutemov wrote:
> Add info about tmpfs/shmem with huge pages.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
> Documentation/vm/transhuge.txt | 130 +++++++++++++++++++++++++++++------------
> 1 file changed, 93 insertions(+), 37 deletions(-)
>
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index d9cb65cf5cfd..96a49f123cac 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
> that supports the automatic promotion and demotion of page sizes and
> without the shortcomings of hugetlbfs.
>
> -Currently it only works for anonymous memory mappings but in the
> -future it can expand over the pagecache layer starting with tmpfs.
> +Currently it only works for anonymous memory mappings and tmpfs/shmem.
> +But in the future it can expand to other filesystems.
>
> The reason applications are running faster is because of two
> factors. The first factor is almost completely irrelevant and it's not
> @@ -48,7 +48,7 @@ miss is going to run faster.
> - if some task quits and more hugepages become available (either
> immediately in the buddy or through the VM), guest physical memory
> backed by regular pages should be relocated on hugepages
> - automatically (with khugepaged)
> + automatically (with khugepaged, limited to anonymous huge pages for now)
Is it still relevant? I think the patch #30 at the support for tmpfs/shmem.
[...]
> == Need of application restart ==
>
> -The transparent_hugepage/enabled values only affect future
> -behavior. So to make them effective you need to restart any
> -application that could have been using hugepages. This also applies to
> -the regions registered in khugepaged.
> +The transparent_hugepage/enabled values and tmpfs mount option only affect
> +future behavior. So to make them effective you need to restart any
> +application that could have been using hugepages. This also applies to the
> +regions registered in khugepaged.
>
> == Monitoring usage ==
>
> -The number of transparent huge pages currently used by the system is
> -available by reading the AnonHugePages field in /proc/meminfo. To
> -identify what applications are using transparent huge pages, it is
> -necessary to read /proc/PID/smaps and count the AnonHugePages fields
> -for each mapping. Note that reading the smaps file is expensive and
> -reading it frequently will incur overhead.
> +The number of anonymous transparent huge pages currently used by the
> +system is available by reading the AnonHugePages field in /proc/meminfo.
> +To identify what applications are using anonymous transparent huge pages,
> +it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
> +for each mapping.
> +
> +The number of file transparent huge pages mapped to userspace is available
> +by reading the FileHugeMapped field in /proc/meminfo. To identify what
> +applications are mapping file transparent huge pages, it is necessary
> +to read /proc/PID/smaps and count the FileHugeMapped fields for each
> +mapping.
I cannot find the field FileHugeMapped in /proc/meminfo and
/proc/PID/smaps. However, there are 2 new fields ShmemHugePages and
ShmemPmdMapped.
Also I guess that filesystems/proc.txt has to be updated to explain the
new fields.
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt
2016-05-19 16:20 ` Julien Grall
@ 2016-05-20 10:33 ` Kirill A. Shutemov
0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-20 10:33 UTC (permalink / raw)
To: Julien Grall
Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
Andrew Morton, Dave Hansen, Vlastimil Babka, Christoph Lameter,
Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
Andres Lagar-Cavilla, Ning Qu, linux-kernel, linux-mm,
linux-fsdevel, Steve Capper
On Thu, May 19, 2016 at 05:20:01PM +0100, Julien Grall wrote:
> Hello Kirill,
>
> On 12/05/16 16:41, Kirill A. Shutemov wrote:
> >Add info about tmpfs/shmem with huge pages.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> > Documentation/vm/transhuge.txt | 130 +++++++++++++++++++++++++++++------------
> > 1 file changed, 93 insertions(+), 37 deletions(-)
> >
> >diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> >index d9cb65cf5cfd..96a49f123cac 100644
> >--- a/Documentation/vm/transhuge.txt
> >+++ b/Documentation/vm/transhuge.txt
> >@@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
> > that supports the automatic promotion and demotion of page sizes and
> > without the shortcomings of hugetlbfs.
> >
> >-Currently it only works for anonymous memory mappings but in the
> >-future it can expand over the pagecache layer starting with tmpfs.
> >+Currently it only works for anonymous memory mappings and tmpfs/shmem.
> >+But in the future it can expand to other filesystems.
> >
> > The reason applications are running faster is because of two
> > factors. The first factor is almost completely irrelevant and it's not
> >@@ -48,7 +48,7 @@ miss is going to run faster.
> > - if some task quits and more hugepages become available (either
> > immediately in the buddy or through the VM), guest physical memory
> > backed by regular pages should be relocated on hugepages
> >- automatically (with khugepaged)
> >+ automatically (with khugepaged, limited to anonymous huge pages for now)
>
> Is it still relevant? I think the patch #30 at the support for tmpfs/shmem.
I forgot to update documentation. I'll do for the next round when rebase
to v4.7-rc1.
Thanks for noticing this.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCHv8 27/32] thp: extract khugepaged from mm/huge_memory.c
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (25 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 26/32] thp: update Documentation/vm/transhuge.txt Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 28/32] khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() Kirill A. Shutemov
` (5 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
khugepaged implementation grew to the point when it deserve separate
file in source.
Let's move it to mm/khugepaged.c.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/huge_mm.h | 10 +
include/linux/khugepaged.h | 6 +
mm/Makefile | 2 +-
mm/huge_memory.c | 1428 +-------------------------------------------
mm/khugepaged.c | 1428 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1453 insertions(+), 1421 deletions(-)
create mode 100644 mm/khugepaged.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0315e8d6db57..bff34b8e2950 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -41,6 +41,16 @@ enum transparent_hugepage_flag {
#endif
};
+struct kobject;
+struct kobj_attribute;
+
+extern ssize_t single_hugepage_flag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count,
+ enum transparent_hugepage_flag flag);
+extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf,
+ enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eeb307985715..f7b4b7dc97bc 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,12 @@
#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct mutex khugepaged_mutex;
+extern struct attribute_group khugepaged_attr_group;
+
+extern int khugepaged_init(void);
+extern void khugepaged_destroy(void);
+extern int start_stop_khugepaged(void);
extern int __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
diff --git a/mm/Makefile b/mm/Makefile
index deb467edca2d..298f34cc4071 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -74,7 +74,7 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3658b7d2c424..c08d3d300f4c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,7 +18,6 @@
#include <linux/mm_inline.h>
#include <linux/swapops.h>
#include <linux/dax.h>
-#include <linux/kthread.h>
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/pfn_t.h>
@@ -37,35 +36,6 @@
#include <asm/pgalloc.h>
#include "internal.h"
-enum scan_result {
- SCAN_FAIL,
- SCAN_SUCCEED,
- SCAN_PMD_NULL,
- SCAN_EXCEED_NONE_PTE,
- SCAN_PTE_NON_PRESENT,
- SCAN_PAGE_RO,
- SCAN_NO_REFERENCED_PAGE,
- SCAN_PAGE_NULL,
- SCAN_SCAN_ABORT,
- SCAN_PAGE_COUNT,
- SCAN_PAGE_LRU,
- SCAN_PAGE_LOCK,
- SCAN_PAGE_ANON,
- SCAN_PAGE_COMPOUND,
- SCAN_ANY_PROCESS,
- SCAN_VMA_NULL,
- SCAN_VMA_CHECK,
- SCAN_ADDRESS_RANGE,
- SCAN_SWAP_CACHE_PAGE,
- SCAN_DEL_PAGE_LRU,
- SCAN_ALLOC_HUGE_PAGE_FAIL,
- SCAN_CGROUP_CHARGE_FAIL,
- SCAN_EXCEED_SWAP_PTE
-};
-
-#define CREATE_TRACE_POINTS
-#include <trace/events/huge_memory.h>
-
/*
* By default transparent hugepage support is disabled in order that avoid
* to risk increase the memory footprint of applications without a guaranteed
@@ -85,127 +55,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
-/* default scan 8*512 pte (or vmas) every 30 second */
-static unsigned int khugepaged_pages_to_scan __read_mostly;
-static unsigned int khugepaged_pages_collapsed;
-static unsigned int khugepaged_full_scans;
-static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
-/* during fragmentation poll the hugepage allocator once every minute */
-static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
-static struct task_struct *khugepaged_thread __read_mostly;
-static DEFINE_MUTEX(khugepaged_mutex);
-static DEFINE_SPINLOCK(khugepaged_mm_lock);
-static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
-/*
- * default collapse hugepages if there is at least one pte mapped like
- * it would have happened if the vma was large enough during page
- * fault.
- */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
-static unsigned int khugepaged_max_ptes_swap __read_mostly = HPAGE_PMD_NR/8;
-
-static int khugepaged(void *none);
-static int khugepaged_slab_init(void);
-static void khugepaged_slab_exit(void);
-
-#define MM_SLOTS_HASH_BITS 10
-static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
-
-static struct kmem_cache *mm_slot_cache __read_mostly;
-
-/**
- * struct mm_slot - hash lookup from mm to mm_slot
- * @hash: hash collision list
- * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
- * @mm: the mm that this information is valid for
- */
-struct mm_slot {
- struct hlist_node hash;
- struct list_head mm_node;
- struct mm_struct *mm;
-};
-
-/**
- * struct khugepaged_scan - cursor for scanning
- * @mm_head: the head of the mm list to scan
- * @mm_slot: the current mm_slot we are scanning
- * @address: the next address inside that to be scanned
- *
- * There is only the one khugepaged_scan instance of this cursor structure.
- */
-struct khugepaged_scan {
- struct list_head mm_head;
- struct mm_slot *mm_slot;
- unsigned long address;
-};
-static struct khugepaged_scan khugepaged_scan = {
- .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
-};
-
static struct shrinker deferred_split_shrinker;
-static void set_recommended_min_free_kbytes(void)
-{
- struct zone *zone;
- int nr_zones = 0;
- unsigned long recommended_min;
-
- for_each_populated_zone(zone)
- nr_zones++;
-
- /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
- recommended_min = pageblock_nr_pages * nr_zones * 2;
-
- /*
- * Make sure that on average at least two pageblocks are almost free
- * of another type, one for a migratetype to fall back to and a
- * second to avoid subsequent fallbacks of other types There are 3
- * MIGRATE_TYPES we care about.
- */
- recommended_min += pageblock_nr_pages * nr_zones *
- MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
-
- /* don't ever allow to reserve more than 5% of the lowmem */
- recommended_min = min(recommended_min,
- (unsigned long) nr_free_buffer_pages() / 20);
- recommended_min <<= (PAGE_SHIFT-10);
-
- if (recommended_min > min_free_kbytes) {
- if (user_min_free_kbytes >= 0)
- pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
- min_free_kbytes, recommended_min);
-
- min_free_kbytes = recommended_min;
- }
- setup_per_zone_wmarks();
-}
-
-static int start_stop_khugepaged(void)
-{
- int err = 0;
- if (khugepaged_enabled()) {
- if (!khugepaged_thread)
- khugepaged_thread = kthread_run(khugepaged, NULL,
- "khugepaged");
- if (IS_ERR(khugepaged_thread)) {
- pr_err("khugepaged: kthread_run(khugepaged) failed\n");
- err = PTR_ERR(khugepaged_thread);
- khugepaged_thread = NULL;
- goto fail;
- }
-
- if (!list_empty(&khugepaged_scan.mm_head))
- wake_up_interruptible(&khugepaged_wait);
-
- set_recommended_min_free_kbytes();
- } else if (khugepaged_thread) {
- kthread_stop(khugepaged_thread);
- khugepaged_thread = NULL;
- }
-fail:
- return err;
-}
-
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
@@ -346,7 +197,7 @@ static ssize_t enabled_store(struct kobject *kobj,
static struct kobj_attribute enabled_attr =
__ATTR(enabled, 0644, enabled_show, enabled_store);
-static ssize_t single_flag_show(struct kobject *kobj,
+ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag)
{
@@ -354,7 +205,7 @@ static ssize_t single_flag_show(struct kobject *kobj,
!!test_bit(flag, &transparent_hugepage_flags));
}
-static ssize_t single_flag_store(struct kobject *kobj,
+ssize_t single_hugepage_flag_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count,
enum transparent_hugepage_flag flag)
@@ -409,13 +260,13 @@ static struct kobj_attribute defrag_attr =
static ssize_t use_zero_page_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return single_flag_show(kobj, attr, buf,
+ return single_hugepage_flag_show(kobj, attr, buf,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
}
static ssize_t use_zero_page_store(struct kobject *kobj,
struct kobj_attribute *attr, const char *buf, size_t count)
{
- return single_flag_store(kobj, attr, buf, count,
+ return single_hugepage_flag_store(kobj, attr, buf, count,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
}
static struct kobj_attribute use_zero_page_attr =
@@ -424,14 +275,14 @@ static struct kobj_attribute use_zero_page_attr =
static ssize_t debug_cow_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return single_flag_show(kobj, attr, buf,
+ return single_hugepage_flag_show(kobj, attr, buf,
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
}
static ssize_t debug_cow_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
- return single_flag_store(kobj, attr, buf, count,
+ return single_hugepage_flag_store(kobj, attr, buf, count,
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
}
static struct kobj_attribute debug_cow_attr =
@@ -455,197 +306,6 @@ static struct attribute_group hugepage_attr_group = {
.attrs = hugepage_attr,
};
-static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
-}
-
-static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- unsigned long msecs;
- int err;
-
- err = kstrtoul(buf, 10, &msecs);
- if (err || msecs > UINT_MAX)
- return -EINVAL;
-
- khugepaged_scan_sleep_millisecs = msecs;
- wake_up_interruptible(&khugepaged_wait);
-
- return count;
-}
-static struct kobj_attribute scan_sleep_millisecs_attr =
- __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
- scan_sleep_millisecs_store);
-
-static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
-}
-
-static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- unsigned long msecs;
- int err;
-
- err = kstrtoul(buf, 10, &msecs);
- if (err || msecs > UINT_MAX)
- return -EINVAL;
-
- khugepaged_alloc_sleep_millisecs = msecs;
- wake_up_interruptible(&khugepaged_wait);
-
- return count;
-}
-static struct kobj_attribute alloc_sleep_millisecs_attr =
- __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
- alloc_sleep_millisecs_store);
-
-static ssize_t pages_to_scan_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
-}
-static ssize_t pages_to_scan_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long pages;
-
- err = kstrtoul(buf, 10, &pages);
- if (err || !pages || pages > UINT_MAX)
- return -EINVAL;
-
- khugepaged_pages_to_scan = pages;
-
- return count;
-}
-static struct kobj_attribute pages_to_scan_attr =
- __ATTR(pages_to_scan, 0644, pages_to_scan_show,
- pages_to_scan_store);
-
-static ssize_t pages_collapsed_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
-}
-static struct kobj_attribute pages_collapsed_attr =
- __ATTR_RO(pages_collapsed);
-
-static ssize_t full_scans_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_full_scans);
-}
-static struct kobj_attribute full_scans_attr =
- __ATTR_RO(full_scans);
-
-static ssize_t khugepaged_defrag_show(struct kobject *kobj,
- struct kobj_attribute *attr, char *buf)
-{
- return single_flag_show(kobj, attr, buf,
- TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
-}
-static ssize_t khugepaged_defrag_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- return single_flag_store(kobj, attr, buf, count,
- TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
-}
-static struct kobj_attribute khugepaged_defrag_attr =
- __ATTR(defrag, 0644, khugepaged_defrag_show,
- khugepaged_defrag_store);
-
-/*
- * max_ptes_none controls if khugepaged should collapse hugepages over
- * any unmapped ptes in turn potentially increasing the memory
- * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
- * reduce the available free memory in the system as it
- * runs. Increasing max_ptes_none will instead potentially reduce the
- * free memory in the system during the khugepaged scan.
- */
-static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
-}
-static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long max_ptes_none;
-
- err = kstrtoul(buf, 10, &max_ptes_none);
- if (err || max_ptes_none > HPAGE_PMD_NR-1)
- return -EINVAL;
-
- khugepaged_max_ptes_none = max_ptes_none;
-
- return count;
-}
-static struct kobj_attribute khugepaged_max_ptes_none_attr =
- __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
- khugepaged_max_ptes_none_store);
-
-static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
-}
-
-static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long max_ptes_swap;
-
- err = kstrtoul(buf, 10, &max_ptes_swap);
- if (err || max_ptes_swap > HPAGE_PMD_NR-1)
- return -EINVAL;
-
- khugepaged_max_ptes_swap = max_ptes_swap;
-
- return count;
-}
-
-static struct kobj_attribute khugepaged_max_ptes_swap_attr =
- __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
- khugepaged_max_ptes_swap_store);
-
-static struct attribute *khugepaged_attr[] = {
- &khugepaged_defrag_attr.attr,
- &khugepaged_max_ptes_none_attr.attr,
- &pages_to_scan_attr.attr,
- &pages_collapsed_attr.attr,
- &full_scans_attr.attr,
- &scan_sleep_millisecs_attr.attr,
- &alloc_sleep_millisecs_attr.attr,
- &khugepaged_max_ptes_swap_attr.attr,
- NULL,
-};
-
-static struct attribute_group khugepaged_attr_group = {
- .attrs = khugepaged_attr,
- .name = "khugepaged",
-};
-
static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
{
int err;
@@ -704,8 +364,6 @@ static int __init hugepage_init(void)
return -EINVAL;
}
- khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
- khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
/*
* hugepages can't be allocated by the buddy allocator
*/
@@ -720,7 +378,7 @@ static int __init hugepage_init(void)
if (err)
goto err_sysfs;
- err = khugepaged_slab_init();
+ err = khugepaged_init();
if (err)
goto err_slab;
@@ -751,7 +409,7 @@ err_khugepaged:
err_split_shrinker:
unregister_shrinker(&huge_zero_page_shrinker);
err_hzp_shrinker:
- khugepaged_slab_exit();
+ khugepaged_destroy();
err_slab:
hugepage_exit_sysfs(hugepage_kobj);
err_sysfs:
@@ -906,12 +564,6 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
return GFP_TRANSHUGE | reclaim_flags;
}
-/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
-static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
-{
- return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
-}
-
/* Caller must hold page table lock. */
static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
@@ -1837,1070 +1489,6 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return NULL;
}
-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
-
-int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
-{
- switch (advice) {
- case MADV_HUGEPAGE:
-#ifdef CONFIG_S390
- /*
- * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
- * can't handle this properly after s390_enable_sie, so we simply
- * ignore the madvise to prevent qemu from causing a SIGSEGV.
- */
- if (mm_has_pgste(vma->vm_mm))
- return 0;
-#endif
- *vm_flags &= ~VM_NOHUGEPAGE;
- *vm_flags |= VM_HUGEPAGE;
- /*
- * If the vma become good for khugepaged to scan,
- * register it here without waiting a page fault that
- * may not happen any time soon.
- */
- if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
- khugepaged_enter_vma_merge(vma, *vm_flags))
- return -ENOMEM;
- break;
- case MADV_NOHUGEPAGE:
- *vm_flags &= ~VM_HUGEPAGE;
- *vm_flags |= VM_NOHUGEPAGE;
- /*
- * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
- * this vma even if we leave the mm registered in khugepaged if
- * it got registered before VM_NOHUGEPAGE was set.
- */
- break;
- }
-
- return 0;
-}
-
-static int __init khugepaged_slab_init(void)
-{
- mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
- sizeof(struct mm_slot),
- __alignof__(struct mm_slot), 0, NULL);
- if (!mm_slot_cache)
- return -ENOMEM;
-
- return 0;
-}
-
-static void __init khugepaged_slab_exit(void)
-{
- kmem_cache_destroy(mm_slot_cache);
-}
-
-static inline struct mm_slot *alloc_mm_slot(void)
-{
- if (!mm_slot_cache) /* initialization failed */
- return NULL;
- return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
-}
-
-static inline void free_mm_slot(struct mm_slot *mm_slot)
-{
- kmem_cache_free(mm_slot_cache, mm_slot);
-}
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
-
- hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
- if (mm == mm_slot->mm)
- return mm_slot;
-
- return NULL;
-}
-
-static void insert_to_mm_slots_hash(struct mm_struct *mm,
- struct mm_slot *mm_slot)
-{
- mm_slot->mm = mm;
- hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
-}
-
-static inline int khugepaged_test_exit(struct mm_struct *mm)
-{
- return atomic_read(&mm->mm_users) == 0;
-}
-
-int __khugepaged_enter(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
- int wakeup;
-
- mm_slot = alloc_mm_slot();
- if (!mm_slot)
- return -ENOMEM;
-
- /* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
- if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
- free_mm_slot(mm_slot);
- return 0;
- }
-
- spin_lock(&khugepaged_mm_lock);
- insert_to_mm_slots_hash(mm, mm_slot);
- /*
- * Insert just behind the scanning cursor, to let the area settle
- * down a little.
- */
- wakeup = list_empty(&khugepaged_scan.mm_head);
- list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
- spin_unlock(&khugepaged_mm_lock);
-
- atomic_inc(&mm->mm_count);
- if (wakeup)
- wake_up_interruptible(&khugepaged_wait);
-
- return 0;
-}
-
-int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- unsigned long hstart, hend;
- if (!vma->anon_vma)
- /*
- * Not yet faulted in so we will register later in the
- * page fault if needed.
- */
- return 0;
- if (vma->vm_ops)
- /* khugepaged not yet working on file or special mappings */
- return 0;
- VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (hstart < hend)
- return khugepaged_enter(vma, vm_flags);
- return 0;
-}
-
-void __khugepaged_exit(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
- int free = 0;
-
- spin_lock(&khugepaged_mm_lock);
- mm_slot = get_mm_slot(mm);
- if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
- hash_del(&mm_slot->hash);
- list_del(&mm_slot->mm_node);
- free = 1;
- }
- spin_unlock(&khugepaged_mm_lock);
-
- if (free) {
- clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
- free_mm_slot(mm_slot);
- mmdrop(mm);
- } else if (mm_slot) {
- /*
- * This is required to serialize against
- * khugepaged_test_exit() (which is guaranteed to run
- * under mmap sem read mode). Stop here (after we
- * return all pagetables will be destroyed) until
- * khugepaged has finished working on the pagetables
- * under the mmap_sem.
- */
- down_write(&mm->mmap_sem);
- up_write(&mm->mmap_sem);
- }
-}
-
-static void release_pte_page(struct page *page)
-{
- /* 0 stands for page_is_file_cache(page) == false */
- dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
- unlock_page(page);
- putback_lru_page(page);
-}
-
-static void release_pte_pages(pte_t *pte, pte_t *_pte)
-{
- while (--_pte >= pte) {
- pte_t pteval = *_pte;
- if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
- release_pte_page(pte_page(pteval));
- }
-}
-
-static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
- unsigned long address,
- pte_t *pte)
-{
- struct page *page = NULL;
- pte_t *_pte;
- int none_or_zero = 0, result = 0;
- bool referenced = false, writable = false;
-
- for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
- _pte++, address += PAGE_SIZE) {
- pte_t pteval = *_pte;
- if (pte_none(pteval) || (pte_present(pteval) &&
- is_zero_pfn(pte_pfn(pteval)))) {
- if (!userfaultfd_armed(vma) &&
- ++none_or_zero <= khugepaged_max_ptes_none) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- goto out;
- }
- }
- if (!pte_present(pteval)) {
- result = SCAN_PTE_NON_PRESENT;
- goto out;
- }
- page = vm_normal_page(vma, address, pteval);
- if (unlikely(!page)) {
- result = SCAN_PAGE_NULL;
- goto out;
- }
-
- VM_BUG_ON_PAGE(PageCompound(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
- VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-
- /*
- * We can do it before isolate_lru_page because the
- * page can't be freed from under us. NOTE: PG_lock
- * is needed to serialize against split_huge_page
- * when invoked from the VM.
- */
- if (!trylock_page(page)) {
- result = SCAN_PAGE_LOCK;
- goto out;
- }
-
- /*
- * cannot use mapcount: can't collapse if there's a gup pin.
- * The page must only be referenced by the scanned process
- * and page swap cache.
- */
- if (page_count(page) != 1 + !!PageSwapCache(page)) {
- unlock_page(page);
- result = SCAN_PAGE_COUNT;
- goto out;
- }
- if (pte_write(pteval)) {
- writable = true;
- } else {
- if (PageSwapCache(page) && !reuse_swap_page(page)) {
- unlock_page(page);
- result = SCAN_SWAP_CACHE_PAGE;
- goto out;
- }
- /*
- * Page is not in the swap cache. It can be collapsed
- * into a THP.
- */
- }
-
- /*
- * Isolate the page to avoid collapsing an hugepage
- * currently in use by the VM.
- */
- if (isolate_lru_page(page)) {
- unlock_page(page);
- result = SCAN_DEL_PAGE_LRU;
- goto out;
- }
- /* 0 stands for page_is_file_cache(page) == false */
- inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(PageLRU(page), page);
-
- /* If there is no mapped pte young don't collapse the page */
- if (pte_young(pteval) ||
- page_is_young(page) || PageReferenced(page) ||
- mmu_notifier_test_young(vma->vm_mm, address))
- referenced = true;
- }
- if (likely(writable)) {
- if (likely(referenced)) {
- result = SCAN_SUCCEED;
- trace_mm_collapse_huge_page_isolate(page, none_or_zero,
- referenced, writable, result);
- return 1;
- }
- } else {
- result = SCAN_PAGE_RO;
- }
-
-out:
- release_pte_pages(pte, _pte);
- trace_mm_collapse_huge_page_isolate(page, none_or_zero,
- referenced, writable, result);
- return 0;
-}
-
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
- struct vm_area_struct *vma,
- unsigned long address,
- spinlock_t *ptl)
-{
- pte_t *_pte;
- for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
- pte_t pteval = *_pte;
- struct page *src_page;
-
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- clear_user_highpage(page, address);
- add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
- if (is_zero_pfn(pte_pfn(pteval))) {
- /*
- * ptl mostly unnecessary.
- */
- spin_lock(ptl);
- /*
- * paravirt calls inside pte_clear here are
- * superfluous.
- */
- pte_clear(vma->vm_mm, address, _pte);
- spin_unlock(ptl);
- }
- } else {
- src_page = pte_page(pteval);
- copy_user_highpage(page, src_page, address, vma);
- VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
- release_pte_page(src_page);
- /*
- * ptl mostly unnecessary, but preempt has to
- * be disabled to update the per-cpu stats
- * inside page_remove_rmap().
- */
- spin_lock(ptl);
- /*
- * paravirt calls inside pte_clear here are
- * superfluous.
- */
- pte_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page, false);
- spin_unlock(ptl);
- free_page_and_swap_cache(src_page);
- }
-
- address += PAGE_SIZE;
- page++;
- }
-}
-
-static void khugepaged_alloc_sleep(void)
-{
- DEFINE_WAIT(wait);
-
- add_wait_queue(&khugepaged_wait, &wait);
- freezable_schedule_timeout_interruptible(
- msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
- remove_wait_queue(&khugepaged_wait, &wait);
-}
-
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
-{
- int i;
-
- /*
- * If zone_reclaim_mode is disabled, then no extra effort is made to
- * allocate memory locally.
- */
- if (!zone_reclaim_mode)
- return false;
-
- /* If there is a count for this node already, it must be acceptable */
- if (khugepaged_node_load[nid])
- return false;
-
- for (i = 0; i < MAX_NUMNODES; i++) {
- if (!khugepaged_node_load[i])
- continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
- return true;
- }
- return false;
-}
-
-#ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
-{
- static int last_khugepaged_target_node = NUMA_NO_NODE;
- int nid, target_node = 0, max_value = 0;
-
- /* find first node with max normal pages hit */
- for (nid = 0; nid < MAX_NUMNODES; nid++)
- if (khugepaged_node_load[nid] > max_value) {
- max_value = khugepaged_node_load[nid];
- target_node = nid;
- }
-
- /* do some balance if several nodes have the same hit record */
- if (target_node <= last_khugepaged_target_node)
- for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
- nid++)
- if (max_value == khugepaged_node_load[nid]) {
- target_node = nid;
- break;
- }
-
- last_khugepaged_target_node = target_node;
- return target_node;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- if (IS_ERR(*hpage)) {
- if (!*wait)
- return false;
-
- *wait = false;
- *hpage = NULL;
- khugepaged_alloc_sleep();
- } else if (*hpage) {
- put_page(*hpage);
- *hpage = NULL;
- }
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
-{
- VM_BUG_ON_PAGE(*hpage, *hpage);
-
- /*
- * Before allocating the hugepage, release the mmap_sem read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_sem during
- * that. We will recheck the vma after taking it again in write mode.
- */
- up_read(&mm->mmap_sem);
-
- *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
- if (unlikely(!*hpage)) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- *hpage = ERR_PTR(-ENOMEM);
- return NULL;
- }
-
- prep_transhuge_page(*hpage);
- count_vm_event(THP_COLLAPSE_ALLOC);
- return *hpage;
-}
-#else
-static int khugepaged_find_target_node(void)
-{
- return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
- struct page *page;
-
- page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
- HPAGE_PMD_ORDER);
- if (page)
- prep_transhuge_page(page);
- return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
- struct page *hpage;
-
- do {
- hpage = alloc_khugepaged_hugepage();
- if (!hpage) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- if (!*wait)
- return NULL;
-
- *wait = false;
- khugepaged_alloc_sleep();
- } else
- count_vm_event(THP_COLLAPSE_ALLOC);
- } while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
- return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- if (!*hpage)
- *hpage = khugepaged_alloc_hugepage(wait);
-
- if (unlikely(!*hpage))
- return false;
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
-{
- up_read(&mm->mmap_sem);
- VM_BUG_ON(!*hpage);
-
- return *hpage;
-}
-#endif
-
-static bool hugepage_vma_check(struct vm_area_struct *vma)
-{
- if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
- (vma->vm_flags & VM_NOHUGEPAGE))
- return false;
- if (!vma->anon_vma || vma->vm_ops)
- return false;
- if (is_vma_temporary_stack(vma))
- return false;
- VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
- return true;
-}
-
-/*
- * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
- *
- * Called and returns without pte mapped or spinlocks held,
- * but with mmap_sem held to protect against vma changes.
- */
-
-static void __collapse_huge_page_swapin(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd)
-{
- pte_t pteval;
- int swapped_in = 0, ret = 0;
- struct fault_env fe = {
- .vma = vma,
- .address = address,
- .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- .pmd = pmd,
- };
-
- fe.pte = pte_offset_map(pmd, address);
- for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
- fe.pte++, fe.address += PAGE_SIZE) {
- pteval = *fe.pte;
- if (!is_swap_pte(pteval))
- continue;
- swapped_in++;
- ret = do_swap_page(&fe, pteval);
- if (ret & VM_FAULT_ERROR) {
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
- return;
- }
- /* pte is unmapped now, we need to map it */
- fe.pte = pte_offset_map(pmd, fe.address);
- }
- fe.pte--;
- pte_unmap(fe.pte);
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
-}
-
-static void collapse_huge_page(struct mm_struct *mm,
- unsigned long address,
- struct page **hpage,
- struct vm_area_struct *vma,
- int node)
-{
- pmd_t *pmd, _pmd;
- pte_t *pte;
- pgtable_t pgtable;
- struct page *new_page;
- spinlock_t *pmd_ptl, *pte_ptl;
- int isolated = 0, result = 0;
- unsigned long hstart, hend;
- struct mem_cgroup *memcg;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
- gfp_t gfp;
-
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
- /* Only allocate from the target node */
- gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
-
- /* release the mmap_sem read lock. */
- new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
- if (!new_page) {
- result = SCAN_ALLOC_HUGE_PAGE_FAIL;
- goto out_nolock;
- }
-
- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
- result = SCAN_CGROUP_CHARGE_FAIL;
- goto out_nolock;
- }
-
- /*
- * Prevent all access to pagetables with the exception of
- * gup_fast later hanlded by the ptep_clear_flush and the VM
- * handled by the anon_vma lock + PG_lock.
- */
- down_write(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm))) {
- result = SCAN_ANY_PROCESS;
- goto out;
- }
-
- vma = find_vma(mm, address);
- if (!vma) {
- result = SCAN_VMA_NULL;
- goto out;
- }
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
- result = SCAN_ADDRESS_RANGE;
- goto out;
- }
- if (!hugepage_vma_check(vma)) {
- result = SCAN_VMA_CHECK;
- goto out;
- }
- pmd = mm_find_pmd(mm, address);
- if (!pmd) {
- result = SCAN_PMD_NULL;
- goto out;
- }
-
- __collapse_huge_page_swapin(mm, vma, address, pmd);
-
- anon_vma_lock_write(vma->anon_vma);
-
- pte = pte_offset_map(pmd, address);
- pte_ptl = pte_lockptr(mm, pmd);
-
- mmun_start = address;
- mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
- /*
- * After this gup_fast can't run anymore. This also removes
- * any huge TLB entry from the CPU so we won't allow
- * huge and small TLB entries for the same virtual address
- * to avoid the risk of CPU bugs in that area.
- */
- _pmd = pmdp_collapse_flush(vma, address, pmd);
- spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- spin_lock(pte_ptl);
- isolated = __collapse_huge_page_isolate(vma, address, pte);
- spin_unlock(pte_ptl);
-
- if (unlikely(!isolated)) {
- pte_unmap(pte);
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- /*
- * We can only use set_pmd_at when establishing
- * hugepmds and never for establishing regular pmds that
- * points to regular pagetables. Use pmd_populate for that
- */
- pmd_populate(mm, pmd, pmd_pgtable(_pmd));
- spin_unlock(pmd_ptl);
- anon_vma_unlock_write(vma->anon_vma);
- result = SCAN_FAIL;
- goto out;
- }
-
- /*
- * All pages are isolated and locked so anon_vma rmap
- * can't run anymore.
- */
- anon_vma_unlock_write(vma->anon_vma);
-
- __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
- pte_unmap(pte);
- __SetPageUptodate(new_page);
- pgtable = pmd_pgtable(_pmd);
-
- _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
- _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
- /*
- * spin_lock() below is not the equivalent of smp_wmb(), so
- * this is needed to avoid the copy_huge_page writes to become
- * visible after the set_pmd_at() write.
- */
- smp_wmb();
-
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false, true);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, address, pmd, _pmd);
- update_mmu_cache_pmd(vma, address, pmd);
- spin_unlock(pmd_ptl);
-
- *hpage = NULL;
-
- khugepaged_pages_collapsed++;
- result = SCAN_SUCCEED;
-out_up_write:
- up_write(&mm->mmap_sem);
-out_nolock:
- trace_mm_collapse_huge_page(mm, isolated, result);
- return;
-out:
- mem_cgroup_cancel_charge(new_page, memcg, true);
- goto out_up_write;
-}
-
-static int khugepaged_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- struct page **hpage)
-{
- pmd_t *pmd;
- pte_t *pte, *_pte;
- int ret = 0, none_or_zero = 0, result = 0;
- struct page *page = NULL;
- unsigned long _address;
- spinlock_t *ptl;
- int node = NUMA_NO_NODE, unmapped = 0;
- bool writable = false, referenced = false;
-
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
- pmd = mm_find_pmd(mm, address);
- if (!pmd) {
- result = SCAN_PMD_NULL;
- goto out;
- }
-
- memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
- _pte++, _address += PAGE_SIZE) {
- pte_t pteval = *_pte;
- if (is_swap_pte(pteval)) {
- if (++unmapped <= khugepaged_max_ptes_swap) {
- continue;
- } else {
- result = SCAN_EXCEED_SWAP_PTE;
- goto out_unmap;
- }
- }
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- if (!userfaultfd_armed(vma) &&
- ++none_or_zero <= khugepaged_max_ptes_none) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- goto out_unmap;
- }
- }
- if (!pte_present(pteval)) {
- result = SCAN_PTE_NON_PRESENT;
- goto out_unmap;
- }
- if (pte_write(pteval))
- writable = true;
-
- page = vm_normal_page(vma, _address, pteval);
- if (unlikely(!page)) {
- result = SCAN_PAGE_NULL;
- goto out_unmap;
- }
-
- /* TODO: teach khugepaged to collapse THP mapped with pte */
- if (PageCompound(page)) {
- result = SCAN_PAGE_COMPOUND;
- goto out_unmap;
- }
-
- /*
- * Record which node the original page is from and save this
- * information to khugepaged_node_load[].
- * Khupaged will allocate hugepage from the node has the max
- * hit record.
- */
- node = page_to_nid(page);
- if (khugepaged_scan_abort(node)) {
- result = SCAN_SCAN_ABORT;
- goto out_unmap;
- }
- khugepaged_node_load[node]++;
- if (!PageLRU(page)) {
- result = SCAN_PAGE_LRU;
- goto out_unmap;
- }
- if (PageLocked(page)) {
- result = SCAN_PAGE_LOCK;
- goto out_unmap;
- }
- if (!PageAnon(page)) {
- result = SCAN_PAGE_ANON;
- goto out_unmap;
- }
-
- /*
- * cannot use mapcount: can't collapse if there's a gup pin.
- * The page must only be referenced by the scanned process
- * and page swap cache.
- */
- if (page_count(page) != 1 + !!PageSwapCache(page)) {
- result = SCAN_PAGE_COUNT;
- goto out_unmap;
- }
- if (pte_young(pteval) ||
- page_is_young(page) || PageReferenced(page) ||
- mmu_notifier_test_young(vma->vm_mm, address))
- referenced = true;
- }
- if (writable) {
- if (referenced) {
- result = SCAN_SUCCEED;
- ret = 1;
- } else {
- result = SCAN_NO_REFERENCED_PAGE;
- }
- } else {
- result = SCAN_PAGE_RO;
- }
-out_unmap:
- pte_unmap_unlock(pte, ptl);
- if (ret) {
- node = khugepaged_find_target_node();
- /* collapse_huge_page will return with the mmap_sem released */
- collapse_huge_page(mm, address, hpage, vma, node);
- }
-out:
- trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
- none_or_zero, result, unmapped);
- return ret;
-}
-
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
- struct mm_struct *mm = mm_slot->mm;
-
- VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
- if (khugepaged_test_exit(mm)) {
- /* free mm_slot */
- hash_del(&mm_slot->hash);
- list_del(&mm_slot->mm_node);
-
- /*
- * Not strictly needed because the mm exited already.
- *
- * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
- */
-
- /* khugepaged_mm_lock actually not necessary for the below */
- free_mm_slot(mm_slot);
- mmdrop(mm);
- }
-}
-
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
- struct page **hpage)
- __releases(&khugepaged_mm_lock)
- __acquires(&khugepaged_mm_lock)
-{
- struct mm_slot *mm_slot;
- struct mm_struct *mm;
- struct vm_area_struct *vma;
- int progress = 0;
-
- VM_BUG_ON(!pages);
- VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
- if (khugepaged_scan.mm_slot)
- mm_slot = khugepaged_scan.mm_slot;
- else {
- mm_slot = list_entry(khugepaged_scan.mm_head.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- khugepaged_scan.mm_slot = mm_slot;
- }
- spin_unlock(&khugepaged_mm_lock);
-
- mm = mm_slot->mm;
- down_read(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm)))
- vma = NULL;
- else
- vma = find_vma(mm, khugepaged_scan.address);
-
- progress++;
- for (; vma; vma = vma->vm_next) {
- unsigned long hstart, hend;
-
- cond_resched();
- if (unlikely(khugepaged_test_exit(mm))) {
- progress++;
- break;
- }
- if (!hugepage_vma_check(vma)) {
-skip:
- progress++;
- continue;
- }
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (hstart >= hend)
- goto skip;
- if (khugepaged_scan.address > hend)
- goto skip;
- if (khugepaged_scan.address < hstart)
- khugepaged_scan.address = hstart;
- VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
-
- while (khugepaged_scan.address < hend) {
- int ret;
- cond_resched();
- if (unlikely(khugepaged_test_exit(mm)))
- goto breakouterloop;
-
- VM_BUG_ON(khugepaged_scan.address < hstart ||
- khugepaged_scan.address + HPAGE_PMD_SIZE >
- hend);
- ret = khugepaged_scan_pmd(mm, vma,
- khugepaged_scan.address,
- hpage);
- /* move to next address */
- khugepaged_scan.address += HPAGE_PMD_SIZE;
- progress += HPAGE_PMD_NR;
- if (ret)
- /* we released mmap_sem so break loop */
- goto breakouterloop_mmap_sem;
- if (progress >= pages)
- goto breakouterloop;
- }
- }
-breakouterloop:
- up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
-breakouterloop_mmap_sem:
-
- spin_lock(&khugepaged_mm_lock);
- VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
- /*
- * Release the current mm_slot if this mm is about to die, or
- * if we scanned all vmas of this mm.
- */
- if (khugepaged_test_exit(mm) || !vma) {
- /*
- * Make sure that if mm_users is reaching zero while
- * khugepaged runs here, khugepaged_exit will find
- * mm_slot not pointing to the exiting mm.
- */
- if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
- khugepaged_scan.mm_slot = list_entry(
- mm_slot->mm_node.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- } else {
- khugepaged_scan.mm_slot = NULL;
- khugepaged_full_scans++;
- }
-
- collect_mm_slot(mm_slot);
- }
-
- return progress;
-}
-
-static int khugepaged_has_work(void)
-{
- return !list_empty(&khugepaged_scan.mm_head) &&
- khugepaged_enabled();
-}
-
-static int khugepaged_wait_event(void)
-{
- return !list_empty(&khugepaged_scan.mm_head) ||
- kthread_should_stop();
-}
-
-static void khugepaged_do_scan(void)
-{
- struct page *hpage = NULL;
- unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
- bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
-
- while (progress < pages) {
- if (!khugepaged_prealloc_page(&hpage, &wait))
- break;
-
- cond_resched();
-
- if (unlikely(kthread_should_stop() || try_to_freeze()))
- break;
-
- spin_lock(&khugepaged_mm_lock);
- if (!khugepaged_scan.mm_slot)
- pass_through_head++;
- if (khugepaged_has_work() &&
- pass_through_head < 2)
- progress += khugepaged_scan_mm_slot(pages - progress,
- &hpage);
- else
- progress = pages;
- spin_unlock(&khugepaged_mm_lock);
- }
-
- if (!IS_ERR_OR_NULL(hpage))
- put_page(hpage);
-}
-
-static void khugepaged_wait_work(void)
-{
- if (khugepaged_has_work()) {
- if (!khugepaged_scan_sleep_millisecs)
- return;
-
- wait_event_freezable_timeout(khugepaged_wait,
- kthread_should_stop(),
- msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
- return;
- }
-
- if (khugepaged_enabled())
- wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
-}
-
-static int khugepaged(void *none)
-{
- struct mm_slot *mm_slot;
-
- set_freezable();
- set_user_nice(current, MAX_NICE);
-
- while (!kthread_should_stop()) {
- khugepaged_do_scan();
- khugepaged_wait_work();
- }
-
- spin_lock(&khugepaged_mm_lock);
- mm_slot = khugepaged_scan.mm_slot;
- khugepaged_scan.mm_slot = NULL;
- if (mm_slot)
- collect_mm_slot(mm_slot);
- spin_unlock(&khugepaged_mm_lock);
- return 0;
-}
-
static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
new file mode 100644
index 000000000000..ef229cdfe5b7
--- /dev/null
+++ b/mm/khugepaged.c
@@ -0,0 +1,1428 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
+#include <linux/freezer.h>
+#include <linux/mman.h>
+#include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/page_idle.h>
+#include <linux/swapops.h>
+
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+enum scan_result {
+ SCAN_FAIL,
+ SCAN_SUCCEED,
+ SCAN_PMD_NULL,
+ SCAN_EXCEED_NONE_PTE,
+ SCAN_PTE_NON_PRESENT,
+ SCAN_PAGE_RO,
+ SCAN_NO_REFERENCED_PAGE,
+ SCAN_PAGE_NULL,
+ SCAN_SCAN_ABORT,
+ SCAN_PAGE_COUNT,
+ SCAN_PAGE_LRU,
+ SCAN_PAGE_LOCK,
+ SCAN_PAGE_ANON,
+ SCAN_PAGE_COMPOUND,
+ SCAN_ANY_PROCESS,
+ SCAN_VMA_NULL,
+ SCAN_VMA_CHECK,
+ SCAN_ADDRESS_RANGE,
+ SCAN_SWAP_CACHE_PAGE,
+ SCAN_DEL_PAGE_LRU,
+ SCAN_ALLOC_HUGE_PAGE_FAIL,
+ SCAN_CGROUP_CHARGE_FAIL,
+ SCAN_EXCEED_SWAP_PTE
+};
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_memory.h>
+
+/* default scan 8*512 pte (or vmas) every 30 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly;
+static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_full_scans;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+/*
+ * default collapse hugepages if there is at least one pte mapped like
+ * it would have happened if the vma was large enough during page
+ * fault.
+ */
+static unsigned int khugepaged_max_ptes_none __read_mostly;
+static unsigned int khugepaged_max_ptes_swap __read_mostly;
+
+#define MM_SLOTS_HASH_BITS 10
+static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
+
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+ struct hlist_node hash;
+ struct list_head mm_node;
+ struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+ struct list_head mm_head;
+ struct mm_slot *mm_slot;
+ unsigned long address;
+};
+static struct khugepaged_scan khugepaged_scan = {
+ .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long msecs;
+ int err;
+
+ err = kstrtoul(buf, 10, &msecs);
+ if (err || msecs > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_scan_sleep_millisecs = msecs;
+ wake_up_interruptible(&khugepaged_wait);
+
+ return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+ __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+ scan_sleep_millisecs_store);
+
+static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
+}
+
+static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long msecs;
+ int err;
+
+ err = kstrtoul(buf, 10, &msecs);
+ if (err || msecs > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_alloc_sleep_millisecs = msecs;
+ wake_up_interruptible(&khugepaged_wait);
+
+ return count;
+}
+static struct kobj_attribute alloc_sleep_millisecs_attr =
+ __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
+ alloc_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long pages;
+
+ err = kstrtoul(buf, 10, &pages);
+ if (err || !pages || pages > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_pages_to_scan = pages;
+
+ return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+ __ATTR(pages_to_scan, 0644, pages_to_scan_show,
+ pages_to_scan_store);
+
+static ssize_t pages_collapsed_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
+}
+static struct kobj_attribute pages_collapsed_attr =
+ __ATTR_RO(pages_collapsed);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_full_scans);
+}
+static struct kobj_attribute full_scans_attr =
+ __ATTR_RO(full_scans);
+
+static ssize_t khugepaged_defrag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_hugepage_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static ssize_t khugepaged_defrag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ return single_hugepage_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static struct kobj_attribute khugepaged_defrag_attr =
+ __ATTR(defrag, 0644, khugepaged_defrag_show,
+ khugepaged_defrag_store);
+
+/*
+ * max_ptes_none controls if khugepaged should collapse hugepages over
+ * any unmapped ptes in turn potentially increasing the memory
+ * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
+ * reduce the available free memory in the system as it
+ * runs. Increasing max_ptes_none will instead potentially reduce the
+ * free memory in the system during the khugepaged scan.
+ */
+static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
+}
+static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long max_ptes_none;
+
+ err = kstrtoul(buf, 10, &max_ptes_none);
+ if (err || max_ptes_none > HPAGE_PMD_NR-1)
+ return -EINVAL;
+
+ khugepaged_max_ptes_none = max_ptes_none;
+
+ return count;
+}
+static struct kobj_attribute khugepaged_max_ptes_none_attr =
+ __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
+ khugepaged_max_ptes_none_store);
+
+static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
+}
+
+static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long max_ptes_swap;
+
+ err = kstrtoul(buf, 10, &max_ptes_swap);
+ if (err || max_ptes_swap > HPAGE_PMD_NR-1)
+ return -EINVAL;
+
+ khugepaged_max_ptes_swap = max_ptes_swap;
+
+ return count;
+}
+
+static struct kobj_attribute khugepaged_max_ptes_swap_attr =
+ __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
+ khugepaged_max_ptes_swap_store);
+
+static struct attribute *khugepaged_attr[] = {
+ &khugepaged_defrag_attr.attr,
+ &khugepaged_max_ptes_none_attr.attr,
+ &pages_to_scan_attr.attr,
+ &pages_collapsed_attr.attr,
+ &full_scans_attr.attr,
+ &scan_sleep_millisecs_attr.attr,
+ &alloc_sleep_millisecs_attr.attr,
+ &khugepaged_max_ptes_swap_attr.attr,
+ NULL,
+};
+
+struct attribute_group khugepaged_attr_group = {
+ .attrs = khugepaged_attr,
+ .name = "khugepaged",
+};
+
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+
+int hugepage_madvise(struct vm_area_struct *vma,
+ unsigned long *vm_flags, int advice)
+{
+ switch (advice) {
+ case MADV_HUGEPAGE:
+#ifdef CONFIG_S390
+ /*
+ * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+ * can't handle this properly after s390_enable_sie, so we simply
+ * ignore the madvise to prevent qemu from causing a SIGSEGV.
+ */
+ if (mm_has_pgste(vma->vm_mm))
+ return 0;
+#endif
+ *vm_flags &= ~VM_NOHUGEPAGE;
+ *vm_flags |= VM_HUGEPAGE;
+ /*
+ * If the vma become good for khugepaged to scan,
+ * register it here without waiting a page fault that
+ * may not happen any time soon.
+ */
+ if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
+ return -ENOMEM;
+ break;
+ case MADV_NOHUGEPAGE:
+ *vm_flags &= ~VM_HUGEPAGE;
+ *vm_flags |= VM_NOHUGEPAGE;
+ /*
+ * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
+ * this vma even if we leave the mm registered in khugepaged if
+ * it got registered before VM_NOHUGEPAGE was set.
+ */
+ break;
+ }
+
+ return 0;
+}
+
+int __init khugepaged_init(void)
+{
+ mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+ sizeof(struct mm_slot),
+ __alignof__(struct mm_slot), 0, NULL);
+ if (!mm_slot_cache)
+ return -ENOMEM;
+
+ khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
+ khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
+ khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
+ return 0;
+}
+
+void __init khugepaged_destroy(void)
+{
+ kmem_cache_destroy(mm_slot_cache);
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+ if (!mm_slot_cache) /* initialization failed */
+ return NULL;
+ return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+ kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+
+ hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
+ if (mm == mm_slot->mm)
+ return mm_slot;
+
+ return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+ struct mm_slot *mm_slot)
+{
+ mm_slot->mm = mm;
+ hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+ return atomic_read(&mm->mm_users) == 0;
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+ int wakeup;
+
+ mm_slot = alloc_mm_slot();
+ if (!mm_slot)
+ return -ENOMEM;
+
+ /* __khugepaged_exit() must not run from under us */
+ VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+ if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
+ free_mm_slot(mm_slot);
+ return 0;
+ }
+
+ spin_lock(&khugepaged_mm_lock);
+ insert_to_mm_slots_hash(mm, mm_slot);
+ /*
+ * Insert just behind the scanning cursor, to let the area settle
+ * down a little.
+ */
+ wakeup = list_empty(&khugepaged_scan.mm_head);
+ list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+ spin_unlock(&khugepaged_mm_lock);
+
+ atomic_inc(&mm->mm_count);
+ if (wakeup)
+ wake_up_interruptible(&khugepaged_wait);
+
+ return 0;
+}
+
+int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ unsigned long hstart, hend;
+ if (!vma->anon_vma)
+ /*
+ * Not yet faulted in so we will register later in the
+ * page fault if needed.
+ */
+ return 0;
+ if (vma->vm_ops)
+ /* khugepaged not yet working on file or special mappings */
+ return 0;
+ VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (hstart < hend)
+ return khugepaged_enter(vma, vm_flags);
+ return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+ int free = 0;
+
+ spin_lock(&khugepaged_mm_lock);
+ mm_slot = get_mm_slot(mm);
+ if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+ hash_del(&mm_slot->hash);
+ list_del(&mm_slot->mm_node);
+ free = 1;
+ }
+ spin_unlock(&khugepaged_mm_lock);
+
+ if (free) {
+ clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+ free_mm_slot(mm_slot);
+ mmdrop(mm);
+ } else if (mm_slot) {
+ /*
+ * This is required to serialize against
+ * khugepaged_test_exit() (which is guaranteed to run
+ * under mmap sem read mode). Stop here (after we
+ * return all pagetables will be destroyed) until
+ * khugepaged has finished working on the pagetables
+ * under the mmap_sem.
+ */
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+}
+
+static void release_pte_page(struct page *page)
+{
+ /* 0 stands for page_is_file_cache(page) == false */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ unlock_page(page);
+ putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+ while (--_pte >= pte) {
+ pte_t pteval = *_pte;
+ if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
+ release_pte_page(pte_page(pteval));
+ }
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *pte)
+{
+ struct page *page = NULL;
+ pte_t *_pte;
+ int none_or_zero = 0, result = 0;
+ bool referenced = false, writable = false;
+
+ for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+ _pte++, address += PAGE_SIZE) {
+ pte_t pteval = *_pte;
+ if (pte_none(pteval) || (pte_present(pteval) &&
+ is_zero_pfn(pte_pfn(pteval)))) {
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto out;
+ }
+ }
+ if (!pte_present(pteval)) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out;
+ }
+ page = vm_normal_page(vma, address, pteval);
+ if (unlikely(!page)) {
+ result = SCAN_PAGE_NULL;
+ goto out;
+ }
+
+ VM_BUG_ON_PAGE(PageCompound(page), page);
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
+ /*
+ * We can do it before isolate_lru_page because the
+ * page can't be freed from under us. NOTE: PG_lock
+ * is needed to serialize against split_huge_page
+ * when invoked from the VM.
+ */
+ if (!trylock_page(page)) {
+ result = SCAN_PAGE_LOCK;
+ goto out;
+ }
+
+ /*
+ * cannot use mapcount: can't collapse if there's a gup pin.
+ * The page must only be referenced by the scanned process
+ * and page swap cache.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ unlock_page(page);
+ result = SCAN_PAGE_COUNT;
+ goto out;
+ }
+ if (pte_write(pteval)) {
+ writable = true;
+ } else {
+ if (PageSwapCache(page) && !reuse_swap_page(page)) {
+ unlock_page(page);
+ result = SCAN_SWAP_CACHE_PAGE;
+ goto out;
+ }
+ /*
+ * Page is not in the swap cache. It can be collapsed
+ * into a THP.
+ */
+ }
+
+ /*
+ * Isolate the page to avoid collapsing an hugepage
+ * currently in use by the VM.
+ */
+ if (isolate_lru_page(page)) {
+ unlock_page(page);
+ result = SCAN_DEL_PAGE_LRU;
+ goto out;
+ }
+ /* 0 stands for page_is_file_cache(page) == false */
+ inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ /* If there is no mapped pte young don't collapse the page */
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
+ mmu_notifier_test_young(vma->vm_mm, address))
+ referenced = true;
+ }
+ if (likely(writable)) {
+ if (likely(referenced)) {
+ result = SCAN_SUCCEED;
+ trace_mm_collapse_huge_page_isolate(page, none_or_zero,
+ referenced, writable, result);
+ return 1;
+ }
+ } else {
+ result = SCAN_PAGE_RO;
+ }
+
+out:
+ release_pte_pages(pte, _pte);
+ trace_mm_collapse_huge_page_isolate(page, none_or_zero,
+ referenced, writable, result);
+ return 0;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ spinlock_t *ptl)
+{
+ pte_t *_pte;
+ for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+ pte_t pteval = *_pte;
+ struct page *src_page;
+
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ clear_user_highpage(page, address);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+ if (is_zero_pfn(pte_pfn(pteval))) {
+ /*
+ * ptl mostly unnecessary.
+ */
+ spin_lock(ptl);
+ /*
+ * paravirt calls inside pte_clear here are
+ * superfluous.
+ */
+ pte_clear(vma->vm_mm, address, _pte);
+ spin_unlock(ptl);
+ }
+ } else {
+ src_page = pte_page(pteval);
+ copy_user_highpage(page, src_page, address, vma);
+ VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
+ release_pte_page(src_page);
+ /*
+ * ptl mostly unnecessary, but preempt has to
+ * be disabled to update the per-cpu stats
+ * inside page_remove_rmap().
+ */
+ spin_lock(ptl);
+ /*
+ * paravirt calls inside pte_clear here are
+ * superfluous.
+ */
+ pte_clear(vma->vm_mm, address, _pte);
+ page_remove_rmap(src_page, false);
+ spin_unlock(ptl);
+ free_page_and_swap_cache(src_page);
+ }
+
+ address += PAGE_SIZE;
+ page++;
+ }
+}
+
+static void khugepaged_alloc_sleep(void)
+{
+ DEFINE_WAIT(wait);
+
+ add_wait_queue(&khugepaged_wait, &wait);
+ freezable_schedule_timeout_interruptible(
+ msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+ remove_wait_queue(&khugepaged_wait, &wait);
+}
+
+static int khugepaged_node_load[MAX_NUMNODES];
+
+static bool khugepaged_scan_abort(int nid)
+{
+ int i;
+
+ /*
+ * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * allocate memory locally.
+ */
+ if (!zone_reclaim_mode)
+ return false;
+
+ /* If there is a count for this node already, it must be acceptable */
+ if (khugepaged_node_load[nid])
+ return false;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!khugepaged_node_load[i])
+ continue;
+ if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ return true;
+ }
+ return false;
+}
+
+/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
+static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
+{
+ return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+}
+
+#ifdef CONFIG_NUMA
+static int khugepaged_find_target_node(void)
+{
+ static int last_khugepaged_target_node = NUMA_NO_NODE;
+ int nid, target_node = 0, max_value = 0;
+
+ /* find first node with max normal pages hit */
+ for (nid = 0; nid < MAX_NUMNODES; nid++)
+ if (khugepaged_node_load[nid] > max_value) {
+ max_value = khugepaged_node_load[nid];
+ target_node = nid;
+ }
+
+ /* do some balance if several nodes have the same hit record */
+ if (target_node <= last_khugepaged_target_node)
+ for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
+ nid++)
+ if (max_value == khugepaged_node_load[nid]) {
+ target_node = nid;
+ break;
+ }
+
+ last_khugepaged_target_node = target_node;
+ return target_node;
+}
+
+static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+{
+ if (IS_ERR(*hpage)) {
+ if (!*wait)
+ return false;
+
+ *wait = false;
+ *hpage = NULL;
+ khugepaged_alloc_sleep();
+ } else if (*hpage) {
+ put_page(*hpage);
+ *hpage = NULL;
+ }
+
+ return true;
+}
+
+static struct page *
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
+ unsigned long address, int node)
+{
+ VM_BUG_ON_PAGE(*hpage, *hpage);
+
+ /*
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
+ */
+ up_read(&mm->mmap_sem);
+
+ *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+ if (unlikely(!*hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ *hpage = ERR_PTR(-ENOMEM);
+ return NULL;
+ }
+
+ prep_transhuge_page(*hpage);
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ return *hpage;
+}
+#else
+static int khugepaged_find_target_node(void)
+{
+ return 0;
+}
+
+static inline struct page *alloc_khugepaged_hugepage(void)
+{
+ struct page *page;
+
+ page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
+ HPAGE_PMD_ORDER);
+ if (page)
+ prep_transhuge_page(page);
+ return page;
+}
+
+static struct page *khugepaged_alloc_hugepage(bool *wait)
+{
+ struct page *hpage;
+
+ do {
+ hpage = alloc_khugepaged_hugepage();
+ if (!hpage) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ if (!*wait)
+ return NULL;
+
+ *wait = false;
+ khugepaged_alloc_sleep();
+ } else
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ } while (unlikely(!hpage) && likely(khugepaged_enabled()));
+
+ return hpage;
+}
+
+static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+{
+ if (!*hpage)
+ *hpage = khugepaged_alloc_hugepage(wait);
+
+ if (unlikely(!*hpage))
+ return false;
+
+ return true;
+}
+
+static struct page *
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
+ unsigned long address, int node)
+{
+ up_read(&mm->mmap_sem);
+ VM_BUG_ON(!*hpage);
+
+ return *hpage;
+}
+#endif
+
+static bool hugepage_vma_check(struct vm_area_struct *vma)
+{
+ if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
+ (vma->vm_flags & VM_NOHUGEPAGE))
+ return false;
+ if (!vma->anon_vma || vma->vm_ops)
+ return false;
+ if (is_vma_temporary_stack(vma))
+ return false;
+ VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
+ return true;
+}
+
+/*
+ * Bring missing pages in from swap, to complete THP collapse.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ *
+ * Called and returns without pte mapped or spinlocks held,
+ * but with mmap_sem held to protect against vma changes.
+ */
+
+static void __collapse_huge_page_swapin(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ pte_t pteval;
+ int swapped_in = 0, ret = 0;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
+ if (!is_swap_pte(pteval))
+ continue;
+ swapped_in++;
+ ret = do_swap_page(&fe, pteval);
+ if (ret & VM_FAULT_ERROR) {
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
+ return;
+ }
+ /* pte is unmapped now, we need to map it */
+ fe.pte = pte_offset_map(pmd, fe.address);
+ }
+ fe.pte--;
+ pte_unmap(fe.pte);
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+ unsigned long address,
+ struct page **hpage,
+ struct vm_area_struct *vma,
+ int node)
+{
+ pmd_t *pmd, _pmd;
+ pte_t *pte;
+ pgtable_t pgtable;
+ struct page *new_page;
+ spinlock_t *pmd_ptl, *pte_ptl;
+ int isolated = 0, result = 0;
+ unsigned long hstart, hend;
+ struct mem_cgroup *memcg;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+ gfp_t gfp;
+
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ /* Only allocate from the target node */
+ gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
+
+ /* release the mmap_sem read lock. */
+ new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
+ if (!new_page) {
+ result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+ goto out_nolock;
+ }
+
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ result = SCAN_CGROUP_CHARGE_FAIL;
+ goto out_nolock;
+ }
+
+ /*
+ * Prevent all access to pagetables with the exception of
+ * gup_fast later hanlded by the ptep_clear_flush and the VM
+ * handled by the anon_vma lock + PG_lock.
+ */
+ down_write(&mm->mmap_sem);
+ if (unlikely(khugepaged_test_exit(mm))) {
+ result = SCAN_ANY_PROCESS;
+ goto out;
+ }
+
+ vma = find_vma(mm, address);
+ if (!vma) {
+ result = SCAN_VMA_NULL;
+ goto out;
+ }
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
+ result = SCAN_ADDRESS_RANGE;
+ goto out;
+ }
+ if (!hugepage_vma_check(vma)) {
+ result = SCAN_VMA_CHECK;
+ goto out;
+ }
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
+ __collapse_huge_page_swapin(mm, vma, address, pmd);
+
+ anon_vma_lock_write(vma->anon_vma);
+
+ pte = pte_offset_map(pmd, address);
+ pte_ptl = pte_lockptr(mm, pmd);
+
+ mmun_start = address;
+ mmun_end = address + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+ /*
+ * After this gup_fast can't run anymore. This also removes
+ * any huge TLB entry from the CPU so we won't allow
+ * huge and small TLB entries for the same virtual address
+ * to avoid the risk of CPU bugs in that area.
+ */
+ _pmd = pmdp_collapse_flush(vma, address, pmd);
+ spin_unlock(pmd_ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+ spin_lock(pte_ptl);
+ isolated = __collapse_huge_page_isolate(vma, address, pte);
+ spin_unlock(pte_ptl);
+
+ if (unlikely(!isolated)) {
+ pte_unmap(pte);
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ /*
+ * We can only use set_pmd_at when establishing
+ * hugepmds and never for establishing regular pmds that
+ * points to regular pagetables. Use pmd_populate for that
+ */
+ pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+ spin_unlock(pmd_ptl);
+ anon_vma_unlock_write(vma->anon_vma);
+ result = SCAN_FAIL;
+ goto out;
+ }
+
+ /*
+ * All pages are isolated and locked so anon_vma rmap
+ * can't run anymore.
+ */
+ anon_vma_unlock_write(vma->anon_vma);
+
+ __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
+ pte_unmap(pte);
+ __SetPageUptodate(new_page);
+ pgtable = pmd_pgtable(_pmd);
+
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+ /*
+ * spin_lock() below is not the equivalent of smp_wmb(), so
+ * this is needed to avoid the copy_huge_page writes to become
+ * visible after the set_pmd_at() write.
+ */
+ smp_wmb();
+
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ page_add_new_anon_rmap(new_page, vma, address, true);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
+ lru_cache_add_active_or_unevictable(new_page, vma);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, address, pmd, _pmd);
+ update_mmu_cache_pmd(vma, address, pmd);
+ spin_unlock(pmd_ptl);
+
+ *hpage = NULL;
+
+ khugepaged_pages_collapsed++;
+ result = SCAN_SUCCEED;
+out_up_write:
+ up_write(&mm->mmap_sem);
+out_nolock:
+ trace_mm_collapse_huge_page(mm, isolated, result);
+ return;
+out:
+ mem_cgroup_cancel_charge(new_page, memcg, true);
+ goto out_up_write;
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ struct page **hpage)
+{
+ pmd_t *pmd;
+ pte_t *pte, *_pte;
+ int ret = 0, none_or_zero = 0, result = 0;
+ struct page *page = NULL;
+ unsigned long _address;
+ spinlock_t *ptl;
+ int node = NUMA_NO_NODE, unmapped = 0;
+ bool writable = false, referenced = false;
+
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
+ memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+ _pte++, _address += PAGE_SIZE) {
+ pte_t pteval = *_pte;
+ if (is_swap_pte(pteval)) {
+ if (++unmapped <= khugepaged_max_ptes_swap) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_SWAP_PTE;
+ goto out_unmap;
+ }
+ }
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto out_unmap;
+ }
+ }
+ if (!pte_present(pteval)) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out_unmap;
+ }
+ if (pte_write(pteval))
+ writable = true;
+
+ page = vm_normal_page(vma, _address, pteval);
+ if (unlikely(!page)) {
+ result = SCAN_PAGE_NULL;
+ goto out_unmap;
+ }
+
+ /* TODO: teach khugepaged to collapse THP mapped with pte */
+ if (PageCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ goto out_unmap;
+ }
+
+ /*
+ * Record which node the original page is from and save this
+ * information to khugepaged_node_load[].
+ * Khupaged will allocate hugepage from the node has the max
+ * hit record.
+ */
+ node = page_to_nid(page);
+ if (khugepaged_scan_abort(node)) {
+ result = SCAN_SCAN_ABORT;
+ goto out_unmap;
+ }
+ khugepaged_node_load[node]++;
+ if (!PageLRU(page)) {
+ result = SCAN_PAGE_LRU;
+ goto out_unmap;
+ }
+ if (PageLocked(page)) {
+ result = SCAN_PAGE_LOCK;
+ goto out_unmap;
+ }
+ if (!PageAnon(page)) {
+ result = SCAN_PAGE_ANON;
+ goto out_unmap;
+ }
+
+ /*
+ * cannot use mapcount: can't collapse if there's a gup pin.
+ * The page must only be referenced by the scanned process
+ * and page swap cache.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ result = SCAN_PAGE_COUNT;
+ goto out_unmap;
+ }
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
+ mmu_notifier_test_young(vma->vm_mm, address))
+ referenced = true;
+ }
+ if (writable) {
+ if (referenced) {
+ result = SCAN_SUCCEED;
+ ret = 1;
+ } else {
+ result = SCAN_NO_REFERENCED_PAGE;
+ }
+ } else {
+ result = SCAN_PAGE_RO;
+ }
+out_unmap:
+ pte_unmap_unlock(pte, ptl);
+ if (ret) {
+ node = khugepaged_find_target_node();
+ /* collapse_huge_page will return with the mmap_sem released */
+ collapse_huge_page(mm, address, hpage, vma, node);
+ }
+out:
+ trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
+ none_or_zero, result, unmapped);
+ return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+ struct mm_struct *mm = mm_slot->mm;
+
+ VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+ if (khugepaged_test_exit(mm)) {
+ /* free mm_slot */
+ hash_del(&mm_slot->hash);
+ list_del(&mm_slot->mm_node);
+
+ /*
+ * Not strictly needed because the mm exited already.
+ *
+ * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+ */
+
+ /* khugepaged_mm_lock actually not necessary for the below */
+ free_mm_slot(mm_slot);
+ mmdrop(mm);
+ }
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+ struct page **hpage)
+ __releases(&khugepaged_mm_lock)
+ __acquires(&khugepaged_mm_lock)
+{
+ struct mm_slot *mm_slot;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int progress = 0;
+
+ VM_BUG_ON(!pages);
+ VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+ if (khugepaged_scan.mm_slot)
+ mm_slot = khugepaged_scan.mm_slot;
+ else {
+ mm_slot = list_entry(khugepaged_scan.mm_head.next,
+ struct mm_slot, mm_node);
+ khugepaged_scan.address = 0;
+ khugepaged_scan.mm_slot = mm_slot;
+ }
+ spin_unlock(&khugepaged_mm_lock);
+
+ mm = mm_slot->mm;
+ down_read(&mm->mmap_sem);
+ if (unlikely(khugepaged_test_exit(mm)))
+ vma = NULL;
+ else
+ vma = find_vma(mm, khugepaged_scan.address);
+
+ progress++;
+ for (; vma; vma = vma->vm_next) {
+ unsigned long hstart, hend;
+
+ cond_resched();
+ if (unlikely(khugepaged_test_exit(mm))) {
+ progress++;
+ break;
+ }
+ if (!hugepage_vma_check(vma)) {
+skip:
+ progress++;
+ continue;
+ }
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (hstart >= hend)
+ goto skip;
+ if (khugepaged_scan.address > hend)
+ goto skip;
+ if (khugepaged_scan.address < hstart)
+ khugepaged_scan.address = hstart;
+ VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+ while (khugepaged_scan.address < hend) {
+ int ret;
+ cond_resched();
+ if (unlikely(khugepaged_test_exit(mm)))
+ goto breakouterloop;
+
+ VM_BUG_ON(khugepaged_scan.address < hstart ||
+ khugepaged_scan.address + HPAGE_PMD_SIZE >
+ hend);
+ ret = khugepaged_scan_pmd(mm, vma,
+ khugepaged_scan.address,
+ hpage);
+ /* move to next address */
+ khugepaged_scan.address += HPAGE_PMD_SIZE;
+ progress += HPAGE_PMD_NR;
+ if (ret)
+ /* we released mmap_sem so break loop */
+ goto breakouterloop_mmap_sem;
+ if (progress >= pages)
+ goto breakouterloop;
+ }
+ }
+breakouterloop:
+ up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+ spin_lock(&khugepaged_mm_lock);
+ VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+ /*
+ * Release the current mm_slot if this mm is about to die, or
+ * if we scanned all vmas of this mm.
+ */
+ if (khugepaged_test_exit(mm) || !vma) {
+ /*
+ * Make sure that if mm_users is reaching zero while
+ * khugepaged runs here, khugepaged_exit will find
+ * mm_slot not pointing to the exiting mm.
+ */
+ if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+ khugepaged_scan.mm_slot = list_entry(
+ mm_slot->mm_node.next,
+ struct mm_slot, mm_node);
+ khugepaged_scan.address = 0;
+ } else {
+ khugepaged_scan.mm_slot = NULL;
+ khugepaged_full_scans++;
+ }
+
+ collect_mm_slot(mm_slot);
+ }
+
+ return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+ return !list_empty(&khugepaged_scan.mm_head) &&
+ khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+ return !list_empty(&khugepaged_scan.mm_head) ||
+ kthread_should_stop();
+}
+
+static void khugepaged_do_scan(void)
+{
+ struct page *hpage = NULL;
+ unsigned int progress = 0, pass_through_head = 0;
+ unsigned int pages = khugepaged_pages_to_scan;
+ bool wait = true;
+
+ barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+ while (progress < pages) {
+ if (!khugepaged_prealloc_page(&hpage, &wait))
+ break;
+
+ cond_resched();
+
+ if (unlikely(kthread_should_stop() || try_to_freeze()))
+ break;
+
+ spin_lock(&khugepaged_mm_lock);
+ if (!khugepaged_scan.mm_slot)
+ pass_through_head++;
+ if (khugepaged_has_work() &&
+ pass_through_head < 2)
+ progress += khugepaged_scan_mm_slot(pages - progress,
+ &hpage);
+ else
+ progress = pages;
+ spin_unlock(&khugepaged_mm_lock);
+ }
+
+ if (!IS_ERR_OR_NULL(hpage))
+ put_page(hpage);
+}
+
+static void khugepaged_wait_work(void)
+{
+ if (khugepaged_has_work()) {
+ if (!khugepaged_scan_sleep_millisecs)
+ return;
+
+ wait_event_freezable_timeout(khugepaged_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
+ return;
+ }
+
+ if (khugepaged_enabled())
+ wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
+}
+
+static int khugepaged(void *none)
+{
+ struct mm_slot *mm_slot;
+
+ set_freezable();
+ set_user_nice(current, MAX_NICE);
+
+ while (!kthread_should_stop()) {
+ khugepaged_do_scan();
+ khugepaged_wait_work();
+ }
+
+ spin_lock(&khugepaged_mm_lock);
+ mm_slot = khugepaged_scan.mm_slot;
+ khugepaged_scan.mm_slot = NULL;
+ if (mm_slot)
+ collect_mm_slot(mm_slot);
+ spin_unlock(&khugepaged_mm_lock);
+ return 0;
+}
+
+static void set_recommended_min_free_kbytes(void)
+{
+ struct zone *zone;
+ int nr_zones = 0;
+ unsigned long recommended_min;
+
+ for_each_populated_zone(zone)
+ nr_zones++;
+
+ /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
+ recommended_min = pageblock_nr_pages * nr_zones * 2;
+
+ /*
+ * Make sure that on average at least two pageblocks are almost free
+ * of another type, one for a migratetype to fall back to and a
+ * second to avoid subsequent fallbacks of other types There are 3
+ * MIGRATE_TYPES we care about.
+ */
+ recommended_min += pageblock_nr_pages * nr_zones *
+ MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
+
+ /* don't ever allow to reserve more than 5% of the lowmem */
+ recommended_min = min(recommended_min,
+ (unsigned long) nr_free_buffer_pages() / 20);
+ recommended_min <<= (PAGE_SHIFT-10);
+
+ if (recommended_min > min_free_kbytes) {
+ if (user_min_free_kbytes >= 0)
+ pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
+ min_free_kbytes, recommended_min);
+
+ min_free_kbytes = recommended_min;
+ }
+ setup_per_zone_wmarks();
+}
+
+int start_stop_khugepaged(void)
+{
+ int err = 0;
+ if (khugepaged_enabled()) {
+ if (!khugepaged_thread)
+ khugepaged_thread = kthread_run(khugepaged, NULL,
+ "khugepaged");
+ if (IS_ERR(khugepaged_thread)) {
+ pr_err("khugepaged: kthread_run(khugepaged) failed\n");
+ err = PTR_ERR(khugepaged_thread);
+ khugepaged_thread = NULL;
+ goto fail;
+ }
+
+ if (!list_empty(&khugepaged_scan.mm_head))
+ wake_up_interruptible(&khugepaged_wait);
+
+ set_recommended_min_free_kbytes();
+ } else if (khugepaged_thread) {
+ kthread_stop(khugepaged_thread);
+ khugepaged_thread = NULL;
+ }
+fail:
+ return err;
+}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 28/32] khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (26 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 27/32] thp: extract khugepaged from mm/huge_memory.c Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 29/32] shmem: make shmem_inode_info::lock irq-safe Kirill A. Shutemov
` (4 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
first: no point keep it inside the function.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
mm/khugepaged.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ef229cdfe5b7..c4693fd12f76 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -736,19 +736,10 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
}
static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
VM_BUG_ON_PAGE(*hpage, *hpage);
- /*
- * Before allocating the hugepage, release the mmap_sem read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_sem during
- * that. We will recheck the vma after taking it again in write mode.
- */
- up_read(&mm->mmap_sem);
-
*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -809,10 +800,8 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
}
static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
- up_read(&mm->mmap_sem);
VM_BUG_ON(!*hpage);
return *hpage;
@@ -896,8 +885,14 @@ static void collapse_huge_page(struct mm_struct *mm,
/* Only allocate from the target node */
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
- /* release the mmap_sem read lock. */
- new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
+ /*
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
+ */
+ up_read(&mm->mmap_sem);
+ new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
goto out_nolock;
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 29/32] shmem: make shmem_inode_info::lock irq-safe
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (27 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 28/32] khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 30/32] khugepaged: add support of collapse for tmpfs/shmem pages Kirill A. Shutemov
` (3 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
We are going to need to call shmem_charge() under tree_lock to get
accoutning right on collapse of small tmpfs pages into a huge one.
The problem is that tree_lock is irq-safe and lockdep is not happy, that
we take irq-unsafe lock under irq-safe[1].
Let's convert the lock to irq-safe.
[1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc
---
ipc/shm.c | 4 ++--
mm/shmem.c | 50 ++++++++++++++++++++++++++------------------------
2 files changed, 28 insertions(+), 26 deletions(-)
diff --git a/ipc/shm.c b/ipc/shm.c
index b1ed484eeda1..6eb6f0416b70 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -766,10 +766,10 @@ static void shm_add_rss_swap(struct shmid_kernel *shp,
} else {
#ifdef CONFIG_SHMEM
struct shmem_inode_info *info = SHMEM_I(inode);
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
*rss_add += inode->i_mapping->nrpages;
*swp_add += info->swapped;
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
#else
*rss_add += inode->i_mapping->nrpages;
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 872d658d37c0..17339c12b48f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -258,14 +258,15 @@ bool shmem_charge(struct inode *inode, long pages)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ unsigned long flags;
if (shmem_acct_block(info->flags, pages))
return false;
- spin_lock(&info->lock);
+ spin_lock_irqsave(&info->lock, flags);
info->alloced += pages;
inode->i_blocks += pages * BLOCKS_PER_PAGE;
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irqrestore(&info->lock, flags);
inode->i_mapping->nrpages += pages;
if (!sbinfo->max_blocks)
@@ -273,10 +274,10 @@ bool shmem_charge(struct inode *inode, long pages)
if (percpu_counter_compare(&sbinfo->used_blocks,
sbinfo->max_blocks - pages) > 0) {
inode->i_mapping->nrpages -= pages;
- spin_lock(&info->lock);
+ spin_lock_irqsave(&info->lock, flags);
info->alloced -= pages;
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irqrestore(&info->lock, flags);
return false;
}
@@ -288,12 +289,13 @@ void shmem_uncharge(struct inode *inode, long pages)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ unsigned long flags;
- spin_lock(&info->lock);
+ spin_lock_irqsave(&info->lock, flags);
info->alloced -= pages;
inode->i_blocks -= pages * BLOCKS_PER_PAGE;
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irqrestore(&info->lock, flags);
if (sbinfo->max_blocks)
percpu_counter_sub(&sbinfo->used_blocks, pages);
@@ -818,10 +820,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index++;
}
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
info->swapped -= nr_swaps_freed;
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
}
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
@@ -838,9 +840,9 @@ static int shmem_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct shmem_inode_info *info = SHMEM_I(inode);
if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
}
generic_fillattr(inode, stat);
return 0;
@@ -984,9 +986,9 @@ static int shmem_unuse_inode(struct shmem_inode_info *info,
delete_from_swap_cache(*pagep);
set_page_dirty(*pagep);
if (!error) {
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
info->swapped--;
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
swap_free(swap);
}
}
@@ -1134,10 +1136,10 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
list_add_tail(&info->swaplist, &shmem_swaplist);
if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
info->swapped++;
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
swap_shmem_alloc(swap);
shmem_delete_from_page_cache(page, swp_to_radix_entry(swap));
@@ -1523,10 +1525,10 @@ repeat:
mem_cgroup_commit_charge(page, memcg, true, false);
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
info->swapped--;
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
if (sgp == SGP_WRITE)
mark_page_accessed(page);
@@ -1603,11 +1605,11 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
PageTransHuge(page));
lru_cache_add_anon(page);
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
info->alloced += 1 << compound_order(page);
inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
alloced = true;
/*
@@ -1639,9 +1641,9 @@ clear:
if (alloced) {
ClearPageDirty(page);
delete_from_page_cache(page);
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
}
error = -EINVAL;
goto unlock;
@@ -1673,9 +1675,9 @@ unlock:
}
if (error == -ENOSPC && !once++) {
info = SHMEM_I(inode);
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
goto repeat;
}
if (error == -EEXIST) /* from above or from radix_tree_insert */
@@ -1874,7 +1876,7 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
struct shmem_inode_info *info = SHMEM_I(inode);
int retval = -ENOMEM;
- spin_lock(&info->lock);
+ spin_lock_irq(&info->lock);
if (lock && !(info->flags & VM_LOCKED)) {
if (!user_shm_lock(inode->i_size, user))
goto out_nomem;
@@ -1889,7 +1891,7 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
retval = 0;
out_nomem:
- spin_unlock(&info->lock);
+ spin_unlock_irq(&info->lock);
return retval;
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 30/32] khugepaged: add support of collapse for tmpfs/shmem pages
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (28 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 29/32] shmem: make shmem_inode_info::lock irq-safe Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 31/32] thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE Kirill A. Shutemov
` (2 subsequent siblings)
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
This patch extends khugepaged to support collapse of tmpfs/shmem pages.
We share fair amount of infrastructure with anon-THP collapse.
Few design points:
- First we are looking for VMA which can be suitable for mapping huge
page;
- If the VMA maps shmem file, the rest scan/collapse operations
operates on page cache, not on page tables as in anon VMA case.
- khugepaged_scan_shmem() finds a range which is suitable for huge
page. The scan is lockless and shouldn't disturb system too much.
- once the candidate for collapse is found, collapse_shmem() attempts
to create a huge page:
+ scan over radix tree, making the range point to new huge page;
+ new huge page is not-uptodate, locked and freezed (refcount
is 0), so nobody can touch them until we say so.
+ we swap in pages during the scan. khugepaged_scan_shmem()
filters out ranges with more than khugepaged_max_ptes_swap
swapped out pages. It's HPAGE_PMD_NR/8 by default.
+ old pages are isolated, unmapped and put to local list in case
to be restored back if collapse failed.
- if collapse succeed, we retract pte page tables from VMAs where huge
pages mapping is possible. The huge page will be mapped as PMD on
next minor fault into the range.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/shmem_fs.h | 23 ++
include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 435 ++++++++++++++++++++++++++++++++++++-
mm/shmem.c | 56 ++++-
4 files changed, 500 insertions(+), 17 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 94eaaa2c6ad9..0890f700a546 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -54,6 +54,7 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
+extern bool shmem_huge_enabled(struct vm_area_struct *vma);
extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
@@ -64,6 +65,19 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);
+/* Flag allocation requirements to shmem_getpage */
+enum sgp_type {
+ SGP_READ, /* don't exceed i_size, don't allocate page */
+ SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
+ SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
+ SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
+};
+
+extern int shmem_getpage(struct inode *inode, pgoff_t index,
+ struct page **pagep, enum sgp_type sgp);
+
static inline struct page *shmem_read_mapping_page(
struct address_space *mapping, pgoff_t index)
{
@@ -71,6 +85,15 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
}
+static inline bool shmem_file(struct file *file)
+{
+ if (!IS_ENABLED(CONFIG_SHMEM))
+ return false;
+ if (!file || !file->f_mapping)
+ return false;
+ return shmem_mapping(file->f_mapping);
+}
+
extern bool shmem_charge(struct inode *inode, long pages);
extern void shmem_uncharge(struct inode *inode, long pages);
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index bda21183eb05..830d47d5ca41 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -29,7 +29,8 @@
EM( SCAN_DEL_PAGE_LRU, "could_not_delete_page_from_lru")\
EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \
EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
- EMe( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte")
+ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
+ EMe(SCAN_TRUNCATED, "truncated") \
#undef EM
#undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c4693fd12f76..f8c971468d90 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -14,6 +14,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -42,7 +43,8 @@ enum scan_result {
SCAN_DEL_PAGE_LRU,
SCAN_ALLOC_HUGE_PAGE_FAIL,
SCAN_CGROUP_CHARGE_FAIL,
- SCAN_EXCEED_SWAP_PTE
+ SCAN_EXCEED_SWAP_PTE,
+ SCAN_TRUNCATED,
};
#define CREATE_TRACE_POINTS
@@ -292,7 +294,7 @@ struct attribute_group khugepaged_attr_group = {
.name = "khugepaged",
};
-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
@@ -813,6 +815,10 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
+ if (shmem_file(vma->vm_file)) {
+ return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+ HPAGE_PMD_NR);
+ }
if (!vma->anon_vma || vma->vm_ops)
return false;
if (is_vma_temporary_stack(vma))
@@ -1146,6 +1152,412 @@ out:
return ret;
}
+#ifdef CONFIG_SHMEM
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
+{
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ pmd_t *pmd, _pmd;
+
+ i_mmap_lock_write(mapping);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ /* probably overkill */
+ if (vma->anon_vma)
+ continue;
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ if (addr & ~HPAGE_PMD_MASK)
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ /*
+ * We need exclusive mmap_sem to retract page table.
+ * If trylock fails we would end up with pte-mapped THP after
+ * re-fault. Not ideal, but it's more important to not disturb
+ * the system too much.
+ */
+ if (down_write_trylock(&vma->vm_mm->mmap_sem)) {
+ spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
+ /* assume page table is clear */
+ _pmd = pmdp_collapse_flush(vma, addr, pmd);
+ spin_unlock(ptl);
+ up_write(&vma->vm_mm->mmap_sem);
+ atomic_long_dec(&vma->vm_mm->nr_ptes);
+ pte_free(vma->vm_mm, pmd_pgtable(_pmd));
+ }
+ }
+ i_mmap_unlock_write(mapping);
+}
+
+/**
+ * collapse_shmem - collapse small tmpfs/shmem pages into huge one.
+ *
+ * Basic scheme is simple, details are more complex:
+ * - allocate and freeze a new huge page;
+ * - scan over radix tree replacing old pages the new one
+ * + swap in pages if necessary;
+ * + fill in gaps;
+ * + keep old pages around in case if rollback is required;
+ * - if replacing succeed:
+ * + copy data over;
+ * + free old pages;
+ * + unfreeze huge page;
+ * - if replacing failed;
+ * + put all pages back and unfreeze them;
+ * + restore gaps in the radix-tree;
+ * + free huge page;
+ */
+static void collapse_shmem(struct mm_struct *mm,
+ struct address_space *mapping, pgoff_t start,
+ struct page **hpage, int node)
+{
+ gfp_t gfp;
+ struct page *page, *new_page, *tmp;
+ struct mem_cgroup *memcg;
+ pgoff_t index, end = start + HPAGE_PMD_NR;
+ LIST_HEAD(pagelist);
+ struct radix_tree_iter iter;
+ void **slot;
+ int nr_none = 0, result = SCAN_SUCCEED;
+
+ VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
+
+ /* Only allocate from the target node */
+ gfp = alloc_hugepage_khugepaged_gfpmask() |
+ __GFP_OTHER_NODE | __GFP_THISNODE;
+
+ new_page = khugepaged_alloc_page(hpage, gfp, node);
+ if (!new_page) {
+ result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+ goto out;
+ }
+
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ result = SCAN_CGROUP_CHARGE_FAIL;
+ goto out;
+ }
+
+ new_page->index = start;
+ new_page->mapping = mapping;
+ __SetPageSwapBacked(new_page);
+ __SetPageLocked(new_page);
+ BUG_ON(!page_ref_freeze(new_page, 1));
+
+
+ /*
+ * At this point the new_page is 'frozen' (page_count() is zero), locked
+ * and not up-to-date. It's safe to insert it into radix tree, because
+ * nobody would be able to map it or use it in other way until we
+ * unfreeze it.
+ */
+
+ index = start;
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ int n = min(iter.index, end) - index;
+
+ /*
+ * Handle holes in the radix tree: charge it from shmem and
+ * insert relevant subpage of new_page into the radix-tree.
+ */
+ if (n && !shmem_charge(mapping->host, n)) {
+ result = SCAN_FAIL;
+ break;
+ }
+ nr_none += n;
+ for (; index < min(iter.index, end); index++) {
+ radix_tree_insert(&mapping->page_tree, index,
+ new_page + (index % HPAGE_PMD_NR));
+ }
+
+ /* We are done. */
+ if (index >= end)
+ break;
+
+ page = radix_tree_deref_slot_protected(slot,
+ &mapping->tree_lock);
+ if (radix_tree_exceptional_entry(page)) {
+ spin_unlock_irq(&mapping->tree_lock);
+ /* swap in the page */
+ if (shmem_getpage(mapping->host, index, &page,
+ SGP_NOHUGE)) {
+ result = SCAN_FAIL;
+ goto tree_unlocked;
+ }
+ spin_lock_irq(&mapping->tree_lock);
+ } else if (trylock_page(page)) {
+ get_page(page);
+ } else {
+ result = SCAN_PAGE_LOCK;
+ break;
+ }
+
+ /*
+ * The page must be locked, so we can drop the tree_lock
+ * without racing with truncate.
+ */
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageUptodate(page), page);
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (page_mapping(page) != mapping) {
+ result = SCAN_TRUNCATED;
+ goto out_unlock;
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+
+ if (isolate_lru_page(page)) {
+ result = SCAN_DEL_PAGE_LRU;
+ goto out_isolate_failed;
+ }
+
+ if (page_mapped(page))
+ unmap_mapping_range(mapping, index << PAGE_SHIFT,
+ PAGE_SIZE, 0);
+
+ spin_lock_irq(&mapping->tree_lock);
+
+ VM_BUG_ON_PAGE(page_mapped(page), page);
+
+ /*
+ * The page is expected to have page_count() == 3:
+ * - we hold a pin on it;
+ * - one reference from radix tree;
+ * - one from isolate_lru_page;
+ */
+ if (!page_ref_freeze(page, 3)) {
+ result = SCAN_PAGE_COUNT;
+ goto out_lru;
+ }
+
+ /*
+ * Add the page to the list to be able to undo the collapse if
+ * something go wrong.
+ */
+ list_add_tail(&page->lru, &pagelist);
+
+ /* Finally, replace with the new page. */
+ radix_tree_replace_slot(slot,
+ new_page + (index % HPAGE_PMD_NR));
+
+ index++;
+ continue;
+out_lru:
+ spin_unlock_irq(&mapping->tree_lock);
+ putback_lru_page(page);
+out_isolate_failed:
+ unlock_page(page);
+ put_page(page);
+ goto tree_unlocked;
+out_unlock:
+ unlock_page(page);
+ put_page(page);
+ break;
+ }
+
+ /*
+ * Handle hole in radix tree at the end of the range.
+ * This code only triggers if there's nothing in radix tree
+ * beyond 'end'.
+ */
+ if (result == SCAN_SUCCEED && index < end) {
+ int n = end - index;
+
+ if (!shmem_charge(mapping->host, n)) {
+ result = SCAN_FAIL;
+ goto tree_locked;
+ }
+
+ for (; index < end; index++) {
+ radix_tree_insert(&mapping->page_tree, index,
+ new_page + (index % HPAGE_PMD_NR));
+ }
+ nr_none += n;
+ }
+
+tree_locked:
+ spin_unlock_irq(&mapping->tree_lock);
+tree_unlocked:
+
+ if (result == SCAN_SUCCEED) {
+ unsigned long flags;
+ struct zone *zone = page_zone(new_page);
+
+ /*
+ * Replacing old pages with new one has succeed, now we need to
+ * copy the content and free old pages.
+ */
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
+ page);
+ list_del(&page->lru);
+ unlock_page(page);
+ page_ref_unfreeze(page, 1);
+ page->mapping = NULL;
+ ClearPageActive(page);
+ ClearPageUnevictable(page);
+ put_page(page);
+ }
+
+ local_irq_save(flags);
+ __inc_zone_page_state(new_page, NR_SHMEM_THPS);
+ if (nr_none) {
+ __mod_zone_page_state(zone, NR_FILE_PAGES, nr_none);
+ __mod_zone_page_state(zone, NR_SHMEM, nr_none);
+ }
+ local_irq_restore(flags);
+
+ /*
+ * Remove pte page tables, so we can re-faulti
+ * the page as huge.
+ */
+ retract_page_tables(mapping, start);
+
+ /* Everything is ready, let's unfreeze the new_page */
+ set_page_dirty(new_page);
+ SetPageUptodate(new_page);
+ page_ref_unfreeze(new_page, HPAGE_PMD_NR);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
+ lru_cache_add_anon(new_page);
+ unlock_page(new_page);
+
+ *hpage = NULL;
+ } else {
+ /* Something went wrong: rollback changes to the radix-tree */
+ shmem_uncharge(mapping->host, nr_none);
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start) {
+ if (iter.index >= end)
+ break;
+ page = list_first_entry_or_null(&pagelist,
+ struct page, lru);
+ if (!page || iter.index < page->index) {
+ if (!nr_none)
+ break;
+ /* Put holes back where they were */
+ radix_tree_replace_slot(slot, NULL);
+ nr_none--;
+ continue;
+ }
+
+ VM_BUG_ON_PAGE(page->index != iter.index, page);
+
+ /* Unfreeze the page. */
+ list_del(&page->lru);
+ page_ref_unfreeze(page, 2);
+ radix_tree_replace_slot(slot, page);
+ spin_unlock_irq(&mapping->tree_lock);
+ putback_lru_page(page);
+ unlock_page(page);
+ spin_lock_irq(&mapping->tree_lock);
+ }
+ VM_BUG_ON(nr_none);
+ spin_unlock_irq(&mapping->tree_lock);
+
+ /* Unfreeze new_page, caller would take care about freeing it */
+ page_ref_unfreeze(new_page, 1);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
+ unlock_page(new_page);
+ new_page->mapping = NULL;
+ }
+out:
+ VM_BUG_ON(!list_empty(&pagelist));
+ /* TODO: tracepoints */
+}
+
+static void khugepaged_scan_shmem(struct mm_struct *mm,
+ struct address_space *mapping,
+ pgoff_t start, struct page **hpage)
+{
+ struct page *page = NULL;
+ struct radix_tree_iter iter;
+ void **slot;
+ int present, swap;
+ int node = NUMA_NO_NODE;
+ int result = SCAN_SUCCEED;
+
+ present = 0;
+ swap = 0;
+ memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+ rcu_read_lock();
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ if (iter.index >= start + HPAGE_PMD_NR)
+ break;
+
+ page = radix_tree_deref_slot(slot);
+ if (radix_tree_deref_retry(page)) {
+ slot = radix_tree_iter_retry(&iter);
+ continue;
+ }
+
+ if (radix_tree_exception(page)) {
+ if (++swap > khugepaged_max_ptes_swap) {
+ result = SCAN_EXCEED_SWAP_PTE;
+ break;
+ }
+ continue;
+ }
+
+ if (PageTransCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ break;
+ }
+
+ node = page_to_nid(page);
+ if (khugepaged_scan_abort(node)) {
+ result = SCAN_SCAN_ABORT;
+ break;
+ }
+ khugepaged_node_load[node]++;
+
+ if (!PageLRU(page)) {
+ result = SCAN_PAGE_LRU;
+ break;
+ }
+
+ if (page_count(page) != 1 + page_mapcount(page)) {
+ result = SCAN_PAGE_COUNT;
+ break;
+ }
+
+ /*
+ * We probably should check if the page is referenced here, but
+ * nobody would transfer pte_young() to PageReferenced() for us.
+ * And rmap walk here is just too costly...
+ */
+
+ present++;
+
+ if (need_resched()) {
+ cond_resched_rcu();
+ slot = radix_tree_iter_next(&iter);
+ }
+ }
+ rcu_read_unlock();
+
+ if (result == SCAN_SUCCEED) {
+ if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+ result = SCAN_EXCEED_NONE_PTE;
+ } else {
+ node = khugepaged_find_target_node();
+ collapse_shmem(mm, mapping, start, hpage, node);
+ }
+ }
+
+ /* TODO: tracepoints */
+}
+#else
+static void khugepaged_scan_shmem(struct mm_struct *mm,
+ struct address_space *mapping,
+ pgoff_t start, struct page **hpage)
+{
+ BUILD_BUG();
+}
+#endif
+
static void collect_mm_slot(struct mm_slot *mm_slot)
{
struct mm_struct *mm = mm_slot->mm;
@@ -1222,6 +1634,8 @@ skip:
if (khugepaged_scan.address < hstart)
khugepaged_scan.address = hstart;
VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+ if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
+ goto skip;
while (khugepaged_scan.address < hend) {
int ret;
@@ -1232,9 +1646,20 @@ skip:
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
- ret = khugepaged_scan_pmd(mm, vma,
- khugepaged_scan.address,
- hpage);
+ if (shmem_file(vma->vm_file)) {
+ struct file *file = get_file(vma->vm_file);
+ pgoff_t pgoff = linear_page_index(vma,
+ khugepaged_scan.address);
+ up_read(&mm->mmap_sem);
+ ret = 1;
+ khugepaged_scan_shmem(mm, file->f_mapping,
+ pgoff, hpage);
+ fput(file);
+ } else {
+ ret = khugepaged_scan_pmd(mm, vma,
+ khugepaged_scan.address,
+ hpage);
+ }
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
progress += HPAGE_PMD_NR;
diff --git a/mm/shmem.c b/mm/shmem.c
index 17339c12b48f..1f942c1b4e61 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -32,6 +32,7 @@
#include <linux/export.h>
#include <linux/swap.h>
#include <linux/uio.h>
+#include <linux/khugepaged.h>
static struct vfsmount *shm_mnt;
@@ -97,16 +98,6 @@ struct shmem_falloc {
pgoff_t nr_unswapped; /* how often writepage refused to swap out */
};
-/* Flag allocation requirements to shmem_getpage */
-enum sgp_type {
- SGP_READ, /* don't exceed i_size, don't allocate page */
- SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
- SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
- SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
- SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
-};
-
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -126,7 +117,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
-static inline int shmem_getpage(struct inode *inode, pgoff_t index,
+int shmem_getpage(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp)
{
return shmem_getpage_gfp(inode, index, pagep, sgp,
@@ -1899,6 +1890,11 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
file_accessed(file);
vma->vm_ops = &shmem_vm_ops;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
+ (vma->vm_end & HPAGE_PMD_MASK)) {
+ khugepaged_enter(vma, vma->vm_flags);
+ }
return 0;
}
@@ -3799,6 +3795,37 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
struct kobj_attribute shmem_enabled_attr =
__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ loff_t i_size;
+ pgoff_t off;
+
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ return true;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return false;
+ switch (sbinfo->huge) {
+ case SHMEM_HUGE_NEVER:
+ return false;
+ case SHMEM_HUGE_ALWAYS:
+ return true;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >= HPAGE_PMD_SIZE &&
+ i_size >> PAGE_SHIFT >= off)
+ return true;
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: implement fadvise() hints */
+ return (vma->vm_flags & VM_HUGEPAGE);
+ default:
+ VM_BUG_ON(1);
+ return false;
+ }
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
#else /* !CONFIG_SHMEM */
@@ -3978,6 +4005,13 @@ int shmem_zero_setup(struct vm_area_struct *vma)
fput(vma->vm_file);
vma->vm_file = file;
vma->vm_ops = &shmem_vm_ops;
+
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
+ (vma->vm_end & HPAGE_PMD_MASK)) {
+ khugepaged_enter(vma, vma->vm_flags);
+ }
+
return 0;
}
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 31/32] thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (29 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 30/32] khugepaged: add support of collapse for tmpfs/shmem pages Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
2016-05-12 15:41 ` [PATCHv8 32/32] shmem: split huge pages beyond i_size under memory pressure Kirill A. Shutemov
[not found] ` <CADf8yx+YMM7DZ8icem2RMQMgtJ8TfGCjGc56xUrBpeY1xLZ4SQ@mail.gmail.com>
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov, Aneesh Kumar K . V
For file mappings, we don't deposit page tables on THP allocation
because it's not strictly required to implement split_huge_pmd(): we
can just clear pmd and let following page faults to reconstruct the
page table.
But Power makes use of deposited page table to address MMU quirk.
Let's hide THP page cache, including huge tmpfs, under separate config
option, so it can be forbidden on Power.
We can revert the patch later once solution for Power found.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
include/linux/shmem_fs.h | 10 +++++++++-
mm/Kconfig | 8 ++++++++
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 11 +++++++----
mm/memory.c | 5 +++--
mm/shmem.c | 26 +++++++++++++-------------
6 files changed, 41 insertions(+), 21 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0890f700a546..54fa28dfbd89 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -54,7 +54,6 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
-extern bool shmem_huge_enabled(struct vm_area_struct *vma);
extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
@@ -112,4 +111,13 @@ static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
#endif
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+extern bool shmem_huge_enabled(struct vm_area_struct *vma);
+#else
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+ return false;
+}
+#endif
+
#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 989f8f3d77e0..084e7cb1e259 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -428,6 +428,14 @@ choice
endchoice
#
+# We don't deposit page tables on file THP mapping,
+# but Power makes use of them to address MMU quirk.
+#
+config TRANSPARENT_HUGE_PAGECACHE
+ def_bool y
+ depends on TRANSPARENT_HUGEPAGE && !PPC
+
+#
# UP and nommu archs use km based percpu allocator
#
config NEED_PER_CPU_KM
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c08d3d300f4c..9de18dbfd1f3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -293,7 +293,7 @@ static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
-#ifdef CONFIG_SHMEM
+#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
&shmem_enabled_attr.attr,
#endif
#ifdef CONFIG_DEBUG_VM
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f8c971468d90..72b048e31aac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -816,6 +816,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
if (shmem_file(vma->vm_file)) {
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+ return false;
return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
HPAGE_PMD_NR);
}
@@ -1152,7 +1154,7 @@ out:
return ret;
}
-#ifdef CONFIG_SHMEM
+#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
@@ -1634,8 +1636,6 @@ skip:
if (khugepaged_scan.address < hstart)
khugepaged_scan.address = hstart;
VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
- if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
- goto skip;
while (khugepaged_scan.address < hend) {
int ret;
@@ -1647,9 +1647,12 @@ skip:
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
if (shmem_file(vma->vm_file)) {
- struct file *file = get_file(vma->vm_file);
+ struct file *file;
pgoff_t pgoff = linear_page_index(vma,
khugepaged_scan.address);
+ if (!shmem_huge_enabled(vma))
+ goto skip;
+ file = get_file(vma->vm_file);
up_read(&mm->mmap_sem);
ret = 1;
khugepaged_scan_shmem(mm, file->f_mapping,
diff --git a/mm/memory.c b/mm/memory.c
index 8d8b1357e1d3..de2e66d8acb9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2859,7 +2859,7 @@ map_pte:
return 0;
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
@@ -2941,7 +2941,8 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
pte_t entry;
int ret;
- if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+ if (pmd_none(*fe->pmd) && PageTransCompound(page) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
/* THP on COW? */
VM_BUG_ON_PAGE(memcg, page);
diff --git a/mm/shmem.c b/mm/shmem.c
index 1f942c1b4e61..a12867157dc3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -363,7 +363,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
#define SHMEM_HUGE_DENY (-1)
#define SHMEM_HUGE_FORCE (-2)
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
/* ifdef here to avoid bloating shmem.o when not necessary */
int shmem_huge __read_mostly;
@@ -406,11 +406,11 @@ static const char *shmem_format_huge(int huge)
}
}
-#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+#else /* !CONFIG_TRANSPARENT_HUGE_PAGECACHE */
#define shmem_huge SHMEM_HUGE_DENY
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE */
/*
* Like add_to_page_cache_locked, but error if expected item has gone.
@@ -1229,7 +1229,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
void __rcu **results;
struct page *page;
- if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
return NULL;
rcu_read_lock();
@@ -1270,7 +1270,7 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
int nr;
int err = -ENOSPC;
- if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
huge = false;
nr = huge ? HPAGE_PMD_NR : 1;
@@ -1773,7 +1773,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
get_area = current->mm->get_unmapped_area;
addr = get_area(file, uaddr, len, pgoff, flags);
- if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
return addr;
if (IS_ERR_VALUE(addr))
return addr;
@@ -1890,7 +1890,7 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
file_accessed(file);
vma->vm_ops = &shmem_vm_ops;
- if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
(vma->vm_end & HPAGE_PMD_MASK)) {
khugepaged_enter(vma, vma->vm_flags);
@@ -3284,7 +3284,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
} else if (!strcmp(this_char, "huge")) {
int huge;
huge = shmem_parse_huge(value);
@@ -3381,7 +3381,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
if (sbinfo->huge)
seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
@@ -3726,7 +3726,7 @@ int __init shmem_init(void)
goto out1;
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
else
@@ -3743,7 +3743,7 @@ out3:
return error;
}
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+#if defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && defined(CONFIG_SYSFS)
static ssize_t shmem_enabled_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -3826,7 +3826,7 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
return false;
}
}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
+#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */
#else /* !CONFIG_SHMEM */
@@ -4006,7 +4006,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
vma->vm_file = file;
vma->vm_ops = &shmem_vm_ops;
- if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
(vma->vm_end & HPAGE_PMD_MASK)) {
khugepaged_enter(vma, vma->vm_flags);
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCHv8 32/32] shmem: split huge pages beyond i_size under memory pressure
2016-05-12 15:40 [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
` (30 preceding siblings ...)
2016-05-12 15:41 ` [PATCHv8 31/32] thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE Kirill A. Shutemov
@ 2016-05-12 15:41 ` Kirill A. Shutemov
[not found] ` <CADf8yx+YMM7DZ8icem2RMQMgtJ8TfGCjGc56xUrBpeY1xLZ4SQ@mail.gmail.com>
32 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-12 15:41 UTC (permalink / raw)
To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
Jerome Marchand, Yang Shi, Sasha Levin, Andres Lagar-Cavilla,
Ning Qu, linux-kernel, linux-mm, linux-fsdevel,
Kirill A. Shutemov
Even if user asked to allocate huge pages always (huge=always), we
should be able to free up some memory by splitting pages which are
partly byound i_size if memory presure comes or once we hit limit on
filesystem size (-o size=).
In order to do this we maintain per-superblock list of inodes, which
potentially have huge pages on the border of file size.
Per-fs shrinker can reclaim memory by splitting such pages.
If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to
free up space on the filesystem and retry allocation if it succeed.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/shmem_fs.h | 6 +-
mm/shmem.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 180 insertions(+), 1 deletion(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 54fa28dfbd89..ff078e7043b6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -16,8 +16,9 @@ struct shmem_inode_info {
unsigned long flags;
unsigned long alloced; /* data pages alloced to file */
unsigned long swapped; /* subtotal assigned to swap */
- struct shared_policy policy; /* NUMA memory alloc policy */
+ struct list_head shrinklist; /* shrinkable hpage inodes */
struct list_head swaplist; /* chain of maybes on swap */
+ struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
struct inode vfs_inode;
};
@@ -33,6 +34,9 @@ struct shmem_sb_info {
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
+ spinlock_t shrinklist_lock; /* Protects shrinklist */
+ struct list_head shrinklist; /* List of shinkable inodes */
+ unsigned long shrinklist_len; /* Length of shrinklist */
};
static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
diff --git a/mm/shmem.c b/mm/shmem.c
index a12867157dc3..37bd59176435 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -188,6 +188,7 @@ static const struct inode_operations shmem_inode_operations;
static const struct inode_operations shmem_dir_inode_operations;
static const struct inode_operations shmem_special_inode_operations;
static const struct vm_operations_struct shmem_vm_ops;
+static struct file_system_type shmem_fs_type;
static LIST_HEAD(shmem_swaplist);
static DEFINE_MUTEX(shmem_swaplist_mutex);
@@ -406,10 +407,122 @@ static const char *shmem_format_huge(int huge)
}
}
+static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
+ struct shrink_control *sc, unsigned long nr_to_split)
+{
+ LIST_HEAD(list), *pos, *next;
+ struct inode *inode;
+ struct shmem_inode_info *info;
+ struct page *page;
+ unsigned long batch = sc ? sc->nr_to_scan : 128;
+ int failed = 0, split = 0;
+
+ if (list_empty(&sbinfo->shrinklist))
+ return SHRINK_STOP;
+
+ spin_lock(&sbinfo->shrinklist_lock);
+ list_for_each_safe(pos, next, &sbinfo->shrinklist) {
+ info = list_entry(pos, struct shmem_inode_info, shrinklist);
+ sbinfo->shrinklist_len--;
+
+ /* pin the inode */
+ inode = igrab(&info->vfs_inode);
+
+ /* inode is about to be evicted */
+ if (!inode) {
+ list_del_init(&info->shrinklist);
+ goto next;
+ }
+
+ /* Check if there's anything to gain */
+ if (round_up(inode->i_size, PAGE_SIZE) ==
+ round_up(inode->i_size, HPAGE_PMD_SIZE)) {
+ list_del_init(&info->shrinklist);
+ iput(inode);
+ goto next;
+ }
+
+ list_move(&info->shrinklist, &list);
+next:
+ if (!--batch)
+ break;
+ }
+ spin_unlock(&sbinfo->shrinklist_lock);
+
+ list_for_each_safe(pos, next, &list) {
+ int ret;
+
+ info = list_entry(pos, struct shmem_inode_info, shrinklist);
+ inode = &info->vfs_inode;
+
+ if (nr_to_split && split >= nr_to_split) {
+ iput(inode);
+ continue;
+ }
+
+ page = find_lock_page(inode->i_mapping,
+ inode->i_size >> PAGE_SHIFT);
+
+ if (!page)
+ goto drop;
+
+ if (!PageTransHuge(page)) {
+ unlock_page(page);
+ put_page(page);
+ goto drop;
+ }
+
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+
+ if (ret) {
+ /* split failed: leave it on the list */
+ failed++;
+ iput(inode);
+ continue;
+ }
+
+ split++;
+drop:
+ list_del_init(&info->shrinklist);
+ iput(inode);
+ }
+
+ spin_lock(&sbinfo->shrinklist_lock);
+ list_splice_tail(&list, &sbinfo->shrinklist);
+ sbinfo->shrinklist_len += failed;
+ spin_unlock(&sbinfo->shrinklist_lock);
+
+ return split;
+}
+
+static long shmem_unused_huge_scan(struct super_block *sb,
+ struct shrink_control *sc)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
+ if (!READ_ONCE(sbinfo->shrinklist_len))
+ return SHRINK_STOP;
+
+ return shmem_unused_huge_shrink(sbinfo, sc, 0);
+}
+
+static long shmem_unused_huge_count(struct super_block *sb,
+ struct shrink_control *sc)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ return READ_ONCE(sbinfo->shrinklist_len);
+}
#else /* !CONFIG_TRANSPARENT_HUGE_PAGECACHE */
#define shmem_huge SHMEM_HUGE_DENY
+static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
+ struct shrink_control *sc, unsigned long nr_to_split)
+{
+ return 0;
+}
#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE */
/*
@@ -843,6 +956,7 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = d_inode(dentry);
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
int error;
error = inode_change_ok(inode, attr);
@@ -878,6 +992,20 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
if (oldsize > holebegin)
unmap_mapping_range(inode->i_mapping,
holebegin, 0, 1);
+
+ /*
+ * Part of the huge page can be beyond i_size: subject
+ * to shrink under memory pressure.
+ */
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
+ spin_lock(&sbinfo->shrinklist_lock);
+ if (list_empty(&info->shrinklist)) {
+ list_add_tail(&info->shrinklist,
+ &sbinfo->shrinklist);
+ sbinfo->shrinklist_len++;
+ }
+ spin_unlock(&sbinfo->shrinklist_lock);
+ }
}
}
@@ -890,11 +1018,20 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
if (inode->i_mapping->a_ops == &shmem_aops) {
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
shmem_truncate_range(inode, 0, (loff_t)-1);
+ if (!list_empty(&info->shrinklist)) {
+ spin_lock(&sbinfo->shrinklist_lock);
+ if (!list_empty(&info->shrinklist)) {
+ list_del_init(&info->shrinklist);
+ sbinfo->shrinklist_len--;
+ }
+ spin_unlock(&sbinfo->shrinklist_lock);
+ }
if (!list_empty(&info->swaplist)) {
mutex_lock(&shmem_swaplist_mutex);
list_del_init(&info->swaplist);
@@ -1563,8 +1700,23 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
index, false);
}
if (IS_ERR(page)) {
+ int retry = 5;
error = PTR_ERR(page);
page = NULL;
+ if (error != -ENOSPC)
+ goto failed;
+ /*
+ * Try to reclaim some spece by splitting a huge page
+ * beyond i_size on the filesystem.
+ */
+ while (retry--) {
+ int ret;
+ ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
+ if (ret == SHRINK_STOP)
+ break;
+ if (ret)
+ goto alloc_nohuge;
+ }
goto failed;
}
@@ -1603,6 +1755,22 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
spin_unlock_irq(&info->lock);
alloced = true;
+ if (PageTransHuge(page) &&
+ DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
+ hindex + HPAGE_PMD_NR - 1) {
+ /*
+ * Part of the huge page is beyond i_size: subject
+ * to shrink under memory pressure.
+ */
+ spin_lock(&sbinfo->shrinklist_lock);
+ if (list_empty(&info->shrinklist)) {
+ list_add_tail(&info->shrinklist,
+ &sbinfo->shrinklist);
+ sbinfo->shrinklist_len++;
+ }
+ spin_unlock(&sbinfo->shrinklist_lock);
+ }
+
/*
* Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
*/
@@ -1920,6 +2088,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
spin_lock_init(&info->lock);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
+ INIT_LIST_HEAD(&info->shrinklist);
INIT_LIST_HEAD(&info->swaplist);
simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
@@ -3515,6 +3684,8 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL))
goto failed;
sbinfo->free_inodes = sbinfo->max_inodes;
+ spin_lock_init(&sbinfo->shrinklist_lock);
+ INIT_LIST_HEAD(&sbinfo->shrinklist);
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize = PAGE_SIZE;
@@ -3676,6 +3847,10 @@ static const struct super_operations shmem_ops = {
.evict_inode = shmem_evict_inode,
.drop_inode = generic_delete_inode,
.put_super = shmem_put_super,
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+ .nr_cached_objects = shmem_unused_huge_count,
+ .free_cached_objects = shmem_unused_huge_scan,
+#endif
};
static const struct vm_operations_struct shmem_vm_ops = {
--
2.8.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
[parent not found: <CADf8yx+YMM7DZ8icem2RMQMgtJ8TfGCjGc56xUrBpeY1xLZ4SQ@mail.gmail.com>]
* Re: [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages
[not found] ` <CADf8yx+YMM7DZ8icem2RMQMgtJ8TfGCjGc56xUrBpeY1xLZ4SQ@mail.gmail.com>
@ 2016-05-25 20:03 ` Kirill A. Shutemov
[not found] ` <CADf8yx+_EEwys7mip0HspKGMGpacws93afX1zKtHLOmF6-Lj1g@mail.gmail.com>
2016-06-06 13:51 ` Kirill A. Shutemov
1 sibling, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-05-25 20:03 UTC (permalink / raw)
To: neha agarwal
Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
Andrew Morton, Dave Hansen, Vlastimil Babka, Christoph Lameter,
Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
Andres Lagar-Cavilla, Ning Qu, linux-kernel, linux-mm,
linux-fsdevel
On Wed, May 25, 2016 at 03:11:55PM -0400, neha agarwal wrote:
> Hi All,
>
> I have been testing Hugh's and Kirill's huge tmpfs patch sets with
> Cassandra (NoSQL database). I am seeing significant performance gap between
> these two implementations (~30%). Hugh's implementation performs better
> than Kirill's implementation. I am surprised why I am seeing this
> performance gap. Following is my test setup.
Thanks for the report. I'll look into it.
> Patchsets
> ========
> - For Hugh's:
> I checked out 4.6-rc3, applied Hugh's preliminary patches (01 to 10
> patches) from here: https://lkml.org/lkml/2016/4/5/792 and then applied the
> THP patches posted on April 16 (01 to 29 patches).
>
> - For Kirill's:
> I am using his branch "git://
> git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8", which
> is based off of 4.6-rc3, posted on May 12.
>
>
> Khugepaged settings
> ================
> cd /sys/kernel/mm/transparent_hugepage
> echo 10 >khugepaged/alloc_sleep_millisecs
> echo 10 >khugepaged/scan_sleep_millisecs
> echo 511 >khugepaged/max_ptes_none
Do you make this for both setup?
It's not really nessesary for Hugh's, but it makes sense to have this
idenatical for testing.
Do you have swap in the system. Is it in use during testing?
> Mount options
> ===========
> - For Hugh's:
> sudo sysctl -w vm/shmem_huge=2
> sudo mount -o remount,huge=1 /hugetmpfs
>
> - For Kirill's:
> sudo mount -o remount,huge=always /hugetmpfs
> echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> echo 511 >khugepaged/max_ptes_swap
>
>
> Workload Setting
> =============
> Please look at the attached setup document for Cassandra (NoSQL database):
> cassandra-setup.txt
>
>
> Machine setup
> ===========
> 36-core (72 hardware thread) dual-socket x86 server with 512 GB RAM running
> Ubuntu. I use control groups for resource isolation. Server and client
> threads run on different sockets. Frequency governor set to "performance"
> to remove any performance fluctuations due to frequency variation.
>
>
> Throughput numbers
> ================
> Hugh's implementation: 74522.08 ops/sec
> Kirill's implementation: 54919.10 ops/sec
>
>
> I am not sure if something is fishy with my test environment or if there is
> actually a performance gap between the two implementations. I have run this
> test 5-6 times so I am certain that this experiment is repeatable. I will
> appreciate if someone can help me understand the reason for this
> performance gap.
>
> On Thu, May 12, 2016 at 11:40 AM, Kirill A. Shutemov <
> kirill.shutemov@linux.intel.com> wrote:
>
> > This update aimed to address my todo list from lsf/mm summit:
> >
> > - we now able to recovery memory by splitting huge pages partly beyond
> > i_size. This should address concern about small files.
> >
> > - bunch of bug fixes for khugepaged, including fix for data corruption
> > reported by Hugh.
> >
> > - Disabled for Power as it requires deposited page table to get THP
> > mapped and we don't do deposit/withdraw for file THP.
> >
> > The main part of patchset (up to khugepaged stuff) is relatively stable --
> > I fixed few minor bugs there, but nothing major.
> >
> > I would appreciate rigorous review of khugepaged and code to split huge
> > pages under memory pressure.
> >
> > The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to
> > THPagecache" and Ebru's khugepaged swapin patches form -mm tree.
> >
> > Git tree:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8
> >
> > == Changelog ==
> >
> > v8:
> > - khugepaged updates:
> > + mark collapsed page dirty, otherwise vmscan would discard it;
> > + account pages to mapping->nrpages on shmem_charge;
> > + fix a situation when not all tail pages put on radix tree on
> > collapse;
> > + fix off-by-one in loop-exit condition in khugepaged_scan_shmem();
> > + use radix_tree_iter_next/radix_tree_iter_retry instead of gotos;
> > + fix build withount CONFIG_SHMEM (again);
> > - split huge pages beyond i_size under memory pressure;
> > - disable huge tmpfs on Power, as it makes use of deposited page tables,
> > we don't have;
> > - fix filesystem size limit accouting;
> > - mark page referenced on split_huge_pmd() if the pmd is young;
> > - uncharge pages from shmem, removed during split_huge_page();
> > - make shmem_inode_info::lock irq-safe -- required by khugepaged;
> >
> > v7:
> > - khugepaged updates:
> > + fix page leak/page cache corruption on collapse fail;
> > + filter out VMAs not suitable for huge pages due misaligned vm_pgoff;
> > + fix build without CONFIG_SHMEM;
> > + drop few over-protective checks;
> > - fix bogus VM_BUG_ON() in __delete_from_page_cache();
> >
> > v6:
> > - experimental collapse support;
> > - fix swapout mapped huge pages;
> > - fix page leak in faularound code;
> > - fix exessive huge page allocation with huge=within_size;
> > - rename VM_NO_THP to VM_NO_KHUGEPAGED;
> > - fix condition in hugepage_madvise();
> > - accounting reworked again;
> >
> > v5:
> > - add FileHugeMapped to /proc/PID/smaps;
> > - make FileHugeMapped in meminfo aligned with other fields;
> > - Documentation/vm/transhuge.txt updated;
> >
> > v4:
> > - first four patch were applied to -mm tree;
> > - drop pages beyond i_size on split_huge_pages;
> > - few small random bugfixes;
> >
> > v3:
> > - huge= mountoption now can have values always, within_size, advice and
> > never;
> > - sysctl handle is replaced with sysfs knob;
> > - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
> > page fault;
> > - mlock() handling had been fixed;
> > - bunch of smaller bugfixes and cleanups.
> >
> > == Design overview ==
> >
> > Huge pages are allocated by shmem when it's allowed (by mount option) and
> > there's no entries for the range in radix-tree. Huge page is represented by
> > HPAGE_PMD_NR entries in radix-tree.
> >
> > MM core maps a page with PMD if ->fault() returns huge page and the VMA is
> > suitable for huge pages (size, alignment). There's no need into two
> > requests to file system: filesystem returns huge page if it can,
> > graceful fallback to small pages otherwise.
> >
> > As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
> > re-fault the page with PTEs later.
> >
> > Basic scheme for split_huge_page() is the same as for anon-THP.
> > Few differences:
> >
> > - File pages are on radix-tree, so we have head->_count offset by
> > HPAGE_PMD_NR. The count got distributed to small pages during split.
> >
> > - mapping->tree_lock prevents non-lockless access to pages under split
> > over radix-tree;
> >
> > - Lockless access is prevented by setting the head->_count to 0 during
> > split, so get_page_unless_zero() would fail;
> >
> > - After split, some pages can be beyond i_size. We drop them from
> > radix-tree.
> >
> > - We don't setup migration entries. Just unmap pages. It helps
> > handling cases when i_size is in the middle of the page: no need
> > handle unmap pages beyond i_size manually.
> >
> > COW mapping handled on PTE-level. It's not clear how beneficial would be
> > allocation of huge pages on COW faults. And it would require some code to
> > make them work.
> >
> > I think at some point we can consider teaching khugepaged to collapse
> > pages in COW mappings, but allocating huge on fault is probably overkill.
> >
> > As with anon THP, we mlock file huge page only if it mapped with PMD.
> > PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
> > scenarios when we can leak mlocked page.
> >
> > As with anon THP, we split huge page on swap out.
> >
> > Truncate and punch hole that only cover part of THP range is implemented
> > by zero out this part of THP.
> >
> > This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > inconsistent results depending what pages happened to be allocated.
> > I don't think this will be a problem.
> >
> > We track per-super_block list of inodes which potentially have huge page
> > partly beyond i_size. Under memory pressure or if we hit -ENOSPC, we split
> > such pages in order to recovery memory.
> >
> > The list is per-sb, as we need to split a page from our filesystem if hit
> > -ENOSPC (-o size= limit) during shmem_getpage_gfp() to free some space.
> >
> > Hugh Dickins (1):
> > shmem: get_unmapped_area align huge page
> >
> > Kirill A. Shutemov (31):
> > thp, mlock: update unevictable-lru.txt
> > mm: do not pass mm_struct into handle_mm_fault
> > mm: introduce fault_env
> > mm: postpone page table allocation until we have page to map
> > rmap: support file thp
> > mm: introduce do_set_pmd()
> > thp, vmstats: add counters for huge file pages
> > thp: support file pages in zap_huge_pmd()
> > thp: handle file pages in split_huge_pmd()
> > thp: handle file COW faults
> > thp: skip file huge pmd on copy_huge_pmd()
> > thp: prepare change_huge_pmd() for file thp
> > thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
> > thp: file pages support for split_huge_page()
> > thp, mlock: do not mlock PTE-mapped file huge pages
> > vmscan: split file huge pages before paging them out
> > page-flags: relax policy for PG_mappedtodisk and PG_reclaim
> > radix-tree: implement radix_tree_maybe_preload_order()
> > filemap: prepare find and delete operations for huge pages
> > truncate: handle file thp
> > mm, rmap: account shmem thp pages
> > shmem: prepare huge= mount option and sysfs knob
> > shmem: add huge pages support
> > shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
> > thp: update Documentation/vm/transhuge.txt
> > thp: extract khugepaged from mm/huge_memory.c
> > khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
> > shmem: make shmem_inode_info::lock irq-safe
> > khugepaged: add support of collapse for tmpfs/shmem pages
> > thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
> > shmem: split huge pages beyond i_size under memory pressure
> >
> > Documentation/filesystems/Locking | 10 +-
> > Documentation/vm/transhuge.txt | 130 ++-
> > Documentation/vm/unevictable-lru.txt | 21 +
> > arch/alpha/mm/fault.c | 2 +-
> > arch/arc/mm/fault.c | 2 +-
> > arch/arm/mm/fault.c | 2 +-
> > arch/arm64/mm/fault.c | 2 +-
> > arch/avr32/mm/fault.c | 2 +-
> > arch/cris/mm/fault.c | 2 +-
> > arch/frv/mm/fault.c | 2 +-
> > arch/hexagon/mm/vm_fault.c | 2 +-
> > arch/ia64/mm/fault.c | 2 +-
> > arch/m32r/mm/fault.c | 2 +-
> > arch/m68k/mm/fault.c | 2 +-
> > arch/metag/mm/fault.c | 2 +-
> > arch/microblaze/mm/fault.c | 2 +-
> > arch/mips/mm/fault.c | 2 +-
> > arch/mn10300/mm/fault.c | 2 +-
> > arch/nios2/mm/fault.c | 2 +-
> > arch/openrisc/mm/fault.c | 2 +-
> > arch/parisc/mm/fault.c | 2 +-
> > arch/powerpc/mm/copro_fault.c | 2 +-
> > arch/powerpc/mm/fault.c | 2 +-
> > arch/s390/mm/fault.c | 2 +-
> > arch/score/mm/fault.c | 2 +-
> > arch/sh/mm/fault.c | 2 +-
> > arch/sparc/mm/fault_32.c | 4 +-
> > arch/sparc/mm/fault_64.c | 2 +-
> > arch/tile/mm/fault.c | 2 +-
> > arch/um/kernel/trap.c | 2 +-
> > arch/unicore32/mm/fault.c | 2 +-
> > arch/x86/mm/fault.c | 2 +-
> > arch/xtensa/mm/fault.c | 2 +-
> > drivers/base/node.c | 13 +-
> > drivers/char/mem.c | 24 +
> > drivers/iommu/amd_iommu_v2.c | 3 +-
> > drivers/iommu/intel-svm.c | 2 +-
> > fs/proc/meminfo.c | 7 +-
> > fs/proc/task_mmu.c | 10 +-
> > fs/userfaultfd.c | 22 +-
> > include/linux/huge_mm.h | 36 +-
> > include/linux/khugepaged.h | 6 +
> > include/linux/mm.h | 51 +-
> > include/linux/mmzone.h | 4 +-
> > include/linux/page-flags.h | 19 +-
> > include/linux/radix-tree.h | 1 +
> > include/linux/rmap.h | 2 +-
> > include/linux/shmem_fs.h | 45 +-
> > include/linux/userfaultfd_k.h | 8 +-
> > include/linux/vm_event_item.h | 7 +
> > include/trace/events/huge_memory.h | 3 +-
> > ipc/shm.c | 10 +-
> > lib/radix-tree.c | 68 +-
> > mm/Kconfig | 8 +
> > mm/Makefile | 2 +-
> > mm/filemap.c | 226 ++--
> > mm/gup.c | 7 +-
> > mm/huge_memory.c | 2032
> > ++++++----------------------------
> > mm/internal.h | 4 +-
> > mm/khugepaged.c | 1851
> > +++++++++++++++++++++++++++++++
> > mm/ksm.c | 5 +-
> > mm/memory.c | 860 +++++++-------
> > mm/mempolicy.c | 4 +-
> > mm/migrate.c | 5 +-
> > mm/mmap.c | 26 +-
> > mm/nommu.c | 3 +-
> > mm/page-writeback.c | 1 +
> > mm/page_alloc.c | 21 +
> > mm/rmap.c | 78 +-
> > mm/shmem.c | 918 +++++++++++++--
> > mm/swap.c | 2 +
> > mm/truncate.c | 22 +-
> > mm/util.c | 6 +
> > mm/vmscan.c | 6 +
> > mm/vmstat.c | 4 +
> > 75 files changed, 4240 insertions(+), 2415 deletions(-)
> > create mode 100644 mm/khugepaged.c
> >
> > --
> > 2.8.1
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
>
>
>
> --
> Thanks and Regards,
> Neha Agarwal
> University of Michigan
> 1. Download and extract Cassandra
> http://archive.apache.org/dist/cassandra/2.0.16/apache-cassandra-2.0.16-bin.tar.gz
>
> Note that my test version is Cassandra-2.0.16.
> We will denote the path to which the file is extracted as CASSANDRA_BIN
>
> 2. Setup environment for cassandra
> mkdir -p run_cassandra/cassandra_conf/triggers
>
> - Download cassandra-env.sh, cassandra.yaml, log4j-server.properties from my mail
> attachement and then copy those files in run_cassandra/cassandra_conf
> - Search for /home/nehaag/hugetmpfs in these files and change this to a local
> directory mounted as tmpfs. Let’s say that is CASSANDRA_DATA. A folder named
> "cassandra" will be automatically created (For example:
> CASSANDRA_DATA/cassandra) when running Cassandra.
> - Please note that these scripts will need modifications if you use Cassandra
> version other that 2.0.16
>
> - Download create-ycsb-table.cql.j2 from my email attachment and copy it in
> run_cassandra/
>
> 3. JAVA setup, get JRE: openjdk v1.7.0_101 (sudo apt-get install openjdk-7-jre
> for Ubuntu)
>
> 4. Setup YCSB Load generator:
> - Clone ycsb from: https://github.com/brianfrankcooper/YCSB.git. Let’s say this is
> downloaded to YCSB_ROOT
> - You need to have maven 3 installed (`sudo apt-get install maven’ in ubuntu)
> - Create a script (say run-cassandra.sh) in run_cassandra as follows:
>
> input_file=run_cassandra/create-ycsb-table.cql.j2
> cassandra_cli=${CASSANDRA_BIN}/bin/cassandra-cli
> host=”127.0.0.1” #Ip address of the machine running cassasndra server
> $cassandra_cli -h $host --jmxport 7199 -f create-ycsb-table.cql
> cd ${YCSB_ROOT}
>
> # Load dataset
> ${YCSB_ROOT}/bin/ycsb -cp ${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar:${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar load cassandra-10 -p hosts=$host -threads 20 -p fieldcount=20 -p recordcount=5000000 -P ${YCSB_ROOT}/workloads/workloadb -s
>
> # Run benchmark
> ${YCSB_ROOT}/bin/ycsb -cp ${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar:${YCSB_ROOT}/cassandra/target/dependency/slf4j-simple-1.7.12.jar run cassandra-10 -p hosts=$host -threads 20 -p fieldcount=20 -p operationcount=50000000 -p recordcount=5000000 -p readproportion=0.05 -p updateproportion=0.95 -P ${YCSB_ROOT}/workloads/workloadb -s
>
> 5. Run the cassandra server on host machine:
> rm -r ${CASSANDRA_DATA}/cassandra && CASSANDRA_CONF=run_cassandra/cassandra_conf JRE_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre ${CASSANDRA_BIN}/bin/cassandra -f
>
> 6. Run load generator on same/some other machine:
> ./run-cassandra.sh
>
> YCSB periodcally spits out the throughput and latency number
> At the end overall throughput and latency will be printed out
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages
[not found] ` <CADf8yx+YMM7DZ8icem2RMQMgtJ8TfGCjGc56xUrBpeY1xLZ4SQ@mail.gmail.com>
2016-05-25 20:03 ` [PATCHv8 00/32] THP-enabled tmpfs/shmem using compound pages Kirill A. Shutemov
@ 2016-06-06 13:51 ` Kirill A. Shutemov
[not found] ` <CADf8yxLnAajuffiTPP+-5PsgDfDBtp9H-sVVXPrZQqg4P4D2mw@mail.gmail.com>
1 sibling, 1 reply; 39+ messages in thread
From: Kirill A. Shutemov @ 2016-06-06 13:51 UTC (permalink / raw)
To: neha agarwal
Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
Andrew Morton, Dave Hansen, Vlastimil Babka, Christoph Lameter,
Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
Andres Lagar-Cavilla, Ning Qu, linux-kernel, linux-mm,
linux-fsdevel
On Wed, May 25, 2016 at 03:11:55PM -0400, neha agarwal wrote:
> Hi All,
>
> I have been testing Hugh's and Kirill's huge tmpfs patch sets with
> Cassandra (NoSQL database). I am seeing significant performance gap between
> these two implementations (~30%). Hugh's implementation performs better
> than Kirill's implementation. I am surprised why I am seeing this
> performance gap. Following is my test setup.
>
> Patchsets
> ========
> - For Hugh's:
> I checked out 4.6-rc3, applied Hugh's preliminary patches (01 to 10
> patches) from here: https://lkml.org/lkml/2016/4/5/792 and then applied the
> THP patches posted on April 16 (01 to 29 patches).
>
> - For Kirill's:
> I am using his branch "git://
> git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v8", which
> is based off of 4.6-rc3, posted on May 12.
>
>
> Khugepaged settings
> ================
> cd /sys/kernel/mm/transparent_hugepage
> echo 10 >khugepaged/alloc_sleep_millisecs
> echo 10 >khugepaged/scan_sleep_millisecs
> echo 511 >khugepaged/max_ptes_none
>
>
> Mount options
> ===========
> - For Hugh's:
> sudo sysctl -w vm/shmem_huge=2
> sudo mount -o remount,huge=1 /hugetmpfs
>
> - For Kirill's:
> sudo mount -o remount,huge=always /hugetmpfs
> echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> echo 511 >khugepaged/max_ptes_swap
>
>
> Workload Setting
> =============
> Please look at the attached setup document for Cassandra (NoSQL database):
> cassandra-setup.txt
>
>
> Machine setup
> ===========
> 36-core (72 hardware thread) dual-socket x86 server with 512 GB RAM running
> Ubuntu. I use control groups for resource isolation. Server and client
> threads run on different sockets. Frequency governor set to "performance"
> to remove any performance fluctuations due to frequency variation.
>
>
> Throughput numbers
> ================
> Hugh's implementation: 74522.08 ops/sec
> Kirill's implementation: 54919.10 ops/sec
In my setup I don't see the difference:
v4.7-rc1 + my implementation:
[OVERALL], RunTime(ms), 822862.0
[OVERALL], Throughput(ops/sec), 60763.53021527304
ShmemPmdMapped: 4999168 kB
v4.6-rc2 + Hugh's implementation:
[OVERALL], RunTime(ms), 833157.0
[OVERALL], Throughput(ops/sec), 60012.698687042175
ShmemPmdMapped: 5021696 kB
It's basically within measuarment error. 'ShmemPmdMapped' indicate how
much memory is mapped with huge pages by the end of test.
It's on dual-socket 24-core machine with 64G of RAM.
I guess we have some configuration difference or something, but so far I
don't see the drastic performance difference you've pointed to.
May be my implementation behaves slower on bigger machines, I don't know..
There's no architectural reason for this.
I'll post my updated patchset today.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 39+ messages in thread