* [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault()
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-07-10 17:44 ` Johannes Weiner
2019-07-10 17:51 ` Johannes Weiner
2019-06-25 0:12 ` [PATCH v9 2/6] filemap: update offset check " Song Liu
` (5 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
Currently, filemap_fault() avoids trace condition with truncate by
checking page->mapping == mapping. This does not work for compound
pages. This patch let it check compound_head(page)->mapping instead.
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index df2006ba0cfa..f5b79a43946d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2517,7 +2517,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
goto out_retry;
/* Did it get truncated? */
- if (unlikely(page->mapping != mapping)) {
+ if (unlikely(compound_head(page)->mapping != mapping)) {
unlock_page(page);
put_page(page);
goto retry_find;
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault()
2019-06-25 0:12 ` [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault() Song Liu
@ 2019-07-10 17:44 ` Johannes Weiner
2019-07-10 17:51 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 17:44 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:41PM -0700, Song Liu wrote:
> Currently, filemap_fault() avoids trace condition with truncate by
-t
> checking page->mapping == mapping. This does not work for compound
> pages. This patch let it check compound_head(page)->mapping instead.
>
> Acked-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault()
2019-06-25 0:12 ` [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault() Song Liu
2019-07-10 17:44 ` Johannes Weiner
@ 2019-07-10 17:51 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 17:51 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:41PM -0700, Song Liu wrote:
> Currently, filemap_fault() avoids trace condition with truncate by
> checking page->mapping == mapping. This does not work for compound
> pages. This patch let it check compound_head(page)->mapping instead.
>
> Acked-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> mm/filemap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index df2006ba0cfa..f5b79a43946d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2517,7 +2517,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> goto out_retry;
>
> /* Did it get truncated? */
> - if (unlikely(page->mapping != mapping)) {
> + if (unlikely(compound_head(page)->mapping != mapping)) {
There is another check like these in pagecache_get_page(), which is
used by find_lock_page() and thus the truncate code (partial page
truncate calls, but this could happen against read-only cache).
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v9 2/6] filemap: update offset check in filemap_fault()
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
2019-06-25 0:12 ` [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault() Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-07-10 17:52 ` Johannes Weiner
2019-06-25 0:12 ` [PATCH v9 3/6] mm,thp: stats for file backed THP Song Liu
` (4 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
With THP, current check of offset:
VM_BUG_ON_PAGE(page->index != offset, page);
is no longer accurate. Update it to:
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f5b79a43946d..5f072a113535 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2522,7 +2522,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
put_page(page);
goto retry_find;
}
- VM_BUG_ON_PAGE(page->index != offset, page);
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
/*
* We have a locked page in the page cache, now we need to check
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 2/6] filemap: update offset check in filemap_fault()
2019-06-25 0:12 ` [PATCH v9 2/6] filemap: update offset check " Song Liu
@ 2019-07-10 17:52 ` Johannes Weiner
0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 17:52 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:42PM -0700, Song Liu wrote:
> With THP, current check of offset:
>
> VM_BUG_ON_PAGE(page->index != offset, page);
>
> is no longer accurate. Update it to:
>
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
>
> Acked-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v9 3/6] mm,thp: stats for file backed THP
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
2019-06-25 0:12 ` [PATCH v9 1/6] filemap: check compound_head(page)->mapping in filemap_fault() Song Liu
2019-06-25 0:12 ` [PATCH v9 2/6] filemap: update offset check " Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-07-10 17:59 ` Johannes Weiner
2019-06-25 0:12 ` [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem() Song Liu
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
In preparation for non-shmem THP, this patch adds a few stats and exposes
them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and
/proc/<pid>/task/<tid>/smaps.
This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/base/node.c | 6 ++++++
fs/proc/meminfo.c | 4 ++++
fs/proc/task_mmu.c | 4 +++-
include/linux/mmzone.h | 2 ++
mm/vmstat.c | 2 ++
5 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8598fcbd2a17..71ae2dc93489 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -426,6 +426,8 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d AnonHugePages: %8lu kB\n"
"Node %d ShmemHugePages: %8lu kB\n"
"Node %d ShmemPmdMapped: %8lu kB\n"
+ "Node %d FileHugePages: %8lu kB\n"
+ "Node %d FilePmdMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -451,6 +453,10 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
HPAGE_PMD_NR),
nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(pgdat, NR_FILE_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
HPAGE_PMD_NR)
#endif
);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 568d90e17c17..bac395fc11f9 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -136,6 +136,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
show_val_kb(m, "ShmemPmdMapped: ",
global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
+ show_val_kb(m, "FileHugePages: ",
+ global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
+ show_val_kb(m, "FilePmdMapped: ",
+ global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
#endif
#ifdef CONFIG_CMA
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 01d4eb0e6bd1..0360e3b2ba89 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -413,6 +413,7 @@ struct mem_size_stats {
unsigned long lazyfree;
unsigned long anonymous_thp;
unsigned long shmem_thp;
+ unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -563,7 +564,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
else if (is_zone_device_page(page))
/* pass */;
else
- VM_BUG_ON_PAGE(1, page);
+ mss->file_thp += HPAGE_PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), locked);
}
#else
@@ -767,6 +768,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss)
SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree);
SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp);
SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
+ SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp);
SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
mss->private_hugetlb >> 10, 7);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 70394cabaf4e..827f9b777938 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -234,6 +234,8 @@ enum node_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
+ NR_FILE_THPS,
+ NR_FILE_PMDMAPPED,
NR_ANON_THPS,
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_VMSCAN_WRITE,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd7e16ca6996..6afc892a148a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1158,6 +1158,8 @@ const char * const vmstat_text[] = {
"nr_shmem",
"nr_shmem_hugepages",
"nr_shmem_pmdmapped",
+ "nr_file_hugepages",
+ "nr_file_pmdmapped",
"nr_anon_transparent_hugepages",
"nr_unstable",
"nr_vmscan_write",
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 3/6] mm,thp: stats for file backed THP
2019-06-25 0:12 ` [PATCH v9 3/6] mm,thp: stats for file backed THP Song Liu
@ 2019-07-10 17:59 ` Johannes Weiner
0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 17:59 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:43PM -0700, Song Liu wrote:
> @@ -413,6 +413,7 @@ struct mem_size_stats {
> unsigned long lazyfree;
> unsigned long anonymous_thp;
> unsigned long shmem_thp;
> + unsigned long file_thp;
This appears to be unused.
Other than that, this looks good to me. It's a bit of a shame that
it's not symmetrical with the anon THP stats, but that already
diverged on shmem pages, so not your fault... Ah well.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
` (2 preceding siblings ...)
2019-06-25 0:12 ` [PATCH v9 3/6] mm,thp: stats for file backed THP Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-06-27 13:19 ` Rik van Riel
2019-07-10 18:21 ` Johannes Weiner
2019-06-25 0:12 ` [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS Song Liu
` (2 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
Next patch will add khugepaged support of non-shmem files. This patch
renames these two functions to reflect the new functionality:
collapse_shmem() => collapse_file()
khugepaged_scan_shmem() => khugepaged_scan_file()
Signed-off-by: Song Liu <songliubraving@fb.com>
---
mm/khugepaged.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0f7419938008..158cad542627 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1287,7 +1287,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
}
/**
- * collapse_shmem - collapse small tmpfs/shmem pages into huge one.
+ * collapse_file - collapse small tmpfs/shmem pages into huge one.
*
* Basic scheme is simple, details are more complex:
* - allocate and lock a new huge page;
@@ -1304,10 +1304,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
* + restore gaps in the page cache;
* + unlock and free huge page;
*/
-static void collapse_shmem(struct mm_struct *mm,
- struct address_space *mapping, pgoff_t start,
+static void collapse_file(struct mm_struct *mm,
+ struct file *file, pgoff_t start,
struct page **hpage, int node)
{
+ struct address_space *mapping = file->f_mapping;
gfp_t gfp;
struct page *new_page;
struct mem_cgroup *memcg;
@@ -1563,11 +1564,11 @@ static void collapse_shmem(struct mm_struct *mm,
/* TODO: tracepoints */
}
-static void khugepaged_scan_shmem(struct mm_struct *mm,
- struct address_space *mapping,
- pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm,
+ struct file *file, pgoff_t start, struct page **hpage)
{
struct page *page = NULL;
+ struct address_space *mapping = file->f_mapping;
XA_STATE(xas, &mapping->i_pages, start);
int present, swap;
int node = NUMA_NO_NODE;
@@ -1631,16 +1632,15 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
result = SCAN_EXCEED_NONE_PTE;
} else {
node = khugepaged_find_target_node();
- collapse_shmem(mm, mapping, start, hpage, node);
+ collapse_file(mm, file, start, hpage, node);
}
}
/* TODO: tracepoints */
}
#else
-static void khugepaged_scan_shmem(struct mm_struct *mm,
- struct address_space *mapping,
- pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm,
+ struct file *file, pgoff_t start, struct page **hpage)
{
BUILD_BUG();
}
@@ -1722,8 +1722,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
file = get_file(vma->vm_file);
up_read(&mm->mmap_sem);
ret = 1;
- khugepaged_scan_shmem(mm, file->f_mapping,
- pgoff, hpage);
+ khugepaged_scan_file(mm, file, pgoff, hpage);
fput(file);
} else {
ret = khugepaged_scan_pmd(mm, vma,
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()
2019-06-25 0:12 ` [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem() Song Liu
@ 2019-06-27 13:19 ` Rik van Riel
2019-07-10 18:21 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2019-06-27 13:19 UTC (permalink / raw)
To: Song Liu, linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton
[-- Attachment #1: Type: text/plain, Size: 432 bytes --]
On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
> Next patch will add khugepaged support of non-shmem files. This patch
> renames these two functions to reflect the new functionality:
>
> collapse_shmem() => collapse_file()
> khugepaged_scan_shmem() => khugepaged_scan_file()
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()
@ 2019-06-27 13:19 ` Rik van Riel
0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2019-06-27 13:19 UTC (permalink / raw)
To: Song Liu, linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton
[-- Attachment #1: Type: text/plain, Size: 432 bytes --]
On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
> Next patch will add khugepaged support of non-shmem files. This patch
> renames these two functions to reflect the new functionality:
>
> collapse_shmem() => collapse_file()
> khugepaged_scan_shmem() => khugepaged_scan_file()
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()
2019-06-25 0:12 ` [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem() Song Liu
2019-06-27 13:19 ` Rik van Riel
@ 2019-07-10 18:21 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 18:21 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:44PM -0700, Song Liu wrote:
> Next patch will add khugepaged support of non-shmem files. This patch
> renames these two functions to reflect the new functionality:
>
> collapse_shmem() => collapse_file()
> khugepaged_scan_shmem() => khugepaged_scan_file()
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
` (3 preceding siblings ...)
2019-06-25 0:12 ` [PATCH v9 4/6] khugepaged: rename collapse_shmem() and khugepaged_scan_shmem() Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-07-10 18:48 ` Johannes Weiner
2019-07-23 23:59 ` Huang, Kai
2019-06-25 0:12 ` [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache Song Liu
2019-06-27 12:46 ` [PATCH v9 0/6] Enable THP for text section of non-shmem files Kirill A. Shutemov
6 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.
This patch enables an application to put part of its text sections to THP
via madvise, for example:
madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
We tried to reuse the logic for THP on tmpfs.
Currently, write is not supported for non-shmem THP. khugepaged will only
process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
execve(). This requirement limits non-shmem THP to text sections.
The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.
An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
mm/Kconfig | 11 ++++++
mm/filemap.c | 4 +--
mm/khugepaged.c | 94 +++++++++++++++++++++++++++++++++++++++++--------
mm/rmap.c | 12 ++++---
4 files changed, 100 insertions(+), 21 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..0a8fd589406d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -762,6 +762,17 @@ config GUP_BENCHMARK
See tools/testing/selftests/vm/gup_benchmark.c
+config READ_ONLY_THP_FOR_FS
+ bool "Read-only THP for filesystems (EXPERIMENTAL)"
+ depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
+
+ help
+ Allow khugepaged to put read-only file-backed pages in THP.
+
+ This is marked experimental because it is a new feature. Write
+ support of file THPs will be developed in the next few release
+ cycles.
+
config ARCH_HAS_PTE_SPECIAL
bool
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f072a113535..e79ceccdc6df 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -203,8 +203,8 @@ static void unaccount_page_cache_page(struct address_space *mapping,
__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
if (PageTransHuge(page))
__dec_node_page_state(page, NR_SHMEM_THPS);
- } else {
- VM_BUG_ON_PAGE(PageTransHuge(page), page);
+ } else if (PageTransHuge(page)) {
+ __dec_node_page_state(page, NR_FILE_THPS);
}
/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 158cad542627..acbbbeaa083c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -48,6 +48,7 @@ enum scan_result {
SCAN_CGROUP_CHARGE_FAIL,
SCAN_EXCEED_SWAP_PTE,
SCAN_TRUNCATED,
+ SCAN_PAGE_HAS_PRIVATE,
};
#define CREATE_TRACE_POINTS
@@ -404,7 +405,11 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
(vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
return false;
- if (shmem_file(vma->vm_file)) {
+
+ if (shmem_file(vma->vm_file) ||
+ (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ vma->vm_file &&
+ (vm_flags & VM_DENYWRITE))) {
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
return false;
return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
@@ -456,8 +461,9 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
unsigned long hstart, hend;
/*
- * khugepaged does not yet work on non-shmem files or special
- * mappings. And file-private shmem THP is not supported.
+ * khugepaged only supports read-only files for non-shmem files.
+ * khugepaged does not yet work on special mappings. And
+ * file-private shmem THP is not supported.
*/
if (!hugepage_vma_check(vma, vm_flags))
return 0;
@@ -1287,12 +1293,12 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
}
/**
- * collapse_file - collapse small tmpfs/shmem pages into huge one.
+ * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
*
* Basic scheme is simple, details are more complex:
* - allocate and lock a new huge page;
* - scan page cache replacing old pages with the new one
- * + swap in pages if necessary;
+ * + swap/gup in pages if necessary;
* + fill in gaps;
* + keep old pages around in case rollback is required;
* - if replacing succeeds:
@@ -1316,7 +1322,9 @@ static void collapse_file(struct mm_struct *mm,
LIST_HEAD(pagelist);
XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
int nr_none = 0, result = SCAN_SUCCEED;
+ bool is_shmem = shmem_file(file);
+ VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
/* Only allocate from the target node */
@@ -1348,7 +1356,8 @@ static void collapse_file(struct mm_struct *mm,
} while (1);
__SetPageLocked(new_page);
- __SetPageSwapBacked(new_page);
+ if (is_shmem)
+ __SetPageSwapBacked(new_page);
new_page->index = start;
new_page->mapping = mapping;
@@ -1363,7 +1372,7 @@ static void collapse_file(struct mm_struct *mm,
struct page *page = xas_next(&xas);
VM_BUG_ON(index != xas.xa_index);
- if (!page) {
+ if (is_shmem && !page) {
/*
* Stop if extent has been truncated or hole-punched,
* and is now completely empty.
@@ -1384,7 +1393,7 @@ static void collapse_file(struct mm_struct *mm,
continue;
}
- if (xa_is_value(page) || !PageUptodate(page)) {
+ if (is_shmem && (xa_is_value(page) || !PageUptodate(page))) {
xas_unlock_irq(&xas);
/* swap in or instantiate fallocated page */
if (shmem_getpage(mapping->host, index, &page,
@@ -1392,6 +1401,29 @@ static void collapse_file(struct mm_struct *mm,
result = SCAN_FAIL;
goto xa_unlocked;
}
+ } else if (!page || xa_is_value(page)) {
+ xas_unlock_irq(&xas);
+ page_cache_sync_readahead(mapping, &file->f_ra, file,
+ index, PAGE_SIZE);
+ /* drain pagevecs to help isolate_lru_page() */
+ lru_add_drain();
+ page = find_lock_page(mapping, index);
+ if (unlikely(page == NULL)) {
+ result = SCAN_FAIL;
+ goto xa_unlocked;
+ }
+ } else if (!PageUptodate(page)) {
+ VM_BUG_ON(is_shmem);
+ xas_unlock_irq(&xas);
+ wait_on_page_locked(page);
+ if (!trylock_page(page)) {
+ result = SCAN_PAGE_LOCK;
+ goto xa_unlocked;
+ }
+ get_page(page);
+ } else if (!is_shmem && PageDirty(page)) {
+ result = SCAN_FAIL;
+ goto xa_locked;
} else if (trylock_page(page)) {
get_page(page);
xas_unlock_irq(&xas);
@@ -1426,6 +1458,12 @@ static void collapse_file(struct mm_struct *mm,
goto out_unlock;
}
+ if (page_has_private(page) &&
+ !try_to_release_page(page, GFP_KERNEL)) {
+ result = SCAN_PAGE_HAS_PRIVATE;
+ break;
+ }
+
if (page_mapped(page))
unmap_mapping_pages(mapping, index, 1, false);
@@ -1463,12 +1501,18 @@ static void collapse_file(struct mm_struct *mm,
goto xa_unlocked;
}
- __inc_node_page_state(new_page, NR_SHMEM_THPS);
+ if (is_shmem)
+ __inc_node_page_state(new_page, NR_SHMEM_THPS);
+ else
+ __inc_node_page_state(new_page, NR_FILE_THPS);
+
if (nr_none) {
struct zone *zone = page_zone(new_page);
__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
- __mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
+ if (is_shmem)
+ __mod_node_page_state(zone->zone_pgdat,
+ NR_SHMEM, nr_none);
}
xa_locked:
@@ -1506,10 +1550,15 @@ static void collapse_file(struct mm_struct *mm,
SetPageUptodate(new_page);
page_ref_add(new_page, HPAGE_PMD_NR - 1);
- set_page_dirty(new_page);
mem_cgroup_commit_charge(new_page, memcg, false, true);
+
+ if (is_shmem) {
+ set_page_dirty(new_page);
+ lru_cache_add_anon(new_page);
+ } else {
+ lru_cache_add_file(new_page);
+ }
count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
- lru_cache_add_anon(new_page);
/*
* Remove pte page tables, so we can re-fault the page as huge.
@@ -1524,7 +1573,9 @@ static void collapse_file(struct mm_struct *mm,
/* Something went wrong: roll back page cache changes */
xas_lock_irq(&xas);
mapping->nrpages -= nr_none;
- shmem_uncharge(mapping->host, nr_none);
+
+ if (is_shmem)
+ shmem_uncharge(mapping->host, nr_none);
xas_set(&xas, start);
xas_for_each(&xas, page, end - 1) {
@@ -1607,6 +1658,17 @@ static void khugepaged_scan_file(struct mm_struct *mm,
break;
}
+ if (page_has_private(page) && trylock_page(page)) {
+ int ret;
+
+ ret = try_to_release_page(page, GFP_KERNEL);
+ unlock_page(page);
+ if (!ret) {
+ result = SCAN_PAGE_HAS_PRIVATE;
+ break;
+ }
+ }
+
if (page_count(page) != 1 + page_mapcount(page)) {
result = SCAN_PAGE_COUNT;
break;
@@ -1713,11 +1775,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
- if (shmem_file(vma->vm_file)) {
+ if (vma->vm_file) {
struct file *file;
pgoff_t pgoff = linear_page_index(vma,
khugepaged_scan.address);
- if (!shmem_huge_enabled(vma))
+
+ if (shmem_file(vma->vm_file)
+ && !shmem_huge_enabled(vma))
goto skip;
file = get_file(vma->vm_file);
up_read(&mm->mmap_sem);
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..87cfa2c19eda 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1192,8 +1192,10 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
- VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
- __inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
+ if (PageSwapBacked(page))
+ __inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
+ else
+ __inc_node_page_state(page, NR_FILE_PMDMAPPED);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
VM_WARN_ON_ONCE(!PageLocked(page));
@@ -1232,8 +1234,10 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
- VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
- __dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
+ if (PageSwapBacked(page))
+ __dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
+ else
+ __dec_node_page_state(page, NR_FILE_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS
2019-06-25 0:12 ` [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS Song Liu
@ 2019-07-10 18:48 ` Johannes Weiner
2019-07-22 23:41 ` Song Liu
2019-07-23 23:59 ` Huang, Kai
1 sibling, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 18:48 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:45PM -0700, Song Liu wrote:
> This patch is (hopefully) the first step to enable THP for non-shmem
> filesystems.
>
> This patch enables an application to put part of its text sections to THP
> via madvise, for example:
>
> madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
>
> We tried to reuse the logic for THP on tmpfs.
>
> Currently, write is not supported for non-shmem THP. khugepaged will only
> process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
> (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
> execve(). This requirement limits non-shmem THP to text sections.
>
> The next patch will handle writes, which would only happen when the all
> the vmas with VM_DENYWRITE are unmapped.
>
> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
> feature.
>
> Acked-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
This is really cool, and less invasive than I anticipated. Nice work.
I only have one concern and one question:
> @@ -1392,6 +1401,29 @@ static void collapse_file(struct mm_struct *mm,
> result = SCAN_FAIL;
> goto xa_unlocked;
> }
> + } else if (!page || xa_is_value(page)) {
> + xas_unlock_irq(&xas);
> + page_cache_sync_readahead(mapping, &file->f_ra, file,
> + index, PAGE_SIZE);
> + /* drain pagevecs to help isolate_lru_page() */
> + lru_add_drain();
> + page = find_lock_page(mapping, index);
> + if (unlikely(page == NULL)) {
> + result = SCAN_FAIL;
> + goto xa_unlocked;
> + }
> + } else if (!PageUptodate(page)) {
> + VM_BUG_ON(is_shmem);
> + xas_unlock_irq(&xas);
> + wait_on_page_locked(page);
> + if (!trylock_page(page)) {
> + result = SCAN_PAGE_LOCK;
> + goto xa_unlocked;
> + }
> + get_page(page);
> + } else if (!is_shmem && PageDirty(page)) {
> + result = SCAN_FAIL;
> + goto xa_locked;
> } else if (trylock_page(page)) {
> get_page(page);
> xas_unlock_irq(&xas);
The many else ifs here check fairly complex page state and are hard to
follow and verify mentally. In fact, it's a bit easier now in the
patch when you see how it *used* to work with just shmem, but the end
result is fragile from a maintenance POV.
The shmem and file cases have little in common - basically only the
trylock_page(). Can you please make one big 'if (is_shmem) {} {}'
structure instead that keeps those two scenarios separate?
> @@ -1426,6 +1458,12 @@ static void collapse_file(struct mm_struct *mm,
> goto out_unlock;
> }
>
> + if (page_has_private(page) &&
> + !try_to_release_page(page, GFP_KERNEL)) {
> + result = SCAN_PAGE_HAS_PRIVATE;
> + break;
> + }
> +
> if (page_mapped(page))
> unmap_mapping_pages(mapping, index, 1, false);
> @@ -1607,6 +1658,17 @@ static void khugepaged_scan_file(struct mm_struct *mm,
> break;
> }
>
> + if (page_has_private(page) && trylock_page(page)) {
> + int ret;
> +
> + ret = try_to_release_page(page, GFP_KERNEL);
> + unlock_page(page);
> + if (!ret) {
> + result = SCAN_PAGE_HAS_PRIVATE;
> + break;
> + }
> + }
> +
> if (page_count(page) != 1 + page_mapcount(page)) {
> result = SCAN_PAGE_COUNT;
> break;
There is already a try_to_release() inside the page lock section in
collapse_file(). I'm assuming you added this one because private data
affects the refcount. But it seems a bit overkill just for that; we
could also still fail the check, in which case we'd have dropped the
buffers in vain. Can you fix the check instead?
There is an is_page_cache_freeable() function in vmscan.c that handles
private fs references:
static inline int is_page_cache_freeable(struct page *page)
{
/*
* A freeable page cache page is referenced only by the caller
* that isolated the page, the page cache and optional buffer
* heads at page->private.
*/
int page_cache_pins = PageTransHuge(page) && PageSwapCache(page) ?
HPAGE_PMD_NR : 1;
return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
}
Wouldn't this work here as well?
The rest looks great to me.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS
2019-07-10 18:48 ` Johannes Weiner
@ 2019-07-22 23:41 ` Song Liu
0 siblings, 0 replies; 22+ messages in thread
From: Song Liu @ 2019-07-22 23:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: Linux-MM, linux-fsdevel, LKML, Matthew Wilcox,
Kirill A. Shutemov, Kernel Team, William Kucharski,
Andrew Morton, hdanton
> On Jul 10, 2019, at 11:48 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Jun 24, 2019 at 05:12:45PM -0700, Song Liu wrote:
>> This patch is (hopefully) the first step to enable THP for non-shmem
>> filesystems.
>>
>> This patch enables an application to put part of its text sections to THP
>> via madvise, for example:
>>
>> madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
>>
>> We tried to reuse the logic for THP on tmpfs.
>>
>> Currently, write is not supported for non-shmem THP. khugepaged will only
>> process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
>> (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
>> execve(). This requirement limits non-shmem THP to text sections.
>>
>> The next patch will handle writes, which would only happen when the all
>> the vmas with VM_DENYWRITE are unmapped.
>>
>> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
>> feature.
>>
>> Acked-by: Rik van Riel <riel@surriel.com>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>
> This is really cool, and less invasive than I anticipated. Nice work.
>
> I only have one concern and one question:
>
>> @@ -1392,6 +1401,29 @@ static void collapse_file(struct mm_struct *mm,
>> result = SCAN_FAIL;
>> goto xa_unlocked;
>> }
>> + } else if (!page || xa_is_value(page)) {
>> + xas_unlock_irq(&xas);
>> + page_cache_sync_readahead(mapping, &file->f_ra, file,
>> + index, PAGE_SIZE);
>> + /* drain pagevecs to help isolate_lru_page() */
>> + lru_add_drain();
>> + page = find_lock_page(mapping, index);
>> + if (unlikely(page == NULL)) {
>> + result = SCAN_FAIL;
>> + goto xa_unlocked;
>> + }
>> + } else if (!PageUptodate(page)) {
>> + VM_BUG_ON(is_shmem);
>> + xas_unlock_irq(&xas);
>> + wait_on_page_locked(page);
>> + if (!trylock_page(page)) {
>> + result = SCAN_PAGE_LOCK;
>> + goto xa_unlocked;
>> + }
>> + get_page(page);
>> + } else if (!is_shmem && PageDirty(page)) {
>> + result = SCAN_FAIL;
>> + goto xa_locked;
>> } else if (trylock_page(page)) {
>> get_page(page);
>> xas_unlock_irq(&xas);
>
> The many else ifs here check fairly complex page state and are hard to
> follow and verify mentally. In fact, it's a bit easier now in the
> patch when you see how it *used* to work with just shmem, but the end
> result is fragile from a maintenance POV.
>
> The shmem and file cases have little in common - basically only the
> trylock_page(). Can you please make one big 'if (is_shmem) {} {}'
> structure instead that keeps those two scenarios separate?
Good point! Will fix in next version.
>
>> @@ -1426,6 +1458,12 @@ static void collapse_file(struct mm_struct *mm,
>> goto out_unlock;
>> }
>>
>> + if (page_has_private(page) &&
>> + !try_to_release_page(page, GFP_KERNEL)) {
>> + result = SCAN_PAGE_HAS_PRIVATE;
>> + break;
>> + }
>> +
>> if (page_mapped(page))
>> unmap_mapping_pages(mapping, index, 1, false);
>
>> @@ -1607,6 +1658,17 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>> break;
>> }
>>
>> + if (page_has_private(page) && trylock_page(page)) {
>> + int ret;
>> +
>> + ret = try_to_release_page(page, GFP_KERNEL);
>> + unlock_page(page);
>> + if (!ret) {
>> + result = SCAN_PAGE_HAS_PRIVATE;
>> + break;
>> + }
>> + }
>> +
>> if (page_count(page) != 1 + page_mapcount(page)) {
>> result = SCAN_PAGE_COUNT;
>> break;
>
> There is already a try_to_release() inside the page lock section in
> collapse_file(). I'm assuming you added this one because private data
> affects the refcount. But it seems a bit overkill just for that; we
> could also still fail the check, in which case we'd have dropped the
> buffers in vain. Can you fix the check instead?
>
> There is an is_page_cache_freeable() function in vmscan.c that handles
> private fs references:
>
> static inline int is_page_cache_freeable(struct page *page)
> {
> /*
> * A freeable page cache page is referenced only by the caller
> * that isolated the page, the page cache and optional buffer
> * heads at page->private.
> */
> int page_cache_pins = PageTransHuge(page) && PageSwapCache(page) ?
> HPAGE_PMD_NR : 1;
> return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
> }
>
> Wouldn't this work here as well?
Good point! Let me try fix this.
Thanks,
Song
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS
2019-06-25 0:12 ` [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS Song Liu
2019-07-10 18:48 ` Johannes Weiner
@ 2019-07-23 23:59 ` Huang, Kai
2019-07-28 6:41 ` Song Liu
1 sibling, 1 reply; 22+ messages in thread
From: Huang, Kai @ 2019-07-23 23:59 UTC (permalink / raw)
To: linux-kernel, linux-mm, songliubraving, linux-fsdevel
Cc: kirill.shutemov, matthew.wilcox, hdanton, kernel-team, akpm,
william.kucharski
On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
> This patch is (hopefully) the first step to enable THP for non-shmem
> filesystems.
>
> This patch enables an application to put part of its text sections to THP
> via madvise, for example:
>
> madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
>
> We tried to reuse the logic for THP on tmpfs.
>
> Currently, write is not supported for non-shmem THP. khugepaged will only
> process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
> (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
> execve(). This requirement limits non-shmem THP to text sections.
>
> The next patch will handle writes, which would only happen when the all
> the vmas with VM_DENYWRITE are unmapped.
>
> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
> feature.
>
> Acked-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> mm/Kconfig | 11 ++++++
> mm/filemap.c | 4 +--
> mm/khugepaged.c | 94 +++++++++++++++++++++++++++++++++++++++++--------
> mm/rmap.c | 12 ++++---
> 4 files changed, 100 insertions(+), 21 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f0c76ba47695..0a8fd589406d 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -762,6 +762,17 @@ config GUP_BENCHMARK
>
> See tools/testing/selftests/vm/gup_benchmark.c
>
> +config READ_ONLY_THP_FOR_FS
> + bool "Read-only THP for filesystems (EXPERIMENTAL)"
> + depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
Hi,
Maybe a stupid question since I am new, but why does it depend on SHMEM?
Thanks,
-Kai
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS
2019-07-23 23:59 ` Huang, Kai
@ 2019-07-28 6:41 ` Song Liu
0 siblings, 0 replies; 22+ messages in thread
From: Song Liu @ 2019-07-28 6:41 UTC (permalink / raw)
To: Huang, Kai
Cc: linux-kernel, linux-mm, linux-fsdevel, kirill.shutemov,
matthew.wilcox, hdanton, Kernel Team, akpm, william.kucharski
> On Jul 23, 2019, at 4:59 PM, Huang, Kai <kai.huang@intel.com> wrote:
>
> On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
>> This patch is (hopefully) the first step to enable THP for non-shmem
>> filesystems.
>>
>> This patch enables an application to put part of its text sections to THP
>> via madvise, for example:
>>
>> madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
>>
>> We tried to reuse the logic for THP on tmpfs.
>>
>> Currently, write is not supported for non-shmem THP. khugepaged will only
>> process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
>> (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
>> execve(). This requirement limits non-shmem THP to text sections.
>>
>> The next patch will handle writes, which would only happen when the all
>> the vmas with VM_DENYWRITE are unmapped.
>>
>> An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
>> feature.
>>
>> Acked-by: Rik van Riel <riel@surriel.com>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> mm/Kconfig | 11 ++++++
>> mm/filemap.c | 4 +--
>> mm/khugepaged.c | 94 +++++++++++++++++++++++++++++++++++++++++--------
>> mm/rmap.c | 12 ++++---
>> 4 files changed, 100 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index f0c76ba47695..0a8fd589406d 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -762,6 +762,17 @@ config GUP_BENCHMARK
>>
>> See tools/testing/selftests/vm/gup_benchmark.c
>>
>> +config READ_ONLY_THP_FOR_FS
>> + bool "Read-only THP for filesystems (EXPERIMENTAL)"
>> + depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
>
> Hi,
>
> Maybe a stupid question since I am new, but why does it depend on SHMEM?
Not stupid at all. :)
We reuse a lot of code for shmem thp, thus the dependency. Technically, we
can remove the dependency. However, we will remove this config option when
THP for FS is more mature. So it doesn't make sense to resolve the
dependency at this stage.
Thanks,
Song
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
` (4 preceding siblings ...)
2019-06-25 0:12 ` [PATCH v9 5/6] mm,thp: add read-only THP support for (non-shmem) FS Song Liu
@ 2019-06-25 0:12 ` Song Liu
2019-06-27 13:18 ` Rik van Riel
2019-07-10 19:11 ` Johannes Weiner
2019-06-27 12:46 ` [PATCH v9 0/6] Enable THP for text section of non-shmem files Kirill A. Shutemov
6 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2019-06-25 0:12 UTC (permalink / raw)
To: linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton, Song Liu
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.
This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_last(), if the file is open for write and nr_thps
is non-zero, we drop page cache for the whole file.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
fs/inode.c | 3 +++
fs/namei.c | 23 ++++++++++++++++++++++-
include/linux/fs.h | 32 ++++++++++++++++++++++++++++++++
mm/filemap.c | 1 +
mm/khugepaged.c | 4 +++-
5 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index df6542ec3b88..518113a4e219 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->flags = 0;
mapping->wb_err = 0;
atomic_set(&mapping->i_mmap_writable, 0);
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+ atomic_set(&mapping->nr_thps, 0);
+#endif
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
mapping->private_data = NULL;
mapping->writeback_index = 0;
diff --git a/fs/namei.c b/fs/namei.c
index 20831c2fbb34..3d95e94029cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3249,6 +3249,23 @@ static int lookup_open(struct nameidata *nd, struct path *path,
return error;
}
+/*
+ * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
+ * it still has THP in page cache, drop the whole file from pagecache
+ * before processing writes. This helps us avoid handling write back of
+ * THP for now.
+ */
+static inline void release_file_thp(struct file *file)
+{
+ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) {
+ struct inode *inode = file_inode(file);
+
+ if (inode_is_open_for_write(inode) &&
+ filemap_nr_thps(inode->i_mapping))
+ truncate_pagecache(inode, 0);
+ }
+}
+
/*
* Handle the last step of open()
*/
@@ -3418,7 +3435,11 @@ static int do_last(struct nameidata *nd,
goto out;
opened:
error = ima_file_check(file, op->acc_mode);
- if (!error && will_truncate)
+ if (error)
+ goto out;
+
+ release_file_thp(file);
+ if (will_truncate)
error = handle_truncate(file);
out:
if (unlikely(error > 0)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..082fc581c7fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -427,6 +427,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
* @i_pages: Cached pages.
* @gfp_mask: Memory allocation flags to use for allocating pages.
* @i_mmap_writable: Number of VM_SHARED mappings.
+ * @nr_thps: Number of THPs in the pagecache (non-shmem only).
* @i_mmap: Tree of private and shared mappings.
* @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
* @nrpages: Number of page entries, protected by the i_pages lock.
@@ -444,6 +445,10 @@ struct address_space {
struct xarray i_pages;
gfp_t gfp_mask;
atomic_t i_mmap_writable;
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+ /* number of thp, only for non-shmem files */
+ atomic_t nr_thps;
+#endif
struct rb_root_cached i_mmap;
struct rw_semaphore i_mmap_rwsem;
unsigned long nrpages;
@@ -2790,6 +2795,33 @@ static inline errseq_t filemap_sample_wb_err(struct address_space *mapping)
return errseq_sample(&mapping->wb_err);
}
+static inline int filemap_nr_thps(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+ return atomic_read(&mapping->nr_thps);
+#else
+ return 0;
+#endif
+}
+
+static inline void filemap_nr_thps_inc(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+ atomic_inc(&mapping->nr_thps);
+#else
+ WARN_ON_ONCE(1);
+#endif
+}
+
+static inline void filemap_nr_thps_dec(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+ atomic_dec(&mapping->nr_thps);
+#else
+ WARN_ON_ONCE(1);
+#endif
+}
+
extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
extern int vfs_fsync(struct file *file, int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index e79ceccdc6df..a8e86c136381 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -205,6 +205,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
__dec_node_page_state(page, NR_SHMEM_THPS);
} else if (PageTransHuge(page)) {
__dec_node_page_state(page, NR_FILE_THPS);
+ filemap_nr_thps_dec(mapping);
}
/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index acbbbeaa083c..0bbc6be51197 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1503,8 +1503,10 @@ static void collapse_file(struct mm_struct *mm,
if (is_shmem)
__inc_node_page_state(new_page, NR_SHMEM_THPS);
- else
+ else {
__inc_node_page_state(new_page, NR_FILE_THPS);
+ filemap_nr_thps_inc(mapping);
+ }
if (nr_none) {
struct zone *zone = page_zone(new_page);
--
2.17.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache
2019-06-25 0:12 ` [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache Song Liu
@ 2019-06-27 13:18 ` Rik van Riel
2019-07-10 19:11 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2019-06-27 13:18 UTC (permalink / raw)
To: Song Liu, linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton
[-- Attachment #1: Type: text/plain, Size: 793 bytes --]
On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
> In previous patch, an application could put part of its text section
> in
> THP via madvise(). These THPs will be protected from writes when the
> application is still running (TXTBSY). However, after the application
> exits, the file is available for writes.
>
> This patch avoids writes to file THP by dropping page cache for the
> file
> when the file is open for write. A new counter nr_thps is added to
> struct
> address_space. In do_last(), if the file is open for write and
> nr_thps
> is non-zero, we drop page cache for the whole file.
>
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache
@ 2019-06-27 13:18 ` Rik van Riel
0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2019-06-27 13:18 UTC (permalink / raw)
To: Song Liu, linux-mm, linux-fsdevel, linux-kernel
Cc: matthew.wilcox, kirill.shutemov, kernel-team, william.kucharski,
akpm, hdanton
[-- Attachment #1: Type: text/plain, Size: 793 bytes --]
On Mon, 2019-06-24 at 17:12 -0700, Song Liu wrote:
> In previous patch, an application could put part of its text section
> in
> THP via madvise(). These THPs will be protected from writes when the
> application is still running (TXTBSY). However, after the application
> exits, the file is available for writes.
>
> This patch avoids writes to file THP by dropping page cache for the
> file
> when the file is open for write. A new counter nr_thps is added to
> struct
> address_space. In do_last(), if the file is open for write and
> nr_thps
> is non-zero, we drop page cache for the whole file.
>
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache
2019-06-25 0:12 ` [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache Song Liu
2019-06-27 13:18 ` Rik van Riel
@ 2019-07-10 19:11 ` Johannes Weiner
1 sibling, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2019-07-10 19:11 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:46PM -0700, Song Liu wrote:
> In previous patch, an application could put part of its text section in
> THP via madvise(). These THPs will be protected from writes when the
> application is still running (TXTBSY). However, after the application
> exits, the file is available for writes.
>
> This patch avoids writes to file THP by dropping page cache for the file
> when the file is open for write. A new counter nr_thps is added to struct
> address_space. In do_last(), if the file is open for write and nr_thps
> is non-zero, we drop page cache for the whole file.
>
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> fs/inode.c | 3 +++
> fs/namei.c | 23 ++++++++++++++++++++++-
> include/linux/fs.h | 32 ++++++++++++++++++++++++++++++++
> mm/filemap.c | 1 +
> mm/khugepaged.c | 4 +++-
> 5 files changed, 61 insertions(+), 2 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index df6542ec3b88..518113a4e219 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> mapping->flags = 0;
> mapping->wb_err = 0;
> atomic_set(&mapping->i_mmap_writable, 0);
> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> + atomic_set(&mapping->nr_thps, 0);
> +#endif
> mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> mapping->private_data = NULL;
> mapping->writeback_index = 0;
> diff --git a/fs/namei.c b/fs/namei.c
> index 20831c2fbb34..3d95e94029cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3249,6 +3249,23 @@ static int lookup_open(struct nameidata *nd, struct path *path,
> return error;
> }
>
> +/*
> + * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
> + * it still has THP in page cache, drop the whole file from pagecache
> + * before processing writes. This helps us avoid handling write back of
> + * THP for now.
> + */
> +static inline void release_file_thp(struct file *file)
> +{
> + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) {
> + struct inode *inode = file_inode(file);
> +
> + if (inode_is_open_for_write(inode) &&
> + filemap_nr_thps(inode->i_mapping))
> + truncate_pagecache(inode, 0);
> + }
> +}
> +
> /*
> * Handle the last step of open()
> */
> @@ -3418,7 +3435,11 @@ static int do_last(struct nameidata *nd,
> goto out;
> opened:
> error = ima_file_check(file, op->acc_mode);
> - if (!error && will_truncate)
> + if (error)
> + goto out;
> +
> + release_file_thp(file);
> + if (will_truncate)
> error = handle_truncate(file);
This would seem better placed in do_dentry_open(), where we're done
with the namespace operation and actually work against the inode.
Something roughly like this?
diff --git a/fs/open.c b/fs/open.c
index b5b80469b93d..cae893edbab6 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -799,6 +799,11 @@ static int do_dentry_open(struct file *f,
if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
return -EINVAL;
}
+
+ /* XXX: Huge page cache doesn't support writing yet */
+ if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping))
+ truncate_pagecache(inode, 0);
+
return 0;
cleanup_all:
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v9 0/6] Enable THP for text section of non-shmem files
2019-06-25 0:12 [PATCH v9 0/6] Enable THP for text section of non-shmem files Song Liu
` (5 preceding siblings ...)
2019-06-25 0:12 ` [PATCH v9 6/6] mm,thp: avoid writes to file with THP in pagecache Song Liu
@ 2019-06-27 12:46 ` Kirill A. Shutemov
6 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2019-06-27 12:46 UTC (permalink / raw)
To: Song Liu
Cc: linux-mm, linux-fsdevel, linux-kernel, matthew.wilcox,
kirill.shutemov, kernel-team, william.kucharski, akpm, hdanton
On Mon, Jun 24, 2019 at 05:12:40PM -0700, Song Liu wrote:
> Please share your comments and suggestions on this.
Looks like a great first step to THP in page cache. Thanks!
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
THP allocation in the fault path and write support are next goals.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread