All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm,thp: Add filemap_huge_fault() for THP
@ 2019-07-29 21:09 William Kucharski
  2019-07-29 21:09 ` [PATCH v2 1/2] mm: Allow the page cache to allocate large pages William Kucharski
  2019-07-29 21:09 ` [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
  0 siblings, 2 replies; 11+ messages in thread
From: William Kucharski @ 2019-07-29 21:09 UTC (permalink / raw)
  To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
	netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
	Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
	Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
	Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
	Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
	John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
	Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
	Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
	David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
	Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
	Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
	Bartlomiej Zolnierkiewicz, Ross Zwisler

This set of patches is the first step towards a mechanism for automatically
mapping read-only text areas of appropriate size and alignment to THPs whenever
possible.

For now, the central routine, filemap_huge_fault(), amd various support
routines are only included if the experimental kernel configuration option

        RO_EXEC_FILEMAP_HUGE_FAULT_THP

is enabled.

This is because filemap_huge_fault() is dependent upon the
address_space_operations vector readpage() pointing to a routine that
will read and fill an entire large page at a time without poulluting the
page cache with PAGESIZE entries for the large page being mapped or
performing readahead that would pollute the page cache entries for
succeeding large pages. Unfortunately, there is no good way to determine
how many bytes were read by readpage(). At present, if filemap_huge_fault()
were to call a conventional readpage() routine, it would only fill the first
PAGESIZE bytes of the large page, which is definitely NOT the desired behavior.

However, by making the code available now it is hoped that filesystem
maintainers who have pledged to provide such a mechanism will do so more
rapidly.

The first part of the patch adds an order field to __page_cache_alloc(),
allowing callers to directly request page cache pages of various sizes.
This code was provided by Matthew Wilcox.

The second part of the patch implements the filemap_huge_fault() mechanism as
described above.

Changes since v1:
1. Fix improperly generated patch for v1 PATCH 1/2

Matthew Wilcox (1):
  mm: Allow the page cache to allocate large pages

William Kucharski (2):
  mm: Allow the page cache to allocate large pages
  mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP

 fs/afs/dir.c            |   2 +-
 fs/btrfs/compression.c  |   2 +-
 fs/cachefiles/rdwr.c    |   4 +-
 fs/ceph/addr.c          |   2 +-
 fs/ceph/file.c          |   2 +-
 include/linux/huge_mm.h |  16 +-
 include/linux/mm.h      |   6 +
 include/linux/pagemap.h |  13 +-
 mm/Kconfig              |  15 ++
 mm/filemap.c            | 322 ++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        |   3 +
 mm/mmap.c               |  36 ++++-
 mm/readahead.c          |   2 +-
 mm/rmap.c               |   8 +
 net/ceph/pagelist.c     |   4 +-
 net/ceph/pagevec.c      |   2 +-
 16 files changed, 404 insertions(+), 35 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] mm: Allow the page cache to allocate large pages
  2019-07-29 21:09 [PATCH v2 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
@ 2019-07-29 21:09 ` William Kucharski
  2019-07-29 22:03   ` Song Liu
  2019-07-29 21:09 ` [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
  1 sibling, 1 reply; 11+ messages in thread
From: William Kucharski @ 2019-07-29 21:09 UTC (permalink / raw)
  To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
	netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
	Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
	Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
	Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
	Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
	John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
	Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
	Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
	David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
	Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
	Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
	Bartlomiej Zolnierkiewicz, Ross Zwisler, kbuild test robot

Add an order field to __page_cache_alloc() to allow for the allocation
of large memory page page cache entries.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reported-by: kbuild test robot <lkp@intel.com>
---
 fs/afs/dir.c            |  2 +-
 fs/btrfs/compression.c  |  2 +-
 fs/cachefiles/rdwr.c    |  4 ++--
 fs/ceph/addr.c          |  2 +-
 fs/ceph/file.c          |  2 +-
 include/linux/pagemap.h | 13 +++++++++----
 mm/filemap.c            | 25 +++++++++++++------------
 mm/readahead.c          |  2 +-
 net/ceph/pagelist.c     |  4 ++--
 net/ceph/pagevec.c      |  2 +-
 10 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index e640d67274be..0a392214f71e 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key)
 				afs_stat_v(dvnode, n_inval);
 
 			ret = -ENOMEM;
-			req->pages[i] = __page_cache_alloc(gfp);
+			req->pages[i] = __page_cache_alloc(gfp, 0);
 			if (!req->pages[i])
 				goto error;
 			ret = add_to_page_cache_lru(req->pages[i],
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 60c47b417a4b..5280e7477b7e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 
 		page = __page_cache_alloc(mapping_gfp_constraint(mapping,
-								 ~__GFP_FS));
+								 ~__GFP_FS), 0);
 		if (!page)
 			break;
 
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 44a3ce1e4ce4..11d30212745f 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 			goto backing_page_already_present;
 
 		if (!newpage) {
-			newpage = __page_cache_alloc(cachefiles_gfp);
+			newpage = __page_cache_alloc(cachefiles_gfp, 0);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 				goto backing_page_already_present;
 
 			if (!newpage) {
-				newpage = __page_cache_alloc(cachefiles_gfp);
+				newpage = __page_cache_alloc(cachefiles_gfp, 0);
 				if (!newpage)
 					goto nomem;
 			}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index e078cc55b989..bcb41fbee533 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1707,7 +1707,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 		if (len > PAGE_SIZE)
 			len = PAGE_SIZE;
 	} else {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page) {
 			err = -ENOMEM;
 			goto out;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..ae58d7c31aa4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		struct page *page = NULL;
 		loff_t i_size;
 		if (retry_op == READ_INLINE) {
-			page = __page_cache_alloc(GFP_KERNEL);
+			page = __page_cache_alloc(GFP_KERNEL, 0);
 			if (!page)
 				return -ENOMEM;
 		}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c7552459a15f..e9004e3cb6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -208,17 +208,17 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 #endif
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), 0);
 }
 
 static inline gfp_t readahead_gfp_mask(struct address_space *x)
@@ -240,6 +240,11 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
+/* If you add more flags, increment FGP_ORDER_SHIFT */
+#define	FGP_ORDER_SHIFT		7
+#define	FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+#define	FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+#define	fgp_get_order(fgp)	((fgp) >> FGP_ORDER_SHIFT)
 
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		int fgp_flags, gfp_t cache_gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..a96092243fc4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -954,7 +954,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
@@ -964,12 +964,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -1597,12 +1597,12 @@ EXPORT_SYMBOL(find_lock_entry);
  * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
- * @fgp_flags: PCG flags
+ * @fgp_flags: FGP flags
  * @gfp_mask: gfp mask to use for the page cache data page allocation
  *
  * Looks up the page cache slot at @mapping & @offset.
  *
- * PCG flags modify how the page is returned.
+ * FGP flags modify how the page is returned.
  *
  * @fgp_flags can be:
  *
@@ -1615,6 +1615,7 @@ EXPORT_SYMBOL(find_lock_entry);
  * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
  *   its own locking dance if the page is already in cache, or unlock the page
  *   before returning if we had to add the page to pagecache.
+ * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
  *
  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  * if the GFP flags specified for FGP_CREAT are atomic.
@@ -1660,12 +1661,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 no_page:
 	if (!page && (fgp_flags & FGP_CREAT)) {
 		int err;
-		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+		if ((fgp_flags & FGP_WRITE) &&
+			mapping_cap_account_dirty(mapping))
 			gfp_mask |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
 			gfp_mask &= ~__GFP_FS;
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(gfp_mask, fgp_get_order(fgp_flags));
 		if (!page)
 			return NULL;
 
@@ -2802,15 +2804,14 @@ static struct page *wait_on_page_read(struct page *page)
 static struct page *do_read_cache_page(struct address_space *mapping,
 				pgoff_t index,
 				int (*filler)(void *, struct page *),
-				void *data,
-				gfp_t gfp)
+				void *data, unsigned int order, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		page = __page_cache_alloc(gfp);
+		page = __page_cache_alloc(gfp, order);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2917,7 +2918,7 @@ struct page *read_cache_page(struct address_space *mapping,
 				int (*filler)(void *, struct page *),
 				void *data)
 {
-	return do_read_cache_page(mapping, index, filler, data,
+	return do_read_cache_page(mapping, index, filler, data, 0,
 			mapping_gfp_mask(mapping));
 }
 EXPORT_SYMBOL(read_cache_page);
@@ -2939,7 +2940,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index,
 				gfp_t gfp)
 {
-	return do_read_cache_page(mapping, index, NULL, NULL, gfp);
+	return do_read_cache_page(mapping, index, NULL, NULL, 0, gfp);
 }
 EXPORT_SYMBOL(read_cache_page_gfp);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..954760a612ea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -193,7 +193,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 			continue;
 		}
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(gfp_mask, 0);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 65e34f78b05d..0c3face908dc 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -56,7 +56,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -107,7 +107,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 64305e7056a1..1d07e639216d 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -45,7 +45,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, 0);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-29 21:09 [PATCH v2 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
  2019-07-29 21:09 ` [PATCH v2 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-07-29 21:09 ` William Kucharski
  2019-07-29 22:47   ` Dan Williams
  2019-07-29 22:51   ` Song Liu
  1 sibling, 2 replies; 11+ messages in thread
From: William Kucharski @ 2019-07-29 21:09 UTC (permalink / raw)
  To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
	netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
	Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
	Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
	Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
	Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
	John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
	Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
	Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
	David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
	Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
	Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
	Bartlomiej Zolnierkiewicz, Ross Zwisler

Add filemap_huge_fault() to attempt to satisfy page faults on
memory-mapped read-only text pages using THP when possible.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
---
 include/linux/huge_mm.h |  16 ++-
 include/linux/mm.h      |   6 +
 mm/Kconfig              |  15 ++
 mm/filemap.c            | 299 +++++++++++++++++++++++++++++++++++++++-
 mm/huge_memory.c        |   3 +
 mm/mmap.c               |  36 ++++-
 mm/rmap.c               |   8 ++
 7 files changed, 373 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 45ede62aa85b..34723f7e75d0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -79,13 +79,15 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
-#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
-
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
-#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PMD_SHIFT		PMD_SHIFT
+#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
+#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
+#define HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
+
+#define HPAGE_PUD_SHIFT		PUD_SHIFT
+#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
+#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
+#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))
 
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..ba24b515468a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2433,6 +2433,12 @@ extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern vm_fault_t filemap_fault(struct vm_fault *vmf);
+
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+			enum page_entry_size pe_size);
+#endif
+
 extern void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..2debaded0e4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
+
+	help
+	    Introduce filemap_huge_fault() to automatically map executable
+	    read-only pages of mapped files of suitable size and alignment
+	    using THP if possible.
+
+	    This is marked experimental because it is a new feature and is
+	    dependent upon filesystmes implementing readpages() in a way
+	    that will recognize large THP pages and read file content to
+	    them without polluting the pagecache with PAGESIZE pages due
+	    to readahead.
+
 endmenu
diff --git a/mm/filemap.c b/mm/filemap.c
index a96092243fc4..4e7287db0d8e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -199,6 +199,8 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	nr = hpage_nr_pages(page);
 
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+
+#ifndef	CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
 	if (PageSwapBacked(page)) {
 		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
 		if (PageTransHuge(page))
@@ -206,6 +208,13 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	} else {
 		VM_BUG_ON_PAGE(PageTransHuge(page), page);
 	}
+#else
+	if (PageSwapBacked(page))
+		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
+
+	if (PageTransHuge(page))
+		__dec_node_page_state(page, NR_SHMEM_THPS);
+#endif
 
 	/*
 	 * At this point page must be either written or cleaned by
@@ -1615,7 +1624,7 @@ EXPORT_SYMBOL(find_lock_entry);
  * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
  *   its own locking dance if the page is already in cache, or unlock the page
  *   before returning if we had to add the page to pagecache.
- * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
+ * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page
  *
  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  * if the GFP flags specified for FGP_CREAT are atomic.
@@ -2642,6 +2651,291 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+/*
+ * Check for an entry in the page cache which would conflict with the address
+ * range we wish to map using a THP or is otherwise unusable to map a large
+ * cached page.
+ *
+ * The routine will return true if a usable page is found in the page cache
+ * (and *pagep will be set to the address of the cached page), or if no
+ * cached page is found (and *pagep will be set to NULL).
+ */
+static bool
+filemap_huge_check_pagecache_usable(struct xa_state *xasp,
+	struct page **pagep, pgoff_t hindex, pgoff_t hindex_max)
+{
+	struct page *page;
+
+	while (1) {
+		page = xas_find(xasp, hindex_max);
+
+		if (xas_retry(xasp, page)) {
+			xas_set(xasp, hindex);
+			continue;
+		}
+
+		/*
+		 * A found entry is unusable if:
+		 *	+ the entry is an Xarray value, not a pointer
+		 *	+ the entry is an internal Xarray node
+		 *	+ the entry is not a Transparent Huge Page
+		 *	+ the entry is not a compound page
+		 *	+ the entry is not the head of a compound page
+		 *	+ the enbry is a page page with an order other than
+		 *	  HPAGE_PMD_ORDER
+		 *	+ the page's index is not what we expect it to be
+		 *	+ the page is not up-to-date
+		 *	+ the page is unlocked
+		 */
+		if ((page) && (xa_is_value(page) || xa_is_internal(page) ||
+			(!PageCompound(page)) || (PageHuge(page)) ||
+			(!PageTransCompound(page)) ||
+			page != compound_head(page) ||
+			compound_order(page) != HPAGE_PMD_ORDER ||
+			page->index != hindex || (!PageUptodate(page)) ||
+			(!PageLocked(page))))
+			return false;
+
+		break;
+	}
+
+	xas_set(xasp, hindex);
+	*pagep = page;
+	return true;
+}
+
+/**
+ * filemap_huge_fault - read in file data for page fault handling to THP
+ * @vmf:	struct vm_fault containing details of the fault
+ * @pe_size:	large page size to map, currently this must be PE_SIZE_PMD
+ *
+ * filemap_huge_fault() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data to a transparent huge page during
+ * a page fault.
+ *
+ * If for any reason we can't allocate a THP, map it or add it to the page
+ * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
+ * handler to try mapping the page using a PAGESIZE page, usually via
+ * filemap_fault() if so speicifed in the vma operations vector.
+ *
+ * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
+ * to map the new THP.
+ *
+ * NOTE: This routine depends upon the file system's readpage routine as
+ *       specified in the address space operations vector to recognize when it
+ *	 is being passed a large page and to read the approprate amount of data
+ *	 in full and without polluting the page cache for the large page itself
+ *	 with PAGESIZE pages to perform a buffered read or to pollute what
+ *	 would be the page cache space for any succeeding pages with PAGESIZE
+ *	 pages due to readahead.
+ *
+ *	 It is VITAL that this routine not be enabled without such filesystem
+ *	 support. As there is no way to determine how many bytes were read by
+ *	 the readpage() operation, if only a PAGESIZE page is read, this routine
+ *	 will map the THP containing only the first PAGESIZE bytes of file data
+ *	 to satisfy the fault, which is never the result desired.
+ */
+vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+		enum page_entry_size pe_size)
+{
+	struct file *filp = vmf->vma->vm_file;
+	struct address_space *mapping = filp->f_mapping;
+	struct vm_area_struct *vma = vmf->vma;
+
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR);
+	pgoff_t hindex_max = hindex + HPAGE_PMD_NR;
+
+	struct page *cached_page, *hugepage;
+	struct page *new_page = NULL;
+
+	vm_fault_t ret = VM_FAULT_FALLBACK;
+	int error;
+
+	XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER);
+
+	/*
+	 * Return VM_FAULT_FALLBACK if:
+	 *
+	 *	+ pe_size != PE_SIZE_PMD
+	 *	+ FAULT_FLAG_WRITE is set in vmf->flags
+	 *	+ vma isn't aligned to allow a PMD mapping
+	 *	+ PMD would extend beyond the end of the vma
+	 */
+	if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) ||
+		(haddr < vma->vm_start ||
+		(haddr + HPAGE_PMD_SIZE > vma->vm_end)))
+		return ret;
+
+	xas_lock_irq(&xas);
+
+retry_xas_locked:
+	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+		hindex_max)) {
+		/* found a conflicting entry in the page cache, so fallback */
+		goto unlock;
+	} else if (cached_page) {
+		/* found a valid cached page, so map it */
+		hugepage = cached_page;
+		goto map_huge;
+	}
+
+	xas_unlock_irq(&xas);
+
+	/* allocate huge THP page in VMA */
+	new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP |
+		__GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER);
+
+	if (unlikely(!new_page))
+		return ret;
+
+	if (unlikely(!(PageCompound(new_page)))) {
+		put_page(new_page);
+		return ret;
+	}
+
+	prep_transhuge_page(new_page);
+	new_page->index = hindex;
+	new_page->mapping = mapping;
+
+	__SetPageLocked(new_page);
+
+	/*
+	 * The readpage() operation below is expected to fill the large
+	 * page with data without polluting the page cache with
+	 * PAGESIZE entries due to a buffered read and/or readahead().
+	 *
+	 * A filesystem's vm_operations_struct huge_fault field should
+	 * never point to this routine without such a capability, and
+	 * without it a call to this routine would eventually just
+	 * fall through to the normal fault op anyway.
+	 */
+	error = mapping->a_ops->readpage(vmf->vma->vm_file, new_page);
+
+	if (unlikely(error)) {
+		put_page(new_page);
+		return ret;
+	}
+
+	/* XXX - use wait_on_page_locked_killable() instead? */
+	wait_on_page_locked(new_page);
+
+	if (!PageUptodate(new_page)) {
+		/* EIO */
+		new_page->mapping = NULL;
+		put_page(new_page);
+		return ret;
+	}
+
+	do {
+		xas_lock_irq(&xas);
+		xas_set(&xas, hindex);
+		xas_create_range(&xas);
+
+		if (!(xas_error(&xas)))
+			break;
+
+		if (!xas_nomem(&xas, GFP_KERNEL)) {
+			if (new_page) {
+				new_page->mapping = NULL;
+				put_page(new_page);
+			}
+
+			goto unlock;
+		}
+
+		xas_unlock_irq(&xas);
+	} while (1);
+
+	/*
+	 * Double check that an entry did not sneak into the page cache while
+	 * creating Xarray entries for the new page.
+	 */
+	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+		hindex_max)) {
+		/*
+		 * An unusable entry was found, so delete the newly allocated
+		 * page and fallback.
+		 */
+		new_page->mapping = NULL;
+		put_page(new_page);
+		goto unlock;
+	} else if (cached_page) {
+		/*
+		 * A valid large page was found in the page cache, so free the
+		 * newly allocated page and map the cached page instead.
+		 */
+		new_page->mapping = NULL;
+		put_page(new_page);
+		new_page = NULL;
+		hugepage = cached_page;
+		goto map_huge;
+	}
+
+	__SetPageLocked(new_page);
+
+	/* did it get truncated? */
+	if (unlikely(new_page->mapping != mapping)) {
+		unlock_page(new_page);
+		put_page(new_page);
+		goto retry_xas_locked;
+	}
+
+	hugepage = new_page;
+
+map_huge:
+	/* map hugepage at the PMD level */
+	ret = alloc_set_pte(vmf, NULL, hugepage);
+
+	VM_BUG_ON_PAGE((!(pmd_trans_huge(*vmf->pmd))), hugepage);
+
+	if (likely(!(ret & VM_FAULT_ERROR))) {
+		/*
+		 * The alloc_set_pte() succeeded without error, so
+		 * add the page to the page cache if it is new, and
+		 * increment page statistics accordingly.
+		 */
+		if (new_page) {
+			unsigned long nr;
+
+			xas_set(&xas, hindex);
+
+			for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
+#ifndef	COMPOUND_PAGES_HEAD_ONLY
+				xas_store(&xas, new_page + nr);
+#else
+				xas_store(&xas, new_page);
+#endif
+				xas_next(&xas);
+			}
+
+			count_vm_event(THP_FILE_ALLOC);
+			__inc_node_page_state(new_page, NR_SHMEM_THPS);
+			__mod_node_page_state(page_pgdat(new_page),
+				NR_FILE_PAGES, HPAGE_PMD_NR);
+			__mod_node_page_state(page_pgdat(new_page),
+				NR_SHMEM, HPAGE_PMD_NR);
+		}
+
+		vmf->address = haddr;
+		vmf->page = hugepage;
+
+		page_ref_add(hugepage, HPAGE_PMD_NR);
+		count_vm_event(THP_FILE_MAPPED);
+	} else if (new_page) {
+		/* there was an error mapping the new page, so release it */
+		new_page->mapping = NULL;
+		put_page(new_page);
+	}
+
+unlock:
+	xas_unlock_irq(&xas);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_huge_fault);
+#endif
+
 void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -2924,7 +3218,8 @@ struct page *read_cache_page(struct address_space *mapping,
 EXPORT_SYMBOL(read_cache_page);
 
 /**
- * read_cache_page_gfp - read into page cache, using specified page allocation flags.
+ * read_cache_page_gfp - read into page cache, using specified page allocation
+ *			 flags.
  * @mapping:	the page's address_space
  * @index:	the page index
  * @gfp:	the page allocator flags to use if allocating
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1334ede667a8..26d74466d1f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -543,8 +543,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 
 	if (addr)
 		goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
+#endif
 
 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
 	if (addr)
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..96ff80d2a8fb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1391,6 +1391,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	unsigned long vm_maywrite = VM_MAYWRITE;
+#endif
+
 	*populate = 0;
 
 	if (!len)
@@ -1429,7 +1433,33 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	/*
+	 * If THP is enabled, it's a read-only executable that is
+	 * MAP_PRIVATE mapped, the length is larger than a PMD page
+	 * and either it's not a MAP_FIXED mapping or the passed address is
+	 * properly aligned for a PMD page, attempt to get an appropriate
+	 * address at which to map a PMD-sized THP page, otherwise call the
+	 * normal routine.
+	 */
+	if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
+		(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
+		(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE &&
+		(!(addr & HPAGE_PMD_OFFSET))) {
+		addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
+
+		if (addr && (!(addr & HPAGE_PMD_OFFSET)))
+			vm_maywrite = 0;
+		else
+			addr = get_unmapped_area(file, addr, len, pgoff, flags);
+	} else {
+#endif
+		addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	}
+#endif
+
 	if (offset_in_page(addr))
 		return addr;
 
@@ -1451,7 +1481,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * of the memory object, so we don't do any here.
 	 */
 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+			mm->def_flags | VM_MAYREAD | vm_maywrite | VM_MAYEXEC;
+#else
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+#endif
 
 	if (flags & MAP_LOCKED)
 		if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..503612d3b52b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1192,7 +1192,11 @@ void page_add_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
+
 		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
@@ -1232,7 +1236,11 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
+
 		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow the page cache to allocate large pages
  2019-07-29 21:09 ` [PATCH v2 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-07-29 22:03   ` Song Liu
  2019-07-30 20:26     ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Song Liu @ 2019-07-29 22:03 UTC (permalink / raw)
  To: William Kucharski
  Cc: ceph-devel, linux-afs, linux-btrfs, lkml, Linux-MM, Networking,
	Chris Mason, David S. Miller, David Sterba, Josef Bacik,
	Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Matthew Wilcox, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler, kbuild test robot



> On Jul 29, 2019, at 2:09 PM, William Kucharski <william.kucharski@oracle.com> wrote:
> 

I guess we need "From: Matthew Wilcox <willy@infradead.org>" here?

> Add an order field to __page_cache_alloc() to allow for the allocation
> of large memory page page cache entries.
> 
> Signed-off-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> ---
> fs/afs/dir.c            |  2 +-
> fs/btrfs/compression.c  |  2 +-
> fs/cachefiles/rdwr.c    |  4 ++--
> fs/ceph/addr.c          |  2 +-
> fs/ceph/file.c          |  2 +-
> include/linux/pagemap.h | 13 +++++++++----
> mm/filemap.c            | 25 +++++++++++++------------
> mm/readahead.c          |  2 +-
> net/ceph/pagelist.c     |  4 ++--
> net/ceph/pagevec.c      |  2 +-
> 10 files changed, 32 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index e640d67274be..0a392214f71e 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key)
> 				afs_stat_v(dvnode, n_inval);
> 
> 			ret = -ENOMEM;
> -			req->pages[i] = __page_cache_alloc(gfp);
> +			req->pages[i] = __page_cache_alloc(gfp, 0);
> 			if (!req->pages[i])
> 				goto error;
> 			ret = add_to_page_cache_lru(req->pages[i],
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 60c47b417a4b..5280e7477b7e 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> 		}
> 
> 		page = __page_cache_alloc(mapping_gfp_constraint(mapping,
> -								 ~__GFP_FS));
> +								 ~__GFP_FS), 0);
> 		if (!page)
> 			break;
> 
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index 44a3ce1e4ce4..11d30212745f 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
> 			goto backing_page_already_present;
> 
> 		if (!newpage) {
> -			newpage = __page_cache_alloc(cachefiles_gfp);
> +			newpage = __page_cache_alloc(cachefiles_gfp, 0);
> 			if (!newpage)
> 				goto nomem_monitor;
> 		}
> @@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
> 				goto backing_page_already_present;
> 
> 			if (!newpage) {
> -				newpage = __page_cache_alloc(cachefiles_gfp);
> +				newpage = __page_cache_alloc(cachefiles_gfp, 0);
> 				if (!newpage)
> 					goto nomem;
> 			}
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index e078cc55b989..bcb41fbee533 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1707,7 +1707,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
> 		if (len > PAGE_SIZE)
> 			len = PAGE_SIZE;
> 	} else {
> -		page = __page_cache_alloc(GFP_NOFS);
> +		page = __page_cache_alloc(GFP_NOFS, 0);
> 		if (!page) {
> 			err = -ENOMEM;
> 			goto out;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 685a03cc4b77..ae58d7c31aa4 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 		struct page *page = NULL;
> 		loff_t i_size;
> 		if (retry_op == READ_INLINE) {
> -			page = __page_cache_alloc(GFP_KERNEL);
> +			page = __page_cache_alloc(GFP_KERNEL, 0);
> 			if (!page)
> 				return -ENOMEM;
> 		}
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c7552459a15f..e9004e3cb6a3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -208,17 +208,17 @@ static inline int page_cache_add_speculative(struct page *page, int count)
> }
> 
> #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
> #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
> {
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
> }
> #endif
> 
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> -	return __page_cache_alloc(mapping_gfp_mask(x));
> +	return __page_cache_alloc(mapping_gfp_mask(x), 0);
> }
> 
> static inline gfp_t readahead_gfp_mask(struct address_space *x)
> @@ -240,6 +240,11 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> #define FGP_NOFS		0x00000010
> #define FGP_NOWAIT		0x00000020
> #define FGP_FOR_MMAP		0x00000040
> +/* If you add more flags, increment FGP_ORDER_SHIFT */
> +#define	FGP_ORDER_SHIFT		7
> +#define	FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> +#define	FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> +#define	fgp_get_order(fgp)	((fgp) >> FGP_ORDER_SHIFT)

This looks like we want support order up to 25 (32 - 7). I guess we don't 
need that many. How about we specify the highest order to support here? 

Also, fgp_flags is signed int, so we need to make sure fgp_flags is not
negative. 

> 
> struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
> 		int fgp_flags, gfp_t cache_gfp_mask);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d0cf700bf201..a96092243fc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -954,7 +954,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
> 
> #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
> {
> 	int n;
> 	struct page *page;
> @@ -964,12 +964,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
> 		do {
> 			cpuset_mems_cookie = read_mems_allowed_begin();
> 			n = cpuset_mem_spread_node();
> -			page = __alloc_pages_node(n, gfp, 0);
> +			page = __alloc_pages_node(n, gfp, order);
> 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
> 
> 		return page;
> 	}
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
> }
> EXPORT_SYMBOL(__page_cache_alloc);
> #endif
> @@ -1597,12 +1597,12 @@ EXPORT_SYMBOL(find_lock_entry);
>  * pagecache_get_page - find and get a page reference
>  * @mapping: the address_space to search
>  * @offset: the page index
> - * @fgp_flags: PCG flags
> + * @fgp_flags: FGP flags
>  * @gfp_mask: gfp mask to use for the page cache data page allocation
>  *
>  * Looks up the page cache slot at @mapping & @offset.
>  *
> - * PCG flags modify how the page is returned.
> + * FGP flags modify how the page is returned.
>  *
>  * @fgp_flags can be:
>  *
> @@ -1615,6 +1615,7 @@ EXPORT_SYMBOL(find_lock_entry);
>  * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
>  *   its own locking dance if the page is already in cache, or unlock the page
>  *   before returning if we had to add the page to pagecache.
> + * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
>  *
>  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
>  * if the GFP flags specified for FGP_CREAT are atomic.
> @@ -1660,12 +1661,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
> no_page:
> 	if (!page && (fgp_flags & FGP_CREAT)) {
> 		int err;
> -		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
> +		if ((fgp_flags & FGP_WRITE) &&
> +			mapping_cap_account_dirty(mapping))
> 			gfp_mask |= __GFP_WRITE;
> 		if (fgp_flags & FGP_NOFS)
> 			gfp_mask &= ~__GFP_FS;
> 
> -		page = __page_cache_alloc(gfp_mask);
> +		page = __page_cache_alloc(gfp_mask, fgp_get_order(fgp_flags));
> 		if (!page)
> 			return NULL;
> 
> @@ -2802,15 +2804,14 @@ static struct page *wait_on_page_read(struct page *page)
> static struct page *do_read_cache_page(struct address_space *mapping,
> 				pgoff_t index,
> 				int (*filler)(void *, struct page *),
> -				void *data,
> -				gfp_t gfp)
> +				void *data, unsigned int order, gfp_t gfp)
> {
> 	struct page *page;
> 	int err;
> repeat:
> 	page = find_get_page(mapping, index);
> 	if (!page) {
> -		page = __page_cache_alloc(gfp);
> +		page = __page_cache_alloc(gfp, order);
> 		if (!page)
> 			return ERR_PTR(-ENOMEM);
> 		err = add_to_page_cache_lru(page, mapping, index, gfp);
> @@ -2917,7 +2918,7 @@ struct page *read_cache_page(struct address_space *mapping,
> 				int (*filler)(void *, struct page *),
> 				void *data)
> {
> -	return do_read_cache_page(mapping, index, filler, data,
> +	return do_read_cache_page(mapping, index, filler, data, 0,
> 			mapping_gfp_mask(mapping));
> }
> EXPORT_SYMBOL(read_cache_page);
> @@ -2939,7 +2940,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
> 				pgoff_t index,
> 				gfp_t gfp)
> {
> -	return do_read_cache_page(mapping, index, NULL, NULL, gfp);
> +	return do_read_cache_page(mapping, index, NULL, NULL, 0, gfp);
> }
> EXPORT_SYMBOL(read_cache_page_gfp);
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2fe72cd29b47..954760a612ea 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -193,7 +193,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
> 			continue;
> 		}
> 
> -		page = __page_cache_alloc(gfp_mask);
> +		page = __page_cache_alloc(gfp_mask, 0);
> 		if (!page)
> 			break;
> 		page->index = page_offset;
> diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
> index 65e34f78b05d..0c3face908dc 100644
> --- a/net/ceph/pagelist.c
> +++ b/net/ceph/pagelist.c
> @@ -56,7 +56,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
> 	struct page *page;
> 
> 	if (!pl->num_pages_free) {
> -		page = __page_cache_alloc(GFP_NOFS);
> +		page = __page_cache_alloc(GFP_NOFS, 0);
> 	} else {
> 		page = list_first_entry(&pl->free_list, struct page, lru);
> 		list_del(&page->lru);
> @@ -107,7 +107,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
> 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
> 
> 	while (space > pl->num_pages_free) {
> -		struct page *page = __page_cache_alloc(GFP_NOFS);
> +		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
> 		if (!page)
> 			return -ENOMEM;
> 		list_add_tail(&page->lru, &pl->free_list);
> diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
> index 64305e7056a1..1d07e639216d 100644
> --- a/net/ceph/pagevec.c
> +++ b/net/ceph/pagevec.c
> @@ -45,7 +45,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
> 	if (!pages)
> 		return ERR_PTR(-ENOMEM);
> 	for (i = 0; i < num_pages; i++) {
> -		pages[i] = __page_cache_alloc(flags);
> +		pages[i] = __page_cache_alloc(flags, 0);
> 		if (pages[i] == NULL) {
> 			ceph_release_page_vector(pages, i);
> 			return ERR_PTR(-ENOMEM);
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-29 21:09 ` [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
@ 2019-07-29 22:47   ` Dan Williams
  2019-07-30 19:18     ` Matthew Wilcox
  2019-07-29 22:51   ` Song Liu
  1 sibling, 1 reply; 11+ messages in thread
From: Dan Williams @ 2019-07-29 22:47 UTC (permalink / raw)
  To: William Kucharski
  Cc: ceph-devel, linux-afs, linux-btrfs, Linux Kernel Mailing List,
	Linux MM, Netdev, Chris Mason, David S. Miller, David Sterba,
	Josef Bacik, Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	Chad Mynhier, Kirill A. Shutemov, Johannes Weiner,
	Matthew Wilcox, Dave Airlie, Vlastimil Babka, Keith Busch,
	Ralph Campbell, Steve Capper, Dave Chinner, Sean Christopherson,
	Hugh Dickins, Ilya Dryomov, Alexander Duyck, Thomas Gleixner,
	Jérôme Glisse, Amir Goldstein, Jason Gunthorpe,
	Michal Hocko, Jann Horn, David Howells, John Hubbard,
	Souptick Joarder, john.hubbard, Jan Kara, Andrey Konovalov,
	Arun KS, Aneesh Kumar K.V, Jeff Layton, Yangtao Li,
	Andrew Morton, Robin Murphy, Mike Rapoport, David Rientjes,
	Andrey Ryabinin, Yafang Shao, Huang Shijie, Yang Shi,
	Miklos Szeredi, Pavel Tatashin, Kirill Tkhai, Sage Weil,
	Ira Weiny, Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler

On Mon, Jul 29, 2019 at 2:10 PM William Kucharski
<william.kucharski@oracle.com> wrote:
>
> Add filemap_huge_fault() to attempt to satisfy page faults on
> memory-mapped read-only text pages using THP when possible.
>
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
[..]
> +/**
> + * filemap_huge_fault - read in file data for page fault handling to THP
> + * @vmf:       struct vm_fault containing details of the fault
> + * @pe_size:   large page size to map, currently this must be PE_SIZE_PMD
> + *
> + * filemap_huge_fault() is invoked via the vma operations vector for a
> + * mapped memory region to read in file data to a transparent huge page during
> + * a page fault.
> + *
> + * If for any reason we can't allocate a THP, map it or add it to the page
> + * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
> + * handler to try mapping the page using a PAGESIZE page, usually via
> + * filemap_fault() if so speicifed in the vma operations vector.
> + *
> + * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
> + * to map the new THP.
> + *
> + * NOTE: This routine depends upon the file system's readpage routine as
> + *       specified in the address space operations vector to recognize when it
> + *      is being passed a large page and to read the approprate amount of data
> + *      in full and without polluting the page cache for the large page itself
> + *      with PAGESIZE pages to perform a buffered read or to pollute what
> + *      would be the page cache space for any succeeding pages with PAGESIZE
> + *      pages due to readahead.
> + *
> + *      It is VITAL that this routine not be enabled without such filesystem
> + *      support.

Rather than a hopeful comment, this wants an explicit mechanism to
prevent inadvertent mismatched ->readpage() assumptions. Either a new
->readhugepage() op, or a flags field in 'struct
address_space_operations' indicating that the address_space opts into
being careful to handle huge page arguments. I.e. something like
mmap_supported_flags that was added to 'struct file_operations'.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-29 21:09 ` [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
  2019-07-29 22:47   ` Dan Williams
@ 2019-07-29 22:51   ` Song Liu
  2019-07-30 14:11     ` William Kucharski
  1 sibling, 1 reply; 11+ messages in thread
From: Song Liu @ 2019-07-29 22:51 UTC (permalink / raw)
  To: William Kucharski
  Cc: ceph-devel, linux-afs, linux-btrfs, lkml, Linux-MM, Networking,
	Chris Mason, David S. Miller, David Sterba, Josef Bacik,
	Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Matthew Wilcox, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler



> On Jul 29, 2019, at 2:09 PM, William Kucharski <william.kucharski@oracle.com> wrote:
> 
> Add filemap_huge_fault() to attempt to satisfy page faults on
> memory-mapped read-only text pages using THP when possible.

I think this 2/2 doesn't need pagecache_get_page() changes in 1/2. 
Maybe we can split pagecache_get_page() related changes out?

> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> ---
> include/linux/huge_mm.h |  16 ++-
> include/linux/mm.h      |   6 +
> mm/Kconfig              |  15 ++
> mm/filemap.c            | 299 +++++++++++++++++++++++++++++++++++++++-
> mm/huge_memory.c        |   3 +
> mm/mmap.c               |  36 ++++-
> mm/rmap.c               |   8 ++
> 7 files changed, 373 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 45ede62aa85b..34723f7e75d0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -79,13 +79,15 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define HPAGE_PMD_SHIFT PMD_SHIFT
> -#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
> -#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
> -
> -#define HPAGE_PUD_SHIFT PUD_SHIFT
> -#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
> -#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))

> +#define HPAGE_PMD_SHIFT		PMD_SHIFT
> +#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
> +#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
          ^ space vs. tab difference here. 

> +#define HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
> +
> +#define HPAGE_PUD_SHIFT		PUD_SHIFT
> +#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
> +#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
> +#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))

Should HPAGE_PMD_OFFSET and HPAGE_PUD_OFFSET include bits for 
PAGE_OFFSET? I guess we can just keep huge_mm.h as-is and use
~HPAGE_PMD_MASK. 

> 
> extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..ba24b515468a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,12 @@ extern void truncate_inode_pages_final(struct address_space *);
> 
> /* generic vm_area_ops exported for stackable file systems */
> extern vm_fault_t filemap_fault(struct vm_fault *vmf);
> +
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
> +			enum page_entry_size pe_size);
> +#endif
> +
> extern void filemap_map_pages(struct vm_fault *vmf,
> 		pgoff_t start_pgoff, pgoff_t end_pgoff);
> extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 56cec636a1fc..2debaded0e4d 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL
> config ARCH_HAS_HUGEPD
> 	bool
> 
> +config RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
> +
> +	help
> +	    Introduce filemap_huge_fault() to automatically map executable
> +	    read-only pages of mapped files of suitable size and alignment
> +	    using THP if possible.
> +
> +	    This is marked experimental because it is a new feature and is
> +	    dependent upon filesystmes implementing readpages() in a way
> +	    that will recognize large THP pages and read file content to
> +	    them without polluting the pagecache with PAGESIZE pages due
> +	    to readahead.
> +
> endmenu
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a96092243fc4..4e7287db0d8e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -199,6 +199,8 @@ static void unaccount_page_cache_page(struct address_space *mapping,
> 	nr = hpage_nr_pages(page);
> 
> 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
> +
> +#ifndef	CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 	if (PageSwapBacked(page)) {
> 		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
> 		if (PageTransHuge(page))
> @@ -206,6 +208,13 @@ static void unaccount_page_cache_page(struct address_space *mapping,
> 	} else {
> 		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> 	}
> +#else
> +	if (PageSwapBacked(page))
> +		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
> +
> +	if (PageTransHuge(page))
> +		__dec_node_page_state(page, NR_SHMEM_THPS);
> +#endif
> 
> 	/*
> 	 * At this point page must be either written or cleaned by
> @@ -1615,7 +1624,7 @@ EXPORT_SYMBOL(find_lock_entry);
>  * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
>  *   its own locking dance if the page is already in cache, or unlock the page
>  *   before returning if we had to add the page to pagecache.
> - * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
> + * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page

I think we haven't used FGP_PMD yet? 

>  *
>  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
>  * if the GFP flags specified for FGP_CREAT are atomic.
> @@ -2642,6 +2651,291 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_fault);
> 
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +/*
> + * Check for an entry in the page cache which would conflict with the address
> + * range we wish to map using a THP or is otherwise unusable to map a large
> + * cached page.
> + *
> + * The routine will return true if a usable page is found in the page cache
> + * (and *pagep will be set to the address of the cached page), or if no
> + * cached page is found (and *pagep will be set to NULL).
> + */
> +static bool
> +filemap_huge_check_pagecache_usable(struct xa_state *xasp,
> +	struct page **pagep, pgoff_t hindex, pgoff_t hindex_max)

We have been using name "xas" for "struct xa_state *". Let's keep using it?

> +{
> +	struct page *page;
> +
> +	while (1) {
> +		page = xas_find(xasp, hindex_max);
> +
> +		if (xas_retry(xasp, page)) {
> +			xas_set(xasp, hindex);
> +			continue;
> +		}
> +
> +		/*
> +		 * A found entry is unusable if:
> +		 *	+ the entry is an Xarray value, not a pointer
> +		 *	+ the entry is an internal Xarray node
> +		 *	+ the entry is not a Transparent Huge Page
> +		 *	+ the entry is not a compound page
> +		 *	+ the entry is not the head of a compound page
> +		 *	+ the enbry is a page page with an order other than
> +		 *	  HPAGE_PMD_ORDER
> +		 *	+ the page's index is not what we expect it to be
> +		 *	+ the page is not up-to-date
> +		 *	+ the page is unlocked
> +		 */
> +		if ((page) && (xa_is_value(page) || xa_is_internal(page) ||
> +			(!PageCompound(page)) || (PageHuge(page)) ||
> +			(!PageTransCompound(page)) ||
> +			page != compound_head(page) ||
> +			compound_order(page) != HPAGE_PMD_ORDER ||
> +			page->index != hindex || (!PageUptodate(page)) ||
> +			(!PageLocked(page))))
> +			return false;
> +
> +		break;
> +	}
> +
> +	xas_set(xasp, hindex);
> +	*pagep = page;
> +	return true;
> +}
> +
> +/**
> + * filemap_huge_fault - read in file data for page fault handling to THP
> + * @vmf:	struct vm_fault containing details of the fault
> + * @pe_size:	large page size to map, currently this must be PE_SIZE_PMD
> + *
> + * filemap_huge_fault() is invoked via the vma operations vector for a
> + * mapped memory region to read in file data to a transparent huge page during
> + * a page fault.
> + *
> + * If for any reason we can't allocate a THP, map it or add it to the page
> + * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
> + * handler to try mapping the page using a PAGESIZE page, usually via
> + * filemap_fault() if so speicifed in the vma operations vector.
> + *
> + * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
> + * to map the new THP.
> + *
> + * NOTE: This routine depends upon the file system's readpage routine as
> + *       specified in the address space operations vector to recognize when it
> + *	 is being passed a large page and to read the approprate amount of data
> + *	 in full and without polluting the page cache for the large page itself
> + *	 with PAGESIZE pages to perform a buffered read or to pollute what
> + *	 would be the page cache space for any succeeding pages with PAGESIZE
> + *	 pages due to readahead.
> + *
> + *	 It is VITAL that this routine not be enabled without such filesystem
> + *	 support. As there is no way to determine how many bytes were read by
> + *	 the readpage() operation, if only a PAGESIZE page is read, this routine
> + *	 will map the THP containing only the first PAGESIZE bytes of file data
> + *	 to satisfy the fault, which is never the result desired.
> + */
> +vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
> +		enum page_entry_size pe_size)
> +{
> +	struct file *filp = vmf->vma->vm_file;
> +	struct address_space *mapping = filp->f_mapping;
> +	struct vm_area_struct *vma = vmf->vma;
> +
> +	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR);
> +	pgoff_t hindex_max = hindex + HPAGE_PMD_NR;
> +
> +	struct page *cached_page, *hugepage;
> +	struct page *new_page = NULL;
> +
> +	vm_fault_t ret = VM_FAULT_FALLBACK;
> +	int error;
> +
> +	XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER);
> +
> +	/*
> +	 * Return VM_FAULT_FALLBACK if:
> +	 *
> +	 *	+ pe_size != PE_SIZE_PMD
> +	 *	+ FAULT_FLAG_WRITE is set in vmf->flags
> +	 *	+ vma isn't aligned to allow a PMD mapping
> +	 *	+ PMD would extend beyond the end of the vma
> +	 */
> +	if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) ||
> +		(haddr < vma->vm_start ||
> +		(haddr + HPAGE_PMD_SIZE > vma->vm_end)))
> +		return ret;
> +
> +	xas_lock_irq(&xas);
> +
> +retry_xas_locked:
> +	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
> +		hindex_max)) {
> +		/* found a conflicting entry in the page cache, so fallback */
> +		goto unlock;
> +	} else if (cached_page) {
> +		/* found a valid cached page, so map it */
> +		hugepage = cached_page;
> +		goto map_huge;
> +	}
> +
> +	xas_unlock_irq(&xas);
> +
> +	/* allocate huge THP page in VMA */
> +	new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP |
> +		__GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER);
> +
> +	if (unlikely(!new_page))
> +		return ret;
> +
> +	if (unlikely(!(PageCompound(new_page)))) {

   What condition triggers this case? 

> +		put_page(new_page);
> +		return ret;
> +	}
> +
> +	prep_transhuge_page(new_page);
> +	new_page->index = hindex;
> +	new_page->mapping = mapping;
> +
> +	__SetPageLocked(new_page);
> +
> +	/*
> +	 * The readpage() operation below is expected to fill the large
> +	 * page with data without polluting the page cache with
> +	 * PAGESIZE entries due to a buffered read and/or readahead().
> +	 *
> +	 * A filesystem's vm_operations_struct huge_fault field should
> +	 * never point to this routine without such a capability, and
> +	 * without it a call to this routine would eventually just
> +	 * fall through to the normal fault op anyway.
> +	 */
> +	error = mapping->a_ops->readpage(vmf->vma->vm_file, new_page);
> +
> +	if (unlikely(error)) {
> +		put_page(new_page);
> +		return ret;
> +	}
> +
> +	/* XXX - use wait_on_page_locked_killable() instead? */
> +	wait_on_page_locked(new_page);
> +
> +	if (!PageUptodate(new_page)) {
> +		/* EIO */
> +		new_page->mapping = NULL;
> +		put_page(new_page);
> +		return ret;
> +	}
> +
> +	do {
> +		xas_lock_irq(&xas);
> +		xas_set(&xas, hindex);
> +		xas_create_range(&xas);
> +
> +		if (!(xas_error(&xas)))
> +			break;
> +
> +		if (!xas_nomem(&xas, GFP_KERNEL)) {
> +			if (new_page) {
> +				new_page->mapping = NULL;
> +				put_page(new_page);
> +			}
> +
> +			goto unlock;
> +		}
> +
> +		xas_unlock_irq(&xas);
> +	} while (1);
> +
> +	/*
> +	 * Double check that an entry did not sneak into the page cache while
> +	 * creating Xarray entries for the new page.
> +	 */
> +	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
> +		hindex_max)) {
> +		/*
> +		 * An unusable entry was found, so delete the newly allocated
> +		 * page and fallback.
> +		 */
> +		new_page->mapping = NULL;
> +		put_page(new_page);
> +		goto unlock;
> +	} else if (cached_page) {
> +		/*
> +		 * A valid large page was found in the page cache, so free the
> +		 * newly allocated page and map the cached page instead.
> +		 */
> +		new_page->mapping = NULL;
> +		put_page(new_page);
> +		new_page = NULL;
> +		hugepage = cached_page;
> +		goto map_huge;
> +	}
> +
> +	__SetPageLocked(new_page);
> +
> +	/* did it get truncated? */
> +	if (unlikely(new_page->mapping != mapping)) {
> +		unlock_page(new_page);
> +		put_page(new_page);
> +		goto retry_xas_locked;
> +	}
> +
> +	hugepage = new_page;
> +
> +map_huge:
> +	/* map hugepage at the PMD level */
> +	ret = alloc_set_pte(vmf, NULL, hugepage);
> +
> +	VM_BUG_ON_PAGE((!(pmd_trans_huge(*vmf->pmd))), hugepage);
> +
> +	if (likely(!(ret & VM_FAULT_ERROR))) {
> +		/*
> +		 * The alloc_set_pte() succeeded without error, so
> +		 * add the page to the page cache if it is new, and
> +		 * increment page statistics accordingly.
> +		 */
> +		if (new_page) {
> +			unsigned long nr;
> +
> +			xas_set(&xas, hindex);
> +
> +			for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
> +#ifndef	COMPOUND_PAGES_HEAD_ONLY

Where do we define COMPOUND_PAGES_HEAD_ONLY? 

> +				xas_store(&xas, new_page + nr);
> +#else
> +				xas_store(&xas, new_page);
> +#endif
> +				xas_next(&xas);
> +			}
> +
> +			count_vm_event(THP_FILE_ALLOC);
> +			__inc_node_page_state(new_page, NR_SHMEM_THPS);
> +			__mod_node_page_state(page_pgdat(new_page),
> +				NR_FILE_PAGES, HPAGE_PMD_NR);
> +			__mod_node_page_state(page_pgdat(new_page),
> +				NR_SHMEM, HPAGE_PMD_NR);
> +		}
> +
> +		vmf->address = haddr;
> +		vmf->page = hugepage;
> +
> +		page_ref_add(hugepage, HPAGE_PMD_NR);
> +		count_vm_event(THP_FILE_MAPPED);
> +	} else if (new_page) {
> +		/* there was an error mapping the new page, so release it */
> +		new_page->mapping = NULL;
> +		put_page(new_page);
> +	}
> +
> +unlock:
> +	xas_unlock_irq(&xas);
> +	return ret;
> +}
> +EXPORT_SYMBOL(filemap_huge_fault);
> +#endif
> +
> void filemap_map_pages(struct vm_fault *vmf,
> 		pgoff_t start_pgoff, pgoff_t end_pgoff)
> {
> @@ -2924,7 +3218,8 @@ struct page *read_cache_page(struct address_space *mapping,
> EXPORT_SYMBOL(read_cache_page);
> 
> /**
> - * read_cache_page_gfp - read into page cache, using specified page allocation flags.
> + * read_cache_page_gfp - read into page cache, using specified page allocation
> + *			 flags.
>  * @mapping:	the page's address_space
>  * @index:	the page index
>  * @gfp:	the page allocator flags to use if allocating
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1334ede667a8..26d74466d1f7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -543,8 +543,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> 
> 	if (addr)
> 		goto out;
> +
> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> 		goto out;
> +#endif
> 
> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> 	if (addr)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7e8c3e8ae75f..96ff80d2a8fb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1391,6 +1391,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	struct mm_struct *mm = current->mm;
> 	int pkey = 0;
> 
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	unsigned long vm_maywrite = VM_MAYWRITE;
> +#endif
> +
> 	*populate = 0;
> 
> 	if (!len)
> @@ -1429,7 +1433,33 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	/* Obtain the address to map to. we verify (or select) it and ensure
> 	 * that it represents a valid section of the address space.
> 	 */
> -	addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	/*
> +	 * If THP is enabled, it's a read-only executable that is
> +	 * MAP_PRIVATE mapped, the length is larger than a PMD page
> +	 * and either it's not a MAP_FIXED mapping or the passed address is
> +	 * properly aligned for a PMD page, attempt to get an appropriate
> +	 * address at which to map a PMD-sized THP page, otherwise call the
> +	 * normal routine.
> +	 */
> +	if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
> +		(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> +		(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE &&
> +		(!(addr & HPAGE_PMD_OFFSET))) {
> +		addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
> +
> +		if (addr && (!(addr & HPAGE_PMD_OFFSET)))
> +			vm_maywrite = 0;
> +		else
> +			addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +	} else {
> +#endif
> +		addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	}
> +#endif
> +
> 	if (offset_in_page(addr))
> 		return addr;
> 
> @@ -1451,7 +1481,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	 * of the memory object, so we don't do any here.
> 	 */
> 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +			mm->def_flags | VM_MAYREAD | vm_maywrite | VM_MAYEXEC;
> +#else
> 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> +#endif
> 
> 	if (flags & MAP_LOCKED)
> 		if (!can_do_mlock())
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e5dfe2ae6b0d..503612d3b52b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1192,7 +1192,11 @@ void page_add_file_rmap(struct page *page, bool compound)
> 		}
> 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
> 			goto out;
> +
> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +#endif
> +
> 		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
> 	} else {
> 		if (PageTransCompound(page) && page_mapping(page)) {
> @@ -1232,7 +1236,11 @@ static void page_remove_file_rmap(struct page *page, bool compound)
> 		}
> 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
> 			goto out;
> +
> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +#endif
> +
> 		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
> 	} else {
> 		if (!atomic_add_negative(-1, &page->_mapcount))
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-29 22:51   ` Song Liu
@ 2019-07-30 14:11     ` William Kucharski
  2019-07-30 19:14       ` Song Liu
  0 siblings, 1 reply; 11+ messages in thread
From: William Kucharski @ 2019-07-30 14:11 UTC (permalink / raw)
  To: Song Liu
  Cc: ceph-devel, linux-afs, linux-btrfs, lkml, Linux-MM, Networking,
	Chris Mason, David S. Miller, David Sterba, Josef Bacik,
	Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Matthew Wilcox, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler



On 7/29/19 4:51 PM, Song Liu wrote:

>
>> +#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
>            ^ space vs. tab difference here.

Thanks, good catch!

> 
>> +#define HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
>> +
>> +#define HPAGE_PUD_SHIFT		PUD_SHIFT
>> +#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
>> +#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)

Saw this one, too.

> Should HPAGE_PMD_OFFSET and HPAGE_PUD_OFFSET include bits for
> PAGE_OFFSET? I guess we can just keep huge_mm.h as-is and use
> ~HPAGE_PMD_MASK.

That's what I had intended; would you rather see those macros
omit the unneeded for the larger page size bits?

>> - * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
>> + * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page

No - this came in as part of patch 1/2 and I missed dropping the period 
at the end of the line that caused this to be a diff, so I will put it
back. :-)

> We have been using name "xas" for "struct xa_state *". Let's keep using it?

Thanks, done.

>> +	if (unlikely(!(PageCompound(new_page)))) {
> 
>     What condition triggers this case
I wanted a check to make sure that __page_cacke_alloc() returned a large 
page. I don't recall if the mechanism guarantees that when you ask for
a large page, you get one, so I wanted to handle that case.

If you prefer, I could make this a VM_BUG_ON_PAGE() instead, but I
wanted it to fallback gracefully if it can't get a properly sized
page.

>> +#ifndef	COMPOUND_PAGES_HEAD_ONLY
> 
> Where do we define COMPOUND_PAGES_HEAD_ONLY?

At present, we do not.

I used this so I could include the code that would be needed once
Matthew's "store only head pages in page cache" changes go back in,
which looks like it may not be until 5.4-rc1. Matthew recommended I
include this so we didn't lose track of the code change that would be
needed then. I'll be talking to him today about this and the issues
you raised regarding patch 1/2.

Thanks for going through this!!

     -- Bill


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-30 14:11     ` William Kucharski
@ 2019-07-30 19:14       ` Song Liu
  0 siblings, 0 replies; 11+ messages in thread
From: Song Liu @ 2019-07-30 19:14 UTC (permalink / raw)
  To: William Kucharski
  Cc: ceph-devel, linux-afs, linux-btrfs, lkml, Linux-MM, Networking,
	Chris Mason, David S. Miller, David Sterba, Josef Bacik,
	Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Matthew Wilcox, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler



> On Jul 30, 2019, at 7:11 AM, William Kucharski <william.kucharski@oracle.com> wrote:
> 
> 
> 
> On 7/29/19 4:51 PM, Song Liu wrote:
> 
>> 
>>> +#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
>>           ^ space vs. tab difference here.
> 
> Thanks, good catch!
> 
>>> +#define HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
>>> +
>>> +#define HPAGE_PUD_SHIFT		PUD_SHIFT
>>> +#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
>>> +#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
> 
> Saw this one, too.
> 
>> Should HPAGE_PMD_OFFSET and HPAGE_PUD_OFFSET include bits for
>> PAGE_OFFSET? I guess we can just keep huge_mm.h as-is and use
>> ~HPAGE_PMD_MASK.
> 
> That's what I had intended; would you rather see those macros
> omit the unneeded for the larger page size bits?

I think using ~HPAGE_PMD_MASK is common practice. Let's keep it 
that way. 

> 
>>> - * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
>>> + * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page
> 
> No - this came in as part of patch 1/2 and I missed dropping the period at the end of the line that caused this to be a diff, so I will put it
> back. :-)
> 
>> We have been using name "xas" for "struct xa_state *". Let's keep using it?
> 
> Thanks, done.
> 
>>> +	if (unlikely(!(PageCompound(new_page)))) {
>>    What condition triggers this case
> I wanted a check to make sure that __page_cacke_alloc() returned a large page. I don't recall if the mechanism guarantees that when you ask for
> a large page, you get one, so I wanted to handle that case.
> 
> If you prefer, I could make this a VM_BUG_ON_PAGE() instead, but I
> wanted it to fallback gracefully if it can't get a properly sized
> page.

I think __page_cache_alloc() guarantees compound page. If not, it
should return NULL. 

> 
>>> +#ifndef	COMPOUND_PAGES_HEAD_ONLY
>> Where do we define COMPOUND_PAGES_HEAD_ONLY?
> 
> At present, we do not.
> 
> I used this so I could include the code that would be needed once
> Matthew's "store only head pages in page cache" changes go back in,
> which looks like it may not be until 5.4-rc1. Matthew recommended I

We don't have to wait until 5.4-rc1. We could develop based on this 
patch once it lands in mm tree. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-07-29 22:47   ` Dan Williams
@ 2019-07-30 19:18     ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2019-07-30 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: William Kucharski, ceph-devel, linux-afs, linux-btrfs,
	Linux Kernel Mailing List, Linux MM, Netdev, Chris Mason,
	David S. Miller, David Sterba, Josef Bacik, Dave Hansen,
	Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Darrick J. Wong, Gao Xiang,
	Bartlomiej Zolnierkiewicz, Ross Zwisler

On Mon, Jul 29, 2019 at 03:47:18PM -0700, Dan Williams wrote:
> On Mon, Jul 29, 2019 at 2:10 PM William Kucharski
> <william.kucharski@oracle.com> wrote:
> >
> > Add filemap_huge_fault() to attempt to satisfy page faults on
> > memory-mapped read-only text pages using THP when possible.
> >
> > Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> [..]
> > +/**
> > + * filemap_huge_fault - read in file data for page fault handling to THP
> > + * @vmf:       struct vm_fault containing details of the fault
> > + * @pe_size:   large page size to map, currently this must be PE_SIZE_PMD
> > + *
> > + * filemap_huge_fault() is invoked via the vma operations vector for a
> > + * mapped memory region to read in file data to a transparent huge page during
> > + * a page fault.
> > + *
> > + * If for any reason we can't allocate a THP, map it or add it to the page
> > + * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
> > + * handler to try mapping the page using a PAGESIZE page, usually via
> > + * filemap_fault() if so speicifed in the vma operations vector.
> > + *
> > + * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
> > + * to map the new THP.
> > + *
> > + * NOTE: This routine depends upon the file system's readpage routine as
> > + *       specified in the address space operations vector to recognize when it
> > + *      is being passed a large page and to read the approprate amount of data
> > + *      in full and without polluting the page cache for the large page itself
> > + *      with PAGESIZE pages to perform a buffered read or to pollute what
> > + *      would be the page cache space for any succeeding pages with PAGESIZE
> > + *      pages due to readahead.
> > + *
> > + *      It is VITAL that this routine not be enabled without such filesystem
> > + *      support.
> 
> Rather than a hopeful comment, this wants an explicit mechanism to
> prevent inadvertent mismatched ->readpage() assumptions.

Filesystems have to opt in to this.  If they add a ->huge_fault entry to
their vm_operations_struct without updating their ->readpage implementation,
they only have themselves to blame.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow the page cache to allocate large pages
  2019-07-29 22:03   ` Song Liu
@ 2019-07-30 20:26     ` Matthew Wilcox
  2019-07-30 21:13       ` Song Liu
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2019-07-30 20:26 UTC (permalink / raw)
  To: Song Liu
  Cc: William Kucharski, ceph-devel, linux-afs, linux-btrfs, lkml,
	Linux-MM, Networking, Chris Mason, David S. Miller, David Sterba,
	Josef Bacik, Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler, kbuild test robot

On Mon, Jul 29, 2019 at 10:03:40PM +0000, Song Liu wrote:
> > +/* If you add more flags, increment FGP_ORDER_SHIFT */
> > +#define	FGP_ORDER_SHIFT		7
> > +#define	FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> > +#define	FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> > +#define	fgp_get_order(fgp)	((fgp) >> FGP_ORDER_SHIFT)
> 
> This looks like we want support order up to 25 (32 - 7). I guess we don't 
> need that many. How about we specify the highest order to support here? 

We can support all the way up to order 64 with just 6 bits, leaving 32 -
6 - 7 = 19 bits free.  We haven't been adding FGP flags very quickly,
so I doubt we'll need anything larger.

> Also, fgp_flags is signed int, so we need to make sure fgp_flags is not
> negative. 

If we ever get there, I expect people to convert the parameter from signed
int to unsigned long.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow the page cache to allocate large pages
  2019-07-30 20:26     ` Matthew Wilcox
@ 2019-07-30 21:13       ` Song Liu
  0 siblings, 0 replies; 11+ messages in thread
From: Song Liu @ 2019-07-30 21:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, ceph-devel, linux-afs, linux-btrfs, lkml,
	Linux-MM, Networking, Chris Mason, David S. Miller, David Sterba,
	Josef Bacik, Dave Hansen, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner, Dave Airlie,
	Vlastimil Babka, Keith Busch, Ralph Campbell, Steve Capper,
	Dave Chinner, Sean Christopherson, Hugh Dickins, Ilya Dryomov,
	Alexander Duyck, Thomas Gleixner, Jérôme Glisse,
	Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
	David Howells, John Hubbard, Souptick Joarder, john.hubbard,
	Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
	Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
	Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
	Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
	Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
	Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
	Ross Zwisler, kbuild test robot



> On Jul 30, 2019, at 1:26 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Jul 29, 2019 at 10:03:40PM +0000, Song Liu wrote:
>>> +/* If you add more flags, increment FGP_ORDER_SHIFT */
>>> +#define	FGP_ORDER_SHIFT		7
>>> +#define	FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
>>> +#define	FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
>>> +#define	fgp_get_order(fgp)	((fgp) >> FGP_ORDER_SHIFT)
>> 
>> This looks like we want support order up to 25 (32 - 7). I guess we don't 
>> need that many. How about we specify the highest order to support here? 
> 
> We can support all the way up to order 64 with just 6 bits, leaving 32 -
> 6 - 7 = 19 bits free.  We haven't been adding FGP flags very quickly,
> so I doubt we'll need anything larger.

lol. I misread the bit usage. 

> 
>> Also, fgp_flags is signed int, so we need to make sure fgp_flags is not
>> negative. 
> 
> If we ever get there, I expect people to convert the parameter from signed
> int to unsigned long.

Agreed. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-07-30 21:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-29 21:09 [PATCH v2 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
2019-07-29 21:09 ` [PATCH v2 1/2] mm: Allow the page cache to allocate large pages William Kucharski
2019-07-29 22:03   ` Song Liu
2019-07-30 20:26     ` Matthew Wilcox
2019-07-30 21:13       ` Song Liu
2019-07-29 21:09 ` [PATCH v2 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
2019-07-29 22:47   ` Dan Williams
2019-07-30 19:18     ` Matthew Wilcox
2019-07-29 22:51   ` Song Liu
2019-07-30 14:11     ` William Kucharski
2019-07-30 19:14       ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.