linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2]  mm,thp: Add filemap_huge_fault() for THP
@ 2019-09-02  9:23 William Kucharski
  2019-09-02  9:23 ` [PATCH v5 1/2] mm: Allow the page cache to allocate large pages William Kucharski
  2019-09-02  9:23 ` [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
  0 siblings, 2 replies; 16+ messages in thread
From: William Kucharski @ 2019-09-02  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox

This set of patches is the first step towards a mechanism for automatically
mapping read-only text areas of appropriate size and alignment to THPs
whenever possible.

For now, the central routine, filemap_huge_fault(), amd various support
routines are only included if the experimental kernel configuration option

        RO_EXEC_FILEMAP_HUGE_FAULT_THP

is enabled.

This is because filemap_huge_fault() is dependent upon the
address_space_operations vector readpage() pointing to a routine that will
read and fill an entire large page at a time without poulluting the page
cache with PAGESIZE entries for the large page being mapped or performing
readahead that would pollute the page cache entries for succeeding large
pages. Unfortunately, there is no good way to determine how many bytes
were read by readpage(). At present, if filemap_huge_fault() were to call
a conventional readpage() routine, it would only fill the first PAGESIZE
bytes of the large page, which is definitely NOT the desired behavior.

However, by making the code available now it is hoped that filesystem
maintainers who have pledged to provide such a mechanism will do so more
rapidly.

The first part of the patch adds an order field to __page_cache_alloc(),
allowing callers to directly request page cache pages of various sizes.
This code was provided by Matthew Wilcox.

The second part of the patch implements the filemap_huge_fault() mechanism
as described above.

As this code is only run when the experimental config option is set,
there are some issues that need to be resolved but this is a good step
step that will enable further developemt.

Changes since v4:
1. More code review comments addressed, fixed bug in rcu logic
2. Add code to delete hugepage from page cache upon failure within
   filemap_huge_fault
3. Remove improperly crafted VM_BUG_ON when removing a THP from the
   page cache
4. Fixed mapping count issue

Changes since v3:
1. Multiple code review comments addressed
2. filemap_huge_fault() now does rcu locking when possible
3. filemap_huge_fault() now properly adds the THP to the page cache before
   calling readpage()

Changes since v2:
1. FGP changes were pulled out to enable submission as an independent
   patch
2. Inadvertent tab spacing and comment changes were reverted

Changes since v1:
1. Fix improperly generated patch for v1 PATCH 1/2

Matthew Wilcox (1):
  Add an 'order' argument to __page_cache_alloc() and
    do_read_cache_page(). Ensure the allocated pages are compound pages.

William Kucharski (1):
  Add filemap_huge_fault() to attempt to satisfy page faults on
    memory-mapped read-only text pages using THP when possible.

 fs/afs/dir.c            |   2 +-
 fs/btrfs/compression.c  |   2 +-
 fs/cachefiles/rdwr.c    |   4 +-
 fs/ceph/addr.c          |   2 +-
 fs/ceph/file.c          |   2 +-
 include/linux/mm.h      |   2 +
 include/linux/pagemap.h |  10 +-
 mm/Kconfig              |  15 ++
 mm/filemap.c            | 418 ++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        |   3 +
 mm/mmap.c               |  39 +++-
 mm/readahead.c          |   2 +-
 mm/rmap.c               |   4 +-
 mm/vmscan.c             |   2 +-
 net/ceph/pagelist.c     |   4 +-
 net/ceph/pagevec.c      |   2 +-
 16 files changed, 473 insertions(+), 40 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-02  9:23 [PATCH v5 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
@ 2019-09-02  9:23 ` William Kucharski
  2019-09-03 11:57   ` Michal Hocko
  2019-09-02  9:23 ` [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
  1 sibling, 1 reply; 16+ messages in thread
From: William Kucharski @ 2019-09-02  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox

Add an 'order' argument to __page_cache_alloc() and
do_read_cache_page(). Ensure the allocated pages are compound pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reported-by: kbuild test robot <lkp@intel.com>
---
 fs/afs/dir.c            |  2 +-
 fs/btrfs/compression.c  |  2 +-
 fs/cachefiles/rdwr.c    |  4 ++--
 fs/ceph/addr.c          |  2 +-
 fs/ceph/file.c          |  2 +-
 include/linux/pagemap.h | 10 ++++++----
 mm/filemap.c            | 20 +++++++++++---------
 mm/readahead.c          |  2 +-
 net/ceph/pagelist.c     |  4 ++--
 net/ceph/pagevec.c      |  2 +-
 10 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 139b4e3cc946..ca8f8e77e012 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key)
 				afs_stat_v(dvnode, n_inval);
 
 			ret = -ENOMEM;
-			req->pages[i] = __page_cache_alloc(gfp);
+			req->pages[i] = __page_cache_alloc(gfp, 0);
 			if (!req->pages[i])
 				goto error;
 			ret = add_to_page_cache_lru(req->pages[i],
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 60c47b417a4b..5280e7477b7e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 
 		page = __page_cache_alloc(mapping_gfp_constraint(mapping,
-								 ~__GFP_FS));
+								 ~__GFP_FS), 0);
 		if (!page)
 			break;
 
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 44a3ce1e4ce4..11d30212745f 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 			goto backing_page_already_present;
 
 		if (!newpage) {
-			newpage = __page_cache_alloc(cachefiles_gfp);
+			newpage = __page_cache_alloc(cachefiles_gfp, 0);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 				goto backing_page_already_present;
 
 			if (!newpage) {
-				newpage = __page_cache_alloc(cachefiles_gfp);
+				newpage = __page_cache_alloc(cachefiles_gfp, 0);
 				if (!newpage)
 					goto nomem;
 			}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b3c8b886bf64..7c1c3857fbb9 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1708,7 +1708,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 		if (len > PAGE_SIZE)
 			len = PAGE_SIZE;
 	} else {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page) {
 			err = -ENOMEM;
 			goto out;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..ae58d7c31aa4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		struct page *page = NULL;
 		loff_t i_size;
 		if (retry_op == READ_INLINE) {
-			page = __page_cache_alloc(GFP_KERNEL);
+			page = __page_cache_alloc(GFP_KERNEL, 0);
 			if (!page)
 				return -ENOMEM;
 		}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c7552459a15f..92e026d9a6b7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -208,17 +208,19 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	if (order > 0)
+		gfp |= __GFP_COMP;
+	return alloc_pages(gfp, order);
 }
 #endif
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), 0);
 }
 
 static inline gfp_t readahead_gfp_mask(struct address_space *x)
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..38b46fc00855 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -954,22 +954,25 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
 
+	if (order > 0)
+		gfp |= __GFP_COMP;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -1665,7 +1668,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		if (fgp_flags & FGP_NOFS)
 			gfp_mask &= ~__GFP_FS;
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(gfp_mask, 0);
 		if (!page)
 			return NULL;
 
@@ -2802,15 +2805,14 @@ static struct page *wait_on_page_read(struct page *page)
 static struct page *do_read_cache_page(struct address_space *mapping,
 				pgoff_t index,
 				int (*filler)(void *, struct page *),
-				void *data,
-				gfp_t gfp)
+				void *data, unsigned int order, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		page = __page_cache_alloc(gfp);
+		page = __page_cache_alloc(gfp, order);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2917,7 +2919,7 @@ struct page *read_cache_page(struct address_space *mapping,
 				int (*filler)(void *, struct page *),
 				void *data)
 {
-	return do_read_cache_page(mapping, index, filler, data,
+	return do_read_cache_page(mapping, index, filler, data, 0,
 			mapping_gfp_mask(mapping));
 }
 EXPORT_SYMBOL(read_cache_page);
@@ -2939,7 +2941,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index,
 				gfp_t gfp)
 {
-	return do_read_cache_page(mapping, index, NULL, NULL, gfp);
+	return do_read_cache_page(mapping, index, NULL, NULL, 0, gfp);
 }
 EXPORT_SYMBOL(read_cache_page_gfp);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..954760a612ea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -193,7 +193,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 			continue;
 		}
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(gfp_mask, 0);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 65e34f78b05d..0c3face908dc 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -56,7 +56,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -107,7 +107,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 64305e7056a1..1d07e639216d 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -45,7 +45,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, 0);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-02  9:23 [PATCH v5 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
  2019-09-02  9:23 ` [PATCH v5 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-09-02  9:23 ` William Kucharski
  2019-09-03 12:14   ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: William Kucharski @ 2019-09-02  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
	William Kucharski, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox

Add filemap_huge_fault() to attempt to satisfy page
faults on memory-mapped read-only text pages using THP when possible.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
---
 include/linux/mm.h |   2 +
 mm/Kconfig         |  15 ++
 mm/filemap.c       | 398 +++++++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c   |   3 +
 mm/mmap.c          |  39 ++++-
 mm/rmap.c          |   4 +-
 mm/vmscan.c        |   2 +-
 7 files changed, 446 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..2a5311721739 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2433,6 +2433,8 @@ extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern vm_fault_t filemap_fault(struct vm_fault *vmf);
+extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+			enum page_entry_size pe_size);
 extern void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..2debaded0e4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
+
+	help
+	    Introduce filemap_huge_fault() to automatically map executable
+	    read-only pages of mapped files of suitable size and alignment
+	    using THP if possible.
+
+	    This is marked experimental because it is a new feature and is
+	    dependent upon filesystmes implementing readpages() in a way
+	    that will recognize large THP pages and read file content to
+	    them without polluting the pagecache with PAGESIZE pages due
+	    to readahead.
+
 endmenu
diff --git a/mm/filemap.c b/mm/filemap.c
index 38b46fc00855..5947d432a4e6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -199,13 +199,12 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	nr = hpage_nr_pages(page);
 
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
-	if (PageSwapBacked(page)) {
+
+	if (PageSwapBacked(page))
 		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
-		if (PageTransHuge(page))
-			__dec_node_page_state(page, NR_SHMEM_THPS);
-	} else {
-		VM_BUG_ON_PAGE(PageTransHuge(page), page);
-	}
+
+	if (PageTransHuge(page))
+		__dec_node_page_state(page, NR_SHMEM_THPS);
 
 	/*
 	 * At this point page must be either written or cleaned by
@@ -303,6 +302,9 @@ static void page_cache_delete_batch(struct address_space *mapping,
 			break;
 		if (xa_is_value(page))
 			continue;
+
+VM_BUG_ON_PAGE(xa_is_internal(page), page);
+
 		if (!tail_pages) {
 			/*
 			 * Some page got inserted in our range? Skip it. We
@@ -315,6 +317,11 @@ static void page_cache_delete_batch(struct address_space *mapping,
 				continue;
 			}
 			WARN_ON_ONCE(!PageLocked(page));
+
+			/*
+			 * If a THP is in the page cache, set the succeeding
+			 * cache entries for the PMD-sized page to NULL.
+			 */
 			if (PageTransHuge(page) && !PageHuge(page))
 				tail_pages = HPAGE_PMD_NR - 1;
 			page->mapping = NULL;
@@ -324,8 +331,6 @@ static void page_cache_delete_batch(struct address_space *mapping,
 			 */
 			i++;
 		} else {
-			VM_BUG_ON_PAGE(page->index + HPAGE_PMD_NR - tail_pages
-					!= pvec->pages[i]->index, page);
 			tail_pages--;
 		}
 		xas_store(&xas, NULL);
@@ -881,7 +886,10 @@ static int __add_to_page_cache_locked(struct page *page,
 		mapping->nrpages++;
 
 		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge)
+		if (PageTransHuge(page) && !huge)
+			__mod_node_page_state(page_pgdat(page),
+				NR_FILE_PAGES, HPAGE_PMD_NR);
+		else
 			__inc_node_page_state(page, NR_FILE_PAGES);
 unlock:
 		xas_unlock_irq(&xas);
@@ -1663,7 +1671,8 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 no_page:
 	if (!page && (fgp_flags & FGP_CREAT)) {
 		int err;
-		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+		if ((fgp_flags & FGP_WRITE) &&
+			mapping_cap_account_dirty(mapping))
 			gfp_mask |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
 			gfp_mask &= ~__GFP_FS;
@@ -2643,6 +2652,372 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+/*
+ * There is a change coming to store only the head page of a compound page in
+ * the head cache.
+ *
+ * When that change is present in the kernel, remove this #define
+ */
+#define	PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
+
+/*
+ * Check for an entry in the page cache which would conflict with the address
+ * range we wish to map using a THP or is otherwise unusable to map a large
+ * cached page.
+ *
+ * The routine will return true if a usable page is found in the page cache
+ * (and *pagep will be set to the address of the cached page), or if no
+ * cached page is found (and *pagep will be set to NULL).
+ */
+static bool
+filemap_huge_check_pagecache_usable(struct xa_state *xas,
+	struct page **pagep, pgoff_t hindex, pgoff_t hindex_max)
+{
+	struct page *page;
+
+	while (1) {
+		xas_set(xas, hindex);
+		page = xas_find(xas, hindex_max);
+
+		if (xas_retry(xas, page))
+			continue;
+
+		/*
+		 * A found entry is unusable if:
+		 *	+ the entry is an Xarray value, not a pointer
+		 *	+ the entry is an internal Xarray node
+		 *	+ the entry is not a compound page
+		 *	+ the order of the compound page is < HPAGE_PMD_ORDER
+		 *	+ the page index is not what we expect it to be
+		 */
+		if (!page)
+			break;
+
+		if (xa_is_value(page) || xa_is_internal(page))
+			return false;
+
+#ifdef PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
+		if ((!PageCompound(page)) || (page != compound_head(page)))
+#else
+		if (!PageCompound(page))
+#endif
+			return false;
+
+		if (compound_order(page) < HPAGE_PMD_ORDER)
+			return false;
+
+		if (page->index != hindex)
+			return false;
+
+		break;
+	}
+
+	*pagep = page;
+	return true;
+}
+
+/**
+ * filemap_huge_fault - read in file data for page fault handling to THP
+ * @vmf:	struct vm_fault containing details of the fault
+ * @pe_size:	large page size to map, currently this must be PE_SIZE_PMD
+ *
+ * filemap_huge_fault() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data to a transparent huge page during
+ * a page fault.
+ *
+ * If for any reason we can't allocate a THP, map it or add it to the page
+ * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
+ * handler to try mapping the page using a PAGESIZE page, usually via
+ * filemap_fault() if so speicifed in the vma operations vector.
+ *
+ * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
+ * to map the new THP.
+ *
+ * NOTE: This routine depends upon the file system's readpage routine as
+ *       specified in the address space operations vector to recognize when it
+ *	 is being passed a large page and to read the approprate amount of data
+ *	 in full and without polluting the page cache for the large page itself
+ *	 with PAGESIZE pages to perform a buffered read or to pollute what
+ *	 would be the page cache space for any succeeding pages with PAGESIZE
+ *	 pages due to readahead.
+ *
+ *	 It is VITAL that this routine not be enabled without such filesystem
+ *	 support. As there is no way to determine how many bytes were read by
+ *	 the readpage() operation, if only a PAGESIZE page is read, this routine
+ *	 will map the THP containing only the first PAGESIZE bytes of file data
+ *	 to satisfy the fault, which is never the result desired.
+ */
+vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+		enum page_entry_size pe_size)
+{
+	struct file *filp = vmf->vma->vm_file;
+	struct address_space *mapping = filp->f_mapping;
+	struct vm_area_struct *vma = vmf->vma;
+
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR);
+	pgoff_t hindex_max = hindex + HPAGE_PMD_NR - 1;
+
+	struct page *cached_page, *hugepage;
+	struct page *new_page = NULL;
+
+	vm_fault_t ret = VM_FAULT_FALLBACK;
+	unsigned long nr;
+
+	int error;
+	bool retry_lookup = true;
+
+	XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER);
+
+	/*
+	 * Return VM_FAULT_FALLBACK if:
+	 *
+	 *	+ pe_size != PE_SIZE_PMD
+	 *	+ FAULT_FLAG_WRITE is set in vmf->flags
+	 *	+ vma isn't aligned to allow a PMD mapping
+	 *	+ PMD would extend beyond the end of the vma
+	 */
+	if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) ||
+	    (haddr < vma->vm_start ||
+	    ((haddr + HPAGE_PMD_SIZE) > vma->vm_end)))
+		return ret;
+
+retry_lookup:
+	rcu_read_lock();
+
+	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+	    hindex_max)) {
+		/* found a conflicting entry in the page cache, so fallback */
+		rcu_read_unlock();
+		return ret;
+	} else if (cached_page) {
+		/* found a valid cached page, so map it */
+		rcu_read_unlock();
+		lock_page(cached_page);
+
+		/* was the cached page truncated while waiting for the lock? */
+		if (unlikely(cached_page->mapping != mapping)) {
+			unlock_page(cached_page);
+
+			/* retry once */
+			if (retry_lookup) {
+				retry_lookup = false;
+				goto retry_lookup;
+			}
+
+			return ret;
+		}
+
+		if (unlikely(!PageUptodate(cached_page))) {
+			unlock_page(cached_page);
+			return ret;
+		}
+
+		VM_BUG_ON_PAGE(cached_page->index != hindex, cached_page);
+
+		hugepage = cached_page;
+		goto map_huge;
+	}
+
+	rcu_read_unlock();
+
+	/* allocate huge THP page in VMA */
+	new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP |
+		__GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER);
+
+	if (unlikely(!new_page))
+		return ret;
+
+	do {
+		xas_lock_irq(&xas);
+		xas_set(&xas, hindex);
+		xas_create_range(&xas);
+
+		if (!(xas_error(&xas)))
+			break;
+
+		xas_unlock_irq(&xas);
+
+		if (!xas_nomem(&xas, GFP_KERNEL)) {
+			/* error creating range, so free THP and fallback */
+			if (new_page)
+				put_page(new_page);
+
+			return ret;
+		}
+	} while (1);
+
+	/* i_pages is locked here */
+
+	/*
+	 * Double check that an entry did not sneak into the page cache while
+	 * creating Xarray entries for the new page.
+	 */
+	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+	    hindex_max)) {
+		/*
+		 * An unusable entry was found, so delete the newly allocated
+		 * page and fallback.
+		 */
+		put_page(new_page);
+		xas_unlock_irq(&xas);
+		return ret;
+	} else if (cached_page) {
+		/*
+		 * A valid large page was found in the page cache, so free the
+		 * newly allocated page and map the cached page instead.
+		 */
+		put_page(new_page);
+		new_page = NULL;
+		xas_unlock_irq(&xas);
+
+		lock_page(cached_page);
+
+		/* was the cached page truncated while waiting for the lock? */
+		if (unlikely(cached_page->mapping != mapping)) {
+			unlock_page(cached_page);
+
+			/* retry once */
+			if (retry_lookup) {
+				retry_lookup = false;
+				goto retry_lookup;
+			}
+
+			return ret;
+		}
+
+		if (unlikely(!PageUptodate(cached_page))) {
+			unlock_page(cached_page);
+			return ret;
+		}
+
+		VM_BUG_ON_PAGE(cached_page->index != hindex, cached_page);
+
+		hugepage = cached_page;
+		goto map_huge;
+	}
+
+	prep_transhuge_page(new_page);
+	new_page->mapping = mapping;
+	new_page->index = hindex;
+	__SetPageLocked(new_page);
+
+	count_vm_event(THP_FILE_ALLOC);
+	xas_set(&xas, hindex);
+
+	for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
+#ifdef PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
+		/*
+		 * Store pointers to both head and tail pages of a compound
+		 * page in the page cache.
+		 */
+		xas_store(&xas, new_page + nr);
+#else
+		/*
+		 * All entries for a compound page in the page cache should
+		 * point to the head page.
+		 */
+		xas_store(&xas, new_page);
+#endif
+		xas_next(&xas);
+	}
+
+	mapping->nrpages += HPAGE_PMD_NR;
+	xas_unlock_irq(&xas);
+
+	/*
+	 * The readpage() operation below is expected to fill the large
+	 * page with data without polluting the page cache with
+	 * PAGESIZE entries due to a buffered read and/or readahead().
+	 *
+	 * A filesystem's vm_operations_struct huge_fault field should
+	 * never point to this routine without such a capability, and
+	 * without it a call to this routine would eventually just
+	 * fall through to the normal fault op anyway.
+	 */
+	error = mapping->a_ops->readpage(vmf->vma->vm_file, new_page);
+
+	if (unlikely(error)) {
+		ret = VM_FAULT_SIGBUS;
+		goto delete_hugepage_from_page_cache;
+	}
+
+	if (wait_on_page_locked_killable(new_page)) {
+		ret = VM_FAULT_SIGSEGV;
+		goto delete_hugepage_from_page_cache;
+	}
+
+	if (!PageUptodate(new_page)) {
+		/* EIO */
+		ret = VM_FAULT_SIGBUS;
+		goto delete_hugepage_from_page_cache;
+	}
+
+	lock_page(new_page);
+
+	/* did the page get truncated while waiting for the lock? */
+	if (unlikely(new_page->mapping != mapping)) {
+		unlock_page(new_page);
+		goto delete_hugepage_from_page_cache;
+	}
+
+	__inc_node_page_state(new_page, NR_SHMEM_THPS);
+	__mod_node_page_state(page_pgdat(new_page),
+		NR_FILE_PAGES, HPAGE_PMD_NR);
+	__mod_node_page_state(page_pgdat(new_page),
+		NR_SHMEM, HPAGE_PMD_NR);
+
+	hugepage = new_page;
+
+map_huge:
+	/* map hugepage at the PMD level */
+
+	ret = alloc_set_pte(vmf, vmf->memcg, hugepage);
+
+	VM_BUG_ON_PAGE((!(pmd_trans_huge(*vmf->pmd))), hugepage);
+	VM_BUG_ON_PAGE(!(PageTransHuge(hugepage)), hugepage);
+
+	if (likely(!(ret & VM_FAULT_ERROR))) {
+		vmf->address = haddr;
+		vmf->page = hugepage;
+
+		page_ref_add(hugepage, HPAGE_PMD_NR);
+		count_vm_event(THP_FILE_MAPPED);
+	} else {
+		if (new_page) {
+			__mod_node_page_state(page_pgdat(new_page),
+				NR_FILE_PAGES, -HPAGE_PMD_NR);
+			__mod_node_page_state(page_pgdat(new_page),
+				NR_SHMEM, -HPAGE_PMD_NR);
+			__dec_node_page_state(new_page, NR_SHMEM_THPS);
+
+delete_hugepage_from_page_cache:
+			xas_lock_irq(&xas);
+			xas_set(&xas, hindex);
+
+			for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
+				xas_store(&xas, NULL);
+				xas_next(&xas);
+			}
+
+			new_page->mapping = NULL;
+			xas_unlock_irq(&xas);
+
+			mapping->nrpages -= HPAGE_PMD_NR;
+			unlock_page(new_page);
+			page_ref_dec(new_page);	/* decrement page coche ref */
+			put_page(new_page);	/* done with page */
+			return ret;
+		}
+	}
+
+	unlock_page(hugepage);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_huge_fault);
+#endif
+
 void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -2925,7 +3300,8 @@ struct page *read_cache_page(struct address_space *mapping,
 EXPORT_SYMBOL(read_cache_page);
 
 /**
- * read_cache_page_gfp - read into page cache, using specified page allocation flags.
+ * read_cache_page_gfp - read into page cache, using specified page allocation
+ *			 flags.
  * @mapping:	the page's address_space
  * @index:	the page index
  * @gfp:	the page allocator flags to use if allocating
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..ea3dbb6fa538 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -544,8 +544,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 
 	if (addr)
 		goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
+#endif
 
 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
 	if (addr)
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..d8b3bce71075 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1391,6 +1391,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
+	unsigned long vm_maywrite = VM_MAYWRITE;
+
 	*populate = 0;
 
 	if (!len)
@@ -1426,10 +1428,41 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (mm->map_count > sysctl_max_map_count)
 		return -ENOMEM;
 
-	/* Obtain the address to map to. we verify (or select) it and ensure
+	/*
+	 * Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	/*
+	 * If THP is enabled, it's a read-only executable that is
+	 * MAP_PRIVATE mapped, the length is larger than a PMD page
+	 * and either it's not a MAP_FIXED mapping or the passed address is
+	 * properly aligned for a PMD page, attempt to get an appropriate
+	 * address at which to map a PMD-sized THP page, otherwise call the
+	 * normal routine.
+	 */
+	if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
+		(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
+		(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
+		addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
+
+		if (addr && (!(addr & ~HPAGE_PMD_MASK))) {
+			/*
+			 * If we got a suitable THP mapping address, shut off
+			 * VM_MAYWRITE for the region, since it's never what
+			 * we would want.
+			 */
+			vm_maywrite = 0;
+		} else
+			addr = get_unmapped_area(file, addr, len, pgoff, flags);
+	} else {
+#endif
+		addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+	}
+#endif
+
 	if (offset_in_page(addr))
 		return addr;
 
@@ -1451,7 +1484,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * of the memory object, so we don't do any here.
 	 */
 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
-			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+			mm->def_flags | VM_MAYREAD | vm_maywrite | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
 		if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..aacc6e330329 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1192,7 +1192,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
 		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
@@ -1232,7 +1232,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
 		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a6c5d0b28321..47a19c59c9a2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -930,7 +930,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under the i_pages lock, then this ordering is not required.
 	 */
-	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
+	if (unlikely(PageTransHuge(page)))
 		refcount = 1 + HPAGE_PMD_NR;
 	else
 		refcount = 2;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-02  9:23 ` [PATCH v5 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-09-03 11:57   ` Michal Hocko
  2019-09-03 12:11     ` Matthew Wilcox
  2019-09-04  3:30     ` William Kucharski
  0 siblings, 2 replies; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 11:57 UTC (permalink / raw)
  To: William Kucharski
  Cc: linux-kernel, linux-mm, linux-fsdevel, Dave Hansen, Song Liu,
	Bob Kasten, Mike Kravetz, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox

On Mon 02-09-19 03:23:40, William Kucharski wrote:
> Add an 'order' argument to __page_cache_alloc() and
> do_read_cache_page(). Ensure the allocated pages are compound pages.

Why do we need to touch all the existing callers and change them to use
order 0 when none is actually converted to a different order? This just
seem to add a lot of code churn without a good reason. If anything I
would simply add __page_cache_alloc_order and make __page_cache_alloc
call it with order 0 argument.

Also is it so much to ask callers to provide __GFP_COMP explicitly?

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> ---
>  fs/afs/dir.c            |  2 +-
>  fs/btrfs/compression.c  |  2 +-
>  fs/cachefiles/rdwr.c    |  4 ++--
>  fs/ceph/addr.c          |  2 +-
>  fs/ceph/file.c          |  2 +-
>  include/linux/pagemap.h | 10 ++++++----
>  mm/filemap.c            | 20 +++++++++++---------
>  mm/readahead.c          |  2 +-
>  net/ceph/pagelist.c     |  4 ++--
>  net/ceph/pagevec.c      |  2 +-
>  10 files changed, 27 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index 139b4e3cc946..ca8f8e77e012 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key)
>  				afs_stat_v(dvnode, n_inval);
>  
>  			ret = -ENOMEM;
> -			req->pages[i] = __page_cache_alloc(gfp);
> +			req->pages[i] = __page_cache_alloc(gfp, 0);
>  			if (!req->pages[i])
>  				goto error;
>  			ret = add_to_page_cache_lru(req->pages[i],
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 60c47b417a4b..5280e7477b7e 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  		}
>  
>  		page = __page_cache_alloc(mapping_gfp_constraint(mapping,
> -								 ~__GFP_FS));
> +								 ~__GFP_FS), 0);
>  		if (!page)
>  			break;
>  
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index 44a3ce1e4ce4..11d30212745f 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
>  			goto backing_page_already_present;
>  
>  		if (!newpage) {
> -			newpage = __page_cache_alloc(cachefiles_gfp);
> +			newpage = __page_cache_alloc(cachefiles_gfp, 0);
>  			if (!newpage)
>  				goto nomem_monitor;
>  		}
> @@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
>  				goto backing_page_already_present;
>  
>  			if (!newpage) {
> -				newpage = __page_cache_alloc(cachefiles_gfp);
> +				newpage = __page_cache_alloc(cachefiles_gfp, 0);
>  				if (!newpage)
>  					goto nomem;
>  			}
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index b3c8b886bf64..7c1c3857fbb9 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1708,7 +1708,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
>  		if (len > PAGE_SIZE)
>  			len = PAGE_SIZE;
>  	} else {
> -		page = __page_cache_alloc(GFP_NOFS);
> +		page = __page_cache_alloc(GFP_NOFS, 0);
>  		if (!page) {
>  			err = -ENOMEM;
>  			goto out;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 685a03cc4b77..ae58d7c31aa4 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		struct page *page = NULL;
>  		loff_t i_size;
>  		if (retry_op == READ_INLINE) {
> -			page = __page_cache_alloc(GFP_KERNEL);
> +			page = __page_cache_alloc(GFP_KERNEL, 0);
>  			if (!page)
>  				return -ENOMEM;
>  		}
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c7552459a15f..92e026d9a6b7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -208,17 +208,19 @@ static inline int page_cache_add_speculative(struct page *page, int count)
>  }
>  
>  #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
>  #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
>  {
> -	return alloc_pages(gfp, 0);
> +	if (order > 0)
> +		gfp |= __GFP_COMP;
> +	return alloc_pages(gfp, order);
>  }
>  #endif
>  
>  static inline struct page *page_cache_alloc(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x));
> +	return __page_cache_alloc(mapping_gfp_mask(x), 0);
>  }
>  
>  static inline gfp_t readahead_gfp_mask(struct address_space *x)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d0cf700bf201..38b46fc00855 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -954,22 +954,25 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
>  
>  #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
>  {
>  	int n;
>  	struct page *page;
>  
> +	if (order > 0)
> +		gfp |= __GFP_COMP;
> +
>  	if (cpuset_do_page_mem_spread()) {
>  		unsigned int cpuset_mems_cookie;
>  		do {
>  			cpuset_mems_cookie = read_mems_allowed_begin();
>  			n = cpuset_mem_spread_node();
> -			page = __alloc_pages_node(n, gfp, 0);
> +			page = __alloc_pages_node(n, gfp, order);
>  		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
>  
>  		return page;
>  	}
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
> @@ -1665,7 +1668,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  		if (fgp_flags & FGP_NOFS)
>  			gfp_mask &= ~__GFP_FS;
>  
> -		page = __page_cache_alloc(gfp_mask);
> +		page = __page_cache_alloc(gfp_mask, 0);
>  		if (!page)
>  			return NULL;
>  
> @@ -2802,15 +2805,14 @@ static struct page *wait_on_page_read(struct page *page)
>  static struct page *do_read_cache_page(struct address_space *mapping,
>  				pgoff_t index,
>  				int (*filler)(void *, struct page *),
> -				void *data,
> -				gfp_t gfp)
> +				void *data, unsigned int order, gfp_t gfp)
>  {
>  	struct page *page;
>  	int err;
>  repeat:
>  	page = find_get_page(mapping, index);
>  	if (!page) {
> -		page = __page_cache_alloc(gfp);
> +		page = __page_cache_alloc(gfp, order);
>  		if (!page)
>  			return ERR_PTR(-ENOMEM);
>  		err = add_to_page_cache_lru(page, mapping, index, gfp);
> @@ -2917,7 +2919,7 @@ struct page *read_cache_page(struct address_space *mapping,
>  				int (*filler)(void *, struct page *),
>  				void *data)
>  {
> -	return do_read_cache_page(mapping, index, filler, data,
> +	return do_read_cache_page(mapping, index, filler, data, 0,
>  			mapping_gfp_mask(mapping));
>  }
>  EXPORT_SYMBOL(read_cache_page);
> @@ -2939,7 +2941,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
>  				pgoff_t index,
>  				gfp_t gfp)
>  {
> -	return do_read_cache_page(mapping, index, NULL, NULL, gfp);
> +	return do_read_cache_page(mapping, index, NULL, NULL, 0, gfp);
>  }
>  EXPORT_SYMBOL(read_cache_page_gfp);
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2fe72cd29b47..954760a612ea 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -193,7 +193,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
>  			continue;
>  		}
>  
> -		page = __page_cache_alloc(gfp_mask);
> +		page = __page_cache_alloc(gfp_mask, 0);
>  		if (!page)
>  			break;
>  		page->index = page_offset;
> diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
> index 65e34f78b05d..0c3face908dc 100644
> --- a/net/ceph/pagelist.c
> +++ b/net/ceph/pagelist.c
> @@ -56,7 +56,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
>  	struct page *page;
>  
>  	if (!pl->num_pages_free) {
> -		page = __page_cache_alloc(GFP_NOFS);
> +		page = __page_cache_alloc(GFP_NOFS, 0);
>  	} else {
>  		page = list_first_entry(&pl->free_list, struct page, lru);
>  		list_del(&page->lru);
> @@ -107,7 +107,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
>  	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
>  
>  	while (space > pl->num_pages_free) {
> -		struct page *page = __page_cache_alloc(GFP_NOFS);
> +		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
>  		if (!page)
>  			return -ENOMEM;
>  		list_add_tail(&page->lru, &pl->free_list);
> diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
> index 64305e7056a1..1d07e639216d 100644
> --- a/net/ceph/pagevec.c
> +++ b/net/ceph/pagevec.c
> @@ -45,7 +45,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
>  	if (!pages)
>  		return ERR_PTR(-ENOMEM);
>  	for (i = 0; i < num_pages; i++) {
> -		pages[i] = __page_cache_alloc(flags);
> +		pages[i] = __page_cache_alloc(flags, 0);
>  		if (pages[i] == NULL) {
>  			ceph_release_page_vector(pages, i);
>  			return ERR_PTR(-ENOMEM);
> -- 
> 2.21.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-03 11:57   ` Michal Hocko
@ 2019-09-03 12:11     ` Matthew Wilcox
  2019-09-03 12:19       ` Michal Hocko
  2019-09-04  3:30     ` William Kucharski
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-09-03 12:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue, Sep 03, 2019 at 01:57:48PM +0200, Michal Hocko wrote:
> On Mon 02-09-19 03:23:40, William Kucharski wrote:
> > Add an 'order' argument to __page_cache_alloc() and
> > do_read_cache_page(). Ensure the allocated pages are compound pages.
> 
> Why do we need to touch all the existing callers and change them to use
> order 0 when none is actually converted to a different order? This just
> seem to add a lot of code churn without a good reason. If anything I
> would simply add __page_cache_alloc_order and make __page_cache_alloc
> call it with order 0 argument.

Patch 2/2 uses a non-zero order.  I agree it's a lot of churn without
good reason; that's why I tried to add GFP_ORDER flags a few months ago.
Unfortunately, you didn't like that approach either.

> Also is it so much to ask callers to provide __GFP_COMP explicitly?

Yes, it's an unreasonable burden on the callers.  Those that pass 0 will
have the test optimised away by the compiler (for the non-NUMA case).
For the NUMA case, passing zero is going to be only a couple of extra
instructions to not set the GFP_COMP flag.

> >  #ifdef CONFIG_NUMA
> > -extern struct page *__page_cache_alloc(gfp_t gfp);
> > +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
> >  #else
> > -static inline struct page *__page_cache_alloc(gfp_t gfp)
> > +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
> >  {
> > -	return alloc_pages(gfp, 0);
> > +	if (order > 0)
> > +		gfp |= __GFP_COMP;
> > +	return alloc_pages(gfp, order);
> >  }
> >  #endif


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-02  9:23 ` [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
@ 2019-09-03 12:14   ` Michal Hocko
  2019-09-03 12:22     ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 12:14 UTC (permalink / raw)
  To: William Kucharski
  Cc: linux-kernel, linux-mm, linux-fsdevel, Dave Hansen, Song Liu,
	Bob Kasten, Mike Kravetz, Chad Mynhier, Kirill A. Shutemov,
	Johannes Weiner, Matthew Wilcox

On Mon 02-09-19 03:23:41, William Kucharski wrote:
> Add filemap_huge_fault() to attempt to satisfy page
> faults on memory-mapped read-only text pages using THP when possible.

This deserves much more description of how the thing is implemented and
expected to work. For one thing it is not really clear to me why you
need CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP at all. You need a support
from the filesystem anyway. So who is going to enable/disable this
config?

I cannot really comment on fs specific parts but filemap_huge_fault
sounds convoluted so much I cannot wrap my head around it. One thing
stand out though. The generic filemap_huge_fault depends on ->readpage
doing the right thing which sounds quite questionable to me. If nothing
else  I would expect ->readpages to do the job.

I am sorry to chime in at v5 without studying all previous 4 versions
(ENOTIME) but if those are deliberate decisions then they should better
be described in the changelog.

> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> ---
>  include/linux/mm.h |   2 +
>  mm/Kconfig         |  15 ++
>  mm/filemap.c       | 398 +++++++++++++++++++++++++++++++++++++++++++--
>  mm/huge_memory.c   |   3 +
>  mm/mmap.c          |  39 ++++-
>  mm/rmap.c          |   4 +-
>  mm/vmscan.c        |   2 +-
>  7 files changed, 446 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..2a5311721739 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2433,6 +2433,8 @@ extern void truncate_inode_pages_final(struct address_space *);
>  
>  /* generic vm_area_ops exported for stackable file systems */
>  extern vm_fault_t filemap_fault(struct vm_fault *vmf);
> +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
> +			enum page_entry_size pe_size);
>  extern void filemap_map_pages(struct vm_fault *vmf,
>  		pgoff_t start_pgoff, pgoff_t end_pgoff);
>  extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 56cec636a1fc..2debaded0e4d 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>  	bool
>  
> +config RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
> +
> +	help
> +	    Introduce filemap_huge_fault() to automatically map executable
> +	    read-only pages of mapped files of suitable size and alignment
> +	    using THP if possible.
> +
> +	    This is marked experimental because it is a new feature and is
> +	    dependent upon filesystmes implementing readpages() in a way
> +	    that will recognize large THP pages and read file content to
> +	    them without polluting the pagecache with PAGESIZE pages due
> +	    to readahead.
> +
>  endmenu
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38b46fc00855..5947d432a4e6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -199,13 +199,12 @@ static void unaccount_page_cache_page(struct address_space *mapping,
>  	nr = hpage_nr_pages(page);
>  
>  	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
> -	if (PageSwapBacked(page)) {
> +
> +	if (PageSwapBacked(page))
>  		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
> -		if (PageTransHuge(page))
> -			__dec_node_page_state(page, NR_SHMEM_THPS);
> -	} else {
> -		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> -	}
> +
> +	if (PageTransHuge(page))
> +		__dec_node_page_state(page, NR_SHMEM_THPS);
>  
>  	/*
>  	 * At this point page must be either written or cleaned by
> @@ -303,6 +302,9 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  			break;
>  		if (xa_is_value(page))
>  			continue;
> +
> +VM_BUG_ON_PAGE(xa_is_internal(page), page);
> +
>  		if (!tail_pages) {
>  			/*
>  			 * Some page got inserted in our range? Skip it. We
> @@ -315,6 +317,11 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  				continue;
>  			}
>  			WARN_ON_ONCE(!PageLocked(page));
> +
> +			/*
> +			 * If a THP is in the page cache, set the succeeding
> +			 * cache entries for the PMD-sized page to NULL.
> +			 */
>  			if (PageTransHuge(page) && !PageHuge(page))
>  				tail_pages = HPAGE_PMD_NR - 1;
>  			page->mapping = NULL;
> @@ -324,8 +331,6 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  			 */
>  			i++;
>  		} else {
> -			VM_BUG_ON_PAGE(page->index + HPAGE_PMD_NR - tail_pages
> -					!= pvec->pages[i]->index, page);
>  			tail_pages--;
>  		}
>  		xas_store(&xas, NULL);
> @@ -881,7 +886,10 @@ static int __add_to_page_cache_locked(struct page *page,
>  		mapping->nrpages++;
>  
>  		/* hugetlb pages do not participate in page cache accounting */
> -		if (!huge)
> +		if (PageTransHuge(page) && !huge)
> +			__mod_node_page_state(page_pgdat(page),
> +				NR_FILE_PAGES, HPAGE_PMD_NR);
> +		else
>  			__inc_node_page_state(page, NR_FILE_PAGES);
>  unlock:
>  		xas_unlock_irq(&xas);
> @@ -1663,7 +1671,8 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  no_page:
>  	if (!page && (fgp_flags & FGP_CREAT)) {
>  		int err;
> -		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
> +		if ((fgp_flags & FGP_WRITE) &&
> +			mapping_cap_account_dirty(mapping))
>  			gfp_mask |= __GFP_WRITE;
>  		if (fgp_flags & FGP_NOFS)
>  			gfp_mask &= ~__GFP_FS;
> @@ -2643,6 +2652,372 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +/*
> + * There is a change coming to store only the head page of a compound page in
> + * the head cache.
> + *
> + * When that change is present in the kernel, remove this #define
> + */
> +#define	PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
> +
> +/*
> + * Check for an entry in the page cache which would conflict with the address
> + * range we wish to map using a THP or is otherwise unusable to map a large
> + * cached page.
> + *
> + * The routine will return true if a usable page is found in the page cache
> + * (and *pagep will be set to the address of the cached page), or if no
> + * cached page is found (and *pagep will be set to NULL).
> + */
> +static bool
> +filemap_huge_check_pagecache_usable(struct xa_state *xas,
> +	struct page **pagep, pgoff_t hindex, pgoff_t hindex_max)
> +{
> +	struct page *page;
> +
> +	while (1) {
> +		xas_set(xas, hindex);
> +		page = xas_find(xas, hindex_max);
> +
> +		if (xas_retry(xas, page))
> +			continue;
> +
> +		/*
> +		 * A found entry is unusable if:
> +		 *	+ the entry is an Xarray value, not a pointer
> +		 *	+ the entry is an internal Xarray node
> +		 *	+ the entry is not a compound page
> +		 *	+ the order of the compound page is < HPAGE_PMD_ORDER
> +		 *	+ the page index is not what we expect it to be
> +		 */
> +		if (!page)
> +			break;
> +
> +		if (xa_is_value(page) || xa_is_internal(page))
> +			return false;
> +
> +#ifdef PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
> +		if ((!PageCompound(page)) || (page != compound_head(page)))
> +#else
> +		if (!PageCompound(page))
> +#endif
> +			return false;
> +
> +		if (compound_order(page) < HPAGE_PMD_ORDER)
> +			return false;
> +
> +		if (page->index != hindex)
> +			return false;
> +
> +		break;
> +	}
> +
> +	*pagep = page;
> +	return true;
> +}
> +
> +/**
> + * filemap_huge_fault - read in file data for page fault handling to THP
> + * @vmf:	struct vm_fault containing details of the fault
> + * @pe_size:	large page size to map, currently this must be PE_SIZE_PMD
> + *
> + * filemap_huge_fault() is invoked via the vma operations vector for a
> + * mapped memory region to read in file data to a transparent huge page during
> + * a page fault.
> + *
> + * If for any reason we can't allocate a THP, map it or add it to the page
> + * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
> + * handler to try mapping the page using a PAGESIZE page, usually via
> + * filemap_fault() if so speicifed in the vma operations vector.
> + *
> + * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
> + * to map the new THP.
> + *
> + * NOTE: This routine depends upon the file system's readpage routine as
> + *       specified in the address space operations vector to recognize when it
> + *	 is being passed a large page and to read the approprate amount of data
> + *	 in full and without polluting the page cache for the large page itself
> + *	 with PAGESIZE pages to perform a buffered read or to pollute what
> + *	 would be the page cache space for any succeeding pages with PAGESIZE
> + *	 pages due to readahead.
> + *
> + *	 It is VITAL that this routine not be enabled without such filesystem
> + *	 support. As there is no way to determine how many bytes were read by
> + *	 the readpage() operation, if only a PAGESIZE page is read, this routine
> + *	 will map the THP containing only the first PAGESIZE bytes of file data
> + *	 to satisfy the fault, which is never the result desired.
> + */
> +vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
> +		enum page_entry_size pe_size)
> +{
> +	struct file *filp = vmf->vma->vm_file;
> +	struct address_space *mapping = filp->f_mapping;
> +	struct vm_area_struct *vma = vmf->vma;
> +
> +	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR);
> +	pgoff_t hindex_max = hindex + HPAGE_PMD_NR - 1;
> +
> +	struct page *cached_page, *hugepage;
> +	struct page *new_page = NULL;
> +
> +	vm_fault_t ret = VM_FAULT_FALLBACK;
> +	unsigned long nr;
> +
> +	int error;
> +	bool retry_lookup = true;
> +
> +	XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER);
> +
> +	/*
> +	 * Return VM_FAULT_FALLBACK if:
> +	 *
> +	 *	+ pe_size != PE_SIZE_PMD
> +	 *	+ FAULT_FLAG_WRITE is set in vmf->flags
> +	 *	+ vma isn't aligned to allow a PMD mapping
> +	 *	+ PMD would extend beyond the end of the vma
> +	 */
> +	if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) ||
> +	    (haddr < vma->vm_start ||
> +	    ((haddr + HPAGE_PMD_SIZE) > vma->vm_end)))
> +		return ret;
> +
> +retry_lookup:
> +	rcu_read_lock();
> +
> +	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
> +	    hindex_max)) {
> +		/* found a conflicting entry in the page cache, so fallback */
> +		rcu_read_unlock();
> +		return ret;
> +	} else if (cached_page) {
> +		/* found a valid cached page, so map it */
> +		rcu_read_unlock();
> +		lock_page(cached_page);
> +
> +		/* was the cached page truncated while waiting for the lock? */
> +		if (unlikely(cached_page->mapping != mapping)) {
> +			unlock_page(cached_page);
> +
> +			/* retry once */
> +			if (retry_lookup) {
> +				retry_lookup = false;
> +				goto retry_lookup;
> +			}
> +
> +			return ret;
> +		}
> +
> +		if (unlikely(!PageUptodate(cached_page))) {
> +			unlock_page(cached_page);
> +			return ret;
> +		}
> +
> +		VM_BUG_ON_PAGE(cached_page->index != hindex, cached_page);
> +
> +		hugepage = cached_page;
> +		goto map_huge;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	/* allocate huge THP page in VMA */
> +	new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP |
> +		__GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER);
> +
> +	if (unlikely(!new_page))
> +		return ret;
> +
> +	do {
> +		xas_lock_irq(&xas);
> +		xas_set(&xas, hindex);
> +		xas_create_range(&xas);
> +
> +		if (!(xas_error(&xas)))
> +			break;
> +
> +		xas_unlock_irq(&xas);
> +
> +		if (!xas_nomem(&xas, GFP_KERNEL)) {
> +			/* error creating range, so free THP and fallback */
> +			if (new_page)
> +				put_page(new_page);
> +
> +			return ret;
> +		}
> +	} while (1);
> +
> +	/* i_pages is locked here */
> +
> +	/*
> +	 * Double check that an entry did not sneak into the page cache while
> +	 * creating Xarray entries for the new page.
> +	 */
> +	if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
> +	    hindex_max)) {
> +		/*
> +		 * An unusable entry was found, so delete the newly allocated
> +		 * page and fallback.
> +		 */
> +		put_page(new_page);
> +		xas_unlock_irq(&xas);
> +		return ret;
> +	} else if (cached_page) {
> +		/*
> +		 * A valid large page was found in the page cache, so free the
> +		 * newly allocated page and map the cached page instead.
> +		 */
> +		put_page(new_page);
> +		new_page = NULL;
> +		xas_unlock_irq(&xas);
> +
> +		lock_page(cached_page);
> +
> +		/* was the cached page truncated while waiting for the lock? */
> +		if (unlikely(cached_page->mapping != mapping)) {
> +			unlock_page(cached_page);
> +
> +			/* retry once */
> +			if (retry_lookup) {
> +				retry_lookup = false;
> +				goto retry_lookup;
> +			}
> +
> +			return ret;
> +		}
> +
> +		if (unlikely(!PageUptodate(cached_page))) {
> +			unlock_page(cached_page);
> +			return ret;
> +		}
> +
> +		VM_BUG_ON_PAGE(cached_page->index != hindex, cached_page);
> +
> +		hugepage = cached_page;
> +		goto map_huge;
> +	}
> +
> +	prep_transhuge_page(new_page);
> +	new_page->mapping = mapping;
> +	new_page->index = hindex;
> +	__SetPageLocked(new_page);
> +
> +	count_vm_event(THP_FILE_ALLOC);
> +	xas_set(&xas, hindex);
> +
> +	for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
> +#ifdef PAGE_CACHE_STORE_COMPOUND_TAIL_PAGES
> +		/*
> +		 * Store pointers to both head and tail pages of a compound
> +		 * page in the page cache.
> +		 */
> +		xas_store(&xas, new_page + nr);
> +#else
> +		/*
> +		 * All entries for a compound page in the page cache should
> +		 * point to the head page.
> +		 */
> +		xas_store(&xas, new_page);
> +#endif
> +		xas_next(&xas);
> +	}
> +
> +	mapping->nrpages += HPAGE_PMD_NR;
> +	xas_unlock_irq(&xas);
> +
> +	/*
> +	 * The readpage() operation below is expected to fill the large
> +	 * page with data without polluting the page cache with
> +	 * PAGESIZE entries due to a buffered read and/or readahead().
> +	 *
> +	 * A filesystem's vm_operations_struct huge_fault field should
> +	 * never point to this routine without such a capability, and
> +	 * without it a call to this routine would eventually just
> +	 * fall through to the normal fault op anyway.
> +	 */
> +	error = mapping->a_ops->readpage(vmf->vma->vm_file, new_page);
> +
> +	if (unlikely(error)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto delete_hugepage_from_page_cache;
> +	}
> +
> +	if (wait_on_page_locked_killable(new_page)) {
> +		ret = VM_FAULT_SIGSEGV;
> +		goto delete_hugepage_from_page_cache;
> +	}
> +
> +	if (!PageUptodate(new_page)) {
> +		/* EIO */
> +		ret = VM_FAULT_SIGBUS;
> +		goto delete_hugepage_from_page_cache;
> +	}
> +
> +	lock_page(new_page);
> +
> +	/* did the page get truncated while waiting for the lock? */
> +	if (unlikely(new_page->mapping != mapping)) {
> +		unlock_page(new_page);
> +		goto delete_hugepage_from_page_cache;
> +	}
> +
> +	__inc_node_page_state(new_page, NR_SHMEM_THPS);
> +	__mod_node_page_state(page_pgdat(new_page),
> +		NR_FILE_PAGES, HPAGE_PMD_NR);
> +	__mod_node_page_state(page_pgdat(new_page),
> +		NR_SHMEM, HPAGE_PMD_NR);
> +
> +	hugepage = new_page;
> +
> +map_huge:
> +	/* map hugepage at the PMD level */
> +
> +	ret = alloc_set_pte(vmf, vmf->memcg, hugepage);
> +
> +	VM_BUG_ON_PAGE((!(pmd_trans_huge(*vmf->pmd))), hugepage);
> +	VM_BUG_ON_PAGE(!(PageTransHuge(hugepage)), hugepage);
> +
> +	if (likely(!(ret & VM_FAULT_ERROR))) {
> +		vmf->address = haddr;
> +		vmf->page = hugepage;
> +
> +		page_ref_add(hugepage, HPAGE_PMD_NR);
> +		count_vm_event(THP_FILE_MAPPED);
> +	} else {
> +		if (new_page) {
> +			__mod_node_page_state(page_pgdat(new_page),
> +				NR_FILE_PAGES, -HPAGE_PMD_NR);
> +			__mod_node_page_state(page_pgdat(new_page),
> +				NR_SHMEM, -HPAGE_PMD_NR);
> +			__dec_node_page_state(new_page, NR_SHMEM_THPS);
> +
> +delete_hugepage_from_page_cache:
> +			xas_lock_irq(&xas);
> +			xas_set(&xas, hindex);
> +
> +			for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
> +				xas_store(&xas, NULL);
> +				xas_next(&xas);
> +			}
> +
> +			new_page->mapping = NULL;
> +			xas_unlock_irq(&xas);
> +
> +			mapping->nrpages -= HPAGE_PMD_NR;
> +			unlock_page(new_page);
> +			page_ref_dec(new_page);	/* decrement page coche ref */
> +			put_page(new_page);	/* done with page */
> +			return ret;
> +		}
> +	}
> +
> +	unlock_page(hugepage);
> +	return ret;
> +}
> +EXPORT_SYMBOL(filemap_huge_fault);
> +#endif
> +
>  void filemap_map_pages(struct vm_fault *vmf,
>  		pgoff_t start_pgoff, pgoff_t end_pgoff)
>  {
> @@ -2925,7 +3300,8 @@ struct page *read_cache_page(struct address_space *mapping,
>  EXPORT_SYMBOL(read_cache_page);
>  
>  /**
> - * read_cache_page_gfp - read into page cache, using specified page allocation flags.
> + * read_cache_page_gfp - read into page cache, using specified page allocation
> + *			 flags.
>   * @mapping:	the page's address_space
>   * @index:	the page index
>   * @gfp:	the page allocator flags to use if allocating
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index de1f15969e27..ea3dbb6fa538 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -544,8 +544,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  
>  	if (addr)
>  		goto out;
> +
> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>  		goto out;
> +#endif
>  
>  	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>  	if (addr)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7e8c3e8ae75f..d8b3bce71075 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1391,6 +1391,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	struct mm_struct *mm = current->mm;
>  	int pkey = 0;
>  
> +	unsigned long vm_maywrite = VM_MAYWRITE;
> +
>  	*populate = 0;
>  
>  	if (!len)
> @@ -1426,10 +1428,41 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	if (mm->map_count > sysctl_max_map_count)
>  		return -ENOMEM;
>  
> -	/* Obtain the address to map to. we verify (or select) it and ensure
> +	/*
> +	 * Obtain the address to map to. we verify (or select) it and ensure
>  	 * that it represents a valid section of the address space.
>  	 */
> -	addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	/*
> +	 * If THP is enabled, it's a read-only executable that is
> +	 * MAP_PRIVATE mapped, the length is larger than a PMD page
> +	 * and either it's not a MAP_FIXED mapping or the passed address is
> +	 * properly aligned for a PMD page, attempt to get an appropriate
> +	 * address at which to map a PMD-sized THP page, otherwise call the
> +	 * normal routine.
> +	 */
> +	if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
> +		(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> +		(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
> +		addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
> +
> +		if (addr && (!(addr & ~HPAGE_PMD_MASK))) {
> +			/*
> +			 * If we got a suitable THP mapping address, shut off
> +			 * VM_MAYWRITE for the region, since it's never what
> +			 * we would want.
> +			 */
> +			vm_maywrite = 0;
> +		} else
> +			addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +	} else {
> +#endif
> +		addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> +	}
> +#endif
> +
>  	if (offset_in_page(addr))
>  		return addr;
>  
> @@ -1451,7 +1484,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	 * of the memory object, so we don't do any here.
>  	 */
>  	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
> -			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> +			mm->def_flags | VM_MAYREAD | vm_maywrite | VM_MAYEXEC;
>  
>  	if (flags & MAP_LOCKED)
>  		if (!can_do_mlock())
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 003377e24232..aacc6e330329 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1192,7 +1192,7 @@ void page_add_file_rmap(struct page *page, bool compound)
>  		}
>  		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
>  			goto out;
> -		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +
>  		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
>  	} else {
>  		if (PageTransCompound(page) && page_mapping(page)) {
> @@ -1232,7 +1232,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
>  		}
>  		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
>  			goto out;
> -		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +
>  		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
>  	} else {
>  		if (!atomic_add_negative(-1, &page->_mapcount))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a6c5d0b28321..47a19c59c9a2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -930,7 +930,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  	 * Note that if SetPageDirty is always performed via set_page_dirty,
>  	 * and thus under the i_pages lock, then this ordering is not required.
>  	 */
> -	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
> +	if (unlikely(PageTransHuge(page)))
>  		refcount = 1 + HPAGE_PMD_NR;
>  	else
>  		refcount = 2;
> -- 
> 2.21.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-03 12:11     ` Matthew Wilcox
@ 2019-09-03 12:19       ` Michal Hocko
  2019-09-03 16:28         ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 12:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue 03-09-19 05:11:55, Matthew Wilcox wrote:
> On Tue, Sep 03, 2019 at 01:57:48PM +0200, Michal Hocko wrote:
> > On Mon 02-09-19 03:23:40, William Kucharski wrote:
> > > Add an 'order' argument to __page_cache_alloc() and
> > > do_read_cache_page(). Ensure the allocated pages are compound pages.
> > 
> > Why do we need to touch all the existing callers and change them to use
> > order 0 when none is actually converted to a different order? This just
> > seem to add a lot of code churn without a good reason. If anything I
> > would simply add __page_cache_alloc_order and make __page_cache_alloc
> > call it with order 0 argument.
> 
> Patch 2/2 uses a non-zero order.

It is a new caller and it can use a new function right?

> I agree it's a lot of churn without
> good reason; that's why I tried to add GFP_ORDER flags a few months ago.
> Unfortunately, you didn't like that approach either.

Is there any future plan that all/most __page_cache_alloc will get a
non-zero order argument?

> > Also is it so much to ask callers to provide __GFP_COMP explicitly?
> 
> Yes, it's an unreasonable burden on the callers.

Care to exaplain why? __GFP_COMP tends to be used in the kernel quite
extensively.

> Those that pass 0 will
> have the test optimised away by the compiler (for the non-NUMA case).
> For the NUMA case, passing zero is going to be only a couple of extra
> instructions to not set the GFP_COMP flag.
> 
> > >  #ifdef CONFIG_NUMA
> > > -extern struct page *__page_cache_alloc(gfp_t gfp);
> > > +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
> > >  #else
> > > -static inline struct page *__page_cache_alloc(gfp_t gfp)
> > > +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
> > >  {
> > > -	return alloc_pages(gfp, 0);
> > > +	if (order > 0)
> > > +		gfp |= __GFP_COMP;
> > > +	return alloc_pages(gfp, order);
> > >  }
> > >  #endif

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-03 12:14   ` Michal Hocko
@ 2019-09-03 12:22     ` Matthew Wilcox
  2019-09-03 12:51       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-09-03 12:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue, Sep 03, 2019 at 02:14:24PM +0200, Michal Hocko wrote:
> On Mon 02-09-19 03:23:41, William Kucharski wrote:
> > Add filemap_huge_fault() to attempt to satisfy page
> > faults on memory-mapped read-only text pages using THP when possible.
> 
> This deserves much more description of how the thing is implemented and
> expected to work. For one thing it is not really clear to me why you
> need CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP at all. You need a support
> from the filesystem anyway. So who is going to enable/disable this
> config?

There are definitely situations in which enabling this code will crash
the kernel.  But we want to get filesystems to a point where they can
start working on their support for large pages.  So our workaround is
to try to get the core pieces merged under a CONFIG_I_KNOW_WHAT_IM_DOING
flag and let people play with it.  Then continue to work on the core
to eliminate those places that are broken.

> I cannot really comment on fs specific parts but filemap_huge_fault
> sounds convoluted so much I cannot wrap my head around it. One thing
> stand out though. The generic filemap_huge_fault depends on ->readpage
> doing the right thing which sounds quite questionable to me. If nothing
> else  I would expect ->readpages to do the job.

Ah, that's because you're not a filesystem person ;-)  ->readpages is
really ->readahead.  It's a crappy interface and should be completely
redesigned.

Thanks for looking!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-03 12:22     ` Matthew Wilcox
@ 2019-09-03 12:51       ` Michal Hocko
  2019-09-03 15:10         ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 12:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue 03-09-19 05:22:08, Matthew Wilcox wrote:
> On Tue, Sep 03, 2019 at 02:14:24PM +0200, Michal Hocko wrote:
> > On Mon 02-09-19 03:23:41, William Kucharski wrote:
> > > Add filemap_huge_fault() to attempt to satisfy page
> > > faults on memory-mapped read-only text pages using THP when possible.
> > 
> > This deserves much more description of how the thing is implemented and
> > expected to work. For one thing it is not really clear to me why you
> > need CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP at all. You need a support
> > from the filesystem anyway. So who is going to enable/disable this
> > config?
> 
> There are definitely situations in which enabling this code will crash
> the kernel.  But we want to get filesystems to a point where they can
> start working on their support for large pages.  So our workaround is
> to try to get the core pieces merged under a CONFIG_I_KNOW_WHAT_IM_DOING
> flag and let people play with it.  Then continue to work on the core
> to eliminate those places that are broken.

I am not sure I understand. Each fs has to opt in to the feature
anyway. If it doesn't then there should be no risk of regression, right?
I do not expect any fs would rush an implementation in while not being
sure about the correctness. So how exactly does a config option help
here.
 
> > I cannot really comment on fs specific parts but filemap_huge_fault
> > sounds convoluted so much I cannot wrap my head around it. One thing
> > stand out though. The generic filemap_huge_fault depends on ->readpage
> > doing the right thing which sounds quite questionable to me. If nothing
> > else  I would expect ->readpages to do the job.
> 
> Ah, that's because you're not a filesystem person ;-)  ->readpages is
> really ->readahead.  It's a crappy interface and should be completely
> redesigned.

OK, the interface looked like the right fit for this purpose. Thanks for
clarifying.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-03 12:51       ` Michal Hocko
@ 2019-09-03 15:10         ` Matthew Wilcox
  2019-09-03 19:15           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-09-03 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue, Sep 03, 2019 at 02:51:50PM +0200, Michal Hocko wrote:
> On Tue 03-09-19 05:22:08, Matthew Wilcox wrote:
> > On Tue, Sep 03, 2019 at 02:14:24PM +0200, Michal Hocko wrote:
> > > On Mon 02-09-19 03:23:41, William Kucharski wrote:
> > > > Add filemap_huge_fault() to attempt to satisfy page
> > > > faults on memory-mapped read-only text pages using THP when possible.
> > > 
> > > This deserves much more description of how the thing is implemented and
> > > expected to work. For one thing it is not really clear to me why you
> > > need CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP at all. You need a support
> > > from the filesystem anyway. So who is going to enable/disable this
> > > config?
> > 
> > There are definitely situations in which enabling this code will crash
> > the kernel.  But we want to get filesystems to a point where they can
> > start working on their support for large pages.  So our workaround is
> > to try to get the core pieces merged under a CONFIG_I_KNOW_WHAT_IM_DOING
> > flag and let people play with it.  Then continue to work on the core
> > to eliminate those places that are broken.
> 
> I am not sure I understand. Each fs has to opt in to the feature
> anyway. If it doesn't then there should be no risk of regression, right?
> I do not expect any fs would rush an implementation in while not being
> sure about the correctness. So how exactly does a config option help
> here.

Filesystems won't see large pages unless they've opted into them.
But there's a huge amount of page-cache work that needs to get done
before this can be enabled by default.  For example, truncate() won't
work properly.

Rather than try to do all the page cache work upfront, then wait for the
filesystems to catch up, we want to get some basics merged.  Since we've
been talking about this for so long without any movement in the kernel
towards actual support, this felt like a good way to go.

We could, of course, develop the entire thing out of tree, but that's
likely to lead to pain and anguish.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-03 12:19       ` Michal Hocko
@ 2019-09-03 16:28         ` Matthew Wilcox
  2019-09-03 19:18           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-09-03 16:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue, Sep 03, 2019 at 02:19:52PM +0200, Michal Hocko wrote:
> On Tue 03-09-19 05:11:55, Matthew Wilcox wrote:
> > On Tue, Sep 03, 2019 at 01:57:48PM +0200, Michal Hocko wrote:
> > > On Mon 02-09-19 03:23:40, William Kucharski wrote:
> > > > Add an 'order' argument to __page_cache_alloc() and
> > > > do_read_cache_page(). Ensure the allocated pages are compound pages.
> > > 
> > > Why do we need to touch all the existing callers and change them to use
> > > order 0 when none is actually converted to a different order? This just
> > > seem to add a lot of code churn without a good reason. If anything I
> > > would simply add __page_cache_alloc_order and make __page_cache_alloc
> > > call it with order 0 argument.
> > 
> > Patch 2/2 uses a non-zero order.
> 
> It is a new caller and it can use a new function right?
> 
> > I agree it's a lot of churn without
> > good reason; that's why I tried to add GFP_ORDER flags a few months ago.
> > Unfortunately, you didn't like that approach either.
> 
> Is there any future plan that all/most __page_cache_alloc will get a
> non-zero order argument?

I'm not sure about "most".  It will certainly become more common, as
far as I can tell.

> > > Also is it so much to ask callers to provide __GFP_COMP explicitly?
> > 
> > Yes, it's an unreasonable burden on the callers.
> 
> Care to exaplain why? __GFP_COMP tends to be used in the kernel quite
> extensively.

Most of the places which call this function get their gfp_t from
mapping->gfp_mask.  If we only want to allocate a single page, we
must not set __GFP_COMP.  If we want to allocate a large page, we must
set __GFP_COMP.  Rather than require individual filesystems to concern
themselves with this wart of the GFP interface, we can solve it in the
page cache.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-03 15:10         ` Matthew Wilcox
@ 2019-09-03 19:15           ` Michal Hocko
  2019-09-04  3:23             ` William Kucharski
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 19:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue 03-09-19 08:10:15, Matthew Wilcox wrote:
> On Tue, Sep 03, 2019 at 02:51:50PM +0200, Michal Hocko wrote:
> > On Tue 03-09-19 05:22:08, Matthew Wilcox wrote:
> > > On Tue, Sep 03, 2019 at 02:14:24PM +0200, Michal Hocko wrote:
> > > > On Mon 02-09-19 03:23:41, William Kucharski wrote:
> > > > > Add filemap_huge_fault() to attempt to satisfy page
> > > > > faults on memory-mapped read-only text pages using THP when possible.
> > > > 
> > > > This deserves much more description of how the thing is implemented and
> > > > expected to work. For one thing it is not really clear to me why you
> > > > need CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP at all. You need a support
> > > > from the filesystem anyway. So who is going to enable/disable this
> > > > config?
> > > 
> > > There are definitely situations in which enabling this code will crash
> > > the kernel.  But we want to get filesystems to a point where they can
> > > start working on their support for large pages.  So our workaround is
> > > to try to get the core pieces merged under a CONFIG_I_KNOW_WHAT_IM_DOING
> > > flag and let people play with it.  Then continue to work on the core
> > > to eliminate those places that are broken.
> > 
> > I am not sure I understand. Each fs has to opt in to the feature
> > anyway. If it doesn't then there should be no risk of regression, right?
> > I do not expect any fs would rush an implementation in while not being
> > sure about the correctness. So how exactly does a config option help
> > here.
> 
> Filesystems won't see large pages unless they've opted into them.
> But there's a huge amount of page-cache work that needs to get done
> before this can be enabled by default.  For example, truncate() won't
> work properly.
> 
> Rather than try to do all the page cache work upfront, then wait for the
> filesystems to catch up, we want to get some basics merged.  Since we've
> been talking about this for so long without any movement in the kernel
> towards actual support, this felt like a good way to go.
> 
> We could, of course, develop the entire thing out of tree, but that's
> likely to lead to pain and anguish.

Then I would suggest mentioning all this in the changelog so that the
overall intention is clear. It is also up to you fs developers to find a
consensus on how to move forward. I have brought that up mostly because
I really hate seeing new config options added due to shortage of
confidence in the code. That really smells like working around standard
code quality inclusion process.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-03 16:28         ` Matthew Wilcox
@ 2019-09-03 19:18           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 19:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner

On Tue 03-09-19 09:28:31, Matthew Wilcox wrote:
> On Tue, Sep 03, 2019 at 02:19:52PM +0200, Michal Hocko wrote:
> > On Tue 03-09-19 05:11:55, Matthew Wilcox wrote:
> > > On Tue, Sep 03, 2019 at 01:57:48PM +0200, Michal Hocko wrote:
> > > > On Mon 02-09-19 03:23:40, William Kucharski wrote:
> > > > > Add an 'order' argument to __page_cache_alloc() and
> > > > > do_read_cache_page(). Ensure the allocated pages are compound pages.
> > > > 
> > > > Why do we need to touch all the existing callers and change them to use
> > > > order 0 when none is actually converted to a different order? This just
> > > > seem to add a lot of code churn without a good reason. If anything I
> > > > would simply add __page_cache_alloc_order and make __page_cache_alloc
> > > > call it with order 0 argument.
> > > 
> > > Patch 2/2 uses a non-zero order.
> > 
> > It is a new caller and it can use a new function right?
> > 
> > > I agree it's a lot of churn without
> > > good reason; that's why I tried to add GFP_ORDER flags a few months ago.
> > > Unfortunately, you didn't like that approach either.
> > 
> > Is there any future plan that all/most __page_cache_alloc will get a
> > non-zero order argument?
> 
> I'm not sure about "most".  It will certainly become more common, as
> far as I can tell.

I would personally still go with  __page_cache_alloc_order way, but this
is up to you and other fs people what suits best. I was just surprised
to see a lot of code churn when it was not really used in the second
patch. That's why I brought it up. 

> > > > Also is it so much to ask callers to provide __GFP_COMP explicitly?
> > > 
> > > Yes, it's an unreasonable burden on the callers.
> > 
> > Care to exaplain why? __GFP_COMP tends to be used in the kernel quite
> > extensively.
> 
> Most of the places which call this function get their gfp_t from
> mapping->gfp_mask.  If we only want to allocate a single page, we
> must not set __GFP_COMP.  If we want to allocate a large page, we must
> set __GFP_COMP.  Rather than require individual filesystems to concern
> themselves with this wart of the GFP interface, we can solve it in the
> page cache.

Fair enough.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
  2019-09-03 19:15           ` Michal Hocko
@ 2019-09-04  3:23             ` William Kucharski
  0 siblings, 0 replies; 16+ messages in thread
From: William Kucharski @ 2019-09-04  3:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-fsdevel,
	Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz, Chad Mynhier,
	Kirill A. Shutemov, Johannes Weiner



> On Sep 3, 2019, at 1:15 PM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> Then I would suggest mentioning all this in the changelog so that the
> overall intention is clear. It is also up to you fs developers to find a
> consensus on how to move forward. I have brought that up mostly because
> I really hate seeing new config options added due to shortage of
> confidence in the code. That really smells like working around standard
> code quality inclusion process.

I do mention a good deal of this in the blurb in part [0/2] of the patch,
though I don't cover the readpage/readpages() debate. Ideally readpage()
should do just that, read a page, based on the size of the page passed,
and not assume "page" means "PAGESIZE."

I can also make the "help" text for the option more descriptive if
desired.

Thanks for your comments!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-03 11:57   ` Michal Hocko
  2019-09-03 12:11     ` Matthew Wilcox
@ 2019-09-04  3:30     ` William Kucharski
  2019-09-04  8:28       ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: William Kucharski @ 2019-09-04  3:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-fsdevel, Dave Hansen, Song Liu, Bob Kasten,
	Mike Kravetz, Chad Mynhier, Kirill A. Shutemov, Johannes Weiner,
	Matthew Wilcox



> On Sep 3, 2019, at 5:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Mon 02-09-19 03:23:40, William Kucharski wrote:
>> Add an 'order' argument to __page_cache_alloc() and
>> do_read_cache_page(). Ensure the allocated pages are compound pages.
> 
> Why do we need to touch all the existing callers and change them to use
> order 0 when none is actually converted to a different order? This just
> seem to add a lot of code churn without a good reason. If anything I
> would simply add __page_cache_alloc_order and make __page_cache_alloc
> call it with order 0 argument.

All the EXISTING code in patch [1/2] is changed to call it with an order
of 0, as you would expect.

However, new code in part [2/2] of the patch calls it with an order of
HPAGE_PMD_ORDER, as it seems cleaner to have those routines operate on
a page, regardless of the order of the page desired.

I certainly can change this as you request, but once again the question
is whether "page" should MEAN "page" regardless of the order desired,
or whether the assumption will always be "page" means base PAGESIZE.

Either approach works, but what is the semantic we want going forward?

Thanks again!



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
  2019-09-04  3:30     ` William Kucharski
@ 2019-09-04  8:28       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-09-04  8:28 UTC (permalink / raw)
  To: William Kucharski
  Cc: LKML, linux-mm, linux-fsdevel, Dave Hansen, Song Liu, Bob Kasten,
	Mike Kravetz, Chad Mynhier, Kirill A. Shutemov, Johannes Weiner,
	Matthew Wilcox

On Tue 03-09-19 21:30:30, William Kucharski wrote:
> 
> 
> > On Sep 3, 2019, at 5:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Mon 02-09-19 03:23:40, William Kucharski wrote:
> >> Add an 'order' argument to __page_cache_alloc() and
> >> do_read_cache_page(). Ensure the allocated pages are compound pages.
> > 
> > Why do we need to touch all the existing callers and change them to use
> > order 0 when none is actually converted to a different order? This just
> > seem to add a lot of code churn without a good reason. If anything I
> > would simply add __page_cache_alloc_order and make __page_cache_alloc
> > call it with order 0 argument.
> 
> All the EXISTING code in patch [1/2] is changed to call it with an order
> of 0, as you would expect.
> 
> However, new code in part [2/2] of the patch calls it with an order of
> HPAGE_PMD_ORDER, as it seems cleaner to have those routines operate on
> a page, regardless of the order of the page desired.
> 
> I certainly can change this as you request, but once again the question
> is whether "page" should MEAN "page" regardless of the order desired,
> or whether the assumption will always be "page" means base PAGESIZE.
> 
> Either approach works, but what is the semantic we want going forward?

I do not have anything against handling page as compound, if that is the
question. All I was interested in whether adding a new helper to
_allocate_ the comound page wouldn't be easier than touching all
existing __page_cache_alloc users.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-09-04  8:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-02  9:23 [PATCH v5 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
2019-09-02  9:23 ` [PATCH v5 1/2] mm: Allow the page cache to allocate large pages William Kucharski
2019-09-03 11:57   ` Michal Hocko
2019-09-03 12:11     ` Matthew Wilcox
2019-09-03 12:19       ` Michal Hocko
2019-09-03 16:28         ` Matthew Wilcox
2019-09-03 19:18           ` Michal Hocko
2019-09-04  3:30     ` William Kucharski
2019-09-04  8:28       ` Michal Hocko
2019-09-02  9:23 ` [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
2019-09-03 12:14   ` Michal Hocko
2019-09-03 12:22     ` Matthew Wilcox
2019-09-03 12:51       ` Michal Hocko
2019-09-03 15:10         ` Matthew Wilcox
2019-09-03 19:15           ` Michal Hocko
2019-09-04  3:23             ` William Kucharski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).