linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups
@ 2021-08-17  8:03 Hugh Dickins
  2021-08-17  8:06 ` [PATCH 1/9] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

A series of huge tmpfs fixes and cleanups, suitable for 5.15, taken
from the earlier 16 "tmpfs: HUGEPAGE and MEM_LOCK fcntls and flags":
https://lore.kernel.org/linux-mm/2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com/

The API additions in that series have not yet been fully agreed, let's
revisit those after the 5.15 merge window, but please proceed now with
the uncontroversial preliminaries - thanks.

Still diffed against 5.14-rc3: no conflict yet with 5.14-rc6 or
linux-next or mmotm - no problems with the foliage.

Most exactly as in the 16, with a few commit comments clarified from
review; most already Reviewed-by Yang Shi (many thanks!), tags added.

Exceptions: 6/9 SGP_NOALLOC introduced to resolve our review doubts,
that and 7/9 not yet tagged, 9/9 now added from a posting last year:
Kirill's "What breaks?" if shmem_enabled "always" reminds me that
I'm still carrying the i915 THP fix, as corrected by Matthew.

1/9 huge tmpfs: fix fallocate(vanilla) advance over huge pages
2/9 huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
3/9 huge tmpfs: remove shrinklist addition from shmem_setattr()
4/9 huge tmpfs: revert shmem's use of transhuge_vma_enabled()
5/9 huge tmpfs: move shmem_huge_enabled() upwards
6/9 huge tmpfs: SGP_NOALLOC to stop collapse_file() on race
7/9 huge tmpfs: shmem_is_huge(vma, inode, index)
8/9 huge tmpfs: decide stat.st_blksize by shmem_is_huge()
9/9 shmem: shmem_writepage() split unlikely i915 THP

 include/linux/shmem_fs.h |   23 +++-
 mm/huge_memory.c         |    6 -
 mm/khugepaged.c          |    2 
 mm/shmem.c               |  229 ++++++++++++++++++-----------------------
 4 files changed, 129 insertions(+), 131 deletions(-)

Hugh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/9] huge tmpfs: fix fallocate(vanilla) advance over huge pages
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
@ 2021-08-17  8:06 ` Hugh Dickins
  2021-08-17  8:08 ` [PATCH 2/9] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE Hugh Dickins
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

shmem_fallocate() goes to a lot of trouble to leave its newly allocated
pages !Uptodate, partly to identify and undo them on failure, partly to
leave the overhead of clearing them until later.  But the huge page case
did not skip to the end of the extent, walked through the tail pages one
by one, and appeared to work just fine: but in doing so, cleared and
Uptodated the huge page, so there was no way to undo it on failure.

And by setting Uptodate too soon, it messed up both its nr_falloced and
nr_unswapped counts, so that the intended "time to give up" heuristic
did not work at all.

Now advance immediately to the end of the huge extent, with a comment on
why this is more than just an optimization.  But although this speeds up
huge tmpfs fallocation, it does leave the clearing until first use, and
some users may have come to appreciate slow fallocate but fast first use:
if they complain, then we can consider adding a pass to clear at the end.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 70d9ce294bb4..0cd5c9156457 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ) {
 		struct page *page;
 
 		/*
@@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto undone;
 		}
 
+		index++;
+		/*
+		 * Here is a more important optimization than it appears:
+		 * a second SGP_FALLOC on the same huge page will clear it,
+		 * making it PageUptodate and un-undoable if we fail later.
+		 */
+		if (PageTransCompound(page)) {
+			index = round_up(index, HPAGE_PMD_NR);
+			/* Beware 32-bit wraparound */
+			if (!index)
+				index--;
+		}
+
 		/*
 		 * Inform shmem_writepage() how far we have reached.
 		 * No need for lock or barrier: we have the page lock.
 		 */
-		shmem_falloc.next++;
 		if (!PageUptodate(page))
-			shmem_falloc.nr_falloced++;
+			shmem_falloc.nr_falloced += index - shmem_falloc.next;
+		shmem_falloc.next = index;
 
 		/*
 		 * If !PageUptodate, leave it that way so that freeable pages
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/9] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
  2021-08-17  8:06 ` [PATCH 1/9] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins
@ 2021-08-17  8:08 ` Hugh Dickins
  2021-08-17  8:10 ` [PATCH 3/9] huge tmpfs: remove shrinklist addition from shmem_setattr() Hugh Dickins
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses
of split_huge_page() near i_size.

It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags.  The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.

Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shmem_fs.h | 13 +++++++++++++
 mm/huge_memory.c         |  6 ++++--
 mm/shmem.c               | 15 ++++++++++++++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..9b7f7ac52351 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -18,6 +18,7 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
+	pgoff_t			fallocend;	/* highest fallocate endindex */
 	struct list_head        shrinklist;     /* shrinkable hpage inodes */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
@@ -119,6 +120,18 @@ static inline bool shmem_file(struct file *file)
 	return shmem_mapping(file->f_mapping);
 }
 
+/*
+ * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
+ * beyond i_size's notion of EOF, which fallocate has committed to reserving:
+ * which split_huge_page() must therefore not delete.  This use of a single
+ * "fallocend" per inode errs on the side of not deleting a reservation when
+ * in doubt: there are plenty of cases when it preserves unreserved pages.
+ */
+static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
+{
+	return max(eof, SHMEM_I(inode)->fallocend);
+}
+
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..890fb73ac89b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2454,11 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
-		/* Some pages can be beyond i_size: drop them from page cache */
+		/* Some pages can be beyond EOF: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
 			__delete_from_page_cache(head + i, NULL);
-			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
+			if (shmem_mapping(head->mapping))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
@@ -2686,6 +2686,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * head page lock is good enough to serialize the trimming.
 		 */
 		end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+		if (shmem_mapping(mapping))
+			end = shmem_fallocend(mapping->host, end);
 	}
 
 	/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 0cd5c9156457..24c9da6b41c2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -905,6 +905,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
+	if (info->fallocend > start && info->fallocend <= end && !unfalloc)
+		info->fallocend = start;
+
 	pagevec_init(&pvec);
 	index = start;
 	while (index < end && find_lock_entries(mapping, index, end - 1,
@@ -2667,7 +2670,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
-	pgoff_t start, index, end;
+	pgoff_t start, index, end, undo_fallocend;
 	int error;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2736,6 +2739,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
+	/*
+	 * info->fallocend is only relevant when huge pages might be
+	 * involved: to prevent split_huge_page() freeing fallocated
+	 * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
+	 */
+	undo_fallocend = info->fallocend;
+	if (info->fallocend < end)
+		info->fallocend = end;
+
 	for (index = start; index < end; ) {
 		struct page *page;
 
@@ -2750,6 +2762,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		else
 			error = shmem_getpage(inode, index, &page, SGP_FALLOC);
 		if (error) {
+			info->fallocend = undo_fallocend;
 			/* Remove the !PageUptodate pages we added */
 			if (index > start) {
 				shmem_undo_range(inode,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/9] huge tmpfs: remove shrinklist addition from shmem_setattr()
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
  2021-08-17  8:06 ` [PATCH 1/9] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins
  2021-08-17  8:08 ` [PATCH 2/9] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE Hugh Dickins
@ 2021-08-17  8:10 ` Hugh Dickins
  2021-08-17  8:12 ` [PATCH 4/9] huge tmpfs: revert shmem's use of transhuge_vma_enabled() Hugh Dickins
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

There's a block of code in shmem_setattr() to add the inode to
shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
from before 5.7 changed truncation to do split_huge_page() for itself,
and should have been removed at that time.

I am over-stating that: split_huge_page() can fail (notably if there's
an extra reference to the page at that time), so there might be value in
retrying.  But there were already retries as truncation worked through
the tails, and this addition risks repeating unsuccessful retries
indefinitely: I'd rather remove it now, and work on reducing the
chance of split_huge_page() failures separately, if we need to.

Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 24c9da6b41c2..ce3ccaac54d6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1061,7 +1061,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(dentry);
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	int error;
 
 	error = setattr_prepare(&init_user_ns, dentry, attr);
@@ -1097,24 +1096,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 			if (oldsize > holebegin)
 				unmap_mapping_range(inode->i_mapping,
 							holebegin, 0, 1);
-
-			/*
-			 * Part of the huge page can be beyond i_size: subject
-			 * to shrink under memory pressure.
-			 */
-			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-				spin_lock(&sbinfo->shrinklist_lock);
-				/*
-				 * _careful to defend against unlocked access to
-				 * ->shrink_list in shmem_unused_huge_shrink()
-				 */
-				if (list_empty_careful(&info->shrinklist)) {
-					list_add_tail(&info->shrinklist,
-							&sbinfo->shrinklist);
-					sbinfo->shrinklist_len++;
-				}
-				spin_unlock(&sbinfo->shrinklist_lock);
-			}
 		}
 	}
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/9] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (2 preceding siblings ...)
  2021-08-17  8:10 ` [PATCH 3/9] huge tmpfs: remove shrinklist addition from shmem_setattr() Hugh Dickins
@ 2021-08-17  8:12 ` Hugh Dickins
  2021-08-17  8:14 ` [PATCH 5/9] huge tmpfs: move shmem_huge_enabled() upwards Hugh Dickins
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
as a wrapper for two very different checks (one check is whether the app
has marked its address range not to use THPs, the other check is whether
the app is running in a hierarchy that has been marked never to use THPs).
shmem_huge_enabled() prefers to show those two checks explicitly,
as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ce3ccaac54d6..c6fa6f4f2db8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
 	loff_t i_size;
 	pgoff_t off;
 
-	if (!transhuge_vma_enabled(vma, vma->vm_flags))
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/9] huge tmpfs: move shmem_huge_enabled() upwards
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (3 preceding siblings ...)
  2021-08-17  8:12 ` [PATCH 4/9] huge tmpfs: revert shmem's use of transhuge_vma_enabled() Hugh Dickins
@ 2021-08-17  8:14 ` Hugh Dickins
  2021-08-17  8:17 ` [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race Hugh Dickins
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

shmem_huge_enabled() is about to be enhanced into shmem_is_huge(),
so that it can be used more widely throughout: before making functional
changes, shift it to its final position (to avoid forward declaration).

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 72 ++++++++++++++++++++++++++----------------------------
 1 file changed, 35 insertions(+), 37 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c6fa6f4f2db8..740d48ef1eb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,41 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly;
 
+bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	loff_t i_size;
+	pgoff_t off;
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return false;
+	switch (sbinfo->huge) {
+	case SHMEM_HUGE_NEVER:
+		return false;
+	case SHMEM_HUGE_ALWAYS:
+		return true;
+	case SHMEM_HUGE_WITHIN_SIZE:
+		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >= HPAGE_PMD_SIZE &&
+				i_size >> PAGE_SHIFT >= off)
+			return true;
+		fallthrough;
+	case SHMEM_HUGE_ADVISE:
+		/* TODO: implement fadvise() hints */
+		return (vma->vm_flags & VM_HUGEPAGE);
+	default:
+		VM_BUG_ON(1);
+		return false;
+	}
+}
+
 #if defined(CONFIG_SYSFS)
 static int shmem_parse_huge(const char *str)
 {
@@ -3995,43 +4030,6 @@ struct kobj_attribute shmem_enabled_attr =
 	__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-bool shmem_huge_enabled(struct vm_area_struct *vma)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	loff_t i_size;
-	pgoff_t off;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
-	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
-	switch (sbinfo->huge) {
-		case SHMEM_HUGE_NEVER:
-			return false;
-		case SHMEM_HUGE_ALWAYS:
-			return true;
-		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
-				return true;
-			fallthrough;
-		case SHMEM_HUGE_ADVISE:
-			/* TODO: implement fadvise() hints */
-			return (vma->vm_flags & VM_HUGEPAGE);
-		default:
-			VM_BUG_ON(1);
-			return false;
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (4 preceding siblings ...)
  2021-08-17  8:14 ` [PATCH 5/9] huge tmpfs: move shmem_huge_enabled() upwards Hugh Dickins
@ 2021-08-17  8:17 ` Hugh Dickins
  2021-08-17 20:14   ` Yang Shi
  2021-08-17  8:19 ` [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
shmem_getpage() not to try allocating a huge page, in the very unlikely
event that a racing hole-punch removes the swapped or fallocated page as
soon as i_pages lock is dropped.

We want to consolidate shmem's huge decisions, removing SGP_HUGE and
SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
the protection in this case - Yang Shi points out that the huge page
would remain indefinitely, charged to root instead of the intended memcg.

collapse_file() should not even allocate a small page in this case: why
proceed if someone is punching a hole?  SGP_READ is almost the right flag
here, except that it optimizes away from a fallocated page, with NULL to
tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
sequence relies on using a cache page.  Add SGP_NOALLOC just for this.

There are too many consecutive "if (page"s there in shmem_getpage_gfp():
group it better; and fix the outdated "bring it back from swap" comment.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |  1 +
 mm/khugepaged.c          |  2 +-
 mm/shmem.c               | 29 +++++++++++++++++------------
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9b7f7ac52351..7d97b15a2f7a 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -94,6 +94,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_NOALLOC,	/* similar, but fail on hole or use fallocated page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_NOHUGE,	/* like SGP_CACHE, but no huge pages */
 	SGP_HUGE,	/* like SGP_CACHE, huge pages preferred */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..045cc579f724 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 				xas_unlock_irq(&xas);
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
-						  SGP_NOHUGE)) {
+						  SGP_NOALLOC)) {
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
diff --git a/mm/shmem.c b/mm/shmem.c
index 740d48ef1eb5..226ac3a911e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1871,26 +1871,31 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return error;
 	}
 
-	if (page)
+	if (page) {
 		hindex = page->index;
-	if (page && sgp == SGP_WRITE)
-		mark_page_accessed(page);
-
-	/* fallocated page? */
-	if (page && !PageUptodate(page)) {
+		if (sgp == SGP_WRITE)
+			mark_page_accessed(page);
+		if (PageUptodate(page))
+			goto out;
+		/* fallocated page */
 		if (sgp != SGP_READ)
 			goto clear;
 		unlock_page(page);
 		put_page(page);
-		page = NULL;
-		hindex = index;
 	}
-	if (page || sgp == SGP_READ)
-		goto out;
 
 	/*
-	 * Fast cache lookup did not find it:
-	 * bring it back from swap or allocate.
+	 * SGP_READ: succeed on hole, with NULL page, letting caller zero.
+	 * SGP_NOALLOC: fail on hole, with NULL page, letting caller fail.
+	 */
+	*pagep = NULL;
+	if (sgp == SGP_READ)
+		return 0;
+	if (sgp == SGP_NOALLOC)
+		return -ENOENT;
+
+	/*
+	 * Fast cache lookup and swap lookup did not find it: allocate.
 	 */
 
 	if (vma && userfaultfd_missing(vma)) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (5 preceding siblings ...)
  2021-08-17  8:17 ` [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race Hugh Dickins
@ 2021-08-17  8:19 ` Hugh Dickins
  2021-08-17 20:16   ` Yang Shi
  2021-08-17  8:22 ` [PATCH 8/9] huge tmpfs: decide stat.st_blksize by shmem_is_huge() Hugh Dickins
  2021-08-17  8:28 ` [PATCH 9/9] shmem: shmem_writepage() split unlikely i915 THP Hugh Dickins
  8 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
that a consistent set of checks can be applied, even when the inode is
accessed through read/write syscalls (with NULL vma) instead of mmaps
(the index argument is seldom of interest, but required by mount option
"huge=within_size").  Clean up and rearrange the checks a little.

This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes.

Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
else needs that symlink check, so leave it there in shmem_getpage_gfp().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |  9 +++--
 mm/shmem.c               | 84 ++++++++++++----------------------------
 2 files changed, 31 insertions(+), 62 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 7d97b15a2f7a..60c6e4eac275 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(unsigned int type, bool frontswap,
 		       unsigned long *fs_pages_to_unuse);
 
-extern bool shmem_huge_enabled(struct vm_area_struct *vma);
+extern bool shmem_is_huge(struct vm_area_struct *vma,
+			  struct inode *inode, pgoff_t index);
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
+}
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
@@ -96,8 +101,6 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_NOALLOC,	/* similar, but fail on hole or use fallocated page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_NOHUGE,	/* like SGP_CACHE, but no huge pages */
-	SGP_HUGE,	/* like SGP_CACHE, huge pages preferred */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
diff --git a/mm/shmem.c b/mm/shmem.c
index 226ac3a911e9..56ee56b1cab6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
-static int shmem_huge __read_mostly;
+static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_huge_enabled(struct vm_area_struct *vma)
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	loff_t i_size;
-	pgoff_t off;
 
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return false;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+
+	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
 		return true;
 	case SHMEM_HUGE_WITHIN_SIZE:
-		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		index = round_up(index, HPAGE_PMD_NR);
 		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-				i_size >> PAGE_SHIFT >= off)
+		if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
 			return true;
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
-		/* TODO: implement fadvise() hints */
-		return (vma->vm_flags & VM_HUGEPAGE);
+		if (vma && (vma->vm_flags & VM_HUGEPAGE))
+			return true;
+		fallthrough;
 	default:
-		VM_BUG_ON(1);
 		return false;
 	}
 }
@@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
+{
+	return false;
+}
+
 static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		struct shrink_control *sc, unsigned long nr_to_split)
 {
@@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct shmem_sb_info *sbinfo;
 	struct mm_struct *charge_mm;
 	struct page *page;
-	enum sgp_type sgp_huge = sgp;
 	pgoff_t hindex = index;
 	gfp_t huge_gfp;
 	int error;
@@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
-	if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
-		sgp = SGP_CACHE;
 repeat:
 	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
@@ -1903,36 +1902,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	/* shmem_symlink() */
-	if (!shmem_mapping(mapping))
-		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
+	/* Never use a huge page for shmem_symlink() */
+	if (S_ISLNK(inode->i_mode))
 		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		goto alloc_huge;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (!shmem_is_huge(vma, inode, index))
 		goto alloc_nohuge;
-	case SHMEM_HUGE_WITHIN_SIZE: {
-		loff_t i_size;
-		pgoff_t off;
-
-		off = round_up(index, HPAGE_PMD_NR);
-		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-		    i_size >> PAGE_SHIFT >= off)
-			goto alloc_huge;
 
-		fallthrough;
-	}
-	case SHMEM_HUGE_ADVISE:
-		if (sgp_huge == SGP_HUGE)
-			goto alloc_huge;
-		/* TODO: implement fadvise() hints */
-		goto alloc_nohuge;
-	}
-
-alloc_huge:
 	huge_gfp = vma_thp_gfp_mask(vma);
 	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
@@ -2088,7 +2063,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
-	enum sgp_type sgp;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2151,15 +2125,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 		spin_unlock(&inode->i_lock);
 	}
 
-	sgp = SGP_CACHE;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		sgp = SGP_NOHUGE;
-	else if (vma->vm_flags & VM_HUGEPAGE)
-		sgp = SGP_HUGE;
-
-	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
+	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
@@ -3966,7 +3932,7 @@ int __init shmem_init(void)
 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
-		shmem_huge = 0; /* just in case it was patched */
+		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
 	return 0;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 8/9] huge tmpfs: decide stat.st_blksize by shmem_is_huge()
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (6 preceding siblings ...)
  2021-08-17  8:19 ` [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins
@ 2021-08-17  8:22 ` Hugh Dickins
  2021-08-17  8:28 ` [PATCH 9/9] shmem: shmem_writepage() split unlikely i915 THP Hugh Dickins
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, linux-kernel, linux-mm

4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
page size if THP is on") added is_huge_enabled() to decide st_blksize:
if hugeness is to be defined per file, that will need to be replaced by
shmem_is_huge().

This does give a different answer (No) for small files on a
"huge=within_size" mount: but that can be considered a minor bugfix.
And a different answer (No) for default files on a "huge=advise" mount:
I'm reluctant to complicate it, just to reproduce the same debatable
answer as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 56ee56b1cab6..b60a7abff27d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -689,15 +689,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
-{
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
-	    shmem_huge != SHMEM_HUGE_DENY)
-		return true;
-	return false;
-}
-
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
@@ -1078,7 +1069,6 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = path->dentry->d_inode;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb);
 
 	if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
 		spin_lock_irq(&info->lock);
@@ -1087,7 +1077,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 	}
 	generic_fillattr(&init_user_ns, inode, stat);
 
-	if (is_huge_enabled(sb_info))
+	if (shmem_is_huge(NULL, inode, 0))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	return 0;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 9/9] shmem: shmem_writepage() split unlikely i915 THP
  2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
                   ` (7 preceding siblings ...)
  2021-08-17  8:22 ` [PATCH 8/9] huge tmpfs: decide stat.st_blksize by shmem_is_huge() Hugh Dickins
@ 2021-08-17  8:28 ` Hugh Dickins
  8 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2021-08-17  8:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Matthew Wilcox, Chris Wilson, linux-kernel, linux-mm

drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
which calls shmem_writepage() from a shrinker: that usually works well
enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
set to "always" (intended to be usable) or "force" (forces huge everywhere
for easy testing), shmem_writepage() is surprised to be called with a huge
page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find out
where the crash happens when CONFIG_DEBUG_VM is off).

LRU page reclaim always splits the shmem huge page first: I'd prefer not
to demand that of i915, so check and split compound in shmem_writepage().

Patch history: when first sent last year
http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
https://lore.kernel.org/linux-mm/20200919042009.bomzxmrg7%25akpm@linux-foundation.org/
Matthew Wilcox noticed that tail pages were wrongly left clean.  This
version brackets the split with Set and Clear PageDirty as he suggested:
which works very well, even if it falls short of our aspirations.  And
recently I realized that the crash is not limited to the testing option
"force", but affects "always" too: which is more important to fix.

Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
---
 mm/shmem.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b60a7abff27d..a1ba03f39eaa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1349,7 +1349,19 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
+	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page)) {
+		/* Ensure the subpages are still dirty */
+		SetPageDirty(page);
+		if (split_huge_page(page) < 0)
+			goto redirty;
+		ClearPageDirty(page);
+	}
+
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race
  2021-08-17  8:17 ` [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race Hugh Dickins
@ 2021-08-17 20:14   ` Yang Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2021-08-17 20:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Matthew Wilcox,
	Linux Kernel Mailing List, Linux MM

On Tue, Aug 17, 2021 at 1:17 AM Hugh Dickins <hughd@google.com> wrote:
>
> khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
> shmem_getpage() not to try allocating a huge page, in the very unlikely
> event that a racing hole-punch removes the swapped or fallocated page as
> soon as i_pages lock is dropped.
>
> We want to consolidate shmem's huge decisions, removing SGP_HUGE and
> SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
> the protection in this case - Yang Shi points out that the huge page
> would remain indefinitely, charged to root instead of the intended memcg.
>
> collapse_file() should not even allocate a small page in this case: why
> proceed if someone is punching a hole?  SGP_READ is almost the right flag
> here, except that it optimizes away from a fallocated page, with NULL to
> tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
> sequence relies on using a cache page.  Add SGP_NOALLOC just for this.
>
> There are too many consecutive "if (page"s there in shmem_getpage_gfp():
> group it better; and fix the outdated "bring it back from swap" comment.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/linux/shmem_fs.h |  1 +
>  mm/khugepaged.c          |  2 +-
>  mm/shmem.c               | 29 +++++++++++++++++------------
>  3 files changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9b7f7ac52351..7d97b15a2f7a 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -94,6 +94,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  /* Flag allocation requirements to shmem_getpage */
>  enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
> +       SGP_NOALLOC,    /* similar, but fail on hole or use fallocated page */
>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
>         SGP_NOHUGE,     /* like SGP_CACHE, but no huge pages */
>         SGP_HUGE,       /* like SGP_CACHE, huge pages preferred */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b0412be08fa2..045cc579f724 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_getpage(mapping->host, index, &page,
> -                                                 SGP_NOHUGE)) {
> +                                                 SGP_NOALLOC)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 740d48ef1eb5..226ac3a911e9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1871,26 +1871,31 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return error;
>         }
>
> -       if (page)
> +       if (page) {
>                 hindex = page->index;
> -       if (page && sgp == SGP_WRITE)
> -               mark_page_accessed(page);
> -
> -       /* fallocated page? */
> -       if (page && !PageUptodate(page)) {
> +               if (sgp == SGP_WRITE)
> +                       mark_page_accessed(page);
> +               if (PageUptodate(page))
> +                       goto out;
> +               /* fallocated page */
>                 if (sgp != SGP_READ)
>                         goto clear;
>                 unlock_page(page);
>                 put_page(page);
> -               page = NULL;
> -               hindex = index;
>         }
> -       if (page || sgp == SGP_READ)
> -               goto out;
>
>         /*
> -        * Fast cache lookup did not find it:
> -        * bring it back from swap or allocate.
> +        * SGP_READ: succeed on hole, with NULL page, letting caller zero.
> +        * SGP_NOALLOC: fail on hole, with NULL page, letting caller fail.
> +        */
> +       *pagep = NULL;
> +       if (sgp == SGP_READ)
> +               return 0;
> +       if (sgp == SGP_NOALLOC)
> +               return -ENOENT;
> +
> +       /*
> +        * Fast cache lookup and swap lookup did not find it: allocate.
>          */
>
>         if (vma && userfaultfd_missing(vma)) {
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-17  8:19 ` [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins
@ 2021-08-17 20:16   ` Yang Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2021-08-17 20:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Matthew Wilcox,
	Linux Kernel Mailing List, Linux MM

On Tue, Aug 17, 2021 at 1:19 AM Hugh Dickins <hughd@google.com> wrote:
>
> Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> that a consistent set of checks can be applied, even when the inode is
> accessed through read/write syscalls (with NULL vma) instead of mmaps
> (the index argument is seldom of interest, but required by mount option
> "huge=within_size").  Clean up and rearrange the checks a little.
>
> This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes.
>
> Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
> obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
> else needs that symlink check, so leave it there in shmem_getpage_gfp().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/linux/shmem_fs.h |  9 +++--
>  mm/shmem.c               | 84 ++++++++++++----------------------------
>  2 files changed, 31 insertions(+), 62 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 7d97b15a2f7a..60c6e4eac275 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
>  extern int shmem_unuse(unsigned int type, bool frontswap,
>                        unsigned long *fs_pages_to_unuse);
>
> -extern bool shmem_huge_enabled(struct vm_area_struct *vma);
> +extern bool shmem_is_huge(struct vm_area_struct *vma,
> +                         struct inode *inode, pgoff_t index);
> +static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
> +{
> +       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
> +}
>  extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
>  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>                                                 pgoff_t start, pgoff_t end);
> @@ -96,8 +101,6 @@ enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
>         SGP_NOALLOC,    /* similar, but fail on hole or use fallocated page */
>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
> -       SGP_NOHUGE,     /* like SGP_CACHE, but no huge pages */
> -       SGP_HUGE,       /* like SGP_CACHE, huge pages preferred */
>         SGP_WRITE,      /* may exceed i_size, may allocate !Uptodate page */
>         SGP_FALLOC,     /* like SGP_WRITE, but make existing page Uptodate */
>  };
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 226ac3a911e9..56ee56b1cab6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>
> -static int shmem_huge __read_mostly;
> +static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> -bool shmem_huge_enabled(struct vm_area_struct *vma)
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
>  {
> -       struct inode *inode = file_inode(vma->vm_file);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         loff_t i_size;
> -       pgoff_t off;
>
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               return true;
>         if (shmem_huge == SHMEM_HUGE_DENY)
>                 return false;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
>                 return false;
> +       if (shmem_huge == SHMEM_HUGE_FORCE)
> +               return true;
> +
> +       switch (SHMEM_SB(inode->i_sb)->huge) {
>         case SHMEM_HUGE_ALWAYS:
>                 return true;
>         case SHMEM_HUGE_WITHIN_SIZE:
> -               off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> +               index = round_up(index, HPAGE_PMD_NR);
>                 i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                               i_size >> PAGE_SHIFT >= off)
> +               if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
>                         return true;
>                 fallthrough;
>         case SHMEM_HUGE_ADVISE:
> -               /* TODO: implement fadvise() hints */
> -               return (vma->vm_flags & VM_HUGEPAGE);
> +               if (vma && (vma->vm_flags & VM_HUGEPAGE))
> +                       return true;
> +               fallthrough;
>         default:
> -               VM_BUG_ON(1);
>                 return false;
>         }
>  }
> @@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
>
>  #define shmem_huge SHMEM_HUGE_DENY
>
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
> +{
> +       return false;
> +}
> +
>  static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>                 struct shrink_control *sc, unsigned long nr_to_split)
>  {
> @@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>         struct shmem_sb_info *sbinfo;
>         struct mm_struct *charge_mm;
>         struct page *page;
> -       enum sgp_type sgp_huge = sgp;
>         pgoff_t hindex = index;
>         gfp_t huge_gfp;
>         int error;
> @@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>
>         if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
>                 return -EFBIG;
> -       if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
> -               sgp = SGP_CACHE;
>  repeat:
>         if (sgp <= SGP_CACHE &&
>             ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
> @@ -1903,36 +1902,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return 0;
>         }
>
> -       /* shmem_symlink() */
> -       if (!shmem_mapping(mapping))
> -               goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
> +       /* Never use a huge page for shmem_symlink() */
> +       if (S_ISLNK(inode->i_mode))
>                 goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               goto alloc_huge;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (!shmem_is_huge(vma, inode, index))
>                 goto alloc_nohuge;
> -       case SHMEM_HUGE_WITHIN_SIZE: {
> -               loff_t i_size;
> -               pgoff_t off;
> -
> -               off = round_up(index, HPAGE_PMD_NR);
> -               i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                   i_size >> PAGE_SHIFT >= off)
> -                       goto alloc_huge;
>
> -               fallthrough;
> -       }
> -       case SHMEM_HUGE_ADVISE:
> -               if (sgp_huge == SGP_HUGE)
> -                       goto alloc_huge;
> -               /* TODO: implement fadvise() hints */
> -               goto alloc_nohuge;
> -       }
> -
> -alloc_huge:
>         huge_gfp = vma_thp_gfp_mask(vma);
>         huge_gfp = limit_gfp_mask(huge_gfp, gfp);
>         page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
> @@ -2088,7 +2063,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         struct inode *inode = file_inode(vma->vm_file);
>         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -       enum sgp_type sgp;
>         int err;
>         vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2151,15 +2125,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       sgp = SGP_CACHE;
> -
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               sgp = SGP_NOHUGE;
> -       else if (vma->vm_flags & VM_HUGEPAGE)
> -               sgp = SGP_HUGE;
> -
> -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
> +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
> @@ -3966,7 +3932,7 @@ int __init shmem_init(void)
>         if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>                 SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>         else
> -               shmem_huge = 0; /* just in case it was patched */
> +               shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>  #endif
>         return 0;
>
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-08-17 20:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-17  8:03 [PATCH 0/9] huge tmpfs: shmem_is_huge() fixes and cleanups Hugh Dickins
2021-08-17  8:06 ` [PATCH 1/9] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins
2021-08-17  8:08 ` [PATCH 2/9] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE Hugh Dickins
2021-08-17  8:10 ` [PATCH 3/9] huge tmpfs: remove shrinklist addition from shmem_setattr() Hugh Dickins
2021-08-17  8:12 ` [PATCH 4/9] huge tmpfs: revert shmem's use of transhuge_vma_enabled() Hugh Dickins
2021-08-17  8:14 ` [PATCH 5/9] huge tmpfs: move shmem_huge_enabled() upwards Hugh Dickins
2021-08-17  8:17 ` [PATCH 6/9] huge tmpfs: SGP_NOALLOC to stop collapse_file() on race Hugh Dickins
2021-08-17 20:14   ` Yang Shi
2021-08-17  8:19 ` [PATCH 7/9] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins
2021-08-17 20:16   ` Yang Shi
2021-08-17  8:22 ` [PATCH 8/9] huge tmpfs: decide stat.st_blksize by shmem_is_huge() Hugh Dickins
2021-08-17  8:28 ` [PATCH 9/9] shmem: shmem_writepage() split unlikely i915 THP Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).